Compositional Reasoning In LLMs Is Far From Being Solved

There's still a lot of research to do for improving compositional reasoning in large (and small) language models.

Jan 06, 2025

Modern user-facing LLMs such Gemini and GPT-4o have made significant advancements in tackling tasks that require reasoning. Recently, with the remarkable performance of OpenAI’s o3 on the ARC-AGI challenge, there has been even more interest in understanding and pushing the reasoning capabilities of such LLMs. However, even with all the promising improvements in LLM reasoning, there are other areas that warrant more research - for example, compositional reasoning. As we will see in this post, when LLMs face tasks requiring compositional reasoning, performance is suboptimal. My goal is to provide an overview of some of the latest advancements in compositional reasoning in LLMs and some challenges that remain.

What is Compositional Reasoning?

Compositional reasoning is the ability of an LLM to combine the solutions of sub-tasks to solve a higher level task that is a composition of the sub-tasks.

I also like to refer to the sub-tasks as atomic tasks, and higher level tasks as composite tasks. As shown in the picture below, a composite task can be formed using one or more atomic tasks.

Image by author. Composite tasks can be formed using the combination of one or more atomic tasks. An LLM should be able to combine the information learned from the atomic tasks to solve the composite task.

We can say that LLMs are successful in performing compositional reasoning when it is able to combine the skills / abilities acquired from the atomic tasks to solve the composite task.

Why should we care about improving compositional reasoning?

It is tedious (and nearly impossible) to collect training data that covers every possible use case. For example, in a code generation setting we might only have access to samples corresponding to single function calls, but during test time, we might need code that combines the single function calls. The model should be able to learn that the result of a code snippet is a composition of many fundametal function calles.
Making models more human like: Humans have the remarkable ability to combine knowledge from atomic facts / task, to solve a composite task, and do it in a sample efficient way. LLMs on the other hand still exhibit failure modes in compositional reasoning (as we will see later). Using this insight, we can effectively train LLMs to make them better in compositional reasoning similar to humans.
It can obviate further model training on compositional data, thus reducing costs and electricity (which I discuss more here).

Next, I discuss some of the latest advancements in LLM compositional reasoning and the existing challenges.

Challenges and Advancements in Compositional Reasoning

I will first discuss the idea of compositionality gap - increasing model size is not necessarily a solution to improve model performance on compositional reasoning tasks. Then I’ll discuss two categories of tasks that show failure modes of modern LLMs and some approaches that have been tried to overcome such failures (prompting, fine-tuning, and training from scratch).

Compositionality Gap: Current LLMs are bigger than ever and their ability to memorize knowledge is impressive. Nevertheless, as these models get larger, their ability to solve increasingly complex tasks has not improved accordingly. This is referred to as the compositionality gap, and is defined as the ratio of the number of times an LLM can tackle simple problems to the number of times it can tackle problems requiring the combination of the simple problems.The authors of the compositionality gap paper show that as the model size increases (in the GPT-3 family), the compositionality gap does not decrease. This compositionality gap was particularly observable when the model was prompted using standard prompting and Chain-of-Thought (CoT) prompting.

To reduce the compositionality gap, the authors introduce the ‘Self-Ask’ prompting technique (building on top of CoT prompting) where the prompt is modified to make the LLM ask ‘follow-up’ questions. When comparing standard prompting, CoT and Self-Ask prompting on a carefully curated compositional QA dataset, the authors show that Self-Ask prompting significantly improves model performance on a compositional multi-hop compositional QA dataset. All in all, the authors conclude that increasing model size is not a general solution for solving compositional reasoning and other techniques such as sophisticated prompting techniques can be a promising avenue to explore.

Although ‘Self-Ask’ prompting reduces the compositionality gap, the gap between the model’s ability to solve atomic tasks and composite tasks exists, thus calling for more research in compositional reasoning.

String Manipulation: In addition to Question Answering, LLMs also struggle in compositional reasoning in other domains. For example, a recent paper showed that user-facing LLMs (closed and open source) struggle with string manipulation tasks requiring operation chaining. The authors created a dataset with two categories of string manipulation tasks: atomic and composite tasks. The atomic tasks consisted of string manipulation operations such as slicing, reversing, checking if the string is title case, etc. The composite tasks contain samples where the atomic tasks are combined (e.g., reversing and uppercasing a string). Using the atomic and the composite tasks, the authors introduce the StringBench dataset - a dataset to test string manipulation capabilities of LLMs.

Similar to the paper exploring the compositional QA of LLMs, the authors compared model performance on compositional string manipulation tasks. They compared well known (closed and open-source) models such as GPT-4o, Gemma-2-9b, Mistral-7B-v0.3, and Llama-3.1-8B (to name a few), using both prompting and fine-tuning techniques. They applied three prompting techniques: standard prompting, CoT, and Program-of-Thought (PoT). For both standard prompting and CoT, the language model directly generates the output. For PoT, the language model generates the python code to execute the string manipulation operations provided in the input prompt (for example, reversing and uppercasing and input string). Not surprisingly, almost all the models showed better performance on StringBench when using the PoT prompting technique, compared to standard prompting and CoT. This is because these LLMs have impressive code generations capabilities, and the generated python codes provides greater determinism compared to directly generating the final result (i.e., without generating the intermediate code).

That being said, the PoT greatly underperformed human level performance (maximum performance of the best performing LLM was ~71%, while human performance was around ~98%). Such a result shows that LLMs are not able to achieve human level performance on compositional string manipulation tasks.

What about fine-tuning? Fine-tuning improves model performance. The authors fine-tune three LLMs: Gemma-2-9b, Mistral-7B-v0.3, and Llama-3.1-8B. They observed significant improvement over non-finetuned model versions. When using the PoT prompting technique on the fine-tuned models, Gemma-2-9b, Mistral-7B-v0.3, and Llama-3.1-8B showed average gains of 38.80, 56.51, and 42.87 accuracy points on StringBench.

Although fine-tuning helped the models to achieve results close to human level performance, there still exists a gap (between human and LLM performance) that warrants more research.

Algorithmic Tasks: Compared to compositional QA and string manipulation, another category of tasks that I would like to discuss is that of algorithmic operations. A recent paper explored the limits of compositionality in LLMs from the lens of sample efficiency in compositional learning - number of samples required for the model to learn a compositional algorithmic task compared to the number of samples required to learn the corresponding atomic tasks. For their experiments, the authors created two synthetic tasks, called the Pointer Execution Neighbor (PEN), and Pointer Execution Reverse Multicount (PERM). These tasks are based on the pointer value retrieval tasks that are designed to probe the generalization capabilities of deep learning systems. These tasks are compositional in nature and thus can be broken down into atomic sub-tasks. Pointer value retrieval tasks also have the benefit of forcing the model to learn the final algorithmic solution instead of finding a shortcut to the final solution. In addition to the pointer execution tasks, the authors also explored compositional reasoning on two other algorithmic tasks (a dynamic programming task and multiplication task) from a previous work exploring robust generalization capabilities of transformer models. The explanation of each task is out of scope for this post, but I would encourage the reader to go through the papers to get a better understanding of the tasks.

Prompting results: The authors prompted two models: GPT-4 and Gemini-Pro, using many prompting techniques including CoT, Few-shot CoT, Code Interpreter prompting to name a few. On the PEN algorithmic task, GPT-4 reached a maximum accuracy of 19%, while Gemini-Pro failed on all accounts (it had a 0% accuracy). On the PERM algorithmic task, GPT-4 reached 42% task accuracy, while Gemini-Pro reached 9%. These results show that just sophisticated prompting alone is unable to provide good results for compositional algorithmic tasks (though it is also possible that the model itself might not be best suited for compositional problem-solving, rendering sophisticated prompting techniques ineffective). All in all, in-context learning with few samples is insufficient for compositional task solving.

Interestingly, the authors also prompted OpenAI’s o1-preview model to solve the PEN and PERM tasks, and found that it performs relatively well (reaching close to 85% and 90% task accuracy respectively) when using Sub-task CoT prompting - the model is also given the description of the sub-tasks in the prompt. It would be interesting to see the results for new the OpenAI o3 model on the algorithmic tasks (and also other compositional tasks discussed before - StringBench and Compostionality QA).

What about training from scratch? We observed before that fine-tuning benefits model performance (on string manipulation at least), but how do models perform compositional reasoning when they are trained from scratch? To answer this question, the authors train a 150M parameter decoder model (with Llama architecture) from scratch on the synthetic algorithmic tasks described above. Specifically, the authors trained the model on both atomic and compositional tasks simultaneously using standard i.i.d training, and only varied the number of samples for each atomic task and the corresponding compositional task. The models is said to have learned the task (atomic or compositional if it reaches close to 100% accuracy). The authors found that the number of samples required to learn the compositional task is more than the sum of the number of samples for each atomic sub-tasks. This shows that compositional learning in (smaller) models is highly inefficient.

A note on model size: The model is just 150M, which is small compared to models like GPT-3 (or GPT-4). One may argue that the representational capacity of a 150M model is insufficient to effectively learn the compositional task (which I think is a fair point), leading to low accuracy values. That being said, the model capacity is sufficient to learn the atomic tasks, but it cannot reuse the information from the sub-tasks to learn the compositional task. I personally think we should invest time in researching compositionality in smaller models as working with larger models is often inaccessible and requires a plethora of training data.

Conclusion

Compositional reasoning represents a critical frontier in advancing the capabilities of large language models. While these models excel in memorization and have demonstrated remarkable performance on tasks requiring reasoning, compositional tasks reveal fundamental limitations in their ability to combine knowledge from atomic tasks to solve higher-order problems. The compositionality gap persists despite increased model sizes, and innovative prompting techniques, such as Chain-of-Thought, Self-Ask, and Program-of-Thought, have shown promise but remain imperfect solutions.

Fine-tuning and training from scratch offer potential pathways to improving compositional reasoning, yet they also expose inefficiencies in current approaches, particularly for smaller models. Moreover, recent results highlight the inadequacy of sophisticated prompting alone for algorithmic tasks, pointing to the need for rethinking how we train and design models for compositional generalization.

Ultimately, bridging the gap between human-like compositional reasoning and the capabilities of LLMs will require continued research into more sample-efficient training paradigms, better architectural designs, and novel evaluation benchmarks. Achieving this will not only make models more robust and human-like but also reduce the need for costly data collection and retraining, thereby making AI more accessible and sustainable. As we push the boundaries of compositional reasoning, we bring AI closer to its goal of general intelligence - one composite task at a time.

simpleParadox.ai

Discussion about this post