Scaling can lead to compositional generalization
Scaling neural networks leads to compositional generalization if the training distribution sufficiently covers the task space.
摘要
评审与讨论
The paper investigates whether standard neural networks can achieve compositional generalization through scaling data and model size. It presents theoretical results suggesting that MLPs can approximate compositional task families with linear scaling in the number of neurons relative to task modules. Empirically, it demonstrates that scaling data and model size improves compositional generalization across various task encodings, provided the training distribution sufficiently covers the task space. The paper also shows that task constituents can be linearly decoded from hidden activations in models that successfully generalize compositionally, correlating this decodability with the success of image generation models.
优缺点分析
Strengths:
- The paper offers a formal definition of compositional tasks.
- The authors conduct experiments across different task encodings and model sizes.
Weaknesses:
- This paper lacks novelty. The problem it explores—whether scale can bring about compositional generalization—is not of significant importance to the industry. Moreover, we cannot gain any insights from it.
- The writing quality of this paper is very poor. The figures are poorly made, and the typesetting and writing are subpar, making it difficult to follow.
- The motivation of this paper is not clear enough. The abstract mentions that “there are still frequent failure cases that raise doubts,” but there is no more specific support in the introduction section provided. Moreover, it is not clear why this would lead to a focus on the issue of scale.
- The scale of the experiments in this paper is too small and lacks a clear experimental setting. The experiments are not extensive enough and are not convincing.
- The writing of the related work section is also unclear, making it difficult to understand the positioning of this work within the field.
问题
- The paper mentions that underrepresentation of certain modules leads to failures. Could the authors provide more detailed analysis on the types of failures observed and potential strategies to mitigate them beyond ensuring sufficient training data coverage?
- How do the findings translate to real-world applications where task specifications are not fully known or are ambiguous?
- I believe that these conclusions should be verified in more popular scenarios, such as in language models, to see if given certain capabilities, a small model cannot solve a task but scaling up can. I suggest referring to the experiments in "Compositional task representations for large language models" to provide a more in-depth analysis.
局限性
yes
格式问题
N/A
Thank you for your review. We regret that you did not find our main research question, whether scale leads to compositional generalization, to be of significance. We hope our response helps to clear up potential misunderstandings and better contextualize the work.
This paper lacks novelty. The problem it explores—whether scale can bring about compositional generalization—is not of significant importance to the industry.
The motivation of this paper is not clear enough. The abstract mentions that “there are still frequent failure cases that raise doubts,” but there is no more specific support in the introduction section provided. Moreover, it is not clear why this would lead to a focus on the issue of scale.
The writing of the related work section is also unclear, making it difficult to understand the positioning of this work within the field.
We have revised the introduction and related work section to more clearly work out the motivation and significance of our research question. In addition, please allow us to copy our response to Reviewer hkqv to summarize the main arguments:
It is a decades-old question whether connectionist models are fundamentally able to capture compositional structure [1-4]. Despite the success of large-scale models, we observe many failure cases that indicate deficits in capturing compositional structure making this a valid concern still today [5-9] (also see section 4.2 for systematic failures of image generation models). A number of recent papers argue that for compositional generalization to succeed, neural networks need to be endowed with architectural priors for modularity [e.g., 10-12]. Our results challenge this position, since even monolithic neural networks (MLPs) can achieve compositional generalization at sufficient scale. We believe it is actionable since it assigns a stronger importance to the structure of the training data over the model architecture than was previously thought, i.e. it is important how diverse the training data is (section 3.1) and how the training data is distributed (section 3.4) but shows that compositional generalization can ultimately be achieved via scaling and does not constitute a fundamental shortcoming of connectionist models.
We refined these points in the introduction, related work section and discussion of the paper.
[1] Connectionism and cognitive architecture: A critical analysis. Fodor & Pylyshyn, Cognition 1988
[2] Connectionism, Constituency and the Language of Thought. Paul Smolensky, Meaning in Mind 1991
[3] Systematicity in connectionist language learning. Hadley, Mind & Language 1994
[4] Connectionism and the problem of systematicity. Phillips 1995
[5] Faith and Fate: Limits of Transformers on Compositionality. Dziri et al., NeurIPS 2023
[6] The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. Berglund et al., ICLR 2024
[7] Not All LLM Reasoners Are Created Equal. Hosseini et al., NeurIPS 2024 Workshop MATH-AI
[8] Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability. Xu et al., ICLR 2024 Workshop ME-FoMo
[9] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. Mirzadeh et al., ICLR 2025
[10] Discovering modular solutions that generalize compositionally. Schug et al., ICLR 2024
[11] Compositional Generative Modeling: A Single Model is Not All You Need. Du & Kaelbling, ICML 2025
[12] Breaking Neural Network Scaling Laws with Modularity. Boopathy et al., ICLR 2025
The paper mentions that underrepresentation of certain modules leads to failures. Could the authors provide more detailed analysis on the types of failures observed and potential strategies to mitigate them beyond ensuring sufficient training data coverage?
We assume that you are referring to the intervention on the training task support we term “unpopular modules” in Section 3.3 and describe in more detail in Appendix C.2, let us know if you had something else in mind:
Please allow us to clarify that this is an experimental intervention designed to elucidate why module imbalance can lead to failures in compositional generalization, as has also been observed in prior work [1,2]. Together with the “popular modules” training task support ablation, we find that such failures are due to underrepresentation of certain modules among the training tasks. Modules being disproportionately overrepresented is not an issue in this case. We can illuminate the type of failure more closely by noting that tasks with unpopular modules still achieve a low training loss/high training R² score, indicating that the network has memorized the specific task compositions present in the training data without having learned the underlying modules which would be required to generalize to held-out compositions that contain these modules.
[1] Attribute-Centric Compositional Text-to-Image Generation. Cong et al., IJCV 2025
[2] Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task. Okawa et al., NeurIPS 2023
How do the findings translate to real-world applications where task specifications are not fully known or are ambiguous?
We discuss this point openly in the limitations section. In the case of ambiguous or unknown task specifications, it is no longer possible to cleanly measure compositional generalization without confounding it with failures in task inference. Since compositional generalization is the subject of our submission, we do not consider this case.
I believe that these conclusions should be verified in more popular scenarios, such as in language models, to see if given certain capabilities, a small model cannot solve a task but scaling up can. I suggest referring to the experiments in "Compositional task representations for large language models" to provide a more in-depth analysis.
The suggested paper proposes a method for learning and combining soft-prompts for pretrained language models to perform new tasks. It uses the term compositional generalization in the sense that soft-prompts are composed from a fixed number of codebook vectors. The model is then evaluated on whether it can perform a particular task such as Natural Language Inference using a combination of soft-prompts optimized for performing other tasks such as Summarization, Sentiment Analysis or Question Answering [1]. Unfortunately, we see no direct connection to our current submission since it is not clear to us how these tasks allow assessing compositional generalization.
[1] Multitask Prompted Training Enables Zero-Shot Task Generalization. Sanh et al., ICLR 2022.
the paper focuses on compositional generalization -- a whether a model can generalize correctly to new combinations of previously learned components.
The paper makes a very bold statement: just giving a model larger scale in terms of parameters and data is sufficient to learn compositional tasks -- without any explicit priors (say architectural priors) to learn compositionality.
The main contributions are:
-
Empirical results that compositional generalization emerges naturally as standard MLPs grow larger, provided the training data covers the compositional space of the task family.
-
A theoretical result showing that MLPs can approximate compositional task families efficiently, with neuron count growing linearly rather than exponentially with number of modules.
-
introduction of "linear decodability," a metric that quantifies the extent to which compositional structure can be recovered linearly from internal activations
优缺点分析
Quality
- The experiments are carefully designed, showing that scaling standard MLPs indeed leads to compositional generalization under some conditions.
- A theoretical contribution showing that the number of neurons required scales only linearly with number of modules.
- Rather artificial/synthetic compositional tasks. Not convincing that the main result of the paper would have more broad applicability.
- The theoretical result is derived under a single hidden layer assumption. Very unclear if this would capture well what's happening with deeper or more complex architectures such as transformers.
Clarity
- The paper is well organized, and for the most part easy to follow.
- very good figures and other visualizations.
- I would prefer to see more clear/formal definitions early on about key concepts such as "compositional support"
- it seems that the paper assumes that all readers are familiar with Kolmogorov complexity
Significance
- The paper challenges a very prevalent assumption about the necessity of specialized architectures or explicit inductive biases for compositional generalization.
- is it true however? I am not convinced because both the theoretical result and the empirical results are for very simplified settings. So it may be that the main "message" of this paper is challenged by others in the next 1-2 years. Of course this is not necessarily bad -- this is how research progresses.
- the paper could contrast its insights to recent theoretical frameworks (such as the Coverage Principle or the Random Hierarchy Model -- see references below) that define conditions for compositional generalization.
Originality
- If the main result is true, the paper is definitely making a very original contribution: "scale is all you need" for compositional generalization
- There are a couple of highly relevant papers that are not cited and discussed by the paper: https://arxiv.org/pdf/2307.02129 (the random hierarchy model) https://arxiv.org/pdf/2505.20278 (the coverage principle)
问题
-
I am not a mathematician but it seems to me that the theoretical contribution is rather informal. It appears less precise compared to say the analysis in Schug et al. where explicit theoretical conditions (connected and compositional support) and identifiability conditions were formally proven. Do you agree? If so, can you clarify/refine the theoretical contribution?
-
As I wrote earlier, a major concern is that the tasks are rather synthetic and the results only applicable to MLPs. Can you at least provide some insights -- why do you think that the same major result would be true say for transformers and for much more complex real-world tasks?
-
I like what you propose about "linear decodability". However, it is unclear what linear decodability exactly captures about the structure learned by the model—particularly how it relates or contrasts with previously introduced measures such as "coverage" or "synonymic invariance" (see the earlier links I gave you)
-
You suggest that scale along is enough for compositional generalization. However the paper that you cite (Schug 2024) shows that too much overparameterization can affect negatively compositional generalization. I cannot quite reconcile these two results/observations. Can you help?
局限性
yes
格式问题
NA
Thank you for your very positive review of our work. We are happy that you appreciated the design of our experiments and the significance of our findings. Your feedback has prompted us to extend Theorem 3.1 to the multilayer setting, run additional transformer experiments and clarify several parts in our manuscript. We are looking forward to hearing your thoughts on our detailed response below.
The theoretical result is derived under a single hidden layer assumption. [...] what's happening with deeper or more complex architectures such as transformers.
Theorem 3.1 is indeed stated for a hyperteacher with a single hidden layer but just to avoid any confusion, we do not make a single hidden-layer assumption for the MLP.
Triggered by your remark, we were able to extend Theorem 3.1 to the L-layer hyperteacher setting and have added it to our updated version of the manuscript. The main finding remains unchanged in the multilayer setting: The number of neurons required by the MLP to approximate the hyperteacher to arbitrary precision scales linearly in the number of modules M. To see this intuitively, consider the construction where we dedicate M neurons in the MLP at each layer to “pipe through” the task constituents and repeat the machinery to implement a single hidden layer hyperteacher we show in the submission L times. What becomes more complicated in the L>1 setting is keeping track of the constants needed to bound the approximation error as a function of L but this has no impact on the scaling in M.
In addition, we see using MLPs in Theorem 3.1 as a strength. It allows extending the result to more complex architectures which are built with MLPs. The transformer is a straightforward example since it explicitly uses MLPs in its feedforward blocks and thus transformers as well can approximate the hyperteacher to arbitrary precision using a number of neurons linear in M.
I would prefer to see more clear/formal definitions early on about key concepts such as "compositional support"
Thank you for this suggestion. We agree that the paper becomes more self-contained and easier to understand by restating these definitions from [1] and [2] early on. Accordingly, we have made changes to the manuscript and now introduce the concept of compositional and connected support more comprehensively.
- I am not a mathematician but it seems to me that the theoretical contribution is rather informal. It appears less precise compared to say the analysis in Schug et al. [...]
Please allow us to clarify this important point. We present two formal theorems (Theorem 3.1 and Theorem 4.1) which we prove in Appendix A. Please let us know if you have any concerns about technical flaws but these are fully formalized proofs rather than informal arguments. We briefly summarize Theorem 3.1 in the following and contrast it with the analysis in [1] that you mention.
For Theorem 3.1, we provide a constructive proof in Appendix A.2 that shows that there exists a standard MLP that can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. This is noteworthy because there are potentially exponentially many different tasks in such a task family and it is a priori not clear whether as a result an MLP solving all tasks would require an exponential number of neurons.
We can compare this result to Theorem 1 in [2] which considers a more constrained setting, namely that the student, i.e. the model learning the compositional task family, has the same architecture as the teacher. That means in this case the student has the same compositional structure as the data built-in and the question of compositional generalization becomes one of identifiability. Now given this stronger and arguably less general assumption, [2] shows a strong result that guarantees compositional generalization using only a number of tasks linear in the number of modules when the training loss is zero.
So ultimately this demonstrates a typical tradeoff when doing theory - stronger assumptions can allow for stronger implications at the expense of generality. We would like to emphasize that this does not make either theoretical contribution more or less formal. In our submission, we pursued a more general question, i.e. can standard neural networks compositionally generalize, and we complement the theoretical existence proof with empirical evidence that MLPs can compositionally generalize in practice.
- As I wrote earlier, a major concern is that the tasks are rather synthetic and the results only applicable to MLPs. Can you at least provide some insights [...]
You raise two good questions:
a) Do the results extend beyond MLPs, e.g. to transformers?
As we argued in our response above, our theoretical result obtained with MLPs can be extended to transformers. We agree however that this point deserves more attention, so in response to your remark (as well as reviewer bGxw), we conducted a new set of experiments and reproduced the data scaling plot on the top of Figure 1 where we replace the MLP with a transformer. We qualitatively reproduce our main result, i.e. as data is scaled, transformers achieve compositional generalization. Transformers are (maybe not unexpectedly) even more sample efficient than MLPs at doing so and to further emphasize this point we create a plot where we compare the minimum number of training tasks required to achieve compositional generalization (OOD R² > 0.95) as the total number of possible tasks is increased (by increasing the number of modules M). On this plot, both the MLP and transformer clearly require a sub-exponential (in fact, sub-quadratic) number of training tasks compared to the total number of tasks with the transformer consistently requiring less tasks than the MLP. We regret that this time around, we cannot show plots in the rebuttal but we have added these plots to the appendix of our updated version of the manuscript and refer to them in Section 3.1. Please let us know if there are specific numbers for these results you would like to see in tabular form here.
b) How can synthetic tasks help us learn something about compositional generalization in real-world tasks?
In our submission, we consider a number of different settings to reduce the likelihood that our results are specific to any one of them: the hyperteacher and the compositional preference task family in Section 3, the compositional image generation task in Section 4.2. In response to Reviewer bGxw we have in addition reproduced findings of Section 3.1 for a deeper hyperteacher.
That being said, an inherent limitation to investigating compositional generalization is that one needs precise control over the data distribution (a) to be able to specify what is considered a composition on which to measure generalization and (b) to systematically hold-out data from training to measure it. This is what makes it tricky to investigate compositional generalization beyond synthetic tasks. We have extended our discussion of this crucial point in the limitations section.
- I like what you propose about "linear decodability". [...]
Thank you for pointing us to these references. “Synonymic invariance” as used in [4] refers to the property that certain parts of the input in the data represent the same value, e.g. two inputs x1, x2 to a non-injective function that are mapped to the same element f(x1)=f(x2). Similarly, coverage in [3] refers to the set of inputs observed during training and all inputs that are functionally equivalent to them. Both are closely related to what is called “substitutivity” in linguistics where replacing a word with a synonym should not change the meaning of the whole [5]. In the settings we consider here, there are by construction no “synonyms”, i.e. each module that we consider for composition has a different meaning.
We are instead focused on assessing whether seeing a subset of compositions during training can lead to generalization to all possible compositions even though these held-out compositions are specifically not “synonymic” but result in an unobserved task. Linear decodability then captures whether a model, despite only being exposed to compositions of modules, internally represents the ground-truth modules that make up each task.
- You suggest [...] However the paper that you cite (Schug 2024) shows that too much overparameterization can affect negatively compositional generalization. [...]
This is a very important observation and indeed part of what makes our results interesting. The theoretical teacher-student analysis conducted in [2] suggests that overparameterization might hurt compositional generalization since there are counterexamples to their Theorem 1 in this setting. As a result, compositional generalization cannot be guaranteed in their setting if the student is overparameterized. However, this does not mean that the student cannot compositionally generalize, such a solution still exists. The question then becomes which solution the optimization algorithm finds. Already the empirical results of [2] show that in practice overparameterization does not always hurt (e.g. see Figure 2D in [2]). In our setting, we train a standard MLP (or transformer, see above) with SGD, instead of using gradient-based meta-learning to train a linear hypernetwork as in [2]. It turns out that in this arguably more common setting, overparameterization is beneficial to compositional generalization.
[1] Compositional Generalization from First Principles. Wiedemer et al., NeurIPS 2023.
[2] Discovering modular solutions that generalize compositionally. Schug et al., ICLR 2024
[3] The Coverage Principle. Chang et al., arXiv 2025
[4] The Random Hierarchy Model. Cagnetta et al., Phys. Rev. X 2024
[5] Compositionality decomposed: how do neural networks generalise? Hupkes et al., JAIR 2020
Let me first thank the authors for the thorough rebuttal. It has largely addressed my concerns and I still think that the paper should be accepted (score: 5).
I also read carefully the other reviews and I admit that I will respectfully disagree with the more negative reviewer (239V) that the paper is poorly written, lacks novelty, or that is not clearly motivated.
This paper investigates the impact of data and model scale on compositional generalization by considering a teacher-student setup. Specifically, they consider compositional tasks and a hypernetwork teacher (that receives as input the specific compositional task identity). The student receives some function of the compositional task identity as well as the given input. The authors find that with increasing model size and data scale, the network improves in its generalization to novel combinations of task. They tie this to an argument that MLPs can efficiently represent such compositional functions and therefore prefer learning such a solution over memorizing solutions for each task combination. They then suggest that MLPs that can learn such compositional functions will have representations that enable decoding the task identity. Notably, they replicate this phenomenon in image generation models.
优缺点分析
Strengths
This paper is a neat demonstration of how large data and model size can enable compositional generalization. This is particularly intriguing as the authors consider a very similar scenario to the one investigated by Schug et al., who demonstrated that modularity can be sufficient for generalization --- thus demonstrating that modularity is not necessary with sufficient data. I also appreciated that the authors provided theoretical arguments for why this might be the case. While this study focuses on relatively simple networks, I think such a simplified scenario can be a strength in itself as it can provide a useful case study. Finally, I thought the paper was well written and easy to follow.
Weaknesses
Theorem 3.1 does not actually show that neural networks in practice learn a compositionally generalizing solution; rather it demonstrates that this is an efficiently learnable solution. This is an important gap between the theory and the practical scenario. That said, theoretically understanding deep neural networks trained with backpropagation is often quite complicated and so in my mind, this is an acceptable gap. However, it would be nice to provide some justification for why Theorem 3.1 might be useful for understanding networks trained in practice, e.g. by drawing on the literature showing that even overparameterized neural networks often end up learning a solution with a sparse hidden layer (indicating that they might be more likely to learn a solution that can be parameterized with few hidden neurons) [e.g 1-3].
Notably, all of the results related to scale are based on simple MLPs and relatively simple teacher networks. While I think this is an interesting scenario to investigate (as noted above), I do think the authors should more strongly qualify their findings based on this limitation, in particular by making the title more specific to the scenario they are investigating.
References
问题
- It is quite surprising to me that the linear decoder of the task identity doesn't work. Given that the hidden layer of the neural network is fairly large, I would imagine that even a random nonlinear embedding would allow for memorizing the different tasks with a linear decoder. Can you give me an intuition for why this might happen? Is it related to the fairly large regularization parameter associated with the decoder? Does the neural network actually end up collapsing the representation?
- I think it would be important to scale back the generality of the claims (as noted in my comments above). An example title would be "Scale leads to compositional generalization: a teacher-student perspective" --- more broadly, just giving a sense of the specific setup in the title would be good. There are many ways to study compositional generalization and the impact of scale and it would be helpful to be specific about what your contribution here is.
局限性
Yes (though see comments about scope of the work above).
最终评判理由
I continue to think that this is a valuable paper that sheds light on how sufficient training data can give rise to compositional generalization and I would like to see it accepted to Neurips.
格式问题
No concerns.
Thank you for your positive evaluation of our work and having taken the time to provide informative and actionable feedback which has led us to revise several parts of the manuscript as well as triggered two new sets of experiments. We address the points you raise individually below and are looking forward to engaging in further discussion with you.
Theorem 3.1 does not actually show that neural networks in practice learn a compositionally generalizing solution; rather it demonstrates that this is an efficiently learnable solution. This is an important gap between the theory and the practical scenario. That said, theoretically understanding deep neural networks trained with backpropagation is often quite complicated and so in my mind, this is an acceptable gap. However, it would be nice to provide some justification for why Theorem 3.1 might be useful for understanding networks trained in practice, e.g. by drawing on the literature showing that even overparameterized neural networks often end up learning a solution with a sparse hidden layer (indicating that they might be more likely to learn a solution that can be parameterized with few hidden neurons) [e.g 1-3].
Theorem 3.1 demonstrates the existence of a compositionally generalizing solution in standard neural networks. This is noteworthy since such compositional task families consist of exponentially many tasks and it is a priori not clear if neural networks would require an exponential number of neurons to learn all tasks. But as you correctly point out, proving that stochastic gradient descent will actually discover this solution is difficult in a general setting and hence this result is complemented with our empirical findings that show circumstances under which neural networks can learn a compositionally generalizing solution in practice.
Following your suggestion, we added a paragraph to the limitations section to more openly discuss this point:
While Theorem 3.1 reveals the existence of a solution that enables compositional generalization without requiring an exponential number of neurons, identifying the theoretical conditions under which one can show that this solution is guaranteed to be discovered by stochastic gradient descent is an open question. Prior results indicate that deep neural networks trained with stochastic gradient descent often display a preference towards simple solutions [1-4]. In the context of compositional generalization, the complexity of the memorizing solution is by definition exponential in the number of modules which might help explain why empirically we observe the discovery of the generalizing solution which in contrast scales linearly in the number of modules.
[1] How do infinite width bounded norm networks look in function space? Savarese et al., PMLR 2019
[2] Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss. Chizat & Bach, PMLR 2020
[3] Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. Boursier et al. NeurIPS 2022
[4] Deep Learning is Not So Mysterious or Different. Wilson, arXiv 2025
Notably, all of the results related to scale are based on simple MLPs and relatively simple teacher networks. While I think this is an interesting scenario to investigate (as noted above), I do think the authors should more strongly qualify their findings based on this limitation [...]
Please allow us to clarify that as part of the original submission, we reproduce all results of Section 3 on the compositional preference task family, an optimal-policy imitation learning task over grid worlds with compositional reward functions (see L112-113 and Appendix B.1). We focus on the hyperteacher in the main text due to space constraints and to keep the exposition clear but we understand that it is easy to miss this aspect and we will make this reference more prominent in the manuscript.
Triggered by your remark on the complexity of the teacher networks, we conducted a new set of experiments and reproduced the data scaling plot on the top of Figure 1 for a deeper hyperteacher with three hidden layers. This makes the task noticeably more difficult which is evidenced by the need for more data to achieve compositional generalization similar to the results on the compositional preference task family. Importantly, the qualitative finding that scaling data eventually leads to compositional generalization is reproduced in this setting. We regret that this time around we cannot show figures in the rebuttal: this plot looks qualitatively the same as the top of Figure 1 with all curves slightly shifted to the right due to the increased task difficulty.
In addition, we would like to emphasize that we see the simplicity of using MLPs as a strength of the work since arguably MLPs are basic building blocks of a variety of other neural network architectures. For instance, transformers explicitly use MLPs in their feedforward blocks. To convince our readers that this is not only a theoretical consideration, and in response to your remark (as well as reviewer kRyd), we conducted a new set of experiments and reproduced the data scaling plot on the top of Figure 1 where we replace the MLP with a transformer. We qualitatively reproduce our main result, i.e. as data is scaled, transformers achieve compositional generalization. Transformers are (maybe not unexpectedly) more sample efficient than MLPs at doing so and to further emphasize this point we create a plot where we compare the minimum number of training tasks required to achieve compositional generalization (OOD R² > 0.95) as the total number of possible tasks is increased (by increasing the number of modules M). In this plot, both the MLP and transformer clearly require a sub-exponential (in fact, sub-quadratic) number of training tasks compared to the total number of tasks with the transformer consistently requiring less than the MLP. Again, we regret that we cannot show the actual plots in the rebuttal but we have added these figures to the appendix of our updated version of the manuscript and refer to them in Section 3.1 of the main text. Please let us know if there are specific numbers for these results you would like to see in tabular form here.
It is quite surprising to me that the linear decoder of the task identity doesn't work. Given that the hidden layer of the neural network is fairly large, I would imagine that even a random nonlinear embedding would allow for memorizing the different tasks with a linear decoder. Can you give me an intuition for why this might happen? Is it related to the fairly large regularization parameter associated with the decoder? Does the neural network actually end up collapsing the representation?
Your intuition is correct. The linear decoder does work when decoding the task modules and thereby the task identity on the in-distribution tasks (i.e. the R² scores on in-distribution tasks are > 0.99). However, after having trained the linear decoder to predict the constituent modules of the training tasks using the hidden representations of the neural network, we report whether it generalizes to predicting the constituent modules on the held-out, out-of-distribution tasks on which we measure compositional generalization. Thus, the instances you describe where the linear decoder fails are instances where it fails to generalize to held-out tasks indicating that in these instances the ground-truth modules are not linearly represented in the hidden representation of the neural network.
I think it would be important to scale back the generality of the claims (as noted in my comments above). An example title would be "Scale leads to compositional generalization: a teacher-student perspective" --- more broadly, just giving a sense of the specific setup in the title would be good. There are many ways to study compositional generalization and the impact of scale and it would be helpful to be specific about what your contribution here is.
We understand your concern that the current title might sound very general but we hope that our response above has also helped to clarify that your suggestion of adding the subtitle “a teacher-student perspective” might not accurately reflect that we run experiments outside the teacher-student setting, i.e. on the compositional preference task family for Section 3 as well as with image generation models in Section 4.2. We would like to maintain a concise title that is easy to understand while avoiding to sound overly authoritative. In an attempt to accommodate both desiderata, we will change the title to “Scaling can lead to compositional generalization".
Thank you for your extensive response! I really appreciate the experiments you have added; I think they lend substantial additional support to your claims.
Regarding the linear decoder: thank you for the clarification, I think your experimental setup makes perfect sense there. I think it might make sense to add a brief clarification that decodability here refers to out-of-distribution generalization to the main text (I think right now, that's not made clear?).
Regarding the title: I like your proposed title and agree that it describes the scope of your study more accurately than my proposition.
Overall, I continue to think that this is a valuable paper that sheds light on how sufficient training data can give rise to compositional generalization and I would like to see it accepted to Neurips.
This paper investigates whether MLP neural networks can achieve compositional generalization through scaling alone, without compositionality enforced in the architecture. Using synthetic "hyperteacher" tasks where networks must compose modules to solve problems, the authors show that simply scaling data and model size leads to compositional generalization. The authors claim that the number of neurons needed for the generalizing solution scales linearly with task modules, making it asymptotically more efficient than memorization. They also find that compositional generalization emerges across different task encodings (identity, orthogonal, language, few-shot examples) and illustrate different shapes of the training data which are required to support compositional generalization. Finally, the authors find that models which successfully generalize can linearly decode task constituents from their hidden activations. The authors validate this on real image generation models, showing that diffusion models with better compositional abilities have more linearly decodable task representations.
优缺点分析
S1: This paper addresses an important problem: Compositional Generalization. And the paper discusses this topic all the way from theory, to synthetic experiments to real experiments. S2: The paper is well formatted and easy to follow with well curated figures. S3: The results on compositional support are well illustrated. S4: The linear decodability result is often assumed but it's useful to see this tested both in synthetic and real image data.
W1: This is my main concern. It is hard to determine whether, as the authors claim, “scale” alone leads to compositional generalization. Recently, several papers (albeit in different topics (e.g. https://arxiv.org/abs/2412.01003, https://arxiv.org/abs/2410.16531) has pointed out that it is often misleading that a scale-dependent effect is claimed when networks are trained with same number of steps, since wider networks are known in practice and in theory to learn faster. Would the authors clarify if all experiments are ran in equal flops settings? (This does not require more data for smaller models, but might need more epochs)
W2: Another concern is that together with W1, it seems like the title is too general, (kind of the “attention is all you need” vibe), as it’s more the case that compositional support, data diversity, enough optimization and scale is needed for compositional generalization.
W3: It would be much better if the experimental data is more explicitly described with examples.
W4: The results are definitively useful and are well illustrated, but it is unclear to what degree these results are unexpected, surprising or actionable.
问题
Q1: Are the runs for different model sizes FLOPs matched? (equal compute).
Q2: do you have any flops to loss plots which demonstrate how saturated the training is?
Q3: Figure 3 is very interesting! Is Compositional support and Connected support both R^2 ~ 1 ?
局限性
Yes
最终评判理由
This is a thorough paper ranging theoretical investigation to empirical validation.
The authors have also responded to my concerns adequately, I confirm my final score of 5.
格式问题
None.
Thank you for your positive and thoughtful review. We appreciate your detailed feedback that has helped us clarify important points in the manuscript and led us to conduct additional experiments, both of which we believe will help address your main concern. We are looking forward to engaging further with you throughout the discussion period.
W1: This is my main concern. It is hard to determine whether, as the authors claim, “scale” alone leads to compositional generalization. Recently, several papers (albeit in different topics (e.g. https://arxiv.org/abs/2412.01003, https://arxiv.org/abs/2410.16531) has pointed out that it is often misleading that a scale-dependent effect is claimed when networks are trained with same number of steps, since wider networks are known in practice and in theory to learn faster. Would the authors clarify if all experiments are ran in equal flops settings? (This does not require more data for smaller models, but might need more epochs)
Q1: Are the runs for different model sizes FLOPs matched? (equal compute).
Thank you for raising this important point. Indeed, when comparing the performance of models of varying size, it is important not to disadvantage smaller models by allocating them less compute (FLOPS). In our submission, we want to assess whether compositional generalization is at all possible at a given scale. This means we are giving smaller models as much compute as our budget allows, training them far beyond the point where the training loss has converged. As a result they are allocated more FLOPS than what is required for larger models to converge to the reported numbers. This point was not sufficiently explained in our initial submission and we have now updated the main text in section 3.1 to add this information.
To further illustrate this point, we performed two additional experiments:
-
We ran training on a subset of the runs with smaller models for 100x longer after the training R² score reached 0.999 in an attempt to capture eventual “grokking”-like behavior but found no such phenomenon to occur, i.e. the compositional generalization R² remained comparable to training 100x less long.
-
We trained models of sizes ranging from 90K to 84M parameters on a fixed FLOPS budget of 5.16e16 on the hyperteacher with M=16 K=3 for five different fractions of tasks held-out from training (0.25, 0.75, 0.9, 0.95, 0.98). This experiment shows that the compositional generalization R² score increases with model size but eventually saturates at a given level, unless the fraction of tasks held-out from training is further increased. We will add a corresponding plot to the main text but for the sake of the rebuttal which does not allow showing figures this year, here are the results in a tabular format:
Table 1: ISOFLOP analysis comparing models with different numbers of parameters (each entry shows the compositional generalization R² score on held-out tasks at the end of training with the same FLOPS budget of 5.16e16):
| Fraction held-out | 90K | 342K | 1.3M | 5.3M | 21M | 84M |
|---|---|---|---|---|---|---|
| 0.98 | -0.01 | 0.24 | 0.51 | 0.53 | 0.53 | 0.52 |
| 0.95 | 0.25 | 0.66 | 0.76 | 0.82 | 0.82 | 0.83 |
| 0.90 | 0.64 | 0.82 | 0.90 | 0.93 | 0.94 | 0.95 |
| 0.75 | 0.89 | 0.95 | 0.97 | 0.99 | 0.99 | 0.99 |
| 0.25 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Q2: do you have any flops to loss plots which demonstrate how saturated the training is?
Thank you for this suggestion that will help to make the previous point more obvious in the manuscript. We added FLOPS vs. loss plots corresponding to the results of Figure 1 to the appendix. Again, since we cannot show figures in the response, here is a sample table that corresponds to the row above where 90% of the tasks are held-out from training.
Table 2: Training loss as more FLOPS are consumed comparing models with different numbers of parameters.
| Training steps | 90K | 342K | 1.3M | 5.3M | 21M | 84M |
|---|---|---|---|---|---|---|
| 25% | 3.1e-3 | 8.4e-4 | 5e-4 | 3.3e-4 | 3.8e-4 | 1e-3 |
| 50% | 2.9e-3 | 7.5e-4 | 3.6e-4 | 2.8e-4 | 2.3e-4 | 8e-4 |
| 75% | 2.8e-3 | 6.8e-4 | 3e-4 | 2.4e-4 | 2.2e-4 | 6.5e-4 |
| 100% | 2.6e-3 | 6.2e-4 | 2.8e-4 | 2e-4 | 2.0e-4 | 3.6e-4 |
W2: Another concern is that together with W1, it seems like the title is too general, (kind of the “attention is all you need” vibe), as it’s more the case that compositional support, data diversity, enough optimization and scale is needed for compositional generalization.
The “right” training data is indeed needed for scaling to lead to compositional generalization and we understand your point that the current title might be too absolute to accurately reflect this point. At the same time, we would like to maintain a concise title that is easy to understand. In an attempt to accommodate both desiderata, we will change the title to “Scaling can lead to compositional generalization" to give better justice to the fact that not any kind of data and model scaling yields such a result.
W3: It would be much better if the experimental data is more explicitly described with examples.
Thank you for your suggestion, we have extended section 2.4 and section B.2 respectively with a more explicit description of the experimental data we use. In addition, we added a conceptual figure that illustrates Definition 2.1 of the general class of compositional task families we consider.
W4: The results are definitively useful and are well illustrated, but it is unclear to what degree these results are unexpected, surprising or actionable.
We believe that these results are both surprising to the community and actionable. Please allow us to briefly restate our main arguments:
It is a decades-old question whether connectionist models are fundamentally able to capture compositional structure [1-4]. Despite the success of large-scale models, we observe many failure cases that indicate deficits in capturing compositional structure making this a valid concern still today [5-9] (also see section 4.2 for systematic failures of image generation models). A number of recent papers argue that for compositional generalization to succeed, neural networks need to be endowed with architectural priors for modularity [e.g., 10-12]. Our results challenge this position, since even monolithic neural networks (MLPs) can achieve compositional generalization at sufficient scale. We believe it is actionable since it assigns a stronger importance to the structure of the training data over the model architecture than was previously thought, i.e. it is important how diverse the training data is (section 3.1) and how the training data is distributed (section 3.4) but shows that compositional generalization can ultimately be achieved via scaling and does not constitute a fundamental shortcoming of connectionist models.
We refined these points in the introduction, related work section and discussion of the paper.
[1] Connectionism and cognitive architecture: A critical analysis. Fodor & Pylyshyn, Cognition 1988
[2] Connectionism, Constituency and the Language of Thought. Paul Smolensky, Meaning in Mind 1991
[3] Systematicity in connectionist language learning. Hadley, Mind & Language 1994
[4] Connectionism and the problem of systematicity. Phillips 1995
[5] Faith and Fate: Limits of Transformers on Compositionality. Dziri et al., NeurIPS 2023
[6] The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. Berglund et al., ICLR 2024
[7] Not All LLM Reasoners Are Created Equal. Hosseini et al., NeurIPS 2024 Workshop MATH-AI
[8] Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability. Xu et al., ICLR 2024 Workshop ME-FoMo
[9] GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. Mirzadeh et al., ICLR 2025
[10] Discovering modular solutions that generalize compositionally. Schug et al., ICLR 2024
[11] Compositional Generative Modeling: A Single Model is Not All You Need. Du & Kaelbling, ICML 2025
[12] Breaking Neural Network Scaling Laws with Modularity. Boopathy et al., ICLR 2025
Q3: Figure 3 is very interesting! Is Compositional support and Connected support both R^2 ~ 1 ?
Yes! We ensure for both the “Random” and “Balanced” task distributions that their support is compositional and connected. We revised the caption to better reflect this point.
Thank you for the thorough reply.
It is comforting to see the FLOPS matched (and extended) experiments!
I also acknowledge the changes to the paper presentation.
I updated my score!
Dear reviewers,
For the reviewers who have not yet read the authors' rebuttal and the other reviews, please do so now. Per the NeurIPS guidelines, the reviewers must comment if the authors' response did or did not address their concern, and the reviewers cannot enter the "mandatory acknowledgment" before they have sent a comment on whether the authors' responses did/didn't address their concern.
Please read the reviews and rebuttal, let the authors know, and submit your final score accordingly. The NeurIPS chairs specifically directed the AC to ask you not to wait till the last moment for this in order not to defeat the purpose of the reviewer-author discussion period.
Thank you for your service!
-Area Chair
The submission focuses on the compositionality of tasks in the context of neural networks. More specifically, if/how a neural network with a continuous nature can model a task that has an inherent discrete and compositional structure. The submission addresses this question from an experimental and partially theoretical angle and reports interesting findings, such as scaling the data and model size can lead to a computational generalization.
All reviewers approve of the paper and recommend acceptance (except one reviewer). The AC also agrees that the paper has clear merits and should appear in NeurIPS. One review was deemed to be of insufficient quality was disregarded in the decision-making process. The review was reported for disciplinary action due to poor quality and failure to engage with the process despite multiple requests.