PaperHub
6.0
/10
Poster4 位审稿人
最低3最高8标准差2.1
8
3
8
5
3.3
置信度
正确性3.3
贡献度2.8
表达3.0
ICLR 2025

Breaking Neural Network Scaling Laws with Modularity

OpenReviewPDF
提交: 2024-09-14更新: 2025-03-04
TL;DR

We show theoretically that modular neural networks trained on modular tasks can generalize to high-dimensional tasks with a fixed number of training points; we propose a learning rule to exploit this advantage empirically.

摘要

关键词
scaling lawsmodularityneural networkgeneralizationcompositionalitycombinatorial generalization

评审与讨论

审稿意见
8

In this paper, the authors study modular neural networks and their ability to learn functions that are themselves modular insofar as they have either a compositional nature or a modular composition in their statistics. The authors first provide a theorem outlining expected scaling laws for sample complaxitiy and trainign accuracy in a curated linear task setup. They then corroborate the predictions from this theorem with numerical experiments. Following this, they present a task setup with explicit ground-truth modular structure and prove a second theorem outlining similar sclaing propoerties for this novel tasks setup along with a particual modular NN architecture. The result is reduced scaling complexity for tusch tasks when NNs are appropriately modular. The authors then propose an initialization scheme for NNs which first learn module initialization in a self-supervised fashion using task statistics. Finally, the authors show that this approach behaves as expected on non-trivial tasks such as compositional CIFAR, and that the proposed method works well on other modular architectures beyond the one used for their theoretical results.

优点

This paper presents an excellent set of theoretical results outlining expected scaling laws for modular networks when task have a modular structure. It also presents a practical initialization scheme for potential models that is based on self-supervised alignment with task statistics. Lastly, it validates predictions with non-trivial and relevant experiments with architectures that go beyond the ones used for theory, outlining the potential generality of the result.

缺点

The paper is sound. There are potntials for improvements in two key areas:

  1. In experiments, it is unclear if the generalization advanatage of the modular networks remain if one factors in the pre-training (i.e. learning module initialization). In other words, for the same total compute, would a monolithic model do as well as the modular one for which some of the compute budget went toward initialization? My apologies if I missed it, but this is a key point that would factor into real-world scaling laws.

  2. What happens if there is a missmatch between the task modularity and the NNm odule count? In experiments, the models could accurately recover tasks modularity when the number of modules is known. In contrast, in other tasks, this is not known a priori. How would the models behave in this mismatched environment? To be fair, the authors acknowledge this issue, but I wonder if some rapid experiments could outline expected behaviors in this case. Once again, apologies if this has been explored and I missed it.

  3. While this is no doubt very relevant for the field, it's overall impact with respect to modern architectures remains unclear. Some discussion about the use of such methods in modern settings such as in the prsence of attention could greatly enhance the scope of the result. This is not necessary however, as this is a good paper in the current scope.

问题

See above

评论

We really appreciate your positive feedback on our submission and your suggested directions for improvement.

Fixed compute budget

This is an important question of great practical relevance. We have not tried any experiments in which the compute budget for the randomly initialized vs. pre-initialized networks are fixed. In fact, in this work, we do not at all optimize our kernel-based modular learning rule for computational efficiency.

However, we note that approximating kernel methods for computational efficiency is a well-established field of research which can reduce the computational cost of kernel methods down to linear time in the size of a dataset (such as using random Fourier features). We expect that combining these methods with our proposal can help make our method significantly more competitive with random initialization under a fixed compute budget.

Mismatch between architectural module count and task module count

This is an insightful point.

In our experiments, in general, the number of architectural modules is always greater than the number of modules in the task: thus, these do not need to match. For example, in the case of Compositional CIFAR-10, we fix the number of architectural modules to 32 and vary the number of task modules from 1 to 8. Figure 10 shows an additional experiment demonstrating the effect of choosing different numbers of architectural modules on the sine wave regression task. We find that adding more architectural modules helps performance even beyond the number of task modules.

Relevance to modern architectures

We appreciate this point as well. In section 5, we have expanded our discussion of how our results apply to modern architectures.

In terms of practical relevance, we believe our results show that when training modular architectures, more emphasis should be placed on the optimization methods used to train the modules. Naively applying gradient descent may not be sufficient for effective training of these architectures. Our results also suggest a potential theoretical basis for why modularity is so effective in practice: namely, modularity breaks up a large high-dimensional task into easier to learn low-dimensional tasks.

评论

thank you for the response. I am happy with the arguments and maintain my score.

审稿意见
3

The authors present both theoretical and algorithmic results regarding the generalization ability of a particular form of modular architectures. They first highlight theoretical results on generalization results showing exponentially large sample complexity as a function of the input dimension. Then they present a particular class of modular architectures as a sum of experts and assume that the training data was generated by this same architecture. They show that thanks to each module making a low-dimensional projection before processing the data, the scaling behavior is better behaved. Then they present a kernel-style algorithm to initialize such an architecture, to be then fine-tuned by usual supervised learning and SGD. Finally, they show results on a toy 1-dim sine-wave regression task and on the recently introduced compositional MNIST task.

优点

Much remains to be understood about the generalization behavior of neural nets, especially the types that have a modular architecture, so advances on the theory (in special cases) that are presented do seem useful.

缺点

(1) the theoretical results are not suprising: projecting the m-dim input to several b-dim low-dimensional representations unsurprisingly reduces the exponential badness from m to b. The theory is also of fairly limited scope, with lots of unreasonable assumptions (e.g., of linearity wrt parameters) that may not tell us as much as we would like for more general forms of modular architectures.

(2) the proposed algorithm is unlikely to scale well in terms of computational efficiency beyond small-size problems and into frontier AI, given the use of kernel methods in the novel part of the method

(3) the fact that all modules are initialized independently and using the same (randomized) procedure suggests that a significant part of the advantage could come from an ensemble effect (which always helps generalization)

(4) the paper seems to overclaim in multiple places, e.g., suggesting that their results tracks empirical behavior of modern neural nets (even the empirical comparisons don't match the theory, e.g., fig 2 bottom right).

(5) I did not find numerical comparisons against benchmark results from other papers, and when I look at Jarvis et al 2023, their figures show much lower errors. Hence the experimental results may not be that good after all.

问题

(1) I was confused by the results in figure 1, whereby test error INCREASES with larger datasets. This seems incompatible with empirical observartions and traditional statistical analyses of generalization.

(2) I did not understand in what sense the y_j could be considered independent (and what is the random variable), after eq 3.

(3) eqn 4 seems wrong: on the LHS y is a function of the linear projection U x, whereas on the RHS U and x only interact via the presumably non-linear function phi.

(4) why should we expect eqn 15 to give the minimum norm solution? (i.e. why is it a solution and why is it minimum norm, from what class)

(5) You should add the citation of the original mixture of expert papers (e.g. Jacobs et al 1991).

评论

Thank you for your detailed review and for highlighting areas where our work can be strengthened. We address your concerns point by point below.

Novelty of theoretical results

We understand that projecting high-dimensional data into lower-dimensional subspaces is a well-known technique. However, our contribution lies in providing a quantitative analysis of how modular architectures affect sample complexity in neural networks.

Our work is the first to derive explicit, non-asymptotic expressions quantifying how modular architectures can circumvent the exponential sample complexity associated with high-dimensional inputs. This provides a theoretical foundation for designing modular networks that are efficient in practice.

While our analysis assumes linearity and specific modular structures, these simplifications are essential for analytical tractability. Future work can build upon our framework to explore more general, nonlinear architectures and investigate how modularity affects their generalization properties.

Scalability of the proposed algorithm

You raise a valid concern regarding scalability.

We note that approximating kernel methods for computational efficiency is a well-established field of research which can reduce the computational cost of kernel methods down to linear time in the size of a dataset (such as using random Fourier features). We do not explore these methods in this work as it is orthogonal to our contributions; however, we believe combining these well-known methods with our proposal is a fruitful direction.

Potential ensemble effects

This is an interesting point.

In our compositional CIFAR-10 experiments, however, the modules all have the same weights (they are weight tied). Thus, the improved performance of the modular architecture cannot be attributed to an ensemble effect in this case.

Overclaiming and empirical comparisons

We apologize if any claims appeared overstated.

While our theoretical model captures the general trends observed empirically, certain deviations occur due to factors not accounted for in the simplified model, such as optimization dynamics. We discuss these factors in Section 3.3.

We are happy to soften any specific claims in our submission to match our empirical results.

Comparison to benchmark results

We emphasize that Jarvis et al. tests on Compositional MNIST while we test on Compositional CIFAR-10, a significantly harder task. Thus, their performance metrics cannot be compared with ours.

Confusion about test error increasing with larger datasets

We understand the confusion.

In Figure 1, the increase in test error with more data reflects the double descent phenomenon (Belkin et al.). This is a now well-established phenomenon in which test error increases with the number of training points until the interpolation threshold is reached (training points = number of parameters). Beyond this point, the test error decreases with the number of training points; this is also clearly illustrated in Figure 1. We are happy to provide further references on this phenomenon if helpful.

Independence of yjy_j

In equation 3, we wish to rescale the summation such that the left hand side has constant magnitude as the number of modules varies. As a heuristic, if we treat each of the terms in the summation as independent under the assumption that xx is a random variable, we then find that the variance of the summation scales with the number of modules KK. Thus, it is natural to divide the summation by K\sqrt{K} to normalize.

Equation 4 clarification

That's correct: as mentioned in the text before and after equation 4, we make a linearizing assumption on the parameters of the model. We assume that the model output is a linear function of the model parameters (which now include both the module parameters and the module input projection). This is, in fact, the same assumption made in Section 3.

Equation 15 clarification

Equation 15 simply solves equation 14 for θ\theta. Note that equation 14 is a linear equation in θ\theta. It is well known that the minimum norm solution for a variable in a linear equation (if a solution exists) can be computed by the pseudoinverse (denoted by \dagger in our notation): the solution to Y=XθY=X \theta minimizing θ||\theta|| is θ=XY\theta = X^\dagger Y. We are happy to provide further references if helpful.

Mixture of Experts paper reference

Thank you for pointing this out. We have fixed this in our latest revision.

评论

Q1: I wrote that I was concerned that "test error INCREASES with larger datasets" in figure 1 and you answered that this is the double descent phenomenon. From what I understand, the double descent phenomenon is about test error decreasing with larger capacity p, not dataset size n.

Q2: Your answer to explain why we should treat the \hat{y}_j's as independent does not make sense to me, sorry. All the \hat{y}_j's are deterministic functions of the same x, so they are highly dependent, and thus not independent random variables.

Q3: My question was not about linearity wrt theta but about \hat{y} being a function of U x on the LHS but not on the RHS.

评论

Thank you for your response.

Q1: In fact, the double descent phenomena can occur both with respect to larger capacity pp and dataset size nn: test error can increase with nn in certain regimes. Nakkiran et al. 2020 and Schaeffer et al. 2023 both provide empirical examples of this, and this is also supported by theory (e.g. D'Ascoli et al., 2020, Rocks et al. 2022). In summary, this is because the large spike in test error when pnp \approx n can lead to non-monotonic behavior of the test loss with respect to nn. We are happy to provide further references on this point.

Q2: Certainly, we agree that the y^j\hat{y}_j are not independent. However, given that each y^j\hat{y}_j is a function of a separate subspace U^jTx\hat{U}_j^T x of the original xx, we believe it is reasonable to approximate the effect of each module y^j\hat{y}_j as being independent. On the other hand, if all the modules were perfectly correlated, then the appropriate scaling factor would be 1K\frac{1}{K} instead.

We emphasize that this argument is simply made to justify the scaling factor in Equation (3); in fact, any scaling factor can be used here (including no scaling factor) and will not change the validity of our theory.

Q3: We understand your concern regarding equation 4: it may seem strange that on the left hand side, the expression is a function of UxUx while on the right hand side, there is no UxUx.

We again emphasize that we are making a modeling assumption in equation (4): we are assuming the function is linear in UU. This, in fact, directly implies the right hand side for some choice of features φ\varphi. We do not claim that equation 4 holds in general without this linearity assumption.

审稿意见
8

The paper shows that the sample complexity associated with training modular neural networks is independent (under certain conditions) of the input dimensionality and does not follow the same exponential increase with input dimension as in the case of monolithic or traditional neural networks (NNs).

First, the authors present a derivation of training and test error in monolithic NNs when the task is non-modular. The task and the NN are modeled linearly based on features generated using the input. The authors then, empirically validate that NNs with different architectures (loosely) match the theoretical trends when varying the input dimension, number of samples and the number of parameters. Note that the task considered for this experiment is modular.

The authors then theoretically compute the training and generalization errors of modular NNs given that the underlying task is also modular with the same structure. Each module is associated with a small NN (modeled linearly) and an input projection mechanism that reduces the module input dimensionality. The task is also modeled in the same way where the parameters are randomly initialized.

The resulting closed form solution shows that the training error is independent of the input dimension and the test error under the condition of under-parametrization is also independent of the input dimension. This result hinges on the input projection associated with each module, where the dimensionality associated with the overall input is reduced. The authors then propose a method to initialize (or learn) the parameters associated with the input projections. Once initialized the modular NNs are trained end-to-end.

Empirical results show that modular NNs that are learned using the proposed initialization mechanism achieve significant improvements over monolithic NNs and modular NNs conventionally trained.

优点

The theoretical derivations are sound and very well done, and the paper is well written. The authors did a good job showcasing how modular networks can be related to monolithic networks (based on certain linear assumptions).

Intuitively, modeling of the data, and learning of parameters related to module input bottlenecks make sense. This is similar to learning the connectivity associated with module input or limiting the number of module inputs to avoid module collapse.

The experiments show a clear trend that the input projection mechanism results in better performance and sample complexity, as compared to monolithic NNs and end-to-end trained modular NNs.

缺点

The major weakness of the paper is the consideration of a single layer of modules and data generating system. In such a system the output is a linear composition of the module outputs. This may not be true for many real world systems where multiple such modular layers can exist in a hierarchy.

问题

Continuing with the previous weakness, the algorithm to learn or initialize the input projection parameters may not work in such a case as it is dependent on the initial module NN weights.

The generalization performance for the compositional CIFAR-10 experiment can be divided into input class permutations present in the training data vs. not present in the training data to further dissect the difference between monolithic NNs and modular NNs. The sample complexity experiments with compositional CIFAR-10 tasks are not present and should be added to further strengthen the claims.

For individual tasks considered, there appears to be a large amount of tuning of methodology to train the modular NNs and the module input projections. (Referring to appendix)

How would the solution to training and test errors change if the input projections from each module were removed, and the modules considered the input x in its entirety? This is consistent with the current mixture-of-exert (MoE) models.

Is there a validation set used for the experiments or is the generalization performance reported from the last training iteration?

Do the modular NNs treat the number of modules as a hyper-parameter and tune it to improve the performance. Or is it an architectural characteristic such as the width or depth in monolithic NNs that is fixed ?

The CIFAR-10 experiments are run only for a single epoch, will increasing the number of epochs result in better performance for networks?

Minor: Equations are referred to the appendix when they are also present in the main part of the paper.

伦理问题详情

none

评论

We really appreciate your thoughtful and positive feedback on our submission. We address the points you raised below:

How to extend to setting with hierarchical modules

We think this is an excellent point: indeed, our paper only considers a relatively simple form of modularity in which we use a single layer of modules and the module outputs are linearly combined to form the model output. As we note in the discussion section, practical modular architectures are often more complex, involving some level of hierarchy. We believe our results are a step in the direction of more theoretical analysis of how modularity can benefit generalization more broadly.

We do note however, that our experimental results include the case of nonlinear module projections (in which the module inputs are nonlinear projections of the task inputs). This is a step towards generalizing our results to more practical settings.

Concerns on CIFAR-10 experiment

This is a nice suggestion. As we note in Appendix E though, given the very large number of possible class permutations (10k10^k where kk is the number of images), we expect that for large kk, all class permutations in the test data will be unseen. However, in Table 3 we explicitly test accuracies in the case where test inputs have distinct class combinations and find similar results.

For this task, we opt to use accuracy (fixing training samples) instead of sample complexity (fixing accuracy) as our performance metric since it is more natural to treat the Compositional CIFAR-10 dataset as unlimited (given the combinatorial number of samples that can be drawn).

Comments on tuning

Our experiments involve a number of hyperparameters. Unless otherwise mentioned, we set the hyperparameters to the most natural choice without tuning them. For certain hyperparameters, we do sweep over different values, with the sweep range indicated in Appendix E. We are happy to clarify any specific hyperparameter choices if needed.

Experiment with no input projection

We appreciate this interesting suggestion. We note however, that in our Compositional CIFAR-10 experiment, all modules have the same weights (they are weight-tied). Thus, if they all had the same inputs, their outputs would also be the same. The final model output is the concatenation of the module outputs which in this case would perform very poorly.

Comment on validation set

For both the Compositional CIFAR-10 and sine wave regression tasks, we use a separate validation set. Note that for the sine wave regression task, each input is drawn completely independently from an infinite sized dataset; thus, any points not trained on can be treated as validation data.

Number of architectural modules

We treat the number of architectural modules as a (potentially tunable) architectural characteristic. It is fixed (at 32) for the Compositional CIFAR-10 task and tuned over for the sine wave regression task. Figure 10 illustrates performance for different choices of number of architectural modules for the sine wave regression ask.

Number of epochs for Compositional CIFAR-10 task

In this task, the number of possible training inputs can be treated as virtually infinite (for large kk). Thus, we believe it is more practically reasonable to fix number of epochs as one rather than repeat training on the same input multiple times.

Comment on equation references

Thank you for pointing this out. We have fixed this in our latest revision.

评论

I would like to thank the authors for their clear responses.

I still believe (also after reading the rest of the reviews -- and the authors' responses) that this is an important and solid paper that will be valuable to others if published.

Of course, mostly because of the restrictive assumptions that I also mention in the review, I cannot describe the paper as "groundbreaking" -- and for that reason I prefer to keep my score as is.

审稿意见
5

This paper (1) constructs a simplified theoretical model of generalization (focused on the case of linear regression from what I believe to be a set of not-strictly-defined features), (2) provides empirical demonstrations that, at least in broad strokes, a sine wave regression task follows the predictions made by that model, (3) argues that in cases where the simple problem under consideration is modular (here meaning that it has k modules of size b which interact, rather than a full set of P features), better sample complexity can be obtained by using an explicitly modular parameter structure, and (4) provides some empirical support for this generalization behavior on another set of (this time modular) sine wave regression tasks.

优点

  • The paper focuses on an empirical task that lets them validate their ideas, but is realistic in terms of scope
  • The paper engages with an interesting problem of the structure of weights that allow for better generalization

缺点

  • The paper's notation, particularly in the initial presentation of the theoretical model, was confusing and felt under-explained. In particular I was confused by feature matrix (how did it get constructed from the inputs? What assumptions are being made about it? Why is it a matrix to begin with rather than a feature vector), and this confusion made it hard to understand future claims made in the paper (especially since the central claim was about the effect on generalization of input dimension, which is mediated by the function implied in this feature matrix)
  • The forms of the expected training and test loss could have been broken down in a clearer and more intuitive way, rather than simply being presented as not-very-comprehensible formulas
  • This paper assumes that the only way to benefit from the generalization behavior of modularity is to have explicitly modular structure; it would have been interesting if it had also engaged with whether modular data gives you generalization benefits without a parallel parameter structure (since in practice modern models seem to generalize well without the benefit of this)

问题

  • What is the explicit definition of modularity being used? This concept was referenced without ever being really explicitly defined in a general-but-still-technical sense (and attention was given as an example). I ended up being confused about whether the focus was on independently functioning parts of a network, or shared weights in a more general sense .
  • As mentioned in "Weaknesses": what assumptions are made in general about the structure of the feature matrix? It is indicated in the modular version of the model that arbitrary nonlinear transforms of the input are considered as valid feature matrices, but this isn't clarified for the first treatment of the model
  • Do the benefits cited require being correct about the number (k) of modules in the underlying data? How much do positive results depend on being correct in your choice of k relative to what is present in the underlying data?
评论

Thank you for your thoughtful review and for highlighting areas where our paper can be improved. We appreciate your insights and address your concerns point by point below.

Confusion with notation and explanation of the feature matrix

We apologize for the confusion caused by the notation in our theoretical model, especially regarding the feature matrix. In our model, the feature matrix arises from applying a feature mapping to the input data, transforming each input xx lying in mm dimensions into a higher-dimensional feature matrix ϕ(x)\phi(x) lying in d×Pd \times P dimensions. Here, dd denotes the number of output dimensions of the model and PP denotes the number of features per output dimension. We do not make any particular assumptions about exactly how ϕ\phi is constructed (it can be arbitrarily nonlinear and complex)- the main property it must satisfy is that the features ϕ(x)\phi(x) are distributed as a Gaussian.

The reason why we use a feature matrix instead of a flattened feature vector is that it allows us to easily account for non-scalar output sizes (d>1d > 1) with a fixed number of parameters PP. Another option to account for multi-dimensional outputs is to have d×Pd \times P parameters (one set of parameters per each output dimension) while the number of features is fixed at PP. This is also valid; however, our choice is more consistent with some prior literature (e.g. Jacot et al.).

We are happy to revise any particular points of confusion in our revision.

Clarity of expected training and test loss formulas

We clarify the forms in Theorem 1 as follows: the test loss consists of three terms. The first term is largest near the interpolation threshold (dnpdn \approx p) and decreases away from it- this causes a spike near the interpolation threshold. Intuitively, this corresponds to overfitting to noise in the training data. The second term is negative and corresponds to the reduction in loss caused by capturing information about the true model. In the overparameterized regime, it linearly grows in magnitude with the number of training points, and in the underparameterized regime it is constant with amount of training data (as information in more training points can't be captured by the model). The last term is a constant offset corresponding to the loss when the number of parameters approaches infinity.

The training loss is a product of two terms: the first corresponds to the information about the underlying target function not captured by the model and the second corresponds to the amount of training data available that can't be captured by the model. If we have more parameters than training data, then the training loss is zero since all the information can be captured. Otherwise, there will be information loss proportional to the excess training data times the remaining information about the target function.

We are happy to revise any particular points of confusion in our revision.

Does modular data alone help without a modular architecture?

The reviewer raises an interesting question about whether it is possible to benefit from merely modular data without a modular architecture. Our experiments on Compositional CIFAR-10 (see Figure 4) suggest the answer is no: a non-modular architecture performs relatively worse on higher-dimensional modular tasks than a modular architecture. Of course, it is possible that the non-modular architecture has learned the modular structure of the task to some extent (just not as well as a modular architecture) and we think this is an interesting future direction to explore.

Definition of modularity used

We appreciate the need for a clear definition of modularity.

In this paper, we define modularity as the decomposition of a task or function into distinct, independently operating components or modules. Each module processes a subset of the input dimensions and contributes to the final output in a compositional manner. This structure allows for specialized processing within modules and recombination to solve complex tasks. Our modular neural network architecture mirrors this by having separate subnetworks (modules) that handle specific input projections.

We also highlight that in other literature, modularity may be used in other, potentially different ways.

Dependency on correct number of modules

This is an insightful question.

In our experiments, in general, the number of architectural modules is always greater than the number of modules in the task: thus, these do not need to match. For example, in the case of Compositional CIFAR-10, we fix the number of architectural modules to 32 and vary the number of task modules from 1 to 8. Figure 10 shows an additional experiment demonstrating the effect of choosing different numbers of architectural modules on the sine wave regression task. We find that adding more architectural modules helps performance even beyond the number of task modules.

评论

Thank you to the authors for engaging with my questions in such depth.

Modularity definition: I'm still confused about how attention is a modular architecture by your definition of modularity (which you describe it as in your first paragraph), since, while attention mechanisms share weights between sequence elements, they do not have the property of modularity described here where the input features are explicitly subdivided and passed to different components

While the authors have added more detailed explanation of both the confusing formulation of the feature matrix and the training/test loss in the comment here, the paper itself does not seem to have had these clarifications added, so my issue with the paper on that front still stands. I realize that this feature matrix formulation might be standard for those with specific familiarity with the Jacot paper mentioned, but I believe it is still quite unintuitive to the average ML researcher or practitioner, and without explanation this conceptual merging of features and parameters will continue to be confusing (if the feature matrix already has an output dimension of d, what is the W matrix doing? Is it just a dxd matrix mapping within output space? The shape of W is not explicitly stated, making this unclear).

I have increased my score slightly, since the authors provided evidence that the benefits shown do not require the k of the underlying data to be known ahead of time. However, I still think on the whole this paper is confusing, unclear in its notation, and hard to read or reason about.

评论

Thank you for your continued engagement with our work and for providing additional feedback. We are grateful that you have increased your score and appreciate the opportunity to further clarify and improve our paper.

Modularity Definition and Attention Mechanisms

We apologize for the confusion regarding our inclusion of attention mechanisms as an example of modular architectures. Our intent was to illustrate that certain aspects of attention mechanisms exhibit modular properties, but we understand that this may not have been clear within the context of our specific definition.

We define modularity as the decomposition of a task or function into distinct, independently operating components (modules), each processing a subset of the input features and contributing to the final output in a compositional manner. In self-attention, each output token is computed as a linear combination of input tokens, with the coefficients of the linear combination being the computed self-attention "weights" (computed via softmax). In many NLP and CV tasks, the self-attention "weights" are sparse: each output token is primarily sensitive to only a small number of input tokens. In this sense, we may consider self-attention to be performing modular computation.

Clarity of Notation and Feature Matrix Explanation

We apologize for not incorporating the detailed explanations provided in our response into the paper itself. It is important to us that the paper is clear and accessible to all readers, regardless of their familiarity with prior work. We have revised the paper to include the clarifications regarding the feature matrix. Specifically, we have:

  • Clearly defined how the feature matrix is constructed from the input data, including any assumptions made about its properties.
  • Explicitly stated the shapes and roles of all matrices involved, including the feature matrix φ(x)\varphi(x) and the weight vector WW, to eliminate ambiguity.
  • Provided an explanation of why we use a feature matrix instead of a feature vector.

Regarding your question about the feature matrix already having an output dimension of dd and the role of the weight vector WW, we realize that the notation may have been confusing. In our formulation, the feature matrix φ(x)\varphi(x) maps the input xx to a feature space of dimension dPd P. The weight vector WW then maps these features to the output space, and has dimensions P×1P \times 1. The output is computed as y=φ(x)Wy = \varphi(x) W. We have made sure to clearly state the dimensions of all quantities and explain how they interact to produce the output.

Once again, we appreciate your feedback and are committed to making our paper as clear and comprehensible as possible. We believe that the revisions we made will address your concerns and enhance the overall quality of our work.

AC 元评审

This paper derives a theoretical quantitative analysis of the generalization abilities of modular architectures. The analysis shows that under certain assumptions (linearity, specific modular structure of architecture and data) models can avoid exponential sample complexity associated with high dimensional inputs. The paper corroborates this with empirical result on two toy tasks.

Reviews agree that the studied problem is interesting, but disagree substantially about the relevance of the results. The main concern is that the assumptions might be too restrictive to apply to realistic cases and modern architectures. The authors defend their paper as a stepping stone towards better understanding of modular generalization.

From the reviews it seems clear that the results will be of interest to at least a part of the community. I lean towards counting being controversial as a potential advantage in this case, and therefore recommend accepting this paper.

审稿人讨论附加意见

A big part of the discussion focussed on clarity of the theory, both in terms of notations and definitions. The authors have incorporated this feedback to improve and clarify their paper, but not all reviewers are satisfied with the presentation.

Several questions about the paper such as the dependence on the number of modules, were raised and answered satisfactorily by the authors.

Apart from clarity the main open concern is about the relevance of the results. While uK9v remains unconvinced, vQzj and 2nhB are convinced that these results are impactful.

最终决定

Accept (Poster)