PaperHub
6.8
/10
Rejected4 位审稿人
最低5最高8标准差1.3
8
6
5
8
3.5
置信度
正确性2.8
贡献度3.0
表达3.0
ICLR 2025

Towards a formal theory of compositionality

OpenReviewPDF
提交: 2024-09-14更新: 2025-02-05
TL;DR

We present a formal, mathematically-precise definition of compositionality. We argue that the theory accounts for and extends our intuitions about compositionality, and can inspire novel methods for learning compositional representations.

摘要

关键词
compositionalitycomplexitydeep learningrepresentationgeneralization

评审与讨论

审稿意见
8

The paper proposes a novel metric to measure representational compositionality from the perspective of algorithmic information theory. Specifically, by estimating the Kolmogorov complexity of the representations (Z) and the corresponding message matrix (W), the paper defined compositionality as the ratio of K(Z) to K(Z|W). Compared with earlier colloquial definitions of compositionality, the proposed metric is well-founded on algorithmic information theory. It has the potential to inspire a more detailed analysis of different practical systems. Compared with existing metrics like topological similarity, the proposed metric aligns better with our intuitions of compositional generalization, and also has the potential to be extended to more general scenarios. The paper also showcased how to estimate such a metric using standard deep-learning tools. The experimental results all support the analysis well. To the best of my knowledge, this is the first paper that formally defined compositionality using Kolmogorov complexity and the authors did a good job of practically estimating this metric in different scenarios. I hence suggest an acceptance.

优点

  • The paper is well-written and easy to follow.
  • The discussion of limitations of colloquial compositionality highlights the necessity of a formal definition.
  • Reaching the definition of compositionality step-by-step makes it easier to understand how different components of the final metric are formulated.
  • The experiments are conducted at different levels (i.e., synthetic representations, emergent communication, and natural language).

缺点

  • The practical experiments can be more practical. It might be helpful to consider some LLMs. For example, if we can obtain the representations of some LLMs, how would the compositionality of these representations evolve during finetuning?
  • Although the authors discussed it in the conclusion part, more discussions on how the proposed metric can inspire the new algorithm design can make the paper stronger. For example, are there particular machine learning tasks or model architectures where you believe optimizing for this compositionality metric could lead to improvements?
  • The role played by W and Z might be a bit confusing. It might be more helpful to draw a system diagram combining representation learning (i.e., how Z is estimated) and the sentence W together. IIUC, if we call our high-dim input X, consider an unsupervised representation learning as an example, our system should be X→Z→X’. We then have the hidden representation of input signals. After that, when we evaluate how good Z is, we introduce the messages W and f to evaluate the corresponding Kolmogorov complexity.

问题

  • For the first line of equation (1), if pw,W,fp_w, W, f are all empty strings, then the right-hand side is still minimized. The equation is degenerated to be K(Z)=K(Z). Is that the case?
  • Is there any particular reason to define C(Z) as the ratio of two K terms? What about the K(Z)-K(Z|W)? What’s the difference between these two options?
  • It might be more helpful to provide a more formal definition of topological similarity. It appears in Figures 2 and 3. But defining them in the main context might be helpful.
  • Still about topsim. This metric is notoriously known for its high complexity because we must calculate the pair-wise distance of all the examples. What about the complexity of the proposed method?
  • Figure 3 caption: iterated learning is an inductive bias … I guess it should be “iterated learning amplifies the inductive bias of model’s learning that more compositional mappings are learned faster”?
  • In section 4.2, the last paragraph, what is the “normal language” system trained without iterated learning?
  • Results in section 4.3 demonstrate that different languages have different compositionality. Are there any results in linguistics that can verify this claim? Furthermore, the experiments show that Japanese has a negative topsim, is there any intuitive way to understand what this means?
评论

We thank the reviewer for their excellent points, which we hope we have addressed. We apologize for the long response, as we wanted to respond to every point.

The experiments can be more practical. It might be helpful to consider LLMs and finetuning.

Our framework can indeed be used to study the compositionality in the representations of LLMs, but this is complicated by the variable representation dimensionality in these models that depends on the number of input tokens. For simplicity, we instead considered sentence-embedding models that encode any variable-length linguistic sequence to a fixed-size vector representation. Note that the sentence-embedding model we used was large (300 million parameters; the largest multilingual sentence embedding model on HuggingFace) [1] and widely used on HuggingFace (200 million downloads the past month) [2].

Regarding the evolution of compositionality in representations over the course of finetuning, we are curious as well! However, we believe that the question of compositionality across different natural languages that we investigated is also very interesting, current, and practical, especially for fields that have been studying compositionality for a long time (such as linguistics). We also studied compositionality in emergent language systems multi-agent systems, which is an important and current field of study in reinforcement learning. We therefore respectfully argue that our experiments do, in fact, consider practical or real-world use cases, although we agree that the core contribution of this work is conceptual and theoretical.

Discussions on how the proposed metric can inspire new algorithms would make the paper stronger

We have now added an additional section, Appendix F, that outlines two very different but equally promising approaches for learning compositional representations that are directly inspired by our definition. The first approach directly regularizes K(ZW)K(Z|W) using the loss of a discrete auto-encoder, and the second approach relies on multi-task training. We believe that this suggestion has made our paper much stronger, especially for ML practitioners that wish to build on it.

Include a system diagram to clarify that ZZ comes from some source like a pretrained model, and the role of WW is to compress it

We have edited Fig. 1b to show that ZZ comes from some external source, and that the role of the components described in Section 2 is to compress this ZZ. We have also clarified the conceptual role played by different components of the compression program in the figure.

For the first line of equation (1), if pw,W,fpw,W,f are all empty strings, then the right-hand side is still minimized. The equation is degenerated to be K(Z)=K(Z)K(Z)=K(Z). Is that the case?

Yes, but only because the shortest program for K(Zf=null,W=null)K(Z| f=null, W=null) would itself decompose ZZ according to the program form we have outlined, and would internally define pw,W,fp_w, W, f. In practice to estimate the minimization in Eq. 1, we must put constraints on the different models so that they learn to represent the correct corresponding program component (e.g., parameterise pwp_w using an autoregressive sequence model, model K(ZW,f)K(Z | W, f) as a Gaussian, etc.). We felt that including this discussion in the paper would add too much complexity, but it is an important caveat.

Is there any particular reason to define C(Z)C(Z) as the ratio of two KK terms? What about a difference?

We believe that a ratio is more intuitive: it asks how much more compressible the representation becomes when it is described as a function of parts (2x, 3x, etc.). A ratio is also unitless, which helps with interpretation across distinct model implementations. However, we are not strongly committed to this particular choice, and there could be reasons why a difference is more appropriate or intuitive in certain circumstances. What we believe is most important is the component KK terms emphasized in our definition. Importantly, when controlling for all representations of fixed complexity K(Z)K(Z), both alternatives agree about how to order the compositionalities of these representations. We would be keen in considering additional reasoning and recommendations from the reviewer on this front.

Include the formal definition of topological similarity in the main text

Thank you for the suggestion: we have added this at the start of Section 4.

Figure 3 caption: change “iterated learning is an inductive bias” to “iterated learning amplifies the inductive bias of model’s learning that more compositional mappings are learned faster”

This was in the main text (not Figure 3), but we have made the reviewer’s suggested change in Section 4.2 when introducing iterated learning. Thank you for the catch!

评论

In section 4.2, the last paragraph, what is the “normal language” system trained without iterated learning?

“Normal” refers to a language system that emerged from training without iterated learning (i.e., just ordinary reinforcement learning for multi-agent communication, without periodic resets).

Topsim is notoriously known for its high complexity because we must calculate the pairwise distance of all the examples. What about the complexity of the proposed method?

Our proposed metric is tractable, as it only involves fitting neural networks to data with gradient descent. We have edited Appendix B to include details in a new paragraph.

The experiments show that Japanese has a negative topsim. Is there any intuitive way to understand what this means?

The fact that Japanese has a negative topsim is a highly unexpected and unintuitive result, and indeed highlights the flaws of topsim. What a negative topsim literally means is that when two sentences were more similar in sentence-space, they tended to be more dissimilar in meaning-space. Why this was the case, we do not know, but as our CLC^L metric shows the mapping from WZW \rightarrow Z is nevertheless still highly systematic, with lower K(ZW)K(Z|W) than any of the other languages we considered. We further articulate this point in the revised text at the very end of Section 4.

Results in section 4.3 demonstrate that different languages have different compositionality. Are there any results in linguistics that can verify this claim?

Around the mid-20th century the consensus among linguists was that all languages were equally compositional, but this view has recently been challenged and there is currently significant debate [4], in part because there is no prevailing quantitative definition for compositionality, which is why we chose to study this question. We articulate this point in the first paragraph of Section 4.3. It would be interesting to collaborate with linguists to test hypotheses they may have in future studies.

References

[1] Reimers, N. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084.

[2] Sentence-embedding model HuggingFace model card: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

[3] Bornschein, J., Li, Y., & Hutter, M. (2022). Sequential learning of neural networks for prequential MDL. arXiv preprint arXiv:2210.07931.

[4] Joseph, J. E., & Newmeyer, F. J. (2012). 'All Languages Are Equally Complex'. Historiographia linguistica, 39.

评论

My queries have been answered and I stand by my score. Thank you for your response. I'm looking forward to seeing the final version and the released code for calculating this metric.

审稿意见
6

This paper proposes a formal definition of representational compositionality based on Kolmogorov complexity from the perspective of compression. The paper verifies that the definition aligns with human intuition on some examples in synthetic and real datasets.

优点

  1. The motivation of this paper is greatly appreciated. It focuses on proposing a new formal definition of compositionality, which is sometimes a vague notion in previous literature. The definition based on compression is also novel.

  2. Some parts of the definition match well with intuition. For example, if the function ff to construct the representation ZZ has no structure and is very complex to the extent of a look-up table, then intuitively this representation is not compositional. Accordingly, the proposed metric also shows low compositionally.

  3. The proposed metric shows the extent to which the representation is compositional, rather than a binary indicator.

缺点

  1. My biggest concern for this paper is its presentation. Although the notion of compositionality is abstract in nature, I think the presentation in Section 2 and 3 could be clearer if properly accompanied by concrete examples. I was a bit struggled with the abstract words like “representations” “constituents” “semantics” “sentences” “language” “systematicity” “modularity.” They prevented me from reading through the whole section smoothly.

I find the example in Appendix D useful for understanding the proposed definition, and I encourage the authors to move some of them into the main text. On the other hand, however, these examples are still not concrete enough. For example, Example 5 says “ff is modular.” Then, what is a modular function? It will look much better if the authors can write out the specific formula of the function (maybe in a toy setting) and use a figure to illustrate how this function structure contributes to high compositionality.

  1. The assumption that the representation ZZ follows a normal distribution seems arbitrary. Is there previous work to provide theoretical support for this assumption? Or is there any empirical evidence?

问题

  1. In Line 270, what is the definition of modularity? Is modularity a specific form of systematicity in Line 253? If so, I think it is not appropriate to put these two terms in parallel.

  2. In the experiment on lookup table representations, are the sentence length and vocabulary size correlate with each other? For example, if vocabulary size is very large, then the sentence length is likely to be 1.

  3. Are there any practical implications of this theoretical definition of representational compositionality. In the conclusion, it says that the compositionality can be used to score tokenization schemes. Could you elaborate on this?

I’m also curious about whether this definition of representational compositionality can be readily applied to vision tasks. There have been some advancements in forcing CNNs to learn compositional parts [c1,c2], and object-centric learning [c3]. I wonder if the definition can be further validated under these settings.

  1. Some recent works studies the compositional generalization in object-centric learning, e.g., the reference [c4]. Although it focuses on a more concrete setting rather than a general definition as in this paper, it provides insight into how we can prove compositional generalization in a formal way. Could you briefly discuss how the definition of compositionality in this paper relates to or differs from that in [c4], and why viewing compositionality from the view of compression is “better”?

[c1] Zhang et al. Interpretable Convolutional Neural Networks. CVPR 2018.

[c2] Interpretable Compositional Convolutional Neural Networks. IJCAI 2021.

[c3] Locatello et al. Object-Centric Learning with Slot Attention. NeurIPS 2020.

[c4] Wiedemer et al. Provable Compositional Generalization for Object-Centric Learning. ICLR 2024.

伦理问题详情

NA

评论

We thank the reviewer for their excellent points, which we hope we have addressed. We apologize for the long response, as we wanted to respond to every point.

The clarity of Section 2 can be improved with clearer definitions of the program components and concrete examples

Thank you for the suggestion. We have addressed this in 2 ways:

  1. We have added a table in Section 2, where for each term we give: its notational symbol, its shape, and a concrete example. We believe this will serve as a useful point-of-reference for readers where all concepts are concretely summarized in one place.
  2. We have edited Fig. 1b to show that ZZ comes from some external source, and that the role of the components described in Section 2 is to compress this ZZ. We have also added additional descriptors in Fig. 1b to clarify the conceptual role played by components of the compression program.

Define systematicity

Systematicity is a term from cognitive science describing a property of compositional systems: that the meaning of a whole (e.g., a sentence) is derived from the meanings of the parts (e.g., the words) in a structured or non-arbitrary way. Idioms are a classic counter-example for systematicity [1]. We have added this definition in Section 3 when first introducing the term.

The examples of high/low C(Z)C(Z) in the Appendix are helpful for understanding the proposed definition. Can these be moved to the main text, and can they be made more concrete?

We agree that the examples are very helpful for gaining intuition. However, we believe that for most readers a large number of examples in the main text will prove distracting. We therefore summarized a small number of prototypical examples in the main text that were supported by our empirical experiments: (1) lookup-table representations and (2) modular grammar representations. Crucially, these are the most important examples to contextualize our definition in the prior literature on compositionality: as we cite early in the second paragraph of the introduction, ML often considers compositionality as synonymous with disentanglement or object-centric representations (lookup-table representations), and both ML and cognitive science think of compositionality in terms of languages, grammars, and modularity.

We agree that our examples in the Appendix could have been further fleshed out. We have now edited them to be more concrete, and in particular have followed the reviewer’s suggestion by adding a new Fig. D.1 to illustrate each example described in the Appendix.

What is the definition of “modularity” used in Section 3?

Modularity refers to a system which can be decomposed into interacting sub-parts that can be understood separately [2]. An example in ML is mixture-of-experts models. We have added the definition and example to Section 3.

How does modularity relate to compositionality?

Modularity relates to compositionality because if a function ff decomposes according to reusable modules, those modules only need to be defined once, making ff more compressible and K(f)K(f) lower (which appears in the denominator of C(Z)C(Z)). An example is that a modular program that defines reusable functions can be much shorter than one that does not. We have clarified this in the main text and in Appendix D.

The assumption that the representation ZZ follows a normal distribution seems arbitrary.

We believe there was a misunderstanding here. Representations ZZ come from some model (e.g., a latent layer in a pretrained image classifier) and are certainly unlikely to be normally-distributed, but we do not assume that they are. Perhaps the normal distribution referred to by the reviewer is p(zf(w))p(z | f(w)): given f(w)f(w), the representation is normally-distributed. This is a mixture-of-Gaussians distribution with as many possible mixture components as there are possible sentences (which is exponentially large), and mixtures of Gaussians can approximate any distribution arbitrarily well given enough mixture components [3].

In the experiment on lookup table representations, are the sentence length and vocabulary size correlated with each other?

No, sentence length and vocabulary size are uncorrelated. When varying one, we keep the other fixed. We wanted to see how each of these attributes independently affect C(Z)C(Z).

Is modularity a specific form of systematicity?

No: as specified above, modularity refers to a function with distinct components, whereas systematicity refers to a function that behaves in a predictable way for novel combinations of inputs. For instance, natural language is systematic in how it represents the meaning of a phrase like “he kicked the ball”, but not systematic (i.e., arbitrary) in how it represents the meaning of an idiom like “he kicked the bucket”. While modularity can in some circumstances help with systematicity, this is not always the case [4].

评论

Are there practical implications of this theoretical definition of representational compositionality? Could you elaborate on how it can be used to score tokenization schemes?

There are practical implications both for scoring different machine learning methods (architecture, learning rules, etc.) on the compositionalities of their trained representations, as well as for regularizing representations to be more compositional by turning the definition into a loss function. We briefly mentioned these ideas in the Conclusion section, and have added a new section Appendix F discussing inductive biases inspired by our definition.

Regarding tokenization schemes, training a language model using a particular tokenizer produces sentences WW (the tokens) and representations ZZ. In addition to downstream task performance, one way to judge the quality of the tokenization scheme is by whether or not it produced representations with high CLC^L. We have updated the text to clarify this point on lines 521-522.

What is the relationship between our definition of compositionality and object-centric learning in computer vision?

Object-centric learning attempts to learn representations of scenes that are factorized into “slots” representing individual objects. This enables the design of decoder architectures that can provably generalize compositionally. In our definition, slot-based representations are a particular case of compositionality related to the disentanglement and lookup-table representations we discussed: they are more compressible as a function of parts because the representation can be broken down into a set of “object” representations, and the way these “objects” are represented is shared across all slots. There are other examples of object-centric learning with representations that can be compressed even more, for instance by factorizing the object representations themselves into attributes that can be recombined across objects [5].

The compression view is useful because it is more general. Slot-based models build compositional representations using rigid constraints on the architecture and representational format, and might suffer in terms of performance. Other approaches might attempt to learn representations that generalize compositionality through data augmentation (e.g., constructing synthetic scenes with more variation in object combinations) or simply by training on more data. Additional losses might also serve as good inductive biases for compositional generalization. Even though these other methods might still learn representations that generalize well compositionally, this cannot be explained in terms of “slots” with explicitly factorized object representations. C(Z)C(Z) can explain the success of varied methods because it defines compositionality using compression, which abstracts across the architecture, learning details, and particular representational format (e.g., whether or not it is separated into “slots”). We have edited the paper to clarify these points on lines 530-533 and a new Appendix section E.

Note that we are very interested in using our definition to measure the compositionalities of representations in different vision architectures that were designed with compositional generalization in mind, and hope that others are also motivated to do so after reading our work.

References

[1] Weinreich, U. (1969). Problems in the analysis of idioms. Substance and structure of language, 23(81), 208-264.

[2] Poole, D. L., & Mackworth, A. K. (2010). Artificial Intelligence: foundations of computational agents. Cambridge University Press.

[3] Everitt, B. (2013). Finite mixture distributions. Springer Science & Business Media.

[4] Mittal, S., Bengio, Y., & Lajoie, G. (2022). Is a modular architecture enough?. Advances in Neural Information Processing Systems, 35, 28747-28760.

[5] Singh, G., Kim, Y., & Ahn, S. (2022). Neural systematic binder. arXiv preprint arXiv:2211.01177.

评论

I would like to thank all authors for their response. Most of my questions are addressed. I have raised my score accordingly.

评论

We thank the reviewer for their assessment of our improvements and are grateful for their score adjustment. As the discussion period continues, we remain available for ongoing improvement directions on key points preventing the reviewer to further strengthen their support.

审稿意见
5

In machine learning applications, we're often interested in understanding whether a model has learned to solve a problem using a compositional computation, or in evaluating whether it has discovered the latent compositional structure underlying some domain of interest. This paper describes a procedure for answering these questions quantitatively. It provides tools the compositionality of an arbitrary collection of vector-valued datapoints zz, using algorithmic information theory as a tool.

From a technical perspective: we first posit that data results from a generative process of the following form:

words wp(w)\textrm{words } w \sim p(w) statistics (μ,σ)=f(w)\textrm{statistics } (\mu, \sigma) = f(w) true data zN(μ,σ)\textrm{true data } z \sim N(\mu, \sigma)

Next, we optimize pp and ff to minimize the overall Kolmogorov complexity:

K(Z)=K(p)+K(f)+ilogp(wi)+logp(ziwi)K(Z) = K(p) + K(f) + \sum_i \log p(w_i) + \log p(z_i | w_i)

where

p(ziwi)=N(zi;f(wi))p(z_i | w_i) = N(z_i ; f(w_i))

Finally, we measure the compositionality of the representation system as:

C(Z)=K(Z)/K(ZW)=K(Z)/(K(f)+logp(ziwi))C(Z) = K(Z) / K(Z | W) = K(Z) / (K(f) + \log p(z_i | w_i))

for pp and ff as chosen above. Intuitively, a representation ziz_i is "more compositional" if it is easy to reconstruct given a simple "parts-based" representation wiw_i.

At a high level: I really like what this paper is trying to do---this is an important topic and I'm very excited about the idea of trying to better ground intuition about compositionality in terms of algorithmic information theory. However, I think there are a few major issues with the formulation presented in the current paper, and I don't think this work is quite ready to publish in its current form.

RELATION TO EXISTING WORK

The paper claims that nobody has attempted to offer a quantitative, graded measure of compositionality targeted at modern representation learning applications. This is not quite right: people were thinking about similar issues as part of the general late-2010s multi-agent communication fad, and I would encourage the authors to check out https://arxiv.org/abs/1902.07181 and follow-ups for some earlier work trying to answer related questions. The technical approach proposed in the current paper is substantially different from this past work---in particular, it successfully infers latent compositional descriptions of data---but many of the experiments are very similar and it might be informative to discuss some relationships to representation-reconstruction techniques in addition to the existing discussion of topological similarity.

DEGREES OF FREEDOM

My primary technical concern with the paper is the following: there's some symmetry between pp and ff in the optimization problem described above, and as a consequence different minima will give very different answers to the question of whether a given representation system is compositional. Here's an argument that I think captures the essence of the issue (we could be more precise about some constant factors, but it wouldn't change the overall conclusion).

Suppose I have a set of NN random bitstrings of length MM. This set is incompressible, so no matter how I set pp and ff I'm going to have K(Z)=MNK(Z) = MN.

[For simplicity, let's also assume that p(z;f(w))p(z ; f(w)) isn't Gaussian, but instead a sequence of MM Bernoulli distributions whose parameters are given by f(w)f(w). The choice of Gaussians is described as arbitrary in the paper, and it wouldn't be too much more work to rephrase the argument below with that parameterization.]

Now consider two different schemes for setting pp and ff.

Scheme 1: p(z)p(z) is a deterministic distribution that places its mass on a single word ww, and f(w)f(w) outputs the all-0.5s vector (so both functions are simple, and K(p)K(p) and K(f)K(f) are both negligible). Then K(Wp)=0K(W | p) = 0 (it's deterministic), and K(ZW,f)=i=1Nj=1Mlogp(zij)=ijlog1/2=MNK(Z | W, f) = -\sum_{i=1}^N \sum_{j=1}^M \log p(z_{ij}) = -\sum_{ij} \log 1/2 = MN (it's the log-probability of a sequence of fair coin flips). So K(Z)=MNK(Z) = MN as expected, and C(Z)=1C(Z) = 1 (since all terms but K(Z | W, f) are negligible or zero).

Scheme 2: Each word ww is just a copy of its associated bitstring zz; p(w)p(w) places a uniform distribution over all length-MM bitstrings, and f(z)f(z) is the identity function (so a 0 in a bitstring gets turned into a Bernoulli that always outputs 0). K(p)K(p) and K(f)K(f) are negligible as before. But now K(Wp)=i=1Nlogp(wi)=i=1NMlog1/2=MNK(W | p) = -\sum_{i=1}^N \log p(w_i) = -\sum_{i=1}^N M \log 1/2 = MN, while K(ZW,f)=0K(Z | W, f) = 0 (since all decoding is deterministic). K(Z)=MNK(Z) = MN as before, but C(Z)C(Z) is arbitrarily large as both terms in the denominator are negligible relative to MN.

So in fact this definition of compositionality allows us to conclude that incompressible noise is either highly compositional, or highly non-compositional.

CONCEPTUAL CONSIDERATIONS

Taking a step back: there's quite a lot of work---in fields ranging from formal semantics to category theory---that offers a precise mathematical treatment of compositionality (just not one that's graded or approximate in the way this paper and the above work is aiming at). One of the big differences between the existing literature and current paper is that compositionality is usually thought of as a property of interpretation functions, not sets of representations: if we have a space of "sentences" and "meanings", each equipped with some algebraic structure, then a mapping from sentences to meanings is compositional if it is a homomorphism with respect to this structure (see e.g. https://plato.stanford.edu/entries/montague-semantics/).

This is closer to the paper's Section 3.1, which fixes the encoding scheme W and its prior, and just tries to optimize the function f that translates ws into zs. But in the experiments using this definition, K(Z)K(Z) is fixed to a constant, so we're really just estimating something like 1/K(ZW)1/K(Z|W), which is just a measure of how predictable zzs are from wws, without saying anything about the structure that relates them.

And I think this gets at the fundamental conceptual issue with the current version of the paper---that the definitions of CC and CLC^L don't actually say anything about parts, or putting them together. They're just measuring a predictability relationship between zzs and wws, and we can make systems look more or less "compositional" under these definitions by moving the cost of that prediction operation into p or f. The brief mention that "structure-preserving maps have low Kolmogorov complexity" does all the work of relating these operations to compositionality as usually understood, such that we should maybe think of the paper as arguing that low Kolmogorov complexity is necessary and sufficient for high compositionality (which I don't think is what was intended, and in any case orthogonal to the current experiments and technical presentation).

NEXT STEPS

So what would it look like to have a definition of compositionality grounded in algorithmic information theory? I think the answer has to focus on mappings rather than representations---fixing both ZZ and WW (as in 3.1, and existing work on the subject). Then plausibly we could introduce a new latent variable that doesn't describe an encoding of the data, but the structure of the mapping. Maybe something like:

  • Given w, compute f(w)=w1,...,wnf(w) = w_1, ..., w_n
  • translate each wiw_i into g(wi)=zig(w_i) = z_i
  • combine all ziz_i into h(z1,...,zn)=zh(z_1, ..., z_n) = z

Then a system is compositional if this decomposition is useful for implementing a low-complexity mapping, which you could characterize by e.g. finding the most aggressive decomposition that remains close to the unconstrained complexity K(ZW)K(Z|W). This is of course all speculative! The main thing I want to communicate is that, while there are issues with the current proposal, I think the high-level idea is really promising, and could probably be made to work if re-oriented around a definition of compositionality as a property of maps rather than sets.

优点

  • Very interesting problem formulation and approach

缺点

  • Fundamental identifiability questions make current problem formulation non-meaningful
  • Unclear connection to other formal theories of compositionality

问题

  • Did I miss something in the argument above?
评论

Hi authors! Thank you for the detailed response. Unfortunately, it doesn't address my main concerns---I still think the paper has fundamental technical issues and isn't ready to publish in its current form. Briefly:

The identifiability argument in the original review used discrete distributions to avoid all the complications that come up with entropies of continuous distributions. But if the Gaussian piece of this is important, let's just call GG the entropy / average number of bits needed to encode a sample from N(0,1)\mathcal{N}(0, 1). Now suppose my data consists of NN i.i.d. draws from N(0,1)\mathcal{N}(0, 1); the Kolmogorov complexity of this set is NGNG. Now as above we have two choices:

We can let p=N(0,1)p = \mathcal{N}(0, 1) and f(x)=(x,ϵ)f(x) = (x, \epsilon)

or we can let p=N(0,ϵ)p = \mathcal{N}(0, \epsilon) and f(x)=(0,1)f(x) = (0, 1)

In the former case compositionality scales with NGNG, and you pay NGNG to represent the encodings EE. In the latter case compositionality is constant, and you pay NGNG to account for prediction noise. Both choices achieve the same value of the objective in Equation 1. As the author response notes, these two implementations are morally "the same". This is exactly the problem with the proposed definition---a useful definition of compositionality should be invariant to these implementation details. The constraints given in the first part of the author response don't actually help us break the tie here: we can implement both ps above as trivial transformers (constraint 1), the matrix of discrete representations EE contains either the binary encoding of all points or the trivial all-zeroes matrix in the two constructions above (constraint 2), ff outputs the parameters of a normal distribution in both constructions (constraint 3), and p(z;f(w))p(z; f(w)) is a Gaussian (constraint 4).

And despite the (helpful!) conceptual scaffolding in the author response, I'm genuinely unsure which of these two schemes the authors would consider to be the "right" one for encoding a bunch of unstructured noise. Is high-dimensional noise compositional or non-compositional? And given how much intuition / discretion seems required to compute C\mathcal{C}, it seems a bit premature to refer to this as a "formal" definition. For something as fundamental as compositionality, it seems a little worrying for that definition to depend on things like transformers and even Gaussians rather than more basic computational primitives like Turing machines---in this sense, the version presented in the first paper draft seemed closer to the right answer than the revised one.

All that said, I have been convinced by the author response that it's worthwhile to quantify the structure in collections of representations, even independent of mappings (I might call the subject of the current paper something like "combinatoriality" to avoid confusion with all the existing definitions of "compositionality"). I think the definitional issues are likely fixable (even if I'm not sure which fix is compatible with the authors' intuition), I'm still excited about this line of work, I look forward to seeing a version of this paper at a future conference!

评论

Realized I didn't respond to the point about predictability vs compositionality in sec 3.1. Suppose I have some highly compositional data with an extremely transparent encoding scheme, e.g. data are pairs of letters (e.g. (a,b)(a, b)) and encodings EE are ASCII representations of these pairs (e.g. [01100001,01100010][01100001, 01100010]). Now suppose I take my letter pairs and pass them through a hash function, such that it's not possible to look at any individual part of the hashed data and say that it corresponds to the aa or the bb. I would think of the hashed data as a canonical set of representations that is "predictable, but non-compositional" (and we can make K(ZW)K(Z|W) really small / CL(Z)\mathcal{C}^L(Z) really big). Is the intuition in the paper that this data is in fact compositional? If so, is there any difference between compositionality and predictability? If so, what do the separating cases look like?

评论

We thank the reviewer for their excellent points, which we hope we have addressed. We apologize for the long response, as we wanted to respond to every point. As the reviewer points out, formally quantifying compositionality is an important topic for the community and a daunting task. We hope the reviewer sees this contribution as a step toward this goal that is worthy of publication so that future work can build on it.

There exists related definitions of compositionality in terms of representation reconstruction that, while different from the proposed method, should be discussed and cited

Thank you for pointing this out. We have edited our paper to acknowledge this line of work in Section 3 under the paragraph “Constituent ‘parts’ are intrinsic to ZZ”, as this feature of our definition is one of the core differences to others. We also note that our definition remains more agnostic to the structure of the mapping ff from ww to zz as it only considers the complexity of ff, whereas other work puts more rigid constraints on the mapping (e.g., linear additive, hierarchical grammar, etc.).

Degrees of freedom in the definition

The reviewer raises an excellent point, and indeed simply reading Eq. 1 our definition appears to be ill-defined. Crucially, however, this is a problem about how to delineate, or “label”, different parts of the program that optimally compresses ZZ. For instance, in the reviewer’s example, both schemes actually implement the same program (a uniform distribution over strings of length MNMN), and we are only changing which parts of that program we label as pwp_w, p(z;f(w))p(z ; f(w)), and so on. Eq. 1 must therefore be read with the entirety of Section 2 in mind, which specifies how to label the different program components. With this in mind, C(Z)C(Z) is well-defined and can be estimated in practice by putting constraints on the different components of the generative model during training. We will explain this in the case of a continuous ZZ where we have given more thought to these labelings and constraints, as opposed to the discrete case in the reviewer’s example. It might be helpful here to think of the generative model as a mixture of gaussians with modes indexed by WW.

  1. pwp_w conceptually is the part of the program that defines the distribution over mode indices in ZZ. For compositional representations, these modes are structured, so the indices should be structured as well using symbolic sequences. Constraint: the function specifying pwp_w should be a sequence model taking discrete inputs and outputting a scalar (e.g., an auto-regressive Transformer).
  2. WW conceptually is the part of the program that specifies the structured mode indices for each row of the representation. Constraint: rows of WW should be discrete sequences.
  3. ff conceptually is the part of the program that maps the structured mode indices to their vector embeddings representing the mode location and size. Constraint: ff should be a function that takes a discrete sequence as input and outputs two real-valued vectors (mode mean and standard deviation).
  4. p(z;f(w))p(z ; f(w)) conceptually represents the prediction error, or how far each row in zz was to its predicted mode f(w)f(w). Constraint: p(z;f(w))p(z ; f(w)) should be a Gaussian with mean and standard deviation given by f(w)f(w). The decision to use a Gaussian is not arbitrary, but is in fact an important constraint, and we have edited Section 2 accordingly.

While technically very important, we felt that including these details in the paper would have added too much complexity. Please let us know if this has addressed your concern, and if so whether you would like it to be included in the paper.

评论

The author’s definition of language system compositionality CLC^L in Section 3.1 measures 1/K(ZW)1/K(Z|W) and is therefore just about predictability; it does not consider the structure of the mapping from WZ\mathcal{W} \rightarrow \mathcal{Z}, such as whether or not it is a homomorphism, nor the “parts” in WW

We believe that the reviewer is incorrect about this particular point, but please correct us if we have misunderstood. K(ZW)K(Z|W) decomposes according to K(f)+K(Zf,W)K(f) + K(Z|f,W). The K(Zf,W)K(Z|f,W) term (modeled by a Gaussian) is only there to account for approximation errors when mapping from a discrete ww to a continuous zz, so for the sake of argument here just assume that it is =0=0 in the optimal compression K(ZW)K(Z|W). What K(f)K(f) quantifies is precisely the structure of the mapping WZ\mathcal{W} \rightarrow \mathcal{Z}, as we argue in Section 3 and all our experiments. For instance, if ff is a homomorphism it will be simple to describe. Crucially, though, our definition generalizes the special case of a homomorphic ff to other kinds of functions in an intuitive way: the more structure there is in ff (low K(f)K(f)), the higher CL(Z)C^L(Z) is. Notice that while natural language is not a homomorphism, we still consider it compositional, and a definition of compositionality should account for this. In our definition, this is because semantics in natural language aligns with syntax, and syntax has a small number of reusable rules which makes it compressible (Section 3 on modularity and Section 4.1 on synthetic grammars).

Next steps: should a definition of compositionality fix both W and Z and only consider the mapping between them?

We don’t dispute that a useful notion of compositionality considers both WW and ZZ as fixed. In fact, this is a special case which is readily tractable, and is exactly why we introduced CLC^L: it helps us talk about the compositionality of a language system, or the compositionality of some representation w.r.t. some externally-defined parts (e.g., some underlying task variables that we think the model should be representing in a compositional manner). However, the “set” perspective that led us to introduce representational compositionality C(Z)C(Z) is also useful when we don’t have such externally-defined parts. It is rooted in the intuition that a representation might have some “intrinsic” parts-based representation, which we formalize using optimal compression.

To take an example outside of neural representations in order to build complementary intuition, consider a program (whose semantics are analogous to ZZ). We can ask how compressible this program is when broken down into some given library of functions (analogous to WW), but we can also ask what the optimal library or refactoring would be such that the total description length of the program and library together are minimized (WW is not fixed). This example, and C(Z)C(Z), are particularly relevant to the Language of Thought hypothesis and empirical studies surrounding it, such as [1].

We therefore believe that both notions of compositionality (fixing ZZ and WW as well as only fixing ZZ) have value, and that important discussions like this one that are prompted by our work are reasons for publication.

Regarding a definition of compositionality in terms of the structure of the mapping from WW to ZZ, how about embedding each word as zi=g(wi)z_i = g(w_i) and combining these embeddings z=h(z1,,zn)z = h(z_1, …, z_n), then seeing if the lowest-complexity decomposition of this form remains close to K(ZW)K(Z|W)?

We may have misunderstood (please correct us if we have), but it seems as if this definition is already intimately tied to the language system version CL(Z)C^L(Z) we proposed, although ignoring the numerator. For a fixed (W,Z)(W, Z) we are estimating K(ZW)K(Z|W) by fitting the simplest function f:WZf: \mathcal{W} \rightarrow \mathcal{Z} that also minimizes the prediction error K(ZW,f)K(Z|W,f). Our mapping is therefore the same as the one proposed by the reviewer, with the exception that we remain more general: ff does not necessarily assume each word has to be independently embedded, and in natural language there are clear examples of this where the meaning of one word in a sentence depends on the other words in a context-sensitive way (although more independent embeddings leads to simpler functions; see our disentanglement results in Fig. 2b). If this decomposition of ZZ into parts WW admits a simple mapping ff that achieves low error K(ZW,f)K(Z|W,f), then we say that the resulting mapping is compositional. Note that if (W,Z)(W, Z) admits no such simple mapping, then K(ZW)=K(f)+K(ZW,f)K(Z|W) = K(f) + K(Z|W,f) will be large and compositionality low according to CL(Z)C^L(Z).

References

[1] Dehaene, S., Al Roumi, F., Lakretz, Y., Planton, S., & Sablé-Meyer, M. (2022). Symbols and mental programs: a hypothesis about human singularity. Trends in Cognitive Sciences, 26(9), 751-766.

评论

Thank you for the quick replies! We'll respond to the predictability/compositionality question first, as it's quicker.

In your example each wWw \in W is a pair of letters (a,b)(a,b) and f(w):=f(w) :=

  1. Define a lookup table with the ASCII characters for aa and bb
  2. Index the lookup table based on the words in ww
  3. Concatenate the embeddings

If we applied a hash function hh to each wWw \in W (let's assume that it's invertible so that we maintain perfect predictability), then the semantics would need to change. We would now have f(w):=f'(w) :=

  1. Invert the hash to retrieve w=h1(w)w' = h^{-1}(w)
  2. Define a lookup table with the ASCII characters for aa and bb
  3. Index the lookup table based on the words in ww'
  4. Concatenate the embeddings

Notice that the extra step 0 can add complexity, which will depend on K(h1)K(h^{-1}). If the hash function was benign (e.g., swap the two letters in ww), then it actually doesn't matter (we can just define the lookup table in step 1 differently). But, if the hash function does what you say where "it's not possible to look at any individual part of the hashed data and say that it corresponds to the aa or the bb", then step 0 is actually relevant and adds complexity.

So, in your example, our intuition is the same as yours and our CL(Z)C^L(Z) agrees with it: the hash scenario is less compositional because K(f)>K(f)K(f') > K(f) (note that K(ZW)=K(f)K(Z|W) = K(f) or K(f)K(f') in this case, due to perfect predictability).

评论

Given the extensive exchange above, we would be curious to know whether our comments above have addressed your concerns about predictability vs. compositionality with regards to CLC^L and, more generally, if there are any further points that would increase you support for our paper.

评论

OK, but just to clarify---should I interpret this comment as adding additional stipulations to the definition that aren't present in the paper? Either:

  • Models "aren't overfit". What makes scheme 1 overfit and scheme 2 not overfit? More generally, what does "overfitting" mean here, given that none of the definitions presently make any reference to generalization?

  • The particular programming model with which respect to which we're measuring KK complexity happens to have a short program for Dirac deltas and a long program for Gaussians. Kolmogorov complexity is typically only defined up to a constant factor, so if these constant factors are really important for choosing solutions to Eq 1, we're not really talking about Kolmogorov complexity in a general sense---we're tied to some specific model of computation, but it's currently not clear what that is (or why it's better than any of the other models that algorithmic information theory generally treats as interchangeable).

Again, I don't think these issues are insurmountable! But they really need to be locked down if the goal is to enable apples-to-apples between different sets of representations. My concern is that, even if we can agree about how to break the tie for this specific hypothetical, I'd be worried that there are other distributions that still have identifiability problems, and we need yet a different set of extra rules to break the tie in the right way.

评论

Should I interpret this comment as adding additional stipulations to the definition that aren't present in the paper?

The comment about scheme 1 and 2 having different total complexities is a consequence of elements already in the paper (they give different results for Eq. 1). You're right that Kolmogorov complexity is defined up to a constant factor and that the difference in complexity between defining a Dirac and Gaussian in code is very small for most programming languages (we're splitting hairs here for a very specific example).

The comments we made about "overfitting" are not elements of the main paper. Those were about actual implementation of our framework to try and estimate C(Z)C(Z) in the wild using deep learning tools, which we only briefly discuss (one possible approach) in Appendix 2, and which we do not attempt to implement. We only bring it up because whether or not the definition proves useful will depend on whether estimates of C(Z)C(Z) based on deep learning tools adhere to intuitions about compositionality.

Overall, we acknowledge that when we bring in things like labels and delineations of the program components to make the definition identifiable, things get more complicated, and it might be possible to further refine the definition to handle these things more cleanly as you say (especially in corner cases like the ones you bring up). Like you, we are optimistic that the definition can still be further built upon, and we are hoping we and others will be able to do so more easily by getting into these dialogues at the conference. We also want to reiterate that this discussion with you has already been very helpful, and we greatly appreciate all the critical thought you're putting into our work.

评论

Now responding to your new identifiability argument. Please correct us if we've misunderstood your example, but we believe the two different schemes here have different total complexities, and so only one of them satisfies Eq. 1 (scheme 2 in this case)

For scheme 1

  1. pwp_w is (a discrete floating point approximation of) a Gaussian
  2. E[K(wpw)]\mathbb{E}[K(w|p_w)] is the entropy of the Gaussian
  3. ff is trivially simple as it just outputs its input, so K(f)0K(f) \approx 0
  4. E[K(zw,f)]=0\mathbb{E}[K(z|w,f)] = 0 because w=zw = z

For scheme 2

  1. pwp_w puts all its mass on a single ww of constant symbols
  2. E[K(wpw)]=0\mathbb{E}[K(w|p_w)] = 0 as the trivial constant ww has probability of 11
  3. ff is trivially simple as it just outputs a constant (0,1)(0,1), so K(f)0K(f) \approx 0
  4. E[K(zw,f)]\mathbb{E}[K(z|w,f)] is the entropy of a Gaussian (the same as 2. in scheme 1)

Comparison of the two schemes

In the end, scheme 1 step 2 cancels with scheme 2 step 4, and then in both schemes the remaining steps contribute zero complexity except for step 1. For scheme 1, K(pw)=K(p_w) = the complexity of defining a Gaussian distribution. For scheme 2, K(pw)=K(p_w) = the complexity of a point-mass distribution, which is simpler than a Gaussian. We therefore have that scheme 2 is the only valid decomposition according to Eq. 1 in our paper, and compositionality is therefore C(Z)=K(ZW,f)/K(ZW,f)=1C(Z) = K(Z|W,f) / K(Z|W,f) = 1 (as we would intuitively expect, structureless noise is not compositional).

Again we can explain what's going on by thinking of a mixture of Gaussians, with modes indexed by WW. If there are no such modes (as in structureless noise), then pwp_w and WW will not pay the complexity cost of representing other bits in ZZ, since they can simply be captured by the error term.

评论

Thanks for the quick reply! The above description is exactly right; the issue is:

the complexity of a point-mass distribution, which is simpler than a Gaussian

In what sense is a point mass simpler than a Gaussian? Both are computed by constant-sized programs, and I thought the proposal above was to implement them as neural sequence models from the same class.

评论

Both are constant-sized programs, but if we're talking about the formal definition and whether it is identifiable in this thought experiment, the length of that constant-sized program needs to be taken into account.

For actual implementation with neural networks, there would be no issue for your example: neural networks for pwp_w and ff (if not overfit) would just learn a point-mass distribution for pwp_w and a constant output for ff. This is because such a solution would give the lowest training error K(ZW,f)K(Z|W,f) that can be achieved without overfitting to ZZ.

评论

Hi---thanks for all the helpful responses! It seems like both of the issues I raised above really stem from the same source, which is a set of unstated assumptions about which functions are "simple" or "complex". These assumptions are in fact critical for determining which representation systems are compositional or non-compositional. In particular:


Compositionality vs predictability: Sorry if this wasn't clear; I was imagining applying the hash after performing the lookup (so calling HH rather than H1H^{-1}), but I think it also works the way you've described here. The only difference between the two schemes is a single invocation of the (inverse) hash function, which takes constant space to represent, so the difference between the two schemes will in general be negligible for large numbers of representations. (And who's to say we're not computing KK w/r/t some esoteric programming language where HH gets applied by default on top of every function call, so that scheme 1 actually has a longer program?)

Here I think it would be really helpful if you could give an example of two systems, both of which have the property that representations are predictable from encodings, but where the difference in compositionality between the two systems can be made arbitrarily large.


Identifiability: It seems like we're on the same page here, in the sense that we agree that Eq 1 has two minima, one saying random noise is highly compositional, and the other saying that random noise is highly non-compositional. The only way to break the tie is to stipulate that some kinds of distributions (arbitrarily) have longer descriptions than others.


One of the earlier comments refers to these points as "splitting hairs", which I really want to push back against. In a "formal definition of compositionality", the formalism needs to be fully spelled out. Right now, it takes several paragraphs of handwaving and special constant factors to correctly describe a set of i.i.d. samples from a Gaussian---and extremely simple and extremely fundamental object of study. Standard ways of measuring Kolmogorov complexity ignore these constant factors altogether, but here they are critically necessary. But right now there's nothing stated explicitly enough to argue about, and I don't know how I would generalize the stipulations given in the comments above to any other distribution.

The authors agreed in a previous comment that

Kolmogorov complexity is defined up to a constant factor

But they also agree that many crucial predictions from this approach depend on constant-factor differences in Kolmogorov complexity. So the approach is not well-defined.

This is a fundamental problem that must be fixed before we have the "formal theory of compositionality" promised by the title. All that said, I hope this discussion has been helpful! If the paper gets in this cycle, I hope the authors at least see fit to mention these issues in a footnote, and if not I look forward to seeing how they are resolved in a future version of the paper.

I really want to echo what the other reviewers have said about this being a clearly written and conceptually interesting piece of work. I would love to see it at a future ML conference, where I think it has the potential to be one of the better pieces of work. But I also feel very strongly that it should not be published in its current form.

评论

Thank you for your continued interaction!

Overall, we believe that we’re on the same page with respect to tricky parts of the definition. The identifiability issues you have raised are important to resolve in a neater way that rests on simpler arguments than the ones provided in our responses (although we stand by those arguments). The limitations applying to Kolmogorov complexity in general that you’ve mentioned (e.g., only defined by to a constant, dependence on the programming language) also apply to our work, which makes significant use of Kolmogorov complexity.

Where we still disagree or course is whether or not the paper is ready for publication despite these limitations. We believe that it is.

  1. The conceptual link between compositionality, compression, and meaning being a simple function of parts has been hinted at in some prior work, but we believe we are the first to dissect it in a systematic and direct way, with both theoretical machinery and concrete examples. As a whole, we believe that this perspective is both novel and useful to the field, with the potential to spawn follow-up works that build on our ideas.
  2. Our definition of language system compositionality, which pertains to a W and Z that are both given (and is a setting that the reviewer believes important), does not have identifiability issues in the current version of our paper.
  3. We believe that using deep learning machinery to estimate compositionality as it is conceptualized in our paper will not suffer from identifiability issues in practice, as specifying architectures for different components of the generative model imposes significant additional constraints. We also believe that it will give results consistent with intuition; the framework resembles reconstruction-based methods described by the reviewer which have already proven useful in estimating compositionality, but without making strong assumptions such as linear additivity of part embeddings that our definition suggests are unnecessary.

All this being said, we are very grateful to the reviewer for their excellent feedback and are actively working on addressing it. We perhaps should have put more emphasis on the “Towards…” part of our title (we will take the reviewer’s advice into account and do so in the next version we upload), and we hope that we can refine the compression-based perspective of representational compositionality through more productive conversations like these.

Note regarding a concrete example where two representations that are equally predictable from their parts can have arbitrarily high differences in compositionality

We believe that our paper already contains such examples using synthetically-generated representations. For instance, lookup table representations with increasingly high “disentanglement”, where the representation is a function of n-grams. If n is made larger, the lookup table defining ff becomes more complex (has more parameters needed to define it) and the representation is less compositional. We don’t have to think too hard about the Kolmogorov complexity in this case; if a lookup table grows substantially in size, a longer program will be needed to define it in any reasonable programming language (in short, a large lookup table is less compressible). While we can always consider an esoteric programming language in which the large lookup table is pre-defined, this sort of thing is a limitation with Kolmogorov complexity in general, not only our work, and it is more artificial than substantial (no such programming languages actually exist). If we consider increasingly long sentences WW, we can make the complexity of ff arbitrarily high (and compositionality arbitrarily low) for large n (but not n=1), despite these cases all having equal predictability (they are all deterministic). Our synthetic grammar experiments admit similar examples (e.g., grammars with arbitrarily many production rules defining ff).

审稿意见
8

The authors present a definition for representational compositionality, aiming to turn an old and only vaguely defined concept into a precise mathematical definition. They ground their definition in Kolmogorov complexity and demonstrate that for synthetic datasets, this quantity conforms well with intuitive expectations of how representational complexity should vary with different hyperparameters. They then consider two empirical representations and use an approximation to Kolmogorov complexity to compute their compositionality.

优点

I think this is a generally well-written and thought-provoking paper. Compositionality and compositional generalization is an important topic across a number of different fields and I agree that a better formal understanding of what we mean by it is important. The authors present a conceptual framework that is well thought out and clearly presented and explain the intuitions behind it well --- I really appreciated the clarity of explanation especially given the highly conceptual and theoretical nature of the work. The empirical results demonstrate that this could be a useful metric and provide interesting use cases to think about.

A few other points I liked about the paper:

  • I liked that the authors explicitly spelled out what bearing their definition has on various related concepts in the field.
  • I appreciated the point of having a useful quantitative point of reference (i.e. compositionality of 1 indicating a complete lack of compositionality).
  • I also thought the example of the grammar representations was a compelling argument for this definition.

缺点

I think the manuscript in its current form has two points that weaken its case for the proposed definition. One concerns discussing how this definition connects to compositional generalization and one concerns further demonstrating its practical usefulness and providing a concrete way for follow-up work to use this definition.

Relationship to systematic generalization

As the authors note in their introduction, compositionality is a highly relevant topic in part because it can enable systematic out-of-distribution generalization. This provides a complementary approach towards this question, i.e. compositional representations are representations that enable systematic generalization. There are a number of prior works who have laid out formal or conceptual frameworks relating different representations, architectures, and inductive biases to systematic/compositional generalization (albeit sometimes under different terms) [e.g. 1-4; as well as some papers that the authors already cite, e.g. Ren et al., 2023] and it would be useful to discuss at least some of them in terms of how they relate to this approach, e.g.: Do they get at the same concept of compositionality or do they address a different problem? How specific would you predict your definition to be, i.e. will neural networks that generalize well have highly compositional representations according to your definition?

I note that the authors already discuss systematicity and generalization in p. 252-262, but so far this discussion seems to be more focused on generalization of the functional embedding rather than the entire network of which ZZ may be a part (or the input).

Practical usefulness

I found the authors suggested framework in Appendix B very intriguing and I think having such a concrete method for estimating compositionality could further increase the impact of the authors' proposed definition. I also think empirically evaluating the relationship between representational compositionality and generalization would be really interesting (e.g. across different network architectures that yield better or worse generalization, is there a correlation between compositionality and generalization?). I think adding this kind of practical method or evaluation would further increase the impact of this paper. I also understand, though, that addressing this may not be realistic during the revision period and I think the paper is still interesting and valuable without this --- as the authors note, the primary contribution of this paper is of theoretical/conceptual nature.

  1. Hupkes, Dieuwke, et al. "Compositionality decomposed: How do neural networks generalise?." Journal of Artificial Intelligence Research 67 (2020): 757-795.
  2. Jarvis, Devon, et al. "On the specialization of neural modules." arXiv preprint arXiv:2409.14981 (2024).
  3. Abbe, Emmanuel, et al. "Generalization on the unseen, logic reasoning and degree curriculum." International Conference on Machine Learning. PMLR, 2023.
  4. Lippl, Samuel, and Kim Stachenfeld. "When does compositional structure yield compositional generalization? A kernel theory." arXiv preprint arXiv:2405.16391 (2024).

问题

  • See weaknesses.
  • To what extent are the changes in topological similarity due to different factors in the lookup tables because that implicitly changes the embedding dimensionality of each individual symbol. For smaller dimensionality (e.g. due to longer sentences), there will be more variance in how correlated different symbols are with each other, leading to lower correlation, whereas for larger dimensionality, most of them will be approximately orthogonal, leading to higher correlation, correct?
  • Could you apply the prequential coding method to the synthetic examples in section 4.1, to further validate it?
  • Could you specify the numerical values on the y axis in Fig. 2?

Minor comments:

  • L. 151: superfluous “a”
  • L. 178: “who’s” -> “whose”
  • L. 374 “rather” -> “rather than”
  • L. 400: “who’s” -> “whose”
评论

We thank the reviewer for their excellent points, which we hope we adequately address.

What is the relationship between representational compositionality and compositional generalization, and does it provide any additional insight?

We hypothesize that representations which are compositional according to our definition should enable better compositional generalization. This is because the representation of constituent parts is systematic: the semantics mapping constituent parts to the representation is a simple function that will generalize better to novel part combinations (i.e., it will assign them a meaningful rather than arbitrary representation, which downstream functions should be able to leverage). One of our central goals for future work is to test this hypothesis empirically, where we measure the compositionalities of many model representations using our definition and then correlate this score with compositional generalization ability. We strongly believe that this work would be out of scope for our current paper, which already introduces many novel ideas.

Furthermore, we believe that representational compositionality relates to other important topics in machine learning outside of compositional generalization. For instance, representational compositionality has significant implications for generative models that reason in latent space (akin to thought in human cognition), since this generation can be done in the space of constituent parts in a way that generalizes through simple semantics.

We have now edited our paper to include these important discussions (as well as connections to papers cited by the reviewer) in a new Appendix section E, which describes hypotheses regarding compositional generalization, ideas for future work to test these hypotheses, and relations to other topics outside of compositional generalization.

Implementing the framework for estimating compositionality suggested in Appendix B and empirically evaluating the relationship between representational compositionality and compositional generalization would increase the impact of the paper.

We completely agree that both of these are very interesting research directions. We are actively working on implementing the method for estimating C(Z)C(Z) described in Appendix B as well as studying the empirical relationship between C(Z)C(Z) and compositional generalization (which we now discuss in a new Appendix E, as explained above). However, the core contribution of the current paper is theoretical in nature, with the goal of introducing and justifying our definitions. We wish to stress that these are novel ideas that require a standalone paper; measuring representational compositionality using new deep learning tools and evaluating its empirical relationship to compositional generalization would themselves be novel and substantive contributions.

Is topological similarity very sensitive to the dimensionality of the word embeddings, given that in high-dimensions vectors tend to be more orthogonal?

This is absolutely correct: for example, as the reviewer points out, in the sentence length experiments having shorter sentences with more high-dimensional word embeddings will lead to the resulting representations more often being orthogonal. In general, topological similarity is highly sensitive to dimensionality effects (like orthogonality in high dimensions) that greatly influence geometry and distance metrics. This is an important point of motivation to define a novel metric that does not suffer from this and other flaws.

Could you apply the prequential coding method to the synthetic examples in Section 4.1, to further validate it?

Thank you for the suggestion. We are actively running these experiments and will reply in this thread when we have edited the paper accordingly. In the past, we have looked at the effect of disentanglement on prequential code length and found that it strongly correlated with the true underlying complexity K(ZW)K(Z|W), so we expect this to hold with our other results.

Could you specify the numerical values on the y axis in Fig. 2?

We did not include these mainly for aesthetic reasons, since it would have compressed the graph horizontally as we would have required different scales for each plot, and different scales for each metric (representational compositionality (ours) and topological similarity). We didn’t feel this was an issue, as the only thing we wanted to highlight in these experiments was the trends in compositionality scores rather than absolute values. As a compromise, though, we have now updated Fig. 2 to include numerical values showing the min/max of each metric for each plot, which preserves the horizontal space of the plots. Please let us know if this is sufficient, and if not we can attempt to do some more reorganization of the figure as a whole to make more space for numerical values along the entire axes.

评论

Thank you for your response, which has addressed my questions. I have also read through the other responses. Overall, I think this is an interesting conceptual contribution to the field. My sense is that it is usually somewhat difficult to connect Kolmogorov complexity to concrete measures (though I note that I'm only superficially familiar with these methods) and so I think that without concrete methods for estimating this, debates about whether the measure is well-defined will easily arise (as they have with reviewer ddLm). From my understanding, the current paper certainly does not settle this debate, but I think it introduces a novel and interesting definition and provides a coherent argument for this definition. I therefore think it should be accepted and have increased my score to "accept" to indicate so.

评论

We thank the reviewer for their assessment of our responses as well as their insightful comments on our paper, and are grateful for their score adjustment. We remain available for additional discussion if the reviewer should have additional questions/concerns.

公开评论

We thank the authors for this interesting submission. While the authors propose an interesting definition of compositionality based on representations and Kolmogorov complexity (the main novel contribution of this paper), we would like to bring to their attention previous definitions of compositional functions such as the one discussed in Pagin & Westerståhl (2010), which has been recently built upon to provide a quantification of compositionality (Ram et al., 2024). The generality of their definition is demonstrated by applying it to various off-the-shelf sequence processing models such as recurrent, convolutional and transformer based neural networks. Their notion of compositional complexity is then theoretically tied to the expressivity and systematic generalization of any compositional model.

A new definition is always interesting and can provide a new perspective, but it is also useful to understand how they compare and contrast to existing mathematical definitions.

P. Pagin, and D. Westerståhl. "Compositionality I: Definitions and variants." Philosophy Compass 5.3 (2010): 250-264. https://doi.org/10.1111/j.1747-9991.2009.00228.x

P. Ram, T. Klinger, and A. G. Gray. "What makes Models Compositional? A Theoretical View." IJCAI 2024. https://www.ijcai.org/proceedings/2024/0533.pdf

评论

Thank you for the comment and for your interest in our work! Compositionality of functions is indeed a very important topic in machine learning, and we agree that it can and has been nicely formalized using ideas from compositional generalization. However, what we attempt to define here is the compositionality of representations, rather than functions. For instance, a representation might be the latent activity in some intermediate layer of a neural network, or some brain region. There is no current definition for the compositionality of representation, although we have intuitions about them (for instance, "disentangled" representations are a paradigmatic examples). In this paper, we are attempting to formalize and generalize these intuitions.

Note that in our reply to other reviewers and in our updated paper, we plan to clarify the relationship between representational compositionality (as we define it) and compositional generalization, and indeed this will relate to the "function compositionality" of the function processing the representation. In particular, we hypothesize that increased representational compositionality puts fewer demands on the structure (or compositionality) of the downstream function required for compositional generalization. As we will explain in those other responses, however, representational compositionality also relates to other important topics apart from compositional generalization (e.g., generative models in latent space, multi-task adaptation, etc.). Crucially, we believe that both of these notions of compositionality (of representation and of function) are crucial to understand and formalize for machine learning and cognitive science.

公开评论

We thank the authors for their time to engage with us, and for the clarification. To the best of my understanding, then this paper focuses on the same topic as Wiedemer et al (2023) (cited in the current paper) where they define compositional representations (based again on the intuition of disentanglement and identifiability), and provide subsequent theoretical compositional generalization guarantees.

Wiedemer, T., Mayilvahanan, P., Bethge, M., & Brendel, W. (2023). Compositional generalization from first principles. Advances in Neural Information Processing Systems, 36. https://openreview.net/pdf?id=LqOQ1uJmSx

Thank you again for your clarification!

评论

We greatly appreciate feedback from the reviewers, and the degree to which they engaged with our paper. We would like to highlight that reviewers recognized that a formal definition of representational compositionality is important in machine learning, and that our work brings a novel, thought-provoking, and useful perspective. In fact, most of the weaknesses and questions raised were rooted in scientific curiosity about the work rather than dismissals of it, prompting the kinds of discussions that one might have at a poster session: how the proposed definition relates to topics like out-of-distribution generalization, what kinds of inductive biases for compositionality can be inspired by the definition, and so on. We found these discussions very fruitful and stimulating (as we hope the reviewers do), and believe that a fundamental goal of conference publications is to prompt productive discussions of this nature. We believe the improvements we made to the paper as a result of this exchange considerably improve it, and that it would be valuable to share these ideas with the general community as compositionality is a timely topic in AI with a need for formal grounding.

Below we summarize common points raised by reviewers and the steps we have taken to address them, with more details in our responses. With these changes, we believe that we have addressed the reviewers’ main concerns.

What is the relationship between the proposed definition of representational compositionality and compositional generalization, and does it provide any additional insight? [6ztP04, DENj02]

We hypothesize that representations which are compositional according to our definition should enable better compositional generalization. This is because the representation of constituent parts is systematic: the semantics mapping constituent parts to the representation is a simple function that will generalize better to novel part combinations (i.e., it will assign them a meaningful rather than arbitrary representation, which downstream functions should be able to leverage). One of our central goals for future work is to test this hypothesis empirically, where we measure the compositionalities of many model representations using our definition and then correlate this score compositional generalization ability. We strongly believe that this work would be out of scope for our current paper, which already introduces many novel ideas.

Furthermore, we believe that representational compositionality relates to other important topics in ML outside of compositional generalization. For instance, representational compositionality has significant implications for generative models that reason in latent space (akin to thought in human cognition), since this generation can be done in the space of constituent parts in a way that generalizes through simple semantics.

We have now edited our paper to include these important discussions in a new Appendix section E, which describes hypotheses regarding compositional generalization, ideas for future work to test these hypotheses, and relations to other topics outside of compositional generalization.

Sections 2 and 3 introduces significant novel vocabulary that negatively impacts clarity. [DENj02, N71Y27]

Reviewers made several suggestions regarding how to improve the clarity of Sections 2 and 3. We have followed their suggestions by:

  1. Improving Fig. 2 to show (a) where representations ZZ come from and (b) better define the various components of the compression program.
  2. Adding a summary table to Section 2 that specifies each term we introduced along with its symbol and, crucially, a concrete example.
  3. Defining the terms “modularity” and “systematicity” when they are first introduced in Section 3.
  4. Adding a new figure to Appendix D that visually illustrates the concrete examples of high/low representational compositionality to provide better intuition.

What are the practical implications of the definition for improving ML models? [DENj02, N71Y27, 6ztP04]

Reviewers wondered how insights from our definition of representational compositionality could improve ML models, particularly in regards to compositional generalization.

First, as mentioned above, we have added a new Appendix section E detailing theoretical connections between representational compositionality and compositional generalization, as well as how future experiments can empirically test these connections.

Second, using our definition it is possible to (1) validate existing ML methods designed with compositionality in mind (e.g., object-centric representations) and (2) design theoretically-motivated inductive biases for compositionality. While pursuing these directions is out of scope for our current paper (which aims to introduce novel theory), we have added a new Appendix section F which describes two inductive biases directly inspired by our theory, which we hope can be explored in future work by us and by others.

AC 元评审

(a) summary

This paper investigates how to define and measure compositionality mathematically. It defines compositional representation as a function of components, and proposes a novel metric to measure representational compositionality from the perspective of algorithmic information theory. Experiments on both real and synthetic data are conducted to validate the definition.

(b) strengths

  • The problem formulation and approach to studying compositionaly from the perspective of algorithmic information theory are interesting.
  • The proposed metric shows not only binary indication but the degree of compositionality.
  • It is well-written and easy to follow.

(c) weaknesses

  • It needs more discussion on how the metric can inspire new algorithms for improving compositional generalization.
  • Some assumptions in the theory needs to be justified such as the distributions of representation.
  • It does not conduct experiments with more pracical models such as LLM to demonstrate its usefulness.
  • It is not clear how to connect the proposed definition to other formal theories of compositionality. Such as comparing with a missing reference: Jacob Andreas, "Measuring compositionality in represenatation learning", ICLR 2019.

(d) decision

Although the high-level idea of the paper is interesting, the definition and mathmatical claims in the current form are not solid enough for publication at ICLR. Please keep the reviewers' comments in mind when preparing a future version of the manuscript.

审稿人讨论附加意见

The reviewers agree that the paper is well-motivated by a formal definition of representational compositionality, and it brings a novel, thought-provoking, and useful perspective. However, the reviewers also shared some concerns on how to connect the proposed definition to other formal theories of compositionality, and how to apply it for designing new algorithms. The authors' rebuttal helped addressing some of the concerns; the reviewers still think the paper needs another round of revision and review to make it stronger.

最终决定

Reject