PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
4
3
3
2
ICML 2025

Contextures: Representations from Contexts

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We develop the contexture theory that clarifies that representations are learned from the association between the input and a context variable.

摘要

关键词
representation learningpretraininglearning theoryfoundation modelsscaling law

评审与讨论

审稿意见
4

The authors propose a framework for understanding representation learning through the concept of contextures, which are the top singular functions of an operator induced by a context variable. The goal is to characterize learned representations across supervised, self-supervised, and manifold learning paradigms. The main contributions are: a unifying theoretical framework for representation learning, a demonstration that scaling model size yields diminishing returns once optimal contextures are approximated, and a task-agnostic metric for evaluating the usefulness of contexts. The study shows that learned representations align with the top singular functions, and the proposed metric correlates well with downstream performance across multiple datasets.

给作者的问题

  1. What is the computational cost of estimating the top singular functions for large-scale datasets? A discussion on efficiency and scalability would clarify the feasibility of implementing contextures in practical settings.
  2. How does the proposed task-agnostic metric compare to existing evaluation methods for representation learning (e.g., probing classifiers, alignment metrics)? A direct comparison would help justify the usefulness of the new metric.
  3. How does the contextures framework perform when the context is noisy? Would the learned representations still be useful, or would they degrade in quality?

论据与证据

The authors claim that contextures unify representation learning by framing learned representations as top singular functions of a context-induced operator. They provide theoretical proofs, empirical validation showing neural networks approximate these functions, and a task-agnostic metric that correlates with downstream performance.

方法与评估标准

The proposed methods and evaluation criteria are well-suited for analyzing representation learning through the contextures framework. The use of singular function decomposition provides a theoretical foundation, and the proposed task-agnostic metric aligns with the goal of evaluating context usefulness. Also, the correlation analysis with the downstream performance is strong.

理论论述

No immediate flaws are evident, and the theoretical claims appear sound.

实验设计与分析

  1. The paper evaluates the contextures framework through neural network scaling experiments to test its alignment with theoretical predictions.
  2. The authors compare learned representations with the top singular functions of the context-induced operator in order to validate their claims on scaling limits.
  3. The authors also evaluated their proposed task-agnostic metric by measuring its correlation with downstream task performance across multiple datasets.
  4. The use of canonical correlation analysis and KNN-based metrics provides a structured approach to evaluating representation quality.

补充材料

I reviewed the supplementary material. That is the theoretical proofs, experimental details, and extended discussions on contextures.

与现有文献的关系

The framework proposed by the authors utilizes singular functions to characterize learned representations, which aligns with SVD techniques used in Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA). For the proposed task-agnostic metric for evaluating context quality, it is conceptually similar to CCA.

遗漏的重要参考文献

The paper's key contributions are grounded in established concepts in representation learning, particularly spectral methods and feature space learning. Most of the essential references were cited in the paper.

其他优缺点

Strengths:

  1. Clear description of background knowledge and motivations needed to understand the proposed representation model.
  2. Clear exposition of the proposed method.
  3. The authors introduce the contextures framework that provides a unified mathematical perspective on representation learning across multiple paradigms.
  4. The analysis on scaling laws and the diminishing returns of increasing model size offers valuable insights for future neural network training strategies.
  5. The proposed task-agnostic metric for evaluating context usefulness is a great contribution.

Weaknesses:

The paper could provide more discussion on the computational complexity and scalability of the proposed methods.

其他意见或建议

  1. Overall the paper is well-written.
  2. Consider discussing potential limitations of the contextures framework in handling highly heterogeneous data distributions.
作者回复

Thank you for your thoughtful review. We are glad that you find our theoretical analysis and the proposed metric valuable. We would like to answer your questions as follows:

1. Efficiency

For extracting the top singular functions, the time complexity of kernel PCA is the same as eigendecomposition, which is about O(m3)O(m^3), where mm is the number of pretrain samples. The time complexity of extracting the singular functions using a pretraining objective (the deep learning method) depends on the optimizer, and there is a rich body of work on the convergence rate of popular optimizers such as Adam and SGD. Moreover, there exist many deep-learning-based methods that can extract the eigenfunctions, including VICReg, which is used in the experiments, NeuralEF (Deng et al., 2022), and NeuralSVD (Ryu et al., 2024). The complexity of these methods grows linearly with the size of the dataset and does not suffer from the scalability issues of kernel PCA. For instance, for a mini batch of size BB, NeuralSVD only has complexity O(B2d+d2B)O(B^2 d + d^2 B), where dd is the dimension of representation. In other words, the deep learning method is scalable and much faster than kernel PCA, though it is not guaranteed to find the exact singular functions because the optimization problem is non-convex. We will add this discussion to the paper.

Deng et al. "NeuralEF: Deconstructing kernels by deep neural networks." ICML 2022.

Ryu et al. "Operator SVD with neural networks via nested low-rank approximation." ICML 2024.

2. Evaluation metric

We evaluate a context with a task-agnostic metric because we would like the encoder to be transferable to various tasks. Moreover, foundation models are often applied to tasks they are not designed for, so we might not know all the tasks at pretrain time. On the contrary, a probing classifier only evaluates the context on a specific task, so it cannot reflect the transferability of the encoder. There is a strong connection between our theory and alignment metrics, which we discuss in Section 4.2. Alignment metrics compare two encoders (focusing on their relative relationship), while our metric evaluates the contexture of one context (assessing its intrinsic information-capturing properties).

3. Noisy contexts

The effect of a noisy context depends on how noisy is defined. We can think of two possible definitions.

  1. The noise reduces the compatibility between the context and the task (Section 4.1). In this case, our analysis shows that the learned representation will have a lower quality.
  2. The noise weakens the association between XX and AA. Section 5 shows that a good context should have a moderate association. If the association level of the original context is already moderate, then the noise will make it worse. However, if the association level of the original context is strong, then adding noise might even have a positive effect. For example, it is widely observed in self-supervised learning that strong augmentations make the learned representations better. For example, the authors of SimCLR attribute its success largely to the aggressive crop ratio and color distortion it uses.

4. Paper updates

We added a new limitation paragraph in the conclusion (see below), and a new related work section (please see our response to reviewer R5n1).

We appreciate your feedback and hope that our response answers all your questions. We are happy to address any follow-up questions.

Updated limitation and open problem paragraph (to be added to conclusion)

Our analysis has three limitations, which lead to three open problems. First, our analysis focused on the minimizers of the objectives. However, Cohen et al. (2021) showed that deep models trained by popular gradient methods do not find the minimizers, but instead oscillate around the edge of stability. The open problem is how this phenomenon affects our results. Second, we did not discuss the impact of the inductive bias of the model architecture, such as the translation invariance of convolutional neural networks. Such inductive biases can affect the context and, therefore, the encoder. We pose how to integrate the effect of these biases into our theory as an open problem. Third, our theory assumes that PXP_X is fixed. In practice, however, there is always a data distribution shift from upstream to downstream. Refining our theory to handle such distribution shifts is an exciting direction for future work.

审稿意见
3

The author proposes a framework that encapsulates common representation learning methods as learning the joint distribution of inputs and contexts. It’s possible to decompose this joint distribution with eigen-decomposition. This decomposition shows that optimal representations are learned via learning the subspace spanned by the top-d singular functions. Thus lots of tasks benefits from learned representation whenever the encoder recovers the span of the top-d singular functions. The authors explain that contexts with either too weak or too strong association with inputs are less effective, and proposes a task-agnostic metric to evaluate context usefulness. Finally, the authors show empirically that the proposed contexture metrics correlate with downstream performance.

给作者的问题

N/A

论据与证据

The observation on optimal representation under learning the contextures is theoretically supported.

方法与评估标准

Yes. The authors compare a variety of context types and there are also controlled experiments on the level of association between inputs and contexts.

理论论述

I closely verified section 2.

实验设计与分析

Figure 1 & 2 makes sense.

补充材料

No.

与现有文献的关系

This paper proposes a theoretical framework of contexture which explains why the learned representations are useful for downstream tasks. I see this framework quite generally applicable to a variety of setting as listed in section 2.1

遗漏的重要参考文献

Canatar, A., Bordelon, B. & Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat Commun 12, 2914 (2021). https://doi.org/10.1038/s41467-021-23103-1

  • This paper discussed similar ideas of learning via eigenfunctions alignment

其他优缺点

Strength

  • The contexture framework provides a perspective to understand why representation learning is useful for downstream tasks. For example, it's not clear why the representation learned via self-supervised learning in vision helps reducing generalization errors on downstream supervised tasks. This top-d singular function learning perspective provides an angle for understanding these phenomenon.

Weakness

  • Figure 1 results are a bit weak. It's still unclear to me why does the correlation decrease as width goes up.

其他意见或建议

N/A

作者回复

Thank you for your review.

1. Experiments in Section 4.2 (scaling) are updated

We changed the embedding dimension from d=16d=16 to d=128d=128 (which is more common in practice) and reran the experiment. The results are plotted in https://i.postimg.cc/FRncq3Zb/alignment.jpg

The main observation is that the alignment becomes much higher (CCA generally >0.8>0.8 and can be 0.9\approx 0.9). Also, as the model becomes wider/deeper, the alignment first increases and then decreases. We interpret the results as follows:

  • Effect of dd: The previously used d=16d=16 was too small. For instance, the 16th16^{th} eigenvalue is quite large (>0.3>0.3), and is close to the 25th25^{th} eigenvalue. Thus, it is hard to distinguish which of the top-2525 singular functions belong to the top-1616, which is why the original alignment was low. In contrast, the 128th128^{th} eigenvalue is close to zero, so when d=128d=128, the network can learn the top singular functions pretty well. Prior work mostly used d128d \ge 128, such as Kornblith et al., 2019 and Huh et al., 2024, so using d=128d = 128 is more reasonable. Moreover, CCA0.9\approx 0.9 is considered very high in prior work.
  • Alignment trend: When the model gets larger, the alignment first increases, because the function class of Φ\Phi becomes closer to the entire L2L^2 space, so the optimizer of the pretraining objective is closer to the top eigenfunctions. However, when the model is already large enough, making it larger makes optimization harder. Hence, with the same number of pretraining steps, a larger model will be farther away from the minima, and the alignment decreases.

2. Updated related work

Based on the suggestions of the reviewers, we extend the discussion of the present paper's relation to the prior work as follows, which will be moved to the main body in the final version.

Related Work

Understanding the representations learned by an encoder has long been a central topic in machine learning research. Prior work has developed quite different theoretical frameworks for different pretraining methods, but these frameworks are typically not transferable across different settings. This paper provides a unified framework that has a very broad scope, covering learning, SSL based on augmentation, supervised learning, graph representation learning, and more.

In the pre-deep-learning-boom era, early work on manifold learning revealed the connection between representation learning and extracting the top eigenfunctions of a kernel (Bengio et al., 2004; Coifman & Lafon, 2006). Moreover, using the eigenvectors of the graph Laplacian as node representations on graphs was a classical technique in graph applications (Belkin & Niyogi, 2002; Zhu et al., 2003).

On the theoretical understanding of deep representations, there are two lines of work that are closely related to this paper. The first line studies representation alignment (Kornblith et al., 2019; Canatar et al., 2021; Huh et al., 2024; Fumero et al., 2024). Representation similarity has also been studied in neuroscience (Kriegeskorte et al., 2008). These papers mainly focus on comparing two representations. While we aim to evaluate a single representation and the context on which it is trained, we use the tools developed by these papers in our analysis.

The second line develops the spectral theory of self-supervised learning (SSL). SSL has achieved remarkable success in recent years (Radford et al., 2019; Chen et al., 2020; Zbontar et al., 2021; Bardes et al., 2022; He et al., 2022; Assran et al., 2023). HaoChen et al. (2021); Johnson et al. (2023) related contrastive learning to the spectrum of the augmentation graph and the positive-pair kernel, and Munkhoeva & Oseledets (2023) related SSL to matrix completion. Later, Zhai et al. (2024) extended the spectral theory to all SSL methods, not just contrastive ones. The present work further extends this theory to representation learning in its most general form, beyond SSL.

Besides, other prior theoretical work on SSL studied its training dynamics (Damian et al., 2022; Jing et al., 2022; Tian, 2022) and built its connection to information theory (Achille & Soatto, 2018; Balestriero & LeCun, 2022; Shwartz-Ziv et al., 2023).

Another line of work for characterizing the representations aims to learn disentangled or causally associated representations (Higgins et al., 2018; Scholkopf et al., 2021). It is shown that such representations can be provably recovered, provided there is sufficient variability in the environments, for instance by indexing the environments via sufficiently varying auxiliary variables or interventions (Khemakhem et al., 2020; Varıcı et al., 2024; Buchholz et al., 2023; Yao et al., 2024). Some of these results further require stringent parametric assumptions.

审稿人评论

Thank you for the additional results and the newly added related work section. The explanation on width scaling makes sense to me. I think the contexture theory is quite interesting, but theory is only interesting to the extent that it's predictive of empirical performances. I am wondering if your approach can be used as a principle for model selection, or can it be predictive of scaling law. There's been recent attempt at comparing scaling law between CLIP-style visual encoder and DINO-style purely vision ssl based encoder https://arxiv.org/abs/2504.01017. It would be interesting to see if the proposed alignment metric can be predictive of scaling law. This is by no means required given the limited amount of time and compute, but I'm just curious what the authors think about this. I maintain my score.

作者评论

Thank you for your response! We are glad that you found the contexture theory quite interesting. Indeed, our aim is to build a theory with the explanatory power that can account for many empirical observations, and the predictive power that can provide insights into how to make further progress in pretraining.

The reviewer cites a very interesting paper, which shows that an encoder trained on images alone with SSL can be used with LLMs for visual QA tasks, and can even perform better than vision-language models like CLIP. They also observed that the scaling law plateaus more quickly for CLIP than for visual SSL. According to our theory, one possible reason is that the context of CLIP has a weaker association than visual SSL, that is the eigenvalues of CLIP decay faster, making it easier to capture all the eigenfunctions with large eigenvalues.

Another related empirical observation was made by Huh et al. (2024), who reported that visual SSL models and vision-language models learn highly aligned representations. These phenomena might be quite baffling because images and text are commonly perceived as fundamentally different. Our theory provides an explanation, that is the visual SSL context (where AA is an augmented image) and the vision-language context (where AA is a text caption) are in fact highly aligned, in the sense that many of their top eigenfunctions are shared!

We will add the above discussions to the paper. They show that the contexture theory can provide insightful explanations for many empirical observations that are otherwise hard to understand. In future work, we will verify these explanations with experiments.

We believe that our theory is a valuable tool for choosing the right context (pretraining objective) or training hyperparameters. For example, when choosing the hyperparameters in self-supervised learning such as crop ratio, mask ratio and so on, we can efficiently estimate the spectrum of the context under each combination of hyperparameters using the method described in the paper, and then use our quantitative metric to choose the one with the best decay rate (neither too fast nor too slow). This allows us to avoid training a large model for every possible hyperparameter combination.

(Huh et al., 2024) Position: The platonic representation hypothesis, ICML 2024.

审稿意见
3

The manuscript presents a framework called contextures, showing that many representation learning methods aim to capture the top singular functions of an operator defined by the relationship between inputs and a context variable. It shows that such representations are optimal for tasks that align with the context and that further improvements require better context design rather than just scaling up model size.

update after rebuttal

I have increased the score to Weak Accept from Weak Reject, primarily due to additional experiments on MNIST and CIFAR-10.

给作者的问题

No additional questions.

论据与证据

Some practical claims, particularly the implications for neural scaling laws and the benefits of designing better contexts, are less extensively validated. These claims are based on experiments with a limited set of small datasets.

方法与评估标准

The experimental validation relies on relatively small and simple datasets (up to 21k samples).

理论论述

Theoretical claims are sound. Additional experimentation on these assumptions and their implications in real-world scenarios can enhance the work.

实验设计与分析

The experiments rely on relatively simple datasets (up to 21k samples) again.

补充材料

Mostly proofs and context evaluation.

与现有文献的关系

The paper builds on SSL representation learning methods. It frames many self-supervised learning (SSL) approaches as the recovery of the top singular functions of a context-induced operator with the focus on extracting dominant eigenfunctions. This perspective unifies diverse SSL methods (contrastive, non-contrastive, masked autoencoders) in the way of learning low-dimensional representations.

遗漏的重要参考文献

For example, recent unification frameworks in self-supervised learning provide an important context that is not discussed.

One work by Balestriero and LeCun (2022) shows that many SSL methods—both contrastive (e.g., SimCLR) and non-contrastive (e.g., Barlow Twins, VICReg)—can be seen as recovering top eigenfunctions of certain operators, akin to classical spectral methods like Laplacian Eigenmaps or PCA.

Balestriero, Randall, and Yann LeCun. "Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods." Advances in Neural Information Processing Systems 35 (2022): 26671-26685.

Similarly, Munkhoeva et al. (2023) interpret SSL losses as implicitly performing Laplacian-based eigenfunction learning under data augmentations.

Munkhoeva, Marina, and Ivan Oseledets. "Neural harmonics: Bridging spectral embedding and matrix completion in self-supervised learning." Advances in Neural Information Processing Systems 36 (2023): 60712-60723.

The paper misses several methods, such as GPT, I-JEPA, data2vec 2.0, and DINO v2, that have significantly advanced self-supervised learning in language and vision.

其他优缺点

  • While the manuscript attempts to unify many approaches across different domains, it does not perform experiments on real datasets or, analyses, for example, with discussed SSL models. The experiments were performed on very small datasets with a sample size of no more than 21613 samples. There is a lack of application of this theory to more common datasets like ImageNet, or at least CIFAR10 and STL10.
  • In Figure 1, even if we train models with different depths and widths, it is not clear how useful the representation will be with respect to correlations.
  • The classical canonical correlation analysis (CCA) has been shown not to work well with neural network representations (Kornblith et al., 2019).
  • There are only vague details on how the models were trained. It is hard to reproduce.

Kornblith, Simon, et al. "Similarity of neural network representations revisited." International conference on machine learning. PMLR, 2019.

其他意见或建议

The manuscript is trying to cover too many topics and ideas, specifically neural collapse and neural scaling laws, which should be discussed in separate publications. Then, you will have space to experiment on SSL and natural image datasets.

作者回复

Thank you for your thoughtful review. We address your concerns as follows:

1. Our results are applicable to larger datasets

We conduct more experiments on larger datasets such as MNIST and CIFAR-10. We initially used datasets with << 30K samples (largest: 28,155; see App. F) since most experiments compare the pretrained Φ\Phi with kernel PCA, which does not scale well to large datasets. However, our theoretical results are independent of dataset size, and our metric is applicable more broadly. Here we show it on MNIST and CIFAR-10.

On MNIST, we use LeNet as Φ\Phi and random cropping with various crop ratios as contexts. Spectra of these contexts are estimated using the method in lines 356–368. For instance, see the estimated spectrum for crop ratio of 0.30.3: https://i.postimg.cc/bYQP1N46/mnist1.jpg.

We plot the prediction error of a linear probe on Φ\Phi (blue) alongside τd\tau_d (orange, scaled) at crop ratio 0.3 here: https://i.postimg.cc/3rg4SGkm/mnist3.jpg, which shows that τd\tau_d tracks the actual error across different dd. We further vary the crop ratio from 0.1 to 0.9 and compare τ\tau with the prediction error of a linear probe trained on Φ\Phi. The result (https://i.postimg.cc/2yzTMwGX/mnist2.png) shows a strong correlation.

On CIFAR-10, we use ResNet-18 with SimCLR augmentation as the context. See the estimated spectrum at https://i.postimg.cc/hGTcDLW0/cifar1.jpg. In https://i.postimg.cc/ZqQcWJpL/cifar2.png, we plot the prediction error of a linear probe trained on Φ\Phi (blue) with τd\tau_d (orange, scaled to fit the image). Again, τd\tau_d tracks the actual error, supporting its usefulness in choosing dimension dd.

We will add these experiments to the paper.

2. The benefits of designing better contexts

We proved that for a fixed context, making the model larger has diminishing returns, so it follows as a natural logical consequence that better contexts are necessary for further improvement. Thus, we only intended our experiments to provide a simple empirical illustration rather than corroboration. We also note the increasing empirical evidence in the literature: many recent advancements in pretraining are due to better contexts. For example, as the authors of SimCLR noted, its success owed greatly to its aggressive crop ratio and color distortion.

3. The usefulness of representations

The experiment in Section 4.2 aims to show that as the model gets larger, Φ\Phi becomes more aligned with the top-d eigenfunctions. However, a better alignment does not necessarily imply better performance at a downstream task. Whether Φ\Phi is good on a particular task depends on the compatibility between the context and the task, which we prove in Section 4.1.

We also improved this experiment. See point 1 in our response to Reviewer R5n1.

4. CCA versus CKA

We use CCA but not CKA because our setting is different from that in Kornblith et al. In their setting, they want the alignment metric to be invariant under orthogonal transformations, but not under other invertible linear transformations on Φ\Phi. That's why they used CKA. In contrast, we want the metric to be invariant under invertible linear transformations on Φ\Phi because they do not affect the performance of the downstream linear probe, so we use CCA.

Moreover, CCA is equivalent to linear CKA if both representations are centered and whitened. We also evaluate the CKA when the representation is [s1μ1,,sdμd][s_1 \mu_1, \cdots, s_d \mu_d], that is each singular function is multiplied by the singular value. We run the experiment in Section 4.2 with a fixed width 512 and varied depths. The result is plotted at https://i.postimg.cc/Jzrbc00M/Picture1.png. The plot shows that the trends of CCA and CKA are very similar.

5. Reproducibility

We will add details to the appendix. We also kindly point out that we have attached our code in the supplementary material, and all experiments are run with fixed random seeds, so all results can be exactly reproduced.

6. Neural collapse and scaling laws

Our remarks show that our theory can explain many empirical phenomena and cover different ML paradigms. We believe that including these remarks is important since they show that our theory has real explanatory power. These should be viewed similar to brief remarks in a later section such as conclusions where such discussions are common. We are happy to move these to a later section if the reviewers see fit.

7. Paper updates

We have updated the related work section. See our response to Reviewer R5n1.

We appreciate your feedback and hope that our response answers all your questions. We are happy to answer any follow-up questions.

审稿人评论

Thank you for addressing some of the questions. I have decided to increase the score to Weak Accept, primarily due to additional experiments on MNIST and CIFAR-10.

作者评论

Thank you for your response! We are glad that our rebuttal addresses your questions.

审稿意见
2

The paper introduces a theoretical framework to characterize representations learned by a neural network is: each data sample is represented by a variable and a context variable (e.g. labels in classification settings, augmentations in self supervised learning, or its K nearest neighbors). The authors proposed to consider an operator, denoted as contexture, induced by data and its context, which is defined as the average. They show that certain classes of representation learning algorithms, such as supervised classification of self supervised learning, can be phrased in the presented theory. In particular they claim that most algorithm aim at extracting the largest singular components of the contexture operator. In order to define the performance , they evaluate if the performance of an average representation over the context matched the one of the sample. They propose a metric, based on computing the first d singular values of the contexture operator, to estimate how downstream performance without having direct access to the downstream task itself. Depending on the relation of the context with the downstream, tasks (i.e. how much one can solve the downstream task from the context) the metric should predict better performance. The validate the theory on a collection of small scale datasets from the OpenML repository.

给作者的问题

  • Why considering just Kernel PCA and VICReg as encoders? How much the contexture operator depends on this choices?

  • In the experiments in Figure 2 and Table 1: Do the learned encoders solve the task? I.e. the kernel pac embedding are sufficient to solve the regression and classification tasks on the test set?

  • How much the context variable depends on which encoder one learns? I.e. for a given data context variable one could optimize for an encoder that controls the level of association despite solving the training task.

  • In the setting of supervised classification what is gg in practice? Can be just a selection operator or It is an encoder for labels?

  • Figure 2: why the \tau scores are divided by 5?

论据与证据

The paper claims the following contributions:

  • Theory and formalization looks generally good supporting the following claims (see below), although is not completely clear to me how much this correspond to practical settings and how they related with previous work in detail (see related work section).

    • (i) representation learning captures the relationship between the inputs and a context variable supported by formalization in Section 3 and 4, although it should be related to previous works (e.g. [8,4,5]).

    • (ii) contexture theory encompasses a large variety of learning paradigms supported by corresponding proofs in Section 3.

    • (iii) Optimal representations (representation minimizing the loss) are the one in which the contexture is learned supported by corresponding proofs in section 4.

    • (iv) Task and context association dependance: Supported by section 5.1. How much this is related to the encoder considered? (see related questions in the questions section).

Points (i) and (iii) could be related to causal representation learning and disentanglement point of view on topicality of representations see e.g. [1,2,3]

  • Experimental claims are less obvious and convincing. see the experiments section of the review for more details :

    • the main role of scaling up the model size is bringing the learned representations close to the top singular values of contexture operator: the experiments provided are small scale and also the evidence is sometimes weak and motivation for this are not discussed in depth (for example not very high correlation scores in Figure 1 are attributed to non convex optimization, without providing deeper analysis nor explanation of the reason why this could be the case)

    • Reflecting current status of deep learning models: while the theory is interesting and valuable, there is not direct evidence of reflecting the current status of deep learning models: Experiments are performed on small scale dataset which are very far even from simple benchmark dataset as MNIST (60k samples).

    • Metric: Is not clear from the experiments for the proposed metric should be compared and contextualized with respect to previous work, e.g. [2,3]

方法与评估标准

  • The use of the datasets for empirical evaluation is not justified in the paper, in particular given that all datasets are very small scale and don't reflect the current status of deep learning datasets and models.

理论论述

  • I didn't check the correctness of proofs in detail, but the formalization looks good at a first glance.

实验设计与分析

Weaknesses:

  • All experiments are performed on small scale datasets (of the order of few hundreds to 20k samples), not reflecting the current status of deep learning models trained on dataset at much larger scales.

  • Figure 1: The trend of the correlation plot is not explained in depth: what is the motivation why the correlation decreases at first? What support that the reason of the correlation is always below 0.5: is it hard to disentangle what is affected by the optimization from what the authors want to show here from the theory. It should be verified that training with different seeds attain low variance results. One strategy could be to show for synthetic data that the correlation trend I alway increasing if it expected so.

  • Figure 2 plots in the second row are not very readable: a correlation plot like the ones in Figure 3, between errors and \tau measure would allow to understand better the association level.

  • Results in Figure 3 and Table 1. Although failure cases are highlighted, not a deep analysis or explanation is provided concerning why on certain dataset the metric fails to predict the downstream error.

补充材料

Reviewed related work, code to inspect details of neural architecture used, and part of the formalization/proofs.

与现有文献的关系

The related work section in the Appendix is quite short and not comprehensive of many works. A more comprehensive discussion with what has been proved in (Zhai 2024) and how has been extended should be included, in order to understand better the contribution of the paper.

Some additional work would need to be discussed. Some examples: [1], in which it is proposed map between operators eigenspaces defined on graphs in latent spaces of arbitrary neural models. [2,3] Discuss representation quality measuring the decay of the eigenspectrum of the covariance operator in representation space. In [6] is introduced relative representation where each data point is represented as function of other points in representations space ( close to the context variable idea) and it's showed that this representation is universal across models, architectures, and training algorithms. Before in [7] was introduced the idea of using second order information to compare representations. It would be interesting to relate the unification perspective proposed by contextures with the one analyze in the causal representation learning setting and disentanglement [4,5,8].

[1] Fumero, M., Pegoraro, M., Maiorca, V., Locatello, F., & Rodolà, E. (2024). Latent Functional Maps: a spectral framework for representation alignment, NeurIPS 2024

[2] Agrawal, Kumar K., et al. "α\alpha -ReQ: Assessing Representation Quality in Self-Supervised Learning by measuring eigenspectrum decay." Advances in Neural Information Processing Systems 35 (2022)

[3] Nassar, Josue, et al. "On 1/n neural representation and robustness." Advances in neural information processing systems 33 (2020)

[4] Yao, Dingling, et al. "Unifying Causal Representation Learning with the Invariance Principle." arXiv preprint arXiv:2409.02772 (2024)

[5] Zimmermann, Roland S., et al. "Contrastive learning inverts the data generating process." International conference on machine learning. PMLR, 2021.

[6] Moschella, Luca, et al. "Relative representations enable zero-shot latent space communication, ICLR 2023

[7] Kriegeskorte, Nikolaus, Marieke Mur, and Peter A. Bandettini. "Representational similarity analysis-connecting the branches of systems neuroscience." Frontiers in systems neuroscience 2 (2008)

[8] Achille, Alessandro, and Stefano Soatto. "Emergence of invariance and disentanglement in deep representations." Journal of Machine Learning Research 19.50 (2018)

遗漏的重要参考文献

As discussed in the previous paragraph more comprehensive literature review would be needed. Some examples reported below and discussed in the previous section:

[1] Fumero, M., Pegoraro, M., Maiorca, V., Locatello, F., & Rodolà, E. (2024). Latent Functional Maps: a spectral framework for representation alignment, NeurIPS 2024

[2] Agrawal, Kumar K., et al. "α\alpha -ReQ: Assessing Representation Quality in Self-Supervised Learning by measuring eigenspectrum decay." Advances in Neural Information Processing Systems 35 (2022)

[3] Nassar, Josue, et al. "On 1/n neural representation and robustness." Advances in neural information processing systems 33 (2020)

[4] Yao, Dingling, et al. "Unifying Causal Representation Learning with the Invariance Principle." arXiv preprint arXiv:2409.02772 (2024)

[5] Zimmermann, Roland S., et al. "Contrastive learning inverts the data generating process." International conference on machine learning. PMLR, 2021.

[6] Moschella, Luca, et al. "Relative representations enable zero-shot latent space communication, ICLR 2023

[7] Kriegeskorte, Nikolaus, Marieke Mur, and Peter A. Bandettini. "Representational similarity analysis-connecting the branches of systems neuroscience." Frontiers in systems neuroscience 2 (2008)

[8] Achille, Alessandro, and Stefano Soatto. "Emergence of invariance and disentanglement in deep representations." Journal of Machine Learning Research 19.50 (2018)

其他优缺点

Strengths

  • significance: The paper targets a very important goal of considering different learning algorithms under a unified framework and how to evaluate quality of representation learned by deep learning algorithms.

  • broad theory the theory proposed is broad and applies to many setting of deep learning and representations.

  • failure cases : The demonstration of failure cases for the proposed metric in Figure 3 (d) and (e) is useful and appreciated.

Weaknesses

  • Clarity: parts of the text could be improved in clarity: for example, in the section introducing contextures, integral operators and kernels, practical examples accompanying this would be helpful to improve the reader's understanding.

  • Related work: Related work section should be improved. See specific section.

  • Experimental evidence: validation in experimental setting of the theory proposed is not very strong and far from models and dataset used in practice (see experimental section).

其他意见或建议

I spotted the following typos:

  • line 131: "The kernels captures" -> "The kernels capture"
作者回复

Thank you for your thoughtful review. We're glad that you find our theory generally good, interesting and valuable. We address your concerns below:

1. Experiments in Section 4.2 (scaling) updated

We improved the setting and observed much higher alignments. Due to the 5000-character limit, please see point 1 in our response to Reviewer R5n1.

All experiments use 15 different random seeds. The standard deviation plot (of one experiment) at https://i.postimg.cc/2651NYg4/Picture3.png shows low variation.

2. Our results are applicable to larger datasets

We added experiments on larger datasets such as MNIST and CIFAR-10. See point 1 in our response to Reviewer RMLf.

3. Figure 2

This figure illustrates how the association between XX and AA affects the spectrum’s decay rate, and how τd\tau_d tracks the prediction error of a dd-dimensional encoder. We divide τ\tau by 5 for visualization. We’ll enlarge the figures to improve readability.

4. Metrics in [2, 3] do not fit our problem

There is a fundamental difference between our metric τ\tau and those in [2, 3]. They use the eigenvalues λi\lambda_i of ΦΦ\Phi \Phi^{\top}: Φ,Φfi=λifi\langle \Phi, \Phi \rangle f_i = \lambda_i f_i. In contrast, our theory is based on the spectrum of the context: Φ,TP+Φfi=si2Φ,Φfi\langle \Phi, T_{P^+}\Phi\rangle f_i=s_i^2 \langle\Phi, \Phi\rangle f_i. These spectra are fundamentally different. si2s_i^2 is invariant under invertible linear transformations on Φ\Phi, whereas λi\lambda_i is not. Since such transformations don’t affect downstream linear probe performance (weights can be adjusted accordingly), metrics in [2, 3] are not suitable for our setting.

5. Elaboration on the two failure cases of our metric

Both failure cases stem from the fact that the metric depends only on the spectrum (singular values) of the context.

  • Case 1: The association of the context is too weak/strong. Our metric will indicate that the context is bad, but it is still possible that for a particular task, the context is good.
  • Case 2: The metric might not be able to compare different types of contexts. It may indicate similarity, though one context may perform well on a specific task while the other does not.

However, the experiment in Sec.5 shows that such failure cases are rare.

6. Why we use kernel PCA and VICReg

Kernel PCA is the standard way to extract the exact top eigenfunctions. It is not scalable, and we compare deep learning based methods with it.

We only use VICReg in the experiments because, as shown in Section 3, many objectives such as spectral contrastive loss and masked autoencoders, are equivalent to VICReg in that they are all minimized by the top-d singular functions. To demonstrate this, we run the experiment in Sec.4.1 with both VICReg and spectral contrastive loss (with width 512, varied depths). Results at https://i.postimg.cc/Hncjj2Yn/Picture2.png show very similar performance, differing slightly due to optimization. Hence, we use a single method in experiments, given their equivalence.

7. Whether the methods "solve the task"

No encoder solves all tasks. While we proved that many pretraining objectives learn the contexture, they are not guaranteed to solve arbitrary tasks. In Sec. 4.1, we proved that the encoder’s performance depends on the compatibility between the task and the context. If they are compatible, then an encoder that learns the contexture is guaranteed to succeed.

8. Whether the context variable depends on the encoder

In general, the context is defined prior to training the encoder, so it does not depend on the encoder. There are two ways to encode additional context in the encoder:

  • The inductive bias of the architecture (such as translation invariance of CNNs)
  • "Smoothed" encoders, such as adding Gaussian noise to the input

The first method changes the model and the second method changes the input. In both cases, adding additional context weakens the association, hence implicitly changes the level of association. Our theory does not analyze the effect of the inductive bias, and we pose it as an open problem.

9. What is gg in supervised learning?

In supervised learning, Φ\Phi extracts the eigenfunctions of TP+ΛTP+T_{P^+}\Lambda T_{P^+}^*, so g(a)=(TP+Φ)(a)=E[Φ(X)class a]g(a) = (T_{P^+}^* \Phi)(a) = E[\Phi(X) | \text{class } a]. By neural collapse, the representations of samples from the same class aa collapse to a cluster, and g(a)g(a) is the center of that cluster.

10. Paper updates

We have updated the related work section. See our response to Reviewer R5n1.

We also added a new limitation paragraph. See our response to Reviewer rV84.

We appreciate your feedback and hope that we've answered all your questions. We are happy to address any remaining concerns.

最终决定

There was a healthy discourse amoung the authors and reviewers for this paper. The primary concerns raised in the review process where that there no experiments on larger datasets, some concerns about related work, and some further analysis. The authors have responded to these comments in their rebuttal and the reviewers are in agreement that this paper merits publishing. Reviewer asLG did not respond to the rebuttal but I believe their concerns have been addressed.