PaperHub
6.0
/10
Poster4 位审稿人
最低3最高5标准差0.8
3
4
5
3
2.5
置信度
创新性3.3
质量2.8
清晰度3.3
重要性2.3
NeurIPS 2025

The Indra Representation Hypothesis

OpenReviewPDF
提交: 2025-04-28更新: 2025-10-29

摘要

关键词
Foundation ModelsRepresentation LearningRepresentation AlignmentMultimodal Alignment

评审与讨论

审稿意见
3

This paper introduces the "Indra Representation Hypothesis," which states that the representations learned by unimodal foundation models converge towards a shared relational structure. The authors formalize this hypothesis using the V-enriched Yoneda embedding from category theory. This leads to a practical method where a sample's representation is redefined as its vector of distances to all other samples in a given dataset. This distance vector, termed the Indra representation, is computed from the outputs of pre-trained unimodal models. The authors provide theoretical claims that this representation is unique, complete, and structure-preserving. The method is evaluated on cross-model and cross-modal matching tasks involving vision, language, and audio, where it is shown to consistently improve matching accuracy over the original model embeddings.

优缺点分析

The paper's primary contribution is a novel and elegant conceptualization of representation convergence. The analogy to Indra's Net, combined with the formalization using category theory, provides a thought-provoking new lens through which to view the properties of embeddings from large-scale models. The proposed method is a simple, post-hoc transformation that requires no retraining of the foundation models. Its generality allows it to be applied to any model that produces vector embeddings, which is a significant practical advantage. Across the conducted experiments in single-modality and cross-modal settings, the use of the Indra representation demonstrates a modest improvement in matching performance over the baseline of using raw embeddings.

The proposed method is fundamentally non-scalable. To construct representations for a dataset of n samples with embedding dimension d, one must compute an n x n distance matrix, which has a computational complexity of at least O(n2d)O(n^2d) and a memory complexity of O(n2)O(n^2). The resulting Indra representation for each sample is an n-dimensional vector. This scaling behavior makes the method computationally infeasible for datasets beyond the small-scale validation sets used in the paper (e.g. n>105n>10^5). This is, in essence, a kernel method that computes the full kernel matrix. The paper shows a critical lack of rigor by failing to acknowledge this.

The appeal to enriched category theory, while elegant, does not appear to offer concrete guidance for the method's implementation. The Yoneda Lemma guarantees that the mapping to a functor category is fully faithful, preserving all information contained within the chosen cost function d. However, this is a tautology: the representation is complete only with respect to the chosen distance metric. The theory provides no principles for selecting an optimal cost function, which is the single most important design choice in the entire method. Any valid metric space could be used to motivate the same procedure, making the complex theoretical overhead seem more justificatory than generative. The empirical evaluation is weak because it only compares the Indra representation against the original embeddings. To rigorously assess a new method for post-hoc representation alignment, it must be benchmarked against established techniques. By omitting these comparisons, the paper fails to demonstrate that its proposed method offers a significant advantage over simpler, well-known alternatives.

问题

1)The proposed method has a computational complexity of O(n2d)O(n^2d) and produces n-dimensional representations. Please comment on the method's viability for large-scale datasets. More importantly, why was the extensive literature on kernel approximation methods, which are designed to mitigate these exact scalability issues, not considered, implemented, or benchmarked against?

  1. The theoretical framework guarantees that the Indra representation preserves the structure induced by the cost function d. However, it provides no guidance on how to select d. How does the theory inform this crucial choice? For instance, your ablation shows Euclidean distance also works; does the theory suggest when angular distance might be superior to Euclidean, or vice-versa?

  2. Could you refine your claim regarding the "structural myopia" of existing representations? How do you distinguish the relational information captured by your method (inter-sample distances) from the relational information already captured by mechanisms like self-attention (inter-token dependencies)?

局限性

Yes, the authors provided a limitations section in the supplementary material. However, it is wholly inadequate. The authors completely sidestep the critical issue of scalability. Instead, they frame the limitation as their method being "post-training adaptation" and suggest incorporating the ideas into pretraining as future work. This is not a meaningful limitation but rather a description of the work.

最终评判理由

The paper presents an interesting hypothesis but does so in a scientific way. It proposes a non-scalable method that is equivalent to computing a kernel matrix without acknowledging or engaging with the literature on kernel methods. It fails to compare against any established baselines, making its empirical claims difficult to assess. The attempt to excuse these omissions by reframing the paper's goals post-submission is not a valid defense. Therefore, I conclude that the paper, in its current form, does not meet the standards for publication.

格式问题

None

作者回复

Dear Reviewer 7Tyf,

We sincerely appreciate you for providing thoughtful and constructive comments.

We are glad to see you recognize our idea as novel and elegant, providing a thought-provoking new lens to view the properties of embeddings of foundation models.

Below, we address your remaining concerns.

W1&Q1 Scalability & Complexity

[Complexity of Exact Computation] We agree with you that constructing exact Indra representations requires a computational complexity of O(n2d)\mathcal{O}(n^2d) and a memory complexity of O(n2)\mathcal{O}(n^2) for a dataset with nn samples and embedding dimension dd. This quadratic scaling potentially limits the direct applicability of the exact Indra representations to large-scale datasets.

[Our Research Focus] However, we respectfully clarify that the goal of this work is not to propose a new application-driven multi-modal alignment method. Rather, our focus lies in understanding the fundamental nature of the representations learned by existing unimodal foundation models. Specifically, we aim to investigate the forms these representations tend to converge to and to identify their ultimate convergent targets.

In this paper, we argue that unimodal foundation models converge toward a form of Indra representation, which implicitly reflects the relational structure underlying reality (i.e., The Indra Representation Hypothesis). All our experiments are designed to validate the effectiveness of this form.

[Application-Oriented Solutions] Moreover, the scalability concern is addressable in practice. In the literature, there exists a rich body of work on approximating pairwise distances efficiently. For example, approximate nearest neighbor search (e.g., FAISS, HNSW), landmark-based approximation (e.g., K-means centroids, random subsampling), hashing-based methods (locality sensitive hashing), and sparsified graph constructions (as suggested by Reviewer 6ioa). From an application view, these techniques can be readily adapted to approximate the Indra representation at scale without sacrificing its structural interpretation.

[Experimental Validation] To evaluate the effectiveness of approximated Indra representations, we adopt a simple landmark-based approximation via random subsampling. We assess this approach on two tasks: image classification on the CIFAR-100 dataset and image-text matching on the large-scale CC3M dataset. For CIFAR-100, we use the standard data split provided by torchvision.datasets and employ logistic regression (linear probing) to evaluate classification accuracy. For CC3M, we sample 300,000 image-text pairs for evaluation. We extract image features with DINOv2 and text features with RoBERTa. For each image, we compute mean CLIP scores with the top-kk candidate texts and compare the performance of the original features against that of the approximated Indra representations.

Due to space limitation, we kindly refer you to our response to Reviewer yaBd (W1 & Q3) for the experimental details and results.

W2&Q2 Distance metric

We believe there is a misunderstanding regarding the role of enriched category theory in our method. The theory is not intended to guide the selection of an optimal cost function for practical applications. Instead, its purpose is to establish that the Indra representation obtained through our construction is both faithful and complete with respect to the chosen cost function. The Yoneda Lemma indeed cannot prescribe an optimal cost function for the downstream tasks.

In practice, it is not feasible to define a universally optimal cost function across different tasks. As our experimental results demonstrate, angular distance may outperform Euclidean distance in some settings, while the opposite may be true in others. This highlights that the choice of cost function should be task-dependent, and the theory is not meant to replace empirical validation.

Furthermore, not all distance functions are valid cost functions in our framework. To be used as a cost function for the Indra representation, the cost function should satisfy the following properties:

Reflexivity: d(x,x)=0d(x, x) = 0
Non-negativity: d(x,y)[0,)d(x, y) \in [0, \infty)
Triangle inequality: d(x,z)d(x,y)+d(y,z)d(x, z) \leq d(x, y) + d(y, z)

In summary, enriched category theory is not used to justify the method post hoc, but to guarantee structural faithfulness and completeness within a mathematically grounded setting.

W3& Baselines

Existing alignment methods, e.g., [50], need additional modules on top of frozen backbones, such as linear projections and adapters, and require substantial fine-tuning on large-scale datasets to learn alignment functions. Once trained, these models become explicitly aligned and are thus more appropriately compared to models like CLIP, which are also designed with explicit alignment.

In contrast, our work aims to investigate the convergent forms and targets of representations learned by unimodal foundation models. Our objective fundamentally differs from that of alignment-driven methods: while they actively design and train new modules to enforce alignment, our approach passively explores whether frozen models already contain a more suitable internal representation that supports better cross-modal alignment.

We uncover that even without any additional training, unimodal foundation models may admit a more optimal representational form than their raw output features. This key difference in motivation, methodology, and training requirements makes a direct empirical comparison with those methods unfair and potentially misleading.

To the best of our knowledge, this is the first work to demonstrate that frozen representations without any task-specific tuning can achieve improved alignment performance through our proposed representation method, not only within a single modality, but also across vision-language and speech-language modalities.

Q1 Kernel approximation methods

We kindly remind the reviewer that the primary goal of this paper is to investigate the representational forms that unimodal foundation models converge to, and to identify their ultimate convergent targets, rather than to accelerate the proposed method through approximation.

In this work, we hypothesize that these models tend to converge toward a form we define as the Indra representation. Accordingly, all our experiments are designed to validate the effectiveness of this representation, demonstrating that it consistently achieves better alignment than the original model outputs without any additional training.

In fact, using approximation techniques (e.g., kernel approximation methods) to construct the Indra representation and empirically demonstrate its performance primarily serves as a validation of the Indra representation hypothesis introduced in this paper.

Q3 Structural myopia & self-attention

Prior studies on representation convergence often treat the representations produced by foundation models as isolated carriers of information. They analyze convergence purely based on model outputs in a pointwise or instance-level manner. This perspective overlooks the underlying structural relationships embedded within the broader data manifold. In contrast, we argue that model outputs do not directly represent the final converged form, but rather provide the initial scaffold upon which convergence should be understood. We contend that truly convergent representations must be structure-dependent.

While both our method and self-attention involve pairwise relationships, they differ fundamentally in nature and purpose. Our method constructs a representation for each sample based on its static, global relationship to others in the dataset, using a predefined cost function. This results in a complete, non-parametric embedding that faithfully captures the sample's position within the overall data geometry. The representation is invariant across contexts and tasks, and fully determined by the relational structure of the data itself.

In contrast, self-attention models dynamic, context-dependent interactions among tokens, where the strength of each relationship is learned via parameterized query-key-value projections and varies across different inputs and layers. Self-attention embeddings are not faithful to the original data geometry; instead, they adapt to downstream objectives and are shaped by the model's learned parameters.

In short, our method is grounded in global structure and faithful representation, while self-attention captures how a token should change based on its current context and specific tasks.

L1 Limitation section revision

Thank you for the thoughtful comments. Our experimental results validate the hypothesis by demonstrating improved alignment performance using frozen unimodal foundation models. We believe it is also meaningful to explore this hypothesis in the pretraining stage, to investigate how learning dynamics evolve when the training objectives align with the representational convergent targets.

We have addressed related concerns in the responses above and will revise the limitations section accordingly. In particular, we will clarify our focus on theoretical insights and empirical findings, explicitly acknowledge the computational challenges associated with computing exact Indra representations, and include a discussion of scalable approximation strategies along with the corresponding experimental results.

We hope the above response adequately addresses your concerns. Once again, thank you for your thoughtful feedback and for helping us improve the quality of our paper.

评论

The authors argue that comparing their method to other alignment techniques (e.g., those using linear projections) is "unfair" because their method is "passive" and requires no training. This argument is unpersuasive.

  • A simple, learnable linear projection is a standard and essential baseline for any representation alignment task. It establishes a baseline of what is achievable with minimal additional parameters and training.
  • Other non-parametric, training-free alignment methods exist (e.g., Canonical Correlation Analysis).
  • By refusing to benchmark against these established techniques, the authors are shielding their method from rigorous evaluation. The claim that their method is superior cannot be substantiated without these comparisons. The burden of proof is on the authors to demonstrate that their computationally expensive O(n2)O(n^2) method offers benefits over simpler, well-known alternatives. They have not done so.

I appreciate the authors' candor in agreeing that the enriched category theory framework does not guide the crucial choice of the distance metric d. However, this confirms my initial criticism: the complex theoretical overhead seems more justificatory than generative. The central guarantee, that the representation is "complete" with respect to the chosen distance metric, is a direct consequence of its construction and can be understood with basic principles of metric geometry. The heavy theoretical machinery does not appear to provide additional insight that justifies its inclusion.

评论

We thank the reviewer for the thoughtful feedback and would like to clarify a few points that may have been misunderstood:

1. Our primary goal is not to propose a new alignment method, but rather to demonstrate that the proposed Indra representations exhibit stronger convergence behavior than the original network outputs.

Specifically, we study the representation convergence problem, where we investigate the nature of the representations to which unimodal foundation models tend to converge. Through experiments across various modalities, we show that the original network outputs may not be the most suitable endpoints for representation convergence. Instead, Indra representations serve as more consistent and convergent targets.

That said, we agree with the reviewer that including comparisons with standard alignment methods such as learnable linear projections is valuable for contextualizing our findings. In response, we have now included comparisons with the linear projection baseline [50]. The classification and matching results are presented in Tables 1, 2, and 3, respectively.

As shown in our results, the learnable Linear Projection baseline improves alignment performance compared to the original network outputs, but still underperforms relative to our method. Notably, it achieves stronger results on the image-to-text (I->T) task, because it explicitly projects image features into the text space. However, this forced alignment comes at a cost: aligning image representations to a different modality degrades their utility for classification, particularly under noisy conditions. This indicates reduced robustness and suggests that linear projection may compromise the integrity of the original modality's semantics.

Table1 Office-HomeA σ\sigma=0.0A σ\sigma=3.0A σ\sigma=5.0A σ\sigma=7.0C σ\sigma=0.0C σ\sigma=3.0C σ\sigma=5.0C σ\sigma=7.0P σ\sigma=0.0P σ\sigma=3.0P σ\sigma=5.0P σ\sigma=7.0R σ\sigma=0.0R σ\sigma=3.0R σ\sigma=5.0R σ\sigma=7.0
Linear Projection[50]83.9548.3526.3413.1784.8839.5217.7508.7195.3867.3438.7419.9391.0672.5942.8924.08
ViT80.2564.4044.0322.6373.2050.4028.6415.2392.3480.7461.1535.2589.2282.1160.0935.32
Ours (Angular)79.6365.0243.6227.5769.7654.9833.1018.2189.7581.5364.0840.7787.1683.4963.6540.48
Ours (Euclidean)80.0463.3743.2125.5170.3355.7834.4818.3389.6480.7464.7540.7787.1683.3764.5641.28
Table2 CIFAR10σ\sigma=0.0σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0
Linear Projection[50]85.9122.2915.1912.08
ViT93.9887.7579.7768.15
Ours (Angular)94.7288.0879.8468.16
Ours Euclidean94.8489.5180.8468.71
Table3Top-5Top-5Top-10Top-10Top-30Top-30Top-50Top-50
MS-COCOT->II->TT->II->TT->II->TT->II->T
Linear Projection[50]0.4600.5170.9631.0522.8453.1834.8905.293
ViT+BERT0.4820.4830.9670.9662.9112.9054.8634.846
ours0.6630.8321.3031.6133.7874.4266.1997.036
NOCAPST->II->TT->II->TT->II->TT->II->T
Linear Projection[50]0.4960.5241.0581.0423.0373.0994.9385.147
ViT+BERT0.4790.4740.9560.9472.8642.8444.7694.742
ours0.7010.6671.3751.2933.9603.7126.4496.069

2 We would like to clarify that canonical correlation analysis is not a non-parametric, training-free alignment method.

CCA computes explicit linear projections that map two datasets into a shared latent space. These projection parameters are derived by solving a generalized eigenvalue problem based on sample covariance matrices, which have a closed-form solution. Although it doesn't involve gradient descent, CCA still involves fitting to the training data and requires estimating projection parameters that depend on the input data.

We thank the reviewer for suggesting a concrete method for us to compare, and we have added comparisons with CCA. The results are shown in Table 4, including comparisons with different numbers of canonical components (e.g., CCA-100 refers to using 100 canonical components).

Please refer to the next page for the continuation of our response.

评论

3 We respectfully emphasize that our method addresses a fundamentally different research question from the standard alignment approaches referenced by the reviewer.

Methods such as learnable projections and CCA are explicitly designed to optimize alignment between modalities through training or closed-form solutions. These methods aim to answer the question: “How can we better align two representations?”

In contrast, our work investigates the convergent behavior of unimodal foundation models. Specifically, what form their representations tend to converge to. We ask whether the proposed Indra representations can serve as the convergent targets, and naturally yield better alignment without requiring any additional training or projection. That means, our work seeks to answer a distinct question: “Should Indra representations be considered the final convergent targets of unimodal foundation models?”

Table4Top-5Top-5Top-10Top-10Top-30Top-30Top-50Top-50
MS-COCOT->II->TT->II->TT->II->TT->II->T
ViT+BERT0.4820.4830.9670.9662.9112.9054.8634.846
CCA 1000.4980.5030.9851.0012.9513.0054.9345.010
CCA 2000.5010.4980.9921.0012.9743.0044.9545.001
CCA 3000.5010.5010.9911.0012.9663.0064.9465.011
CCA 5000.4870.5020.9730.9992.9312.9984.8594.844
Ours0.6630.8321.3031.6133.7874.4266.1997.036
ViT+Roberta0.4860.4910.9700.9812.9122.9274.8534.874
CCA 1000.5130.5191.0141.0282.9993.0194.9765.016
CCA 2000.5100.4921.0090.9922.9922.9914.9824.999
CCA 3000.5040.4891.0060.9922.9922.9864.9624.981
CCA 5000.5120.4921.0210.9903.0162.9754.9974.968
Ours1.0480.8802.0651.7495.9705.1499.7028.446
Convnext+BERT0.3960.4740.8370.9502.6032.8514.4124.755
CCA 1000.5030.4951.0030.9952.9862.9964.9654.994
CCA 2000.5060.4971.0060.9962.9922.9844.9774.972
CCA 3000.5090.4991.0081.0012.9932.9964.9854.992
CCA 5000.5070.4961.0010.9952.9912.9904.9764.980
Ours0.6120.5371.1271.0223.1822.8755.2424.783
Convnext+Roberta0.4920.4800.9850.9642.9622.8894.9404.824
CCA 1000.4990.5110.9931.0092.9672.9864.9494.969
CCA 2000.4810.5070.9611.0072.9022.9934.8694.975
CCA 3000.4860.5090.9741.0042.9102.9924.8624.982
CCA 5000.5170.5101.0251.0053.0102.9995.0024.984
ours1.0050.6161.9301.2175.2473.5388.2675.790

4 The reviewer’s comment may reflect a misunderstanding of the role that enriched category theory plays in our work.

Our proposed Indra representation is motivated by the metaphor of Indra’s Net: each sample is defined not in isolation, but through its relational pattern with other samples. The enriched Yoneda embedding formalizes this idea. It shows that an object in an enriched category is fully and faithfully represented by its morphisms to and from all other objects, with respect to the given enrichment. In our setting, an object is fully determined by how it relates to all other objects in a category given the cost function. In other words, this theory tells us we can understand an object best by understanding its interactions with all other objects.

We respectfully disagree with the claim that this theoretical structure is merely justificatory. While the completeness result may be partially intuited via classical metric geometry, the enriched Yoneda embedding generalizes beyond metric spaces: any suitable monoidal structure can serve as the enrichment. This flexibility is crucial for extending beyond metric embeddings, and the enriched framework ensures that the relational representation is coherent and principled in all such cases.

Indeed, metric spaces can be viewed as a special case of enriched categories. While classical metric geometry can ensure that certain representations are complete with respect to a chosen distance function, the enriched Yoneda embedding goes further. It is not merely about comparing objects or measuring distances. It’s about reconstructing an object’s identity entirely through its relationships with others, which is central to the philosophy and implementation of our approach. In contrast, standard metric embeddings only glimpse part of that.

We hope the above response adequately addresses your concerns. Once again, thank you for your thoughtful feedback and for helping us improve the quality of our paper. Please feel free to let us know if any part of our response remains unclear or if you have any further questions!

评论

Dear Reviewer 7Tyf,

As the discussion deadline approaches, we would like to kindly ask whether our responses have addressed your remaining concerns.

Following your suggestions, we have conducted additional experiments to compare with the learnable linear projection baseline and canonical correlation analysis. The results are provided in Tables 1–4 in Parts 1 and 2 of our above response. We have also further clarified the role of enriched category theory in our work.

We really appreciate your thoughtful and constructive comments, and sincerely hope you could check our response. Thank you so much for your time!

Sincerely,

The Authors

审稿意见
4

The paper proposes a new theoretical and empirical framework for representation learning inspired by the metaphor of Indra’s Net. The central hypothesis is that unimodal foundation models, despite being trained independently, converge to representations that reflect a shared relational structure of reality. This is formalized using enriched category theory, where each sample is represented by its vector of distances to all other samples under a specific cost function (e.g., angular distance). The resulting Indra representation is theoretically shown to be unique, complete, and structure-preserving. The authors instantiate the method on pretrained models from vision, language, and audio modalities, and evaluate it on cross-model and cross-modal matching tasks. They also test an alternative cost function (Euclidean distance) to assess robustness. The method is applied post hoc, without retraining or supervised alignment.

优缺点分析

Strengths

  • The paper introduces a novel and compelling framing of representation learning using the Indra’s Net metaphor.
  • The use of category theory and the Yoneda embedding provides an interesting foundation for relational representations.
  • The approach is post-hoc and model-agnostic, so it can be applied to any pretrained unimodal encoder without retraining or supervision.
  • Empirical validation is provided across multiple modalities (vision, language, audio), model architectures, and datasets, showing consistent performance gains.
  • The method is shown to be somewhat robust to the choice of cost function (angular vs. Euclidean distance).

Weaknesses

  • The method requires computing and storing n×nn \times n distance matrices (Eq. 4), resulting in quadratic complexity in dataset size. The paper does not analyze runtime or memory scaling, nor does it explore sparse or approximate alternatives and their impact on theoretical guarantees.
  • The paper could benefit from qualitative visualizations or interpretability analyses to provide intuition on what kinds of relations are being captured.
  • The evaluation is limited to retrieval-style tasks using similarity-based matching metrics. The effectiveness of Indra representations on other downstream tasks (e.g., classification, generation, etc) is not explored. Evaluating such tasks would help clarify whether the benefits of the relational structure extend beyond similarity matching.
  • The paper cites prior work demonstrating that lightweight post-hoc alignment methods, such as linear projections or Procrustes alignment, can effectively bridge pretrained unimodal encoders (e.g., Merullo et al. [50], Sharma et al. [62]). However, it does not include these as baselines in the empirical evaluation. Comparing against these methods, especially in low-data or zero-shot variants, could clarify whether the gains arise uniquely from the relational structure or could be matched by simpler geometric transformations.

问题

  1. Have the authors considered evaluating Indra representations on downstream tasks beyond retrieval, such as classification (e.g., via linear probing) or caption generation?
  2. Can the authors provide formal or empirical analysis of the time and space complexity of Indra representation computation, especially in large-scale settings?
  3. Have the authors considered whether sparse approximations (e.g., k-nearest neighbor graphs) could reduce computational cost while preserving the theoretical guarantees (uniqueness, completeness, structure preservation) of the Indra representation?
  4. Can the authors provide qualitative examples or case studies where Indra representations yield more semantically aligned or informative matches than original embeddings?
  5. How robust is the Indra representation to the quality of base embeddings? Specifically, how does it behave when the embeddings are noisy or derived from undertrained or small-capacity models?
  6. The paper cites prior work on lightweight post-hoc alignment methods, such as linear projections (e.g., [50]), which operate under broadly similar goals. Could the authors elaborate on the decision not to include these as empirical baselines?

局限性

Yes, the authors have addressed the limitations and potential negative societal impact of their work. One suggestion for improvement would be to include a discussion of scalability limitations, especially regarding the complexity of computing full pairwise distance matrices.

最终评判理由

The rebuttal addressed most of my concerns, especially regarding the scalability of the method through approximate techniques and the added runtime analysis. The additional experiments on robustness and downstream classification tasks strengthen the empirical evidence for the utility of the Indra representation. While some aspects remain underexplored (ie the qualitative understanding of the captured relations and comparisons to baseline alignment methods), I find the core idea novel and the analysis interesting.

格式问题

No major formatting issues observed.

作者回复

Dear Reviewer 6ioa,

We sincerely appreciate you for providing useful suggestions and for recognizing our idea as novel and compelling.

Below, we address your remaining concerns.

W1&Q2&Q3 Complexity

[Complexity of Exact Computation] Due to space limitations, we kindly refer you to our response to Reviewer yaBd (W1 & Q3) for the complexity analysis.

[Application-Oriented (Scalable) Solutions] The scalability concern is addressable in practice. In the literature, there exists a rich body of work on approximating pairwise distances efficiently. For example, approximate nearest neighbor search (e.g., FAISS, HNSW), landmark-based approximation (e.g., K-means centroids, random subsampling), hashing-based methods (locality sensitive hashing), and sparsified graph constructions. From an application view, these techniques can be readily adapted to approximate the Indra representation at scale without sacrificing its structural interpretation.

[Experimental Validation] Due to space limitations, we kindly refer you to our response to Reviewer yaBd (W1 & Q3) for the corresponding results and analysis.

[Runtime Analysis] The table below reports the runtime for constructing Indra representations on the CC3M dataset using a landmark-based approximation with varying numbers of landmarks. All experiments are conducted on an NVIDIA L40S GPU, and each setting is run 20 times to obtain the average runtime.

#of landmarks1000200030005000
runtime (s)0.04760.06660.08880.1401

[Theoretical Guarantees] Approximate solutions, such as landmark-based methods, may not guarantee uniqueness, completeness, or structure preservation, unless the selected subcategory of landmarks is sufficiently large or dense to distinguish all objects in our sample category. In such cases, we may consider leveraging categorical tools, such as Kan extensions, to extend the behavior of objects defined on the subcategory to the entire category, thereby approximating the full Indra representation. However, this direction lies beyond the scope of the current paper. We will explore this in our future work.

W2 Intuition on captured relations

Thank you for the helpful suggestion. The types of relations captured by our method are influenced by both the choice of cost function and the foundation model, and are not deterministic. To better understand these relations, we perform visualization analyses using t-SNE and clustering algorithms. Due to rebuttal policy constraints, we are unable to include qualitative visualizations at this stage, but we will incorporate them in the revised version of the paper.

W3&Q1&Q5 Other downstream tasks & robustness

We additionally report image classification results on the CIFAR-10, CIFAR-100, and Office-Home datasets under various distance metrics. For CIFAR-10 and CIFAR-100, we use the standard data splits provided by torchvision.datasets. For Office-Home, we evaluate classification accuracy across four domains: Art (A), Clipart (C), Product (P), and Real-World (R), using an 80/20 split for training and testing. Across all datasets, we adopt logistic regression (linear probing) to assess the quality of the extracted representations.

To investigate the robustness of noisy representations, we inject Gaussian noise into the features with varying standard deviations σ{0.0,1.5,3.0,5.0,7.0}\sigma \in \{0.0, 1.5, 3.0, 5.0, 7.0\}. For each noise level, we perturb the features accordingly and train a linear classifier on the noisy representations. This allows us to assess how classification performance degrades as the feature representations are increasingly corrupted by noise.The corresponding results are presented below.

cifar10σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0cifar100σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0
ViT93.9889.6087.7579.7768.1579.4570.4054.6935.7627.45
Ours (Angular)94.7292.9088.0879.8468.1680.5174.4861.1242.2928.49
Ours (Euclidean)94.8493.1589.5180.8468.7180.0977.3669.0051.5932.74
ConvNeXt97.0094.3785.8980.1065.8585.7778.5662.7934.3921.28
Ours (Angular)97.2095.2290.7780.1865.9385.7480.3066.0441.5824.45
Ours (Euclidean)97.2196.0692.8681.5966.6485.6482.2572.1651.5130.25
DinoV299.1998.5895.2185.5776.5491.9789.2482.2163.0640.16
Ours (Angular)99.1998.8096.5987.3976.6891.7789.7283.7267.5347.80
Ours (Euclidean)99.1498.7296.8789.7377.9291.9389.7684.8374.2958.67
Noise levelsA σ\sigma=0.0A σ\sigma=3.0A σ\sigma=5.0A σ\sigma=7.0C σ\sigma=0.0C σ\sigma=3.0C σ\sigma=5.0C σ\sigma=7.0P σ\sigma=0.0P σ\sigma=3.0P σ\sigma=5.0P σ\sigma=7.0R σ\sigma=0.0R σ\sigma=3.0R σ\sigma=5.0R σ\sigma=7.0
ViT80.2564.4044.0322.6373.2050.4028.6415.2392.3480.7461.1535.2589.2282.1160.0935.32
Ours (Angular)79.6365.0243.6227.5769.7654.9833.1018.2189.7581.5364.0840.7787.1683.4963.6540.48
Ours (Euclidean)80.0463.3743.2125.5170.3355.7834.4818.3389.6480.7464.7540.7787.1683.3764.5641.28
ConvNeXt89.7162.7627.9812.1483.6254.0720.8509.7496.6284.9144.2619.3793.4682.1138.3017.78
ours (Angular)87.8659.8828.8114.2082.7057.8525.0911.3496.7385.9245.6122.1893.3584.6340.7119.61
Ours (Euclidean)87.4558.0226.7512.9682.9358.3023.8311.2396.2886.2645.3821.7393.3584.7540.1419.04
DinoV287.6573.0546.9127.7888.4375.1451.0931.0496.7393.2483.3360.7092.7887.3971.4448.51
Ours (Angular)87.0470.9947.5327.3787.2976.6354.7533.5696.4092.7984.4660.5992.8988.5373.1749.89
ours (Euclidean)86.2168.5244.6527.1687.2976.7555.2134.5996.5192.5784.1260.0292.5588.5373.3949.54

The results clearly show that stronger backbone models (e.g., DINOv2) lead to better performance for Indra representations across all noise levels. For instance, on CIFAR-100 with σ=0.0\sigma=0.0, Indra (Euclidean) achieves 91.93% accuracy using DINOv2 features, compared to 85.64% with ConvNeXt and 80.09% with ViT. This performance gap persists and even widens under higher noise; at σ=7.0\sigma=7.0σ=7.0, Indra (Euclidean) with DINOv2 maintains 58.67%, while ConvNeXt and ViT drop to 30.25% and 32.74%, respectively. In addition, as Gaussian noise increases, our Indra representations consistently retain higher classification accuracy compared to the original features, highlighting their robustness. The performance gains of Indra representations hold across multiple backbone architectures (ViT, ConvNeXt, DINOv2), indicating the broad applicability of the proposed method.

W4&Q6 Baselines

The cited methods such as [50] are application-driven with a primary goal of improving cross-modal alignment. They achieve this goal by training additional modules on top of frozen backbones, such as linear projections and adapters. As a result, these methods typically require substantial fine-tuning on large-scale datasets to learn optimal alignment mappings. Once trained, they become explicitly aligned models and are therefore more appropriately compared to models like CLIP, which are also designed with explicit alignment.

In contrast, our work is exploratory in nature, with the main focus on investigating what kind of representation unimodal foundation models inherently converge to. While prior works suggest that different unimodal models exhibit convergence in their output representations, they generally assume that the last-layer outputs themselves constitute this convergent form. We challenge this assumption and hypothesize that there exists a more structured and alignment-friendly representation, i.e., what we propose as the Indra representation. To validate this hypothesis, we compare Indra representations with the original model outputs, showing that Indra representations lead to better alignment performance without any model retraining. It is therefore clear that our objective fundamentally differs from that of those methods. While they proactively design and train new modules to enforce alignment, our approach seeks to passively uncover whether existing, frozen models already contain a more suitable representation structure when viewed from a relational perspective. In summary, while these alignment methods achieve improved performance through training additional layers, our method reveals that even without training, unimodal foundation models may admit a more optimal representational form than their raw outputs. This key difference in motivation, methodology, and training requirements makes a direct empirical comparison with those methods unfair and potentially misleading.

Q4 Qualitative examples

Thank you for the constructive feedback. We have conducted qualitative comparisons demonstrating that Indra representations yield more semantically aligned matches than the original embeddings. Due to rebuttal policy constraints, we are unable to include qualitative examples at this stage, but we will incorporate them in the revised version of the paper.

We hope the above response adequately addresses your concerns. Once again, thank you for your thoughtful feedback and for helping us improve the quality of our paper.

评论

Dear Reviewer 6ioa,

Thank you very much for your time and effort in reviewing our paper.

As the discussion deadline approaches, we would greatly appreciate your thoughts on whether our responses have addressed your concerns. Your feedback is extremely valuable to us, and we would be grateful for any further comments or suggestions you might have.

Thank you once again for your time and consideration!

Sincerely,

The Authors

评论

I’d like to thank the authors for the detailed rebuttal. The responses are clear, and I appreciate the authors’ efforts in addressing the points and questions I raised. I believe the authors have addressed my concerns. The paper presents an interesting idea, and the revisions have positively influenced my assessment.

评论

Dear Reviewer 6ioa,

Thank you so much for your positive assessment. We are glad to hear that our response has addressed your concerns!

We sincerely appreciate your valuable suggestions and constructive feedback, especially your comments on the evaluation of classification tasks and the robustness of the Indra representation. These insights are important in helping us further explore and validate the potential of our proposed approach.

We truly value your input and will do our best to revise the manuscript accordingly. Once again, thank you for your thoughtful comments and for helping us improve the quality of our paper!

Best regards,

The Authors

审稿意见
5

This paper introduces The Indra Representation Hypothesis, a novel theoretical and empirical framework aimed at understanding representation convergence in unimodal foundation models. Drawing philosophical inspiration from the metaphor of Indra’s Net, the authors propose that foundation models, despite being trained on different modalities and objectives, tend to learn internal representations that converge towards a shared relational structure.

The key contribution is the formalization of this idea through category theory, specifically the V-enriched Yoneda embedding, which leads to the definition of the Indra representation. The Indra representation encodes each sample through its relational profile with respect to all others, using a distance metric (angular distance in this work) between model-generated embeddings. Theoretical guarantees are provided, showing that this representation is unique, complete, and preserves structural relationships.

The paper then instantiates this formulation practically and demonstrates its utility across a wide range of experiments involving cross-model and cross-modal matching tasks, including vision, language, and audio domains. Results show that Indra representations significantly improve matching performance between independently trained unimodal models, outperforming the original embeddings and narrowing the gap with jointly trained multimodal baselines like CLIP and CLAP.

Through both its mathematical grounding and extensive empirical validation, the paper positions the Indra representation as a general, theoretically sound, and practically beneficial approach for understanding and leveraging latent relational structures in foundation models.

优缺点分析

Strengths:

  1. Quality & Clarity: The paper presents a well-grounded and mathematically rigorous framework by introducing the Indra representation using V-enriched category theory. The authors provide theoretical guarantees—uniqueness, completeness, and structure preservation—supported by formal theorems and proofs.

  2. Originality: The analogy to Indra’s Net offers a fresh and philosophically motivated perspective on representation learning, distinguishing this work from traditional embedding-centric approaches.

  3. Significance: The proposed method is applied to vision, language, and audio models, showing strong performance improvements across cross-model and cross-modal matching tasks, without requiring any retraining or supervision.

Weaknesses:

  1. Insufficient limitation discussion: It fails to address computational cost, storage footprint (Indra vectors require pairwise distances over the entire dataset), sensitivity to the choice of distance metric, and scalability to real downstream tasks, among other issues. Overall, the limitation discussion is too narrow in scope and pays insufficient attention to potential engineering and data challenges.

  2. Small Unclear in proof of theory 2: Seems missing proof of separation axiom (d(x,y)=0 ⇒ x=y)

问题

  1. Really nice proof! However, the proof of Theory 2, one thing I find confusing, you seem mixing the concept of Identity Property with the Separation Axiom. Would you mind explain what your identity property refers to?
  2. Your experiment part is almost excellent. Besides these scores, can you show some real-world downstream task performance of your method?
  3. My main concern is that your method has O(N^2) computational complexity. Under this assumption, many of us are interested in how large the computation and storage costs become on truly large-scale datasets. Could you discuss more about that?

局限性

I suppose you should talk more about limitation. It fails to address computational cost, storage footprint, sensitivity to the choice of distance metric, and scalability to real downstream tasks, among other issues. Overall, the limitation discussion is too narrow in scope and pays insufficient attention to potential engineering and data challenges.

最终评判理由

The rebuttal experiment has already addressed my concerns. I believe it deserves to be accepted.

格式问题

No.

作者回复

Dear Reviewer yaBd,

We sincerely appreciate you for providing helpful suggestions and for recognizing our idea as fresh and distinguishable from traditional approaches.

Below, we address your remaining concerns.

W1&Q3 Limitation discussion

[Complexity of Exact Computation] Constructing exact Indra representations requires a computational complexity of O(n2d)\mathcal{O}(n^2d) and a memory complexity of O(n2)\mathcal{O}(n^2) for a dataset with nn samples and embedding dimension dd. This quadratic scaling potentially limits the direct applicability of the exact Indra representations to large-scale datasets.

[Application-Oriented (Scalable) Solutions] The scalability concern is addressable in practice. In the literature, there exists a rich body of work on approximating pairwise distances efficiently. For example, approximate nearest neighbor search (e.g., FAISS, HNSW), landmark-based approximation (e.g., K-means centroids, random subsampling), hashing-based methods (locality sensitive hashing), and sparsified graph constructions (as suggested by Reviewer 6ioa). From an application view, these techniques can be readily adapted to approximate the Indra representation at scale without sacrificing its structural interpretation.

[Experimental Validation] To evaluate the effectiveness of approximated Indra representations, we adopt a simple landmark-based approximation via random subsampling. We randomly select m<nm < n landmarks as representative samples, reducing the computational complexity to O(nmd)\mathcal{O}(nmd) and the memory complexity to O(nm)\mathcal{O}(nm). We assess this approach on two tasks: image classification on the CIFAR-100 dataset and image-text matching on the large-scale CC3M dataset.

For CIFAR-100, we use the standard data split provided by torchvision.datasets and employ logistic regression (linear probing) to evaluate classification accuracy. To assess the robustness, we inject Gaussian noise with different levels of standard deviation σ\sigma into the extracted features.

For CC3M, we sample 300,000 image-text pairs for evaluation. We extract image features with DINOv2 and text features with RoBERTa. For each image, we compute mean CLIP scores with the top-kk candidate texts and compare the performance of the original features against that of the approximated Indra representations.

Results under varying numbers of landmarks are presented below.

CIFAR-100σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0CIFAR-100σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0# of landmarks
DinoV291.9789.2482.2163.0640.16DinoV291.9789.2482.2163.0640.16---
Ours (Euclidean)90.4688.5682.4966.0040.46Ours (Angular)91.7589.5584.0367.8346.901000
Ours (Euclidean)91.8390.0884.3171.3849.33Ours (Angular)92.0990.2585.3772.0253.662000
Ours (Euclidean)91.8589.2884.9170.3952.02Ours (Angular)92.3490.3485.5373.7455.183000
Ours (Euclidean)92.0690.1285.8171.9854.56Ours (Angular)92.1090.2085.4873.4055.155000
Ours (Euclidean)92.2090.1285.5673.2356.91Ours (Angular)92.2489.9985.1472.0353.6810000
#of landmarksk=5k=10
DinoV2+RoBERTa---25219.7825208.54
Ours500028225.0726836.45

The experimental results demonstrate that the approximated Indra representations still achieve better classification and alignment performance for real-world applications. Specifically, the Indra representations consistently outperform the original representations under increasing levels of Gaussian noise, particularly at higher noise levels, highlighting their superior robustness. Furthermore, increasing the number of landmarks may make improvements in classification accuracy.

W2 &Q1 Proof of Theory 2

Thank you for carefully reviewing the proof. You are absolutely right, our statement that “the cost function dd satisfies the identity property” should be revised to “the cost function dd satisfies the T₀ separation axiom.” The identity property, which requires that d(Xi,Xj)=0d(X_i,X_j)=0 does not imply Xi=XjX_i=X_j. We will revise the theorem and proof accordingly to incorporate a stricter condition on the cost function.

Q2 Downstream tasks

We additionally report image classification results on the CIFAR-10, CIFAR-100, and Office-Home datasets under various distance metrics. For CIFAR-10 and CIFAR-100, we use the standard data splits provided by torchvision.datasets. For Office-Home, we evaluate classification accuracy across four domains: Art (A), Clipart (C), Product (P), and Real-World (R), using an 80/20 split for training and testing. Across all datasets, we adopt logistic regression (linear probing) to assess the quality of the extracted representations. To evaluate robustness, we inject Gaussian noise into the features with varying standard deviations σ\sigma. The corresponding results are presented below.

cifar10σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0cifar100σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0
ViT93.9889.6087.7579.7768.1579.4570.4054.6935.7627.45
Ours (Angular)94.7292.9088.0879.8468.1680.5174.4861.1242.2928.49
Ours (Euclidean)94.8493.1589.5180.8468.7180.0977.3669.0051.5932.74
ConvNeXt97.0094.3785.8980.1065.8585.7778.5662.7934.3921.28
Ours (Angular)97.2095.2290.7780.1865.9385.7480.3066.0441.5824.45
Ours (Euclidean)97.2196.0692.8681.5966.6485.6482.2572.1651.5130.25
DinoV299.1998.5895.2185.5776.5491.9789.2482.2163.0640.16
Ours (Angular)99.1998.8096.5987.3976.6891.7789.7283.7267.5347.80
Ours (Euclidean)99.1498.7296.8789.7377.9291.9389.7684.8374.2958.67
Noise levelsA σ\sigma=0.0A σ\sigma=3.0A σ\sigma=5.0A σ\sigma=7.0C σ\sigma=0.0C σ\sigma=3.0C σ\sigma=5.0C σ\sigma=7.0P σ\sigma=0.0P σ\sigma=3.0P σ\sigma=5.0P σ\sigma=7.0R σ\sigma=0.0R σ\sigma=3.0R σ\sigma=5.0R σ\sigma=7.0
ViT80.2564.4044.0322.6373.2050.4028.6415.2392.3480.7461.1535.2589.2282.1160.0935.32
Ours (Angular)79.6365.0243.6227.5769.7654.9833.1018.2189.7581.5364.0840.7787.1683.4963.6540.48
Ours (Euclidean)80.0463.3743.2125.5170.3355.7834.4818.3389.6480.7464.7540.7787.1683.3764.5641.28
ConvNeXt89.7162.7627.9812.1483.6254.0720.8509.7496.6284.9144.2619.3793.4682.1138.3017.78
ours (Angular)87.8659.8828.8114.2082.7057.8525.0911.3496.7385.9245.6122.1893.3584.6340.7119.61
Ours (Euclidean)87.4558.0226.7512.9682.9358.3023.8311.2396.2886.2645.3821.7393.3584.7540.1419.04
DinoV287.6573.0546.9127.7888.4375.1451.0931.0496.7393.2483.3360.7092.7887.3971.4448.51
Ours (Angular)87.0470.9947.5327.3787.2976.6354.7533.5696.4092.7984.4660.5992.8988.5373.1749.89
ours (Euclidean)86.2168.5244.6527.1687.2976.7555.2134.5996.5192.5784.1260.0292.5588.5373.3949.54

L1 Limitation section revision

Thank you for the constructive comments. We have addressed these potential issues in the discussion above and will revise the limitations section accordingly. Specifically, we will explicitly acknowledge the computational challenges of computing exact Indra representations and include a discussion of scalable approximation strategies, supported by corresponding experimental validation.

We hope the above response adequately addresses your concerns. Once again, thank you for your thoughtful feedback and for helping us improve the quality of our paper.

评论

I thank the authors for their comprehensive response. I think the response addresses my concerns. This is an interesting idea, and I believe it deserves to be accepted, so I decided to raise the score to 5.

评论

Dear Reviewer yaBd,

Thank you very much for recommending acceptance!

We truly appreciate your recognition of our novelty and your thoughtful feedback. Your comments are highly valuable, and we will do our best to revise the manuscript accordingly. Once again, thank you for helping us improve the quality of our work!

Best regards,

The Authors

审稿意见
3

Motivated by recent findings that unimodal foundation models often exhibit strong correlations between their feature representations, this paper introduces the concept of the Indra Representation. Inspired by the ancient notion of Indra’s Net—a metaphysical idea from Eastern philosophy suggesting that each entity is a reflection of all others—the authors formalize a representation that captures the relational structure underlying model features. This formulation provides a unified framework for interpreting and aligning features across different models and modalities. The effectiveness of the proposed representation is validated through experiments conducted across diverse datasets and settings.

优缺点分析

Strengths:

  1. Building upon the ancient philosophical concept of Indra’s Net as a foundational metaphor, this work formalizes and instantiates it within a modern machine learning framework. I find this conceptual grounding to be highly novel and intellectually compelling.
  2. The paper presents comprehensive experiments demonstrating that the proposed Indra representation achieves stronger alignment across features from different unimodal foundation encoders, compared to using raw features directly from each encoder. This provides solid empirical support for the effectiveness of the proposed framework.

Weaknesses:

  1. While the proposed representation is theoretically elegant and empirically validated, the paper could benefit from a clearer articulation of its practical utility. Specifically, it remains somewhat unclear in what real-world scenarios such cross-model, cross-modality feature alignment is necessary. Furthermore, the authors do not address why existing well-aligned models, such as CLIP or more recent MLLM like GPT-4o, would not already suffice in such contexts. Providing stronger motivation for why this new representation is needed in practice would strengthen the impact of the work.
  2. The use of number notes such as ① and ② throughout the paper feels informal and somewhat inconsistent with the typical academic writing style expected in formal publications. It may be advisable to adopt more standard notation or formatting for clarity and professionalism.
  3. While the proposed method shows strong performance on the selected benchmarks, the experiments are primarily conducted on datasets with a limited number of categories. In real-world applications, it may be challenging to define the appropriate scope and number of relevant entities for constructing the Indra representation, potentially limiting its scalability and practicality in large-scale or open-domain settings.

问题

You can directly refer to the weaknesses part.

局限性

yes

格式问题

no

作者回复

Dear Reviewer enB4,

We sincerely appreciate you for providing constructive comments and are glad to see you recognize our idea as highly novel and intellectually compelling.

Below, we address your remaining concerns.

W1 Practical utility

Thank you for providing constructive suggestions. We would like to clarify that the primary goal of this work is not to demonstrate that our proposed Indra representations outperform well-aligned models such as CLIP or GPT-4o in practical applications, nor is it to argue that such models are insufficient.

Instead, our core contribution is to investigate what form the learned representations of large-scale unimodal foundation models converge to, after training on unimodal data. Prior studies have shown that such models tend to converge in representation space, but they do not explain what the representations ultimately converge to. Our work offers a new representation convergence hypothesis and provides empirical results to validate this hypothesis.

The practical utility of our proposed Indra representation lies in its ability to improve cross-modality alignment without the need of additional adapter or large-scale finetuning. In domains like medicine or other specialized fields, there may not exist multimodal foundation models, nor is it always feasible to train them. In such settings, leveraging existing unimodal models through our Indra representation offers a compelling solution for cross-modal tasks.

As demonstrated in our experiments, the proposed Indra representation improves performance not only in unimodal tasks, but also across vision-language and speech-language modalities. Furthermore, the added classification experiments show that our representation also enhances classification accuracy and robustness (kindly refer to our response to Reviewer yaBd (W1 & Q3)).

In summary, our work is not intended to replace well-aligned models but to uncover a foundational mechanism that explains how unimodal models can be structurally leveraged or adapted for cross-modal settings, especially where aligned models are unavailable or infeasible.

W2 Notation

We appreciate the reviewer’s feedback regarding the use of numbered notes. We understand that this stylistic choice may come across as informal or inconsistent with conventional academic norms. We have revised the manuscript to adopt a more standard academic style. We thank the reviewer for highlighting this and will ensure the final version adheres to expected standards.

W3 Scalability and practicality

[Application-Oriented (Scalable) Solutions] The scalability concern is addressable in practice. In the literature, there exists a rich body of work on approximating pairwise distances efficiently. For example, approximate nearest neighbor search (e.g., FAISS, HNSW), landmark-based approximation (e.g., K-means centroids, random subsampling), hashing-based methods (locality sensitive hashing), and sparsified graph constructions (as suggested by Reviewer 6ioa). From an application view, these techniques can be readily adapted to approximate the Indra representation at scale without sacrificing its structural interpretation.

[Experimental Validation] To evaluate the effectiveness of approximated Indra representations, we adopt a simple landmark-based approximation via random subsampling. We randomly select m<nm < n landmarks as representative samples, reducing the computational complexity to O(nmd)\mathcal{O}(nmd) and the memory complexity to O(nm)\mathcal{O}(nm). We assess this approach on two tasks: image classification on the CIFAR-100 dataset and image-text matching on the large-scale CC3M dataset.

For CIFAR-100, we use the standard data split provided by torchvision.datasets and employ logistic regression (linear probing) to evaluate classification accuracy. To assess the robustness, we inject Gaussian noise with different levels of standard deviation σ\sigma into the extracted features.

For CC3M, we sample 300,000 image-text pairs for evaluation. We extract image features with DINOv2 and text features with RoBERTa. For each image, we compute mean CLIP scores with the top-kk candidate texts and compare the performance of the original features against that of the approximated Indra representations.

Results under varying numbers of landmarks are presented below.

CIFAR-100σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0CIFAR-100σ\sigma=0.0σ\sigma=1.5σ\sigma=3.0σ\sigma=5.0$\sigma$=7.0# of landmarks
DinoV291.9789.2482.2163.0640.16DinoV291.9789.2482.2163.0640.16---
Ours (Euclidean)90.4688.5682.4966.0040.46Ours (Angular)91.7589.5584.0367.8346.901000
Ours (Euclidean)91.8390.0884.3171.3849.33Ours (Angular)92.0990.2585.3772.0253.662000
Ours (Euclidean)91.8589.2884.9170.3952.02Ours (Angular)92.3490.3485.5373.7455.183000
Ours (Euclidean)92.0690.1285.8171.9854.56Ours (Angular)92.1090.2085.4873.4055.155000
Ours (Euclidean)92.2090.1285.5673.2356.91Ours (Angular)92.2489.9985.1472.0353.6810000
#of landmarksk=5k=10
DinoV2+RoBERTa---25219.7825208.54
Ours500028225.0726836.45

The experimental results demonstrate that the approximated Indra representations still achieve better classification and alignment performance for real-world applications. Specifically, the Indra representations consistently outperform the original representations under increasing levels of Gaussian noise, particularly at higher noise levels, highlighting their superior robustness. Furthermore, increasing the number of landmarks may make improvements in classification accuracy.

We hope the above response adequately addresses your concerns. Once again, thank you for your thoughtful feedback and for helping us improve the quality of our paper!

评论

Dear Reviewer enB4,

Thank you very much for your time and effort in reviewing our paper.

As the discussion deadline approaches, we would greatly appreciate your thoughts on whether our responses have addressed your concerns and if they have helped in re-evaluating our work. Your feedback is extremely valuable to us, and we would be grateful for any further comments or suggestions you might have.

Thank you once again for your time and consideration!

Sincerely,

The Authors

评论

Dear Reviewers,

Thank you very much for your thoughtful and constructive feedback on our submission.

We are encouraged to see a consistent recognition of this paper’s novelty across all reviews:

  • Reviewer enB4: "Highly novel and intellectually compelling."

  • Reviewer yaBd: "Fresh and philosophically motivated, distinguishable from traditional approaches."

  • Reviewer 6ioa: "Novel and compelling, an interesting foundation for relational representations."

  • Reviewer 7Tyf: "A novel and elegant conceptualization, providing a thought-provoking new lens."

To facilitate your re-engagement with the paper and address any remaining concerns, we briefly summarize the core contributions and new experimental findings below:


Core Contributions

  1. The Indra Representation Hypothesis: Drawing inspiration from the philosophical metaphor of Indra’s Net, we propose that unimodal foundation models naturally converge toward a form of representation that implicitly reflects the underlying relational structure of reality.

  2. Formalization via Enriched Category Theory: We formalize these representations using V-enriched Yoneda embeddings, ensuring that the resulting representations are unique, complete, and structure-preserving.

  3. Empirical Validation Across Modalities: Extensive experiments demonstrate that Indra representations lead to improved alignment performance without retraining. This effect generalizes beyond unimodal models to vision-language and speech-language settings.


New Findings in Response to Reviewer Suggestions

In response to reviewer feedback, we conducted additional experiments on new tasks and settings. Notably, we find that:

  • Indra representations improve classification performance, demonstrating increased accuracy and robustness even under highly noisy conditions.

  • Landmark-based approximations remain effective on large-scale datasets, enhancing the practicality and scalability of the proposed Indra representations.


We hope these clarifications and additional results address your concerns and further support the significance of our contributions. As the discussion period approaches, we would be very grateful if you could take a moment to review our responses.

We sincerely appreciate your time and dedication in helping improve the quality of our work!

Warm regards,

The Authors

最终决定

The paper proposes the Indra Representation Hypothesis: unimodal models converge to a relational structure formalized with a V-enriched Yoneda embedding. It builds distance-based “Indra” vectors and shows post-hoc gains for cross-model and cross-modal matching, plus robustness, using landmark/ANN approximations. Two reviewers praise the novelty and the stronger rebuttal, and recommend accept or borderline accept. Two others question practical need versus multimodal models, O(n²) cost and storage, and guidance on metric choice; one barely engaged.

I recommend accept, but this is borderline from my perspective. The idea is fresh and now better supported by added baselines (linear projection, CCA) and runtime/large-scale results. Scaling and metric-choice risks remain, but approximations and sensitivity checks help, and the camera-ready should expand limitations and fix notation. With split reviews and moderate confidence, I still see clear value for the community.