6.4

/10

Poster5 位审稿人

最低4最高4标准差0.0

3.2

置信度

创新性2.8

质量3.0

清晰度3.0

重要性3.0

NeurIPS 2025

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek,Sewoong Oh,Simon Shaolei Du

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We theoretically analyze the benefit of filtering a noisy training dataset on model performance in multimodal contrastive learning, and identify two regimes with different amounts of gain.

摘要

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $\eta\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\frac{1}{\eta \sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\frac{1}{\sqrt{\eta n}}$ in the large $\eta$ regime, and by $\frac{1}{\sqrt{n}}$ in the small $\eta$ regime.

关键词

Data filteringMultimodal contrastive learningTheory of contrastive learningCLIPTeacher-student filtering

评审与讨论

审稿意见

评分: 4置信度: 32025-06-26

This paper provides a theoretical analysis of teacher-based data filtering in the context of multimodal contrastive learning. It builds on prior linear models of vision-language representation learning and studies how data filtering (particularly using a trained model to score and retain high-quality samples) affects learning performance under a stochastic corruption model.

优缺点分析

Strengths

Theoretical novelty: The result that filtering can outperform an “oracle” that uses only clean samples is surprising and nontrivial. The analysis highlights the interaction between stochastic corruption and contrastive inner product objectives.
Relevance to practice: Teacher-based filtering is increasingly used in real-world large-scale multimodal pretraining (e.g., CLIP, DataComp), yet lacked formal justification.
Clarity of analysis: The paper is well-structured and first builds intuition before going into formal details. The progression from baseline to teacher-based methods is systematic.

Weaknesses

Restrictive assumptions: The analysis relies on linear encoders, Gaussian noise, and exact knowledge of latent dimension $r$ . These assumptions simplify analysis but limit applicability to realistic settings involving deep networks and complex noise.
Limited empirical validation: While a synthetic experiment is provided, real-world empirical evaluation is absent. The paper refers to past work (e.g., Fang et al.) but doesn’t test its theory in new real-world settings.
Teacher training is idealized: The teacher is assumed to generalize well in the corrupted regime, which may not hold in practice (especially when $\eta$ is small). While this is acknowledged, more discussion on when this assumption breaks would help.

问题

how limiting is the linear assumption? or in other words, how far from real-world experiments are we when using the linear assumption? I understand the necessity/simplicity of using a linear assumption. Nevertheless, I would be curious to hear the authors' opinion on this.
is filtering mainly related to the multimodal pairing or also on noisy samples in general? does it make a difference at all to distinguish between the two cases?

局限性

yes

最终评判理由

格式问题

作者回复

2025-07-31

We thank the reviewer for their detailed and constructive feedback. We appreciate the positive comments about the theoretical novelty of the analysis, and the practical relevance of the problem of data filtering. In the following, we address the concerns raised.

Weaknesses

(1) Restrictive assumptions.

Linearity + Gaussianity: The goal of our work is to provide theoretical justification for the common empirical observation that data filtering using a teacher CLIP model improves performance. The setup of Linearity and Gaussian noise provides a tractable setting to study this phenomenon, and is motivated by past works that make the same assumptions [6, 7]. Beyond CLIP filtering, other phenomena in deep learning have also been successfully characterized in the linear setup, e.g., benign overfitting [8] (also known as double descent), scaling laws [11], and implicit bias [14, 15]. Overall, the linear assumption, though not directly applicable in some real cases, provides a useful abstraction that often preserves the primary problem characteristics and helps to build a deeper understanding through a theoretical lens. We believe our work (and the assumptions we have), although simple, is within the typical assumptions made in theoretical anlayses of deep learning, which has proven to be productive and insightful (e.g., [8, 11, 14, 15]).

Latent dimension $r$ : In practice, the latent dimension $r$ (i.e., the embedding space dimension) is typically a design choice and is therefore known at training time. Theoretically, assuming the underlying $r$ is known allows us to isolate the effects of data filtering from the separate, well-studied problem of subspace rank estimation (for e.g., in [12]). Thanks for pointing this out, we think it's useful to add this comment in Line 150.

(2) Limited empirical validation.

The teacher-based filtering (Figure 1c, Algorithm 1) has been relatively well-studied empirically, for instance in [1, 2, 3]. The focus of this paper is to theoretically analyze this approach, and demonstrate a provable benefit. Since this involved no new algorithmic modifications, we relied on synthetic experiments (Figure 3a) and reused a real-data experiment (Figure 3b), instead of a suite of new experiments.

That said, we plan to run small-scale CLIP training experiments (equivalent to the small scale in DataComp [1]) with a focus on the data quality, particularly focusing on the misalignment in image-text pairs. The main idea of this experiment is to have two data sources (one clean and one noisy), similar to Figure 3(b) replicated from [2]. We can sample data with $\eta$ , $1-\eta$ proportions to create the training dataset. From this, we can observe the variation in the learnt model's performance with varying $\eta$ .

(3) Teacher training is idealized. What happens when $\eta$ is small?

Meta comment: We are somewhat unclear about what this question asks. Below we provide a response, but please clarify the question further if this does not answer it. In particular, please also point out which line number you are referring to when you say "while this is acknowledged".

Response: The modeling assumptions result in the teacher achieving error $1/\eta \sqrt{n}$ for $\eta > \eta_0$ , as shown in Corollary 1 for an appropriate $\eta_0$ . We would like to stress the condition of $\eta > \eta_0$ , as this shows that even the current simplistic setup captures that $\eta$ needs to be large enough for the teacher to learn something meaningful (the rightmost points in Figure 3(a) validate this empirically, via the growing error bars and deviation from the theoretical trend). That said, you are right to believe the threshold $\eta_0$ in practice will be much higher than the one predicted by the Corollary 1. We believe the below two factors will play a crucial role in this:

Introduction of non-linearity, since parameter estimation becomes statistically harder with non-linear function classes.
The model in Section 3.1 only assumes a binary distinction of clean v/s noisy to model data quality. Real datasets display a broader range of data quality across samples.

We can include a discussion about this in the revised manuscript. However, a detailed study would involve a significant modification to the setup in Section 3.1, and falls outside the scope of the current work.

Questions

(1) Comment on the limitations of the linear assumption.

Linear models have been surprisingly successful in explaining important empirical phenomena in highly nonlinear deep neural networks. This includes benign overfitting (also known as double descent) [8], scaling laws [11], and implicit bias [14, 15]. Closest to our problem might be [9], where the authors analyze linear regression to demonstrate the gain of data filtering. For a detailed justification of using linear models to study deep neural networks, we refer to a nice lecture note [16] by Andrea Montanari at Stanford.

Despite these successes, linear models have notable limitations. They assume that learning operates in a regime where the deep neural network can be well approximated by its first-order Taylor expansion. It's crucial to understand that training within this regime is significantly less powerful than fully training neural networks. While it's possible to train neural networks in a way that makes them amenable to this first-order approximation, this often occurs as a side effect of very specific parametrization choices.

This implies that while we can restrict hyperparameter choices to make linear models more accurate at approximating actual deep neural network training dynamics, doing so would likely compromise the practical performance of the resulting models. Although the high-level insights derived from linear modeling have proven surprisingly successful, the specific theorems themselves may not accurately predict the behavior of nonlinear model training. Therefore, the lessons learned from linear models should generally be interpreted qualitatively, serving more as a justification or general guidance rather than providing immediate predictive insights.

In this paper, we observed the same thing with respect to the real data experiment in Figure 3(b). As indicated in lines 73-74, the qualitative message of the improved $\eta$ dependence, derived from the linear modeling theory, holds true with nonlinear models as well.

(2) Discussion of data filtering beyond unaligned pairings.

This is an insightful question, thank you for raising this! Indeed, the community is increasingly observing the utility of data filtering in various settings. We will answer this in two parts:

Data filtering for multimodal data beyond unaligned pairings.

Indeed, beyond unaligned pairings, data filtering is also used to remove other kinds of noisy data (i.e., simply bad image or text quality). For example, empirically, the creation of large scale datasets (e.g. LAION-5B, DataComp-1B) involved various heuristic-based filters to remove data with 'bad' samples in individual modalities.

Theoretically, we believe it does make sense to distinguish between these two cases. A better model to study the latter would try to capture that a significant portion of the marginal distributions $D_x, D_{\widetilde{x}}$ are individually noisy beyond the faulty pairings. The generative model described in Section 3.1 does not capture this due to a common prior $D_z$ for both modalities.

Overall, this would be akin to studying the $\eta = 1$ case in the new setup described above, and asking, does filtering still help? This is a very valid direction of future work, but formulating the right mathematical assumptions and the subsequent analysis falls outside the current scope.

Note that our main result -- improving the error dependence from $1/\eta$ to $1/\sqrt{\eta}$ (or better) -- is a statement about solving the pairing problem. This gain cannot be achieved by simply addressing per-modality noise. Hence, it is needed to distinguish this case with others studied in the past. Of course, the practical reality is a mixture of these two cases (unaligned pairs + noisy samples in individual modalities). In this work, we focus on the alignment problem only.

Data filtering for general data.

Beyond multimodal data, data filtering has also been studied in the usual statistical framework, assuming access to $n$ samples (not necessarily multimodal) [9, 10]. Again, it is worth noting that none of these frameworks capture the pairing problem. Further, these works also largely study the linear case for their theoretical contributions.

References

[1] DataComp: In search of the next generation of multimodal datasets, arxiv:2304.14108.

[2] Data Filtering Networks, arxiv:2309.17425.

[3] CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning, arxiv:2405.19547.

[4] (unused reference)

[5] (unused reference)

[6] Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data, arxiv:2302.06232.

[7] The Power of Contrast for Feature Learning: A Theoretical Analysis, arxiv:2110.02473.

[8] Benign Overfitting in Linear Regression, arxiv:1906.11300.

[9] Towards a statistical theory of data selection under weak supervision, arxiv:2309.14563.

[10] Iterative Least Trimmed Squares for Mixed Linear Regression, arxiv:1902.03653.

[11] Scaling Laws in Linear Regression: Compute, Parameters, and Data, arxiv:2406.08466.

[12] Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices, arxiv:1305.3235.

[13] (unused reference)

[14] The Implicit Bias of Gradient Descent on Separable Data, arXiv:1710.10345.

[15] In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning, arXiv:1412.6614.

[16] Six Lectures on Linearized Neural Networks, arXiv:2308.13431.

2025-08-04

I would like to thank the authors for the detailed reply and explanations. I raised my score, accordingly.

2025-08-07

Thank you and please let us know if there are any more questions!

审稿意见

评分: 4置信度: 22025-07-01

This paper derives bounds for the error in the contrastive learning setup when the data is filtered using a teacher that is trained from the same data.

The contrastive learning setup is simplified to make it amenable to mathematical analysis: the data pairs are assumed to be low rank signals (i.e. underlying vector $z$ with dimensionality $r$ is projected to dimensionality $d$ where signal $x$ is observed) to which zero mean isotropic Gaussian noise is added. With probability $\eta$ the observed pairs are projected from the same $z$ , and with probability $1-\eta$ the pairs are sampled independently (in both cases sampled from standard multivariate Gaussian). The contrastive training process is modeled as learning the projection that maps $x$ back to $z$ . The data filtering is modeled as first learning a linear projection on half of the data, and the doing threshold-based filtering of the pairs after applying the learnt mapping. The error finally is defined as the chordal distance between the subspaces of the learned projection and the generating projection.

In this simplified setup, the authors show that the error is bounded by $\eta^{-1}n^{-0.5}$ in the unfiltered setup, and by $\eta^{-0.5}n^{-0.5}$ and $n^{-0.5}$ in the filtered setup (when $\eta$ is large and when $\eta$ is small). Note that the rate $n^{-0.5}$ is even better than the rate achieved when training on data that is filtered by an oracle (which is also $\eta^{-0.5}n^{-0.5}$ ).

优缺点分析

Disclaimer: I am a ML practitioner and have experience with data filtering setup with contrastive models, but I am a novice when it comes to their mathematical analysis. I have read all the parts from the main paper and appendix sections A, B, H – but have skipped appendix sections C-G because I could not give good feedback on the mathematical derivations. I hope that my review of the other parts is still useful and look forward to the discussions from the more mathematically oriented reviewers.

Strengths

The paper studies a practically relevant problem: contrastive learning is still very relevant, for example most state-of-the-art vision encoders are trained with a contrastive loss on image/text pairs. Data filtering is often an integral part in this context and it's important to understand its effects on the learning process.
The paper is well structured and clearly written. The assumptions are clearly outlined, the main mathematical findings are presented in a concise manner that makes it possible to understand how the results were derived, while keeping all of the details for the long derivations in the appendix.
The derivations build on top of an existing literature, like [5] and [22], and extend these previous studies by providing new insights into what happens when the data is filtered using a teacher trained on the same data distribution.

Weaknesses

From a practical perspective, all the simplifying assumptions that make the setup amenable to mathematical analysis create a large gap with a realistic setup used to train relevant models. It is hard to translate the final results to something other than "data filtering can help in the contrastive learning setup", which is also what is shown qualitatively in Figure 3b. There is a substantial difference between the theoretical results with all the strong simplifications, and a realistic setup like the ones cited in the introduction. Furthermore, the practical setup (Figure 3b) – which does not make reference to the derived bounds anymore but simply shows that performance without filtering decreases more quickly when $\eta$ becomes smaller – operates at extremely low performance (best achieved accuracy with $\eta=1$ is below 30%).

问题

What would the "Slope = 1" and "Slope = 0.5" lines look like on FIgure 3b? Is it possible to translate the error bounds as used in this paper (chordal distances of subspaces) to error rates that can be measured in a more realistic setup? Maybe it would be insightful to only remove some of the simplifications but keep others (e.g. generated data with more realistic non-linear training setup) to see which of the findings hold true?
The paper argues that in the small $\eta$ regime, teacher-based filtering has a better bound than oracle-based filtering. The authors explain in appendix H why this is the case. Under which circumstances could something similar be observed in a more realistic and practically relevant setup? Are there any insights from the presented analysis that would explain when filtering can be harmful?
The introduction (lines 22-23) states "smaller but higher quality subsets of the data have been observed to result in better models". But there are other studies that show that similar models can be trained directly on noisy large scale datasets (e.g. ALIGN). The submission argues that filtering with a teacher that is trained on the same data distribution is better than oracle-based filtering. Could it be argued that oracle-based filtering is more similar to the manual filtering that was applied to the CLIP data, while teacher-based filtering could in fact be a mechanism that is more similar to a partly trained network that can learn to ignore noisy data from the same distribution?

Small comments:

Legend Figure 2: consider writing $50000$ instead of $50k$ because the latter (with $k$ in italic font) reads like a variable.
Lines 62-69, lines 223-234: Maybe also refer to Section H in this context since that section has some additional information about why something can be learned from corrupted samples.
Line 541: Even if the code is simple, I would still find it useful to have it published. That would be beneficial to check some implementation details, and useful to build work on top of what is presented in the paper.

局限性

yes

最终评判理由

I remain with my original rating 4=borderline accept.

The authors have been very responsive in the rebuttal period and given clear answers to all of my questions. Even though I'm not too familiar with the subject matter (theoretical analysis), after the review process I am fairly confident that the presented work is technically sound and well anchored in the existing literature.

The reason for not increasing my original rating is mainly that the authors have still not entirely convinced me about the practical usability of the presented bounds – the one weakness that I already mentioned in my original review. My concerns could have been resolved, if the authors showed how the quantitative results from the paper could be applied to improve practically important decisions like "dataset election" or "valuating data augmentation" (as the authors have mentioned in some of the rebuttals, e.g. the rebuttal to my review). To clarify this point I have created a new thread about the Practical Applicability of the Theoretical Findings, in which the authors have replied that this kind of practical decisions cannot be improved directly with the theoretical findings from the paper, but they promised to add more results to the paper that tries to bridge the gap partially, which I think would strengthen the paper.

格式问题

作者回复

2025-07-31

We thank the reviewer for their detailed and constructive feedback. We appreciate the positive comments about the relevance of the theoretical analysis, and the practical impact of data filtering. In the following, we address the concerns raised.

Weaknesses

(1) Connection between theory and practice.

In the following, we will make two points to justify our choices and their usefulness.

1. Why the linear setup analysis is useful?

The goal of our work is to provide theoretical justification for the common empirical observation that data filtering using a teacher CLIP model improves performance. The setup with Linearity and Gaussian noise provides a tractable setting to study this phenomenon, and is motivated by past works that make the same assumptions [6, 7]. Beyond CLIP filtering, other phenomena in deep learning have also been successfully characterized in the linear setup, e.g., benign overfitting [8] (also known as double descent), scaling laws [11], and implicit bias [14, 15]. These topics are foundations of what we call the Theory of Deep Learning, and are textbook materials in courses taught at top institutions. Overall, the linear assumption, though not directly applicable in some real cases, provides a useful abstraction that often preserves the primary problem characteristics and helps to build a deeper understanding through a theoretical lens. We believe our work (and the assumptions we have), although simple, is within the typical assumptions made in theoretical anlayses of deep learning, which has proven to be productive and insightful (e.g., [8, 11, 14, 15]). For a detailed justification of using linear models to study deep neural networks, we refer to a nice lecture note [16] by Andrea Montanari at Stanford.

2. What are the practical implications?

Beyond its theoretical contributions, Theorem 1 also offers some useful guidance to practitioners by formalizing the trade-off between data quantity ( $n$ ) and quality (via clean fraction $\eta$ ). Our analysis allows practitioners to move from qualitative intuition to principled, quantitative decision-making in common data curation scenarios. The following two examples illustrate this in detail:

Dataset election: Consider a practitioner faced with the choice to pick a dataset among many alternatives. Often the choice involves a large, noisy dataset (large $n$ , small $\eta$ ) versus a smaller, curated one (small $n$ , large $\eta$ ). Further, the high-level aggregates (like $n$ and $\eta$ or approximate $\eta$ ) are usually available, but the actual datasets might be behind a paywall. By plugging these values into the error bounds from Theorem 1 (i.e., comparing which option yields a smaller error), one can make a more informed decision about which dataset is likely to produce a better model.
Valuating Data Augmentation: When considering whether to invest resources in collecting a small amount of clean data to augment a large, noisy corpus, our theorem provides quantitative guidance. It helps estimate the expected performance gain from the improved $\eta$ and increased $n$ , allowing a practitioner to assess if the return on investment is worthwhile.

Crucially, the two distinct error regimes identified in Theorem 1 provide non-obvious insights. The $1/\sqrt{\eta}$ error rate in the low- $\eta$ regime suggests that for very noisy datasets, simply increasing data volume ( $n$ ) can be surprisingly effective, a practical takeaway that is not immediately intuitive.

Lastly, in relation to a point raised by the reviewer, we would like to argue that a 30% zero-shot accuracy on ImageNet is non-trivial, since no explicit classifier was trained on top of the CLIP embeddings (ImageNet has a 1000 classes, so a random guesser would get 0.1% accuracy).

Questions

(1a) About the theoretical metric (chordal distance) v/s real-world metrics.

Since the metric between Fig 3(a) and 3(b) is different (note the different y-axis), the "slope 1" and "slope 0.5" lines would not make sense on the metric used in Fig 3(b). Relatedly, note that the metric used in Figure 3(b) is not a theoretical distance like the chordal distance. Instead, it is 1 - ImageNet Accuracy, a downstream metric that practitioners care about.

(1b) Weakening the assumptions (e.g., removing linearity).

Indeed, weakening the assumptions can be insightful by showing the differences in the observed trends. For example, what would Theorem 1 look like without the linearity assumption? A full analysis of this requires non-trivial analysis that falls outside the scope of this work. That said, linear models have been successfully used to explain many phenomena in modern deep learning, e.g., benign overfitting [8] (also known as double descent), scaling laws [11], and implicit bias [14, 15]. The lessons learned from linear models generally serve as a useful qualitative guide to more complex scenarios. In this paper, we observed the same thing with respect to the real data experiment in Figure 3(b). As indicated in lines 73-74, the qualitative message of the improved dependence, derived from the linear modeling theory, holds true with nonlinear models as well.

(2) Can teacher-based filtering be harmful?

This is a great question, thank you for raising this. Indeed, it is a fruitful direction to study how filtering behaves in setups other than the one presented in Section 3.1. One such idea is relaxing Assumption 2 to a general covariance matrix for the noise $\xi$ . This exercise indicates that teacher-based filtering can introduce an algorithmic bias. That is, the error with data filtering stays above zero even when $n \rightarrow \infty$ (i.e. a statistically inconsistent estimate). Note that the no-filtering algorithm does not suffer from this.

Future work could study the precise behaviour of this bias, and how to mitigate this potential harmful effect of data filtering. For instance, this might reveal that the standardization of data in the right basis could mitigate this (e.g., normalizing the images in certain ways). This is a very interesting direction, but requires a detailed investigation from both theoretical and empirical angles.

(3a) Could it be argued that oracle-based filtering is more similar to the manual filtering that was applied to the CLIP data?

Note that oracle based filtering assumes knowledge of ground-truth parameters (Figure 1b). In this sense, it is hard to compare it with manual filtering, which involves applying heuristics like length of text, quality of image, etc. to filter the dataset. Note that the datasets are huge (billions of samples), meaning that manual filtering can not involve human annotation. In light of this, we don't expect any manual filtering (heuristic-based) to be able to perform like oracle filtering.

(3b) Teacher-based filtering could in fact be a mechanism that is more similar to a partly trained network that can learn to ignore noisy data from the same distribution?

This is indeed true. Lines 201-203 posit that the teacher learns a useful signal despite the presence of noisy samples, which help in data filtering. The reason teacher-based can outperform even oracle filtering is the incidental alignment in some noisy samples (Lines 223-225).

Small comments

Thank you for rasining points 1 and 2, we will include them in the revised manuscript. For 3, we indeed plan to release the code publicly on github!

References

[1] DataComp: In search of the next generation of multimodal datasets, arxiv:2304.14108.

[2] Data Filtering Networks, arxiv:2309.17425.

[3] CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning, arxiv:2405.19547.

[4] Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP, arxiv:2208.05516.

[5] Improving Multimodal Datasets with Image Captioning, arxiv:2307.10350.

[6] Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data, arxiv:2302.06232.

[7] The Power of Contrast for Feature Learning: A Theoretical Analysis, arxiv:2110.02473.

[8] Benign Overfitting in Linear Regression, arxiv:1906.11300.

[9] Towards a statistical theory of data selection under weak supervision, arxiv:2309.14563.

[10] Iterative Least Trimmed Squares for Mixed Linear Regression, arxiv:1902.03653.

[11] Scaling Laws in Linear Regression: Compute, Parameters, and Data, arxiv:2406.08466.

[12] Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices, arxiv:1305.3235.

[13] Going Beyond Nouns With Vision & Language Models Using Synthetic Data, arXiv:2303.17590.

[14] The Implicit Bias of Gradient Descent on Separable Data, arXiv:1710.10345.

[15] In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning, arXiv:1412.6614.

[16] Six Lectures on Linearized Neural Networks, arXiv:2308.13431.

2025-08-06

I would like to thank the authors for their interesting and thoughtful rebuttal to my review, which answered most of my questions satisfactorily.

I noticed that all reviewers raised questions about the degree to which the theoretical findings can be applied in a practical setting, so I created a new thread Practical Applicability of the Theoretical Findings for discussion.

Other than the points raised in that separate thread, I only have one remaining question with respect to the response (2) Can teacher-based filtering be harmful?: The authors mention in their rebuttal that the exercise of relaxing Assumption 2 to a general covariance matrix for the noise $\xi$ indicates that teacher-based filtering can introduce an algorithmic bias. Where exactly can I find this derivation in the manuscript? I think the discussion of this point (which was also mentioned in the rebuttal to reviewer AyoJ), and its practical relevance, might be an interesting addition to the paper.

2025-08-07

Thank you for your response, and we are glad to have answered most of your questions. Thanks also for creating the common thread. We will respond to those questions raised in the thread itself.

Regarding the algorithmic bias, as mentioned in our rebuttal response, this is an interesting direction but requires a detailed investigation from both theoretical and empirical angles. We do not have this in the submitted manuscript. After submission, we worked on the analysis when the noise is not isotropic, and identified that this creates an extra term in the error, that corresponds to a bias. Since this analysis is very similar to the one we have in the appendix in the submission (but with a more general assumption on the noise covariance $\Sigma$ ), we are planning to add this result in the revision. At the high-level, the error after filtering in Theorem 1 gets the additional bias term (for a general $\theta$ ), like below:

$ERR(\mathbf{G}(\theta), \widetilde{\mathbf{G}}(\theta)) \lesssim E(n, \eta, \theta, \Sigma) + B(\eta, \theta, \Sigma).$

The terms $E$ and $B$ also depend on the dimensions $r, d, \tilde{d}$ , but we suppress that for clarity. The bias $B$ importantly does not depend on $n$ , but it goes to zero when $(i)$ $\eta=1$ , since we only have clean data in that case, $(ii)$ $\theta = -\infty$ , since that corresponds to no filtering, and $(iii)$ the noise covariance $\Sigma$ is isotropic.

We agree that this will be an interesting addition to the literature of data filtering, and we will add discussion of this result and its practical relevance in the revision also. However, empirical investigation of this in the realistic setting is outside the scope of this paper. We believe it is an important topic of interest for future research with respect to how such theoretical insights can help improve data filtering in practice.

审稿意见

评分: 4置信度: 32025-07-02

This paper explores the advantage of data filtering in improving the performance of contrastive learning in multimodal representation learning. Specifically, the paper provides a theoretical explanation for the empirical success of teacher-based data filtering, which uses a pre-trained model to filter out low-quality data from large-scale multimodal datasets. The authors establish bounds on the performance improvement obtained through data filtering and demonstrate the advantages of this method over traditional contrastive learning without filtering.

优缺点分析

Strength:

The paper provides a detailed theoretical analysis of the benefits of data filtering in contrastive learning.
The idea that removing low-quality data pairs improves multimodal contrastive learning is intuitive, and the analysis about it has a great potential.

Weakness:

The idea of filtering out low-quality data pairs to improve multimodal contrastive learning is highly intuitive. It's a common approach in many domains, like active learning, where selecting high-quality data is known to improve model performance. While the theoretical analysis is valuable, it doesn’t lead to the development of any new or better algorithms. The potential for algorithmic innovation stemming from the theory is not explored in the paper. How to leverage this theory to design more efficient or scalable algorithms for data filtering in real-world scenarios?
The theoretical results provided by the authors are not sufficiently linked to real-world applications. It remains unclear how the theory could directly inform practices in real-world cases. For instance, how would one practically implement a better teacher-based filtering approach?
The experiments are conducted only on synthetic data and one real-world dataset. A more diverse set of experiments is needed to validate the theoretical results.

问题

Please check the former section.

局限性

Please check the former section.

最终评判理由

After the rebuttal, my major concerns are addressed by the authors, and I keep my positive score.

格式问题

no formatting concerns

作者回复

2025-07-31

We thank the reviewer for their detailed and constructive feedback. We appreciate the positive comment about the comprehensiveness of the theoretical contributions, and also about the suprising finding of the independent-of- $\eta$ regime in Theorem 1. In the following, we address the concerns raised.

(1) How to use the theory to get better algorithms?

This is a great question, thank you for raising this. To the best of our knowledge, this is the first work that provably shows a benefit of teacher-based data filtering. We believe this is a significant contribution, which lays the necessary theoretical groundwork that future algorithmic advancements can build upon.

One such idea is relaxing Assumption 2 to a general covariance matrix for the noise $\xi$ . This exercise indicated that teacher-based filtering can introduce an algorithmic bias. That is, the error with data filtering stays above zero even when $n \rightarrow \infty$ (i.e. a statistically inconsistent estimate). Future work could study the precise behaviour of this bias, which could in turn lead to algorithmic improvements. For instance, this might reveal that the standardization of data in the right basis could mitigate this (e.g., normalizing the images in certain ways). This is a very interesting direction, but requires a detailed investigation from both theoretical and empirical angles.

(2) What are the practical implications?

Beyond its theoretical contributions, Theorem 1 also offers some useful guidance to practitioners by formalizing the trade-off between data quantity (i.e. $n$ ) and quality (via clean fraction $\eta$ ). This allows practitioners to move from qualitative intuition to quantitative decision-making in common data curation scenarios. The following two examples illustrate this in detail:

Dataset election: Consider a practitioner faced with the choice to pick a dataset among many alternatives. Often the choice involves a large, noisy dataset (large $n$ , small $\eta$ ) versus a smaller, curated one (small $n$ , large $\eta$ ). Further, the high-level aggregates (like $n$ and $\eta$ or approximate $\eta$ ) are usually available, but the actual datasets might be behind a paywall. By plugging these values into the error bounds from Theorem 1 (i.e., comparing which option yields a smaller error), one can make a more informed decision about which dataset is likely to produce a better model.
Valuating Data Augmentation: When considering whether to invest resources in collecting a small amount of clean data to augment a large, noisy corpus, our theorem provides quantitative guidance. It helps estimate the expected performance gain from the improved $\eta$ and increased $n$ , allowing a practitioner to assess if the return on investment is worthwhile.

(3) Limited empirical validation.

The teacher-based filtering (Figure 1c, Algorithm 1) has been relatively well-studied empirically, for instance in [1, 2, 3]. The focus of this paper is to theoretically analyze this approach, and demonstrate a provable benefit. Since this involved no new algorithmic modifications, we relied on synthetic experiments (Figure 3a) and reused a real-data experiment (Figure 3b), instead of a suite of new experiments.