PaperHub
7.0
/10
Spotlight3 位审稿人
最低6最高8标准差0.8
7
8
6
4.0
置信度
正确性3.3
贡献度3.0
表达3.3
NeurIPS 2024

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

we design universal data selection methods for CLIP pretraining and achieve near SOTA results with less than 10% of preprocessing resources. Combining our methods with the current best one, we can achieve a new state-of-the-art.

摘要

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce $negCLIPLoss$, a method inspired by CLIP training loss that adds the alignment between one sample and its contrastive pairs as an extra normalization term to CLIPScore for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, $NormSim$, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp [Gadre et al., 2023]. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5.3% improvement on ImageNet-1k and a 2.8% improvement on 38 downstream evaluation tasks. Moreover, both $negCLIPLoss$ and $NormSim$ are compatible with existing techniques. By combining our methods with the current best methods DFN [Fang et al., 2023] and HYPE [Kim et al., 2024], we can boost average performance on downstream tasks by 0.9%, achieving a new state-of-the-art on the DataComp-medium benchmark.
关键词
contrastive learningvisual-language pretrainingdata selectionCLIP

评审与讨论

审稿意见
7

This paper introduces a novel metric, negCLIPLoss, for selecting high-quality data. Additionally, the paper proposes a norm-based metric, Normsim, which offers an improved measure of data quality and is compatible with existing methods. Both negCLIPLoss and NormSim demonstrate significant performance improvements, outperforming state-of-the-art methods, while maintain low preprocessing time. Theoretical interpretations are provided for NormSim within the framework of a linear model.

优点

  1. The paper is intuitive, well-motivated and well-written.
  2. The proposed methods are simple and effective.
  3. The experiments are sufficient.

缺点

  1. In Line 86, the authors assert that NormSim does not explicitly consider diversity but provide no further explanation. Since diversity is often linked to the generalization performance of models, it is unclear how the proposed methods implicitly connect with diversity.

问题

  1. The algorithm 1 requires the knowledge of the batch size B and the parameter τ\tau from the teacher model. If the teacher model is private and both B and τ\tau are not accessible (for example, only api is provided), are the proposed methods still workable? How critical is the batch size B to model performance? Can the parameter τ\tau be estimated?

局限性

NA

作者回复

Thank you for your recognition of our paper and your constructive feedback. We have responded to your concerns and will revise our paper based on the discussions. We would also appreciate it if you could let us know if our response addresses your concerns.

Q1: In Line 86, the authors assert that NormSim does not explicitly consider diversity but provide no further explanation. Since diversity is often linked to the generalization performance of models, it is unclear how the proposed methods implicitly connect with diversity.

A1: We reply to this concern as the following points:

  1. Many top baselines, such as DFN and T-MARS, also don't explicitly consider diversity, yet they still provide good performance. Devil [1] even shows that valuable data is worth sampling multiple times, which they call 'quality duplication'. Therefore, one important reason why NormSim works well without explicitly considering diversity may be that when the computing budget is limited, as in the DataComp benchmark, the model first needs to learn the most useful and representative data, which should be similar to some target data.
  2. Moreover, we chose validation data from 24 downstream tasks ranging from ImageNet to EuroSet, which may have covered a sufficiently diverse range of target examples for NormSim to calculate similarity. The diversity of the target data will consequently result in the diversity of the selected subset.
  3. An additional reason may be that our proposed negCLIPLoss already implicitly selects more diverse data, as shown in Figure 1 of the main paper. If some training data are diverse, they will match less with other data and thus have a lower normalization term RR. This results in a larger negCLIPLoss and a higher probability of being sampled.

Thanks for your concern and we would like to add these discussions into the NormSim section in the revised paper.

[1] Yu, Haichao, et al. "The devil is in the details: A deep dive into the rabbit hole of data filtering." arXiv preprint arXiv:2309.15954 (2023).

Q2: Algorithm 1 requires the knowledge of the batch size B and the parameter τ\tau from the teacher model. If the teacher model is private and both B and τ\tau are not accessible (for example, only api is provided), are the proposed methods still workable? How critical is the batch size B to model performance? Can the parameter τ\tau be estimated?

A2: This is a good concern about the limitation of our method. First we claim that most of the CLIP model is either close-sourced (no API, like the SOTA filtering model of DFN) or fully open-sourced (providing model weights, like OAI CLIP, openclip, LAION, etc), so our method should be workable for most of the current CLIP models. Besides, when only the API is provided, the recommended values for BB and τ\tau are 32768 and 0.01, respectively. The reason is that

  1. In general, similar to the training stage, a larger batch size can result in better performance in negCLIPLoss filtering since it contains more contrastive data pairs in a batch. 32768 is the training batch size of the OAI CLIP model, and these data can be fitted into a single 24G GPU in the CLIP forward pass. In A1 in the ‘reply to all reviewers’ parts, we also theoretically show that using a larger batch size, negCLIPLoss will have a smaller approximation error.

  2. For τ\tau, when the model is accessible, we can directly get it by the model parameters since it’s learnable (And note that their temperature is the reciprocal of our definition). However, their values of τ\tau are always as regular as 0.01. The reason is that there is a manually set lower bound in the CLIP training setup (for the original definition it’s upper bound 100) for the trainable τ\tau, and after training they always reach this bound. Therefore, when the model parameter is unavailable, we recommend first trying 0.01 for τ\tau, and then sampling a small subset and tuning τ\tau around it. For tuning the parameters, except for training a small-scale model, we also recommend sampling a small subset and calculating the negCLIPLoss on it with different hype-parameters settings, and then visualizing them (like Figure 6-11 in the main paper) for choosing. Details are shown in Appendix C.5.

What’s more, to show how batch size B influences the model performance, we do some ablation study on BB and τ\tau. Due to the limited time and resources, we mainly focus on the OAI CLIP-B/32 model. The results are shown in A2 in the ‘reply to all reviewers’ parts (Table R1). In R1 we can see that in general, |B| = 32768 is better than |B|=16384, and τ=0.01\tau=0.01 performs the best for both batch sizes. These results support our claims above. We will add these ablation studies and the discussion in the revised paper.

审稿意见
8

This work proposes two new approaches for data selection for vision-language pre-training. The first approach, negCLIPLoss, adds the contrastive loss as a normalization term on top of the existing CLIPScore metric. The second approach, NormSim, further improves the performance if examples from target task distribution are available. Together, the methods achieve state-of-the-art results on ImageNet-1K and 38 downstream tasks with DataComp-medium without any external data or model. Several important theoretical justifications and interpretations are provided for the methods.

优点

Quality: the quality is overall high. Without external resources (on which previous methods rely), the proposed approaches improve evaluation performances by 5.3% on ImageNet and 2.8% on average of 38 downstream tasks. Further, there are many theoretical results that focus on the guarantees of NormSim (though with strong assumptions).

Clarity: this paper is very well written. It is well-motivated, the distinctions of previous approaches are succinctly laid out, the methods are well presented, and the results have a clear structure.

Significance: this paper will bring significant impacts. The data selection problem has been increasingly vital for training higher-quality vision-language models. The proposed approaches, which focus on metrics instead of models or data, are compatible with different techniques that can be combined with advanced models in the future. The approaches also provide significant efficiency improvements (e.g., from 70 L40 hours to 5 L40 hours). The theoretical analyses can provide useful tools for future research as well.

缺点

Quality: this is a minor complaint, but in Lines 229 - 230 the authors state that "the results of baselines on the leaderboard do not apply to our datasets, and we reproduce all the top baselines on the leaderboard with their public UIDs of the selected data" because some URLs of images become invalid. The leaderboard scores of baselines seem higher than the reproduced results in the submission. Could the authors also include the DataComp leaderboard results in the Appendix for fair comparison?

There are also some minor questions below.

问题

  1. In Lines 135 - 136, the inaccessible batch division BB^* from teacher CLIP models is different from the actual batch BkB_k in this work, in terms of both the actual image-text pairs and the batch size. Are there any potential theoretical guarantees or approximations to show that such a difference is reasonably negligible?

  2. Could the authors further show the derivations of the discussions on the two important NormSim instances? 1) Lines 179 - 180 (p=2p=2, equivalent to selecting a subset that aligns with the principal components), and 2) Lines 181-182 (p=p=\infty, a sample will be selected if it has high similarity to any target)? These may help other readers to understand NormSim better.

局限性

The authors discussed the limitations of this work.

作者回复

Thank you for your recognition of our paper together with your valuable comments and suggestions. We will revise our paper according to your comments. We respond to your questions below and would appreciate it if you could let us know if our response addresses your concerns.

Q1: this is a minor complaint, but in Lines 229 - 230 the authors state that "the results of baselines on the leaderboard do not apply to our datasets, and we reproduce all the top baselines on the leaderboard with their public UIDs of the selected data" because some URLs of images become invalid. The leaderboard scores of baselines seem higher than the reproduced results in the submission. Could the authors also include the DataComp leaderboard results in the Appendix for fair comparison?

A1: Thanks for your advice! We would include the DataComp leaderboard results in appendix in the revised version.

Q2: In Lines 135 - 136, the inaccessible batch division B∗ from teacher CLIP models is different from the actual batch Bk in this work, in terms of both the actual image-text pairs and the batch size. Are there any potential theoretical guarantees or approximations to show that such a difference is reasonably negligible?

A2: Thanks for mentioning this. We construct a theorem using the concentration inequality to show that when the batch size is sufficiently large, the normalization term RBkR^{B_k} obtained from actual batch BkB_k can approximate RBR^{B^*} calculated using ground truth batch BB^* quite well, i.e., RBk=(1+o(1))RBR^{B_k} = (1+o(1))R^{B^*}. The details have been shown in A1 in the ‘reply to all reviewers for the major concern’ parts. Here we assume that BB^* and BkB_k are i.i.d. for simplicity since the claim cannot hold if the teacher batch is very different from the actual batch. We also assume that B=B|B|=|B^*|. In practice, we claim that a larger batch size is better since it can contain more contrastive pairs in a batch, and we do some ablation studies as shown in A2 in the ‘reply to all reviewers for the major concern’ parts (Table R1) to support our claim.

Q3: Could the authors further show the derivations of the discussions on the two important NormSim instances? 1) Lines 179 - 180 (p=2, equivalent to selecting a subset that aligns with the principal components), and 2) Lines 181-182 (p=∞, a sample will be selected if it has high similarity to any target)? These may help other readers to understand NormSim better.

A3: Thanks for your advice, we show the derivations as follows and we add them in the revised paper. For convenience, we let f(xt)f(x_t) denote the image embedding of the target data xtXTx_t \in X_T, and f(xs)f(x_s) denotes the image embeddings of training data xsXSx_s \in X_S. Then the definition of NormSim on a data xsx_s is

NormSimp(XT,xs)=(xtXT[f(xt)f(xs)]p)1/p(R1)NormSim_p(X_{T}, x_s) = \left(\sum_{x_t \in X_T} [f(x_t)^\top f(x_s)]^p\right)^{1/p} \qquad (R1)

Then when p=2p=2, we have

NormSim2(XT,xs)=(xtXT[f(xs)f(xt)][f(xt)f(xs)])1/2=(f(xs)xtXT[f(xt)f(xt)]f(xs))1/2NormSim_2(X_{T}, x_s) = \left(\sum_{x_t \in X_T} [f(x_s)^\top f(x_t)]\cdot [f(x_t)^\top f(x_s)] \right)^{1/2} = \left(f(x_s)^\top \cdot\sum_{x_t \in X_T} [f(x_t) f(x_t)^\top ]\cdot f(x_s) \right)^{1/2}

Note that Λ=1XTxtXT[f(xt)f(xt)]\Lambda=\frac{1}{|X_T|}\sum_{x_t \in X_T} [f(x_t) f(x_t)^\top] is the variance matrix of the target image embeddings. Then using NormSim2NormSim_2 for filtering, we have

S=argmaxS=NxsXSNormSim2(XT,xs)=argmaxS=NxsXSf(xs)Λf(xs)(R2)S = \arg \max_{|S|=N}\sum_{x_s \in X_S} NormSim_2(X_{T}, x_s) = \arg \max_{|S|=N}\sum_{x_s \in X_S} f(x_s)^\top \cdot \Lambda \cdot f(x_s) \qquad (R2)

Take Λ=USU\Lambda=USU^\top as the eigen decoposition of Λ\Lambda, S=diag(s1,,sr)S = \text{diag}(s_1,\ldots,s_r) where s1>>srs_1>\ldots > s_r is the matrix of eigenvalues, and U=[u1,,ur]Rd×rU=[u_1,\ldots,u_r] \in R^{d\times r} are the corresponding eigenvectors, i.e., the principal component directions. Note that the column vectors of UU and f(xs)f(x_s) are all unit vectors, so we get that Eqn. R2 means NormSim2\text{NormSim}_2 select the data that best match with the principal components of the target variance.

Besides, when p=p=\infty, from Eqn. R1 and the definition of infinity norm, we know that NormSim(XT,xs)=maxxtXTf(xt)f(xs)NormSim_{\infty}(X_{T}, x_s) = \max_{x_t \in X_T} f(x_t)^\top f(x_s), thus it measures the max similarity between the data xsx_s with any target data xtXTx_t \in X_T. Therefore, a sample will be selected if it has high similarity to any target data.

We will add these discussions in the revised papers.

评论

The reviewer thanks the authors for the global and the specific responses. The reviewer is satisfied with the response and will maintain the score.

评论

We sincerely thank you for your time and constructive advice on improving our work!

审稿意见
6

Data selection is crucial in the pretraining stage to clean the web-crawled, large, and noisy pretraining dataset. Typically, existing methods use embeddings to compute CLIPscore in order to assess the data sample alignment quality. This paper introduces two new methods to enhance this measurement:

  1. negCLIPLoss: a better adjustment to reduce bias within a given batch.
  2. NormSim: provides additional information when downstream tasks are known, allowing the selection of samples that are close to the target downstream tasks. Empirical results demonstrate that these proposed methods can be easily combined with existing filtering approaches. The authors also illustrate that their approach yields state-of-the-art results on the DataComp leaderboard.

优点

  • Originality: Most of the work in data curation relies heavily on the original CLIP score. It's a new idea to adapt the CLIP score and elevate this measurement for better use.
  • Quality: The resulting performance is solid and achieves the top position on the leaderboard (medium-scale).
  • Clarity: The motivation behind the two approaches is clear, but some areas need further clarification. Questions are listed below.
  • Significance: Data selection in the pretraining dataset is important to the field, and they have demonstrated that their approaches are effective in achieving state-of-the-art results.

缺点

  1. I think we need more clarification on how to interpret the Top X% in three different metrics in Figure 1. Can the authors provide a more detailed description? Also, how is the R score derived from the batched data? How to find the proper batched data to use?
  2. It seems that the negCLIPLoss is not incorporated into the training loss. We use it as a measurement when CLIP embeddings are provided. In this scenario, how do we determine the batch data, B, for subtracting the regularization term? Would the size of the batched data affect the measurement? The sampling method to find batched data is unclear to me.
  3. I am unclear about the process for greedily selecting samples using NormSim, especially when the raw data pool is massive, and how to define the size of S.
  4. I would suggest moving algorithm steps from the Appendix into the main body, or showing some steps in the main body. They are good at understanding filtering steps.

问题

  1. In Figure 1, R scores on the left side are in the top 100%, while on the right side they are in the top 10%. How should these be interpreted and categorized as underestimates or overestimates of quality?
  2. When the downstream targets are not accessible, we may use the current filtered dataset as a reference, but how do we find the first-round reference dataset as a proxy to compute NormSim?
  3. I would like to list several papers that I found and read for data selection.

https://arxiv.org/abs/2405.15613, https://arxiv.org/abs/2401.12225, https://arxiv.org/abs/2302.03169, https://arxiv.org/abs/2401.04578, https://arxiv.org/abs/2404.07177

局限性

I didn't see any potential negative societal impact of their work.

作者回复

Thank you for your constructive feedback to help us improve our paper. We will revise our paper based on your feedback. We detail our response below and please kindly let us know if our response addresses your concerns.

Q1: I think we need more clarification on how to interpret the Top X% in three different metrics in Figure 1. Can the authors provide a more detailed description? (How should these be interpreted and categorized as underestimates or overestimates of quality?)

A1: Thanks for mentioning this. We show the modified Figure 1 in the one-page supplementary based on your advice, and we illustrate it in detail as follows. We use the ‘Top X%’ of a metric to denote the score which is top X% high among all the scores of this metric in the data pool. For example, in Figure 1, R scores on the left side are top 100%, indicating that these examples have the smallest R in the dataset.

Besides, in this case, we note them as ‘CLIPScore can underestimate the quality’, mainly just because their CLIPScore is relatively small (like Top 78%) while negCLIPLoss are high (like Top 34%). As we can see from both visualization and the experimental result, those data have high quality which is underestimated by CLIPScore. Similar claims hold overestimation cases. In Lines 154-165 in the main paper, we further show the intuition behind the normalization term RR.

Q2: How is the R score derived from the batched data? How to find the proper batched data to use?

A2: We summarize how we choose random batch and obtain R score and negCLIPLoss from different batched data as follows:

(1) We split the whole data into batches randomly, from which we obtain batches {B1,,Bk}\{B_1,\ldots, B_k\}.

(2) For each batch BsB_s, we calculate the cross-image-text similarity between the data in the batch, i.e., fl(xil)fv(xjv)f_l(x^l_i)^\top f_v(x^v_j) for any i,jBsi, j \in B_s.

(3) Using these scores, we can calculate the metrics of all the data in this batch from Eqn.1-2, and we record them for each data.

(4) Repeat (1) - (3) for K times (Note each data will have multiple RR calculated from K different batches which all contain the data itself), we then calculate the mean value for these K different R scores and negCLIPLoss, and use them to approximate the ground-truth values.

Details can be found in Algorithm 1 in Appendix C.1. We mention that this process isn’t the only choice to get the random batch. We choose this method mainly to avoid double calculation of the cross-image-text similarities.

Q3: negCLIPLoss is not incorporated into the training loss. We use it as a measurement when CLIP embeddings are provided...Would the size of the batched data affect the measurement?

A3: Yes, we use negCLIPLoss only for data filtering rather than training. We want to emphasize that the main focus of our paper is on data selection with fixed training pipelines. In A2 in the 'reply to all reviewers' parts, we show how the batch size affects the measurement. In A1 in the ‘reply to all reviewers’ parts, we also theoretically show that using a larger batch size, negCLIPLoss will have a smaller approximation error.

Q4: the process for greedily selecting samples using NormSim, especially when the raw data pool is massive

A4: We note that NormSim is only determined by each data itself like CLIPScore, so the ‘greedily selecting samples using NormSim’ just means simply selecting the data with top NormSim scores. We use the words ‘greedily’ because for this particular NormSim-D algorithm (Details in Algorithm 2 in Appendix C.3), theoretically, we should solve harder optimization problems, but here we use a greedy way (select the top scores) to do approximation. In the revised paper we would change the word to prevent confusion.

Q5: how to define the size of S

A5: In general, for all the top filtering methods, like CLIPScore, HYPE, and T-MARS, we always need to set the target size of the filtered dataset manually. Like in DataComp, all these top baselines keep the downsampling ratios ranging from 15%~30%. Our method with OAI CLIP first selects the data with the top 30% negCLIPLoss and then selects the top 66.7% NormSim scores to keep 20% of the original pool. We don’t tune the target size carefully here for fair comparison.

In practice, this remains an open problem for all leading baselines when dealing with a large raw data pool. Here we found that a simple but very useful way to define SS, is random sampling a small subset (like 1000 data) from the large pool and visualizing these data based on their scores, as Figure 6-11 in the main paper. From this we can determine the filtering threshold of the metric scores and thus the target size. (like we find 0.7~0.75 is a good threshold for NormSim). Details are shown in Appendix C.5.

But overall, deciding a proper S is beyond the scope of this paper. We agree that this can be a meaningful direction for future research. We are also aware of some recent works [1] that suggest there are scaling laws for data filtering, indicating that the target size for filtering is strongly dependent on the computing budget.

[1] Goyal, Sachin, et al. "Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Q6: I would suggest moving algorithm steps from the Appendix into the main body

A6: Thanks for your advice! We will add them in the revised version.

Q7: how do we find the first-round reference dataset as a proxy to compute NormSim in NormSim-D?

A7: For the first run, we just use the whole original dataset as the proxy for calculating NormSim2\text{NormSim}_2. For effectiveness, we only randomly downsample 10% of data for calculating NormSim2\text{NormSim}_2, and the results are similar to using all the data.

Q8: list several related papers

A8: Thanks for your advice! We would cite all these papers in the revised version.

评论

Dear Authors,

I have read your general response and individual comments. Thanks for your reply. Thanks for addressing studies on batch size and clarifying some details in the paper. In general, this paper gives a new idea and a good adjustment to replace CLIPScore, but some places lack detailed descriptions. I support this paper and keep my original score here. Thanks.

评论

Thanks for your reply and support, and we will add the suggested details mentioned in the rebuttal in the next version. Thanks for taking the time to make our paper better!

作者回复

Reply to all reviewers for the major concern

We sincerely appreciate all reviewers for their insightful and constructive feedback to make our paper better. We will revise our paper according to these comments. Here we will address the most common concerns of the reviewers and will put other responses in separate rebuttals.

Most of the reviewers have some concerns related to whether there is any (theoretical) guarantee that we can use the random batch from the pretraining dataset to approximate the inaccessible ground-truth batch in calculating R\mathcal{R} and negCLIPLoss, and how the batch size and temperature will affect our method negCLIPLoss. We answer these questions as follows

A1: Concentration of Normalization Term R\mathcal{R}

We construct a theorem using the concentration inequality to show that when the batch size is sufficiently large, the normalization term RBkR^{B_k} obtained from actual batch BkB_k can approximate RBR^{B^*} calculated using ground truth batch BB^* quite well. The details are as follows:

We assume that the pretraining dataset D\mathcal{D} is i.i.d. sampled from distribution P\mathcal{P}. Besides, to use pretraining data batch to approximate the ground truth batch, one necessary condition is that their distribution is similar. Here for simplicity, we assume that they are also i.i.d..

Assumption R1: We assume that the ground-truth batch of data BB^* used by the teacher model is i.i.d. to the pretraining dataset D\mathcal{D} which is required to be filtered.

For simplicity, we denote sij=fˉv(xiv)fˉl(xjl),i,jBs_{ij} = \bar f_{v}(x^v_i)^\top \bar f_{l}(x^l_j), i, j \in B to be the cross-image-text similarities in the batch BB. Then the normalization term can be written as RiB=τ2[log(jBexp(sij/τ))+log(jBexp(sji/τ))]\mathcal{R}^B_i = \frac{\tau}{2}\left[\log(\sum_{j \in B} \exp(s_{ij}/\tau)) + \log(\sum_{j\in B}\exp(s_{ji}/\tau))\right]. Here note that sij[1,1]s_{ij} \in [-1,1]. We show that RiB=(1+o(1))RiB\mathcal{R}_i^B = (1+o(1))\mathcal{R}_i^{B^*} for all ii when B|B| is sufficiently large, which means that we can use the random batch to approximate the ground-truth batch.

Theorem R1: If Assumption R1 holds and the batch size satisfies B=B|B|=|B^*|, then we have RiB=Θ(log(B))\mathcal{R}_{i}^B=\Theta(\log(|B|)) while RiBRiB=O(1B)|\mathcal{R}_i^B - \mathcal{R}_i^{B^*}| = O(\frac{1}{\sqrt{|B|}}) for any iBBi \in B \cap B^*.

Proof: Since sij[1,1]s_{ij} \in [-1,1], It's obvious that RiB=Θ(log(B))\mathcal{R}_i^B=\Theta(\log(|B|)).

Let αij:=e(sij/τ)Ej[e(sij/τ)]\alpha_{ij} := e^{(s_{ij}/\tau)} - E_j[e^{(s_{ij}/\tau)}], then αij\alpha_{ij} is zero-mean. Note that since the data is i.i.d., so does αij\alpha_{ij}. Therefore, we denote γ:=Ej[αij2]\gamma := E_{j}[\alpha_{ij}^2]. Note that αije1/τ=:M|\alpha_{ij}|\leq e^{1/\tau} =: M, from Bernstein inequality we have

P(jBαijt)2exp(12t2Bγ+13Mt) \mathbb{P}(|\sum_{j \in B}\alpha_{ij}| \geq t) \leq 2\exp(-\frac{\frac{1}{2}t^2}{|B|\gamma + \frac{1}{3}Mt})

A similar conclusion holds for BB^*. These result that with probability at least 1η1-\eta, we have

jBαijmax(2Bγln(2η),43Mln(2η))=:t(B,γ,η,M)|\sum_{j \in B}\alpha_{ij}| \leq \max \left( 2\sqrt{|B|\gamma\ln(\frac{2}{\eta})}, \frac{4}{3}M\ln(\frac{2}{\eta}) \right) =: t(|B|,\gamma, \eta, M)

Thus we have jBexp(sijτ)jBexp(sijτ)2t(B,γ,η)|\sum_{j\in B}\exp(\frac{s_{ij}}{\tau})-\sum_{j\in B^*}\exp(\frac{s_{ij}}{\tau})| \leq 2 t(|B|,\gamma, \eta). Furthermore, for any x1,x2>1x_1, x_2 > 1, it's easy to prove that log(x1)log(x2)x1x2min(x1,x2)|\log(x_1)-\log(x_2)| \leq \frac{|x_1 - x_2|}{\min(x_1, x_2)}. Therefore, we have log(jBexp(sijτ))log(jBexp(sijτ))O(1B)|\log(\sum_{j\in B}\exp(\frac{s_{ij}}{\tau}))-\log(\sum_{j\in B^*}\exp(\frac{s_{ij}}{\tau}))| \lesssim O(\frac{1}{\sqrt{|B|}}), and thus similar claims hold for RiBRiB|\mathcal{R}_i^B - \mathcal{R}_i^{B^*}|.

A2: Ablation study on batch size and the temperature.

All the reviewers are concerned about the choice of batch size. We claim that in general, similar to the training stage, a larger batch size always results in better performance in negCLIPLoss filtering since it can contain more contrastive data pairs in a batch, and thus it can check the image-text matching between more different data. Therefore, we consider the largest batch size 32768 which can fit into a single 24G GPU in the CLIP forward pass, and we note that this is also the training batch size that OpenAI used for training CLIP.

To support our claim, we do some ablation studies on BB and τ\tau. Due to the limited time and resources, we mainly focus on the OAI CLIP-B/32 model. Results are as in Table R1:

Table R1: Ablation study of BB and τ\tau using OpenAI CLIP-B/32 model on DataComp-medium.

negCLIPLossDataset SizeImageNet (1)ImageNet Dist. Shift (6)VTAB (11)Retrieval (3)Avg. (38)
B=16384,τ=0.01\|B\|=16384, \tau=0.0133M28.825.032.526.233.0
B=16384,τ=0.02\|B\|=16384, \tau=0.0233M28.624.833.325.333.1
B=16384,τ=0.07\|B\|=16384, \tau=0.0733M28.024.233.525.132.6
B=32768,τ=0.005\|B\|=32768, \tau=0.00533M28.525.033.627.033.0
B=32768,τ=0.01\|B\|=32768, \tau=0.0133M28.825.133.726.633.6
B=32768,τ=0.02\|B\|=32768, \tau=0.0233M28.524.833.626.232.9
B=32768,τ=0.07\|B\|=32768, \tau=0.0733M28.224.532.825.232.7
negCLIPLoss \cap NormSim
B=16384,τ=0.01\|B\|=16384, \tau=0.0122M32.427.434.526.134.7
B=16384,τ=0.02\|B\|=16384, \tau=0.0222M31.826.735.024.934.2
B=16384,τ=0.07\|B\|=16384, \tau=0.0722M31.026.335.025.533.9
B=32768,τ=0.005\|B\|=32768, \tau=0.00522M32.227.235.326.534.8
B=32768,τ=0.01\|B\|=32768, \tau=0.0122M32.427.435.926.335.2

We can see that in general, negCLIPLoss with a larger batch size (B=32768|B|=32768) indeed has better or comparable downstream performance. Nevertheless, B=16384,τ=0.01|B|=16384, \tau=0.01 still has good performance when being combined with NormSim (τ=0.01\tau=0.01 performs well for both batch sizes). These match our theoretical findings in A1: using a larger batch size, negCLIPLoss will have a smaller approximation error.

最终决定

The submission proposes two data selection methods for large-scale visual-language model pretraining (e.g., CLIP). All three reviewers agree that this submission is well-written and intuitive with extensive experiments. The AC therefore recommends accepting the paper and asks the authors to include their discussions with the reviewers in the final manuscript.