LEMoN: Label Error Detection using Multimodal Neighbors
We propose a method to automatically identify label errors in image captions using the multimodal neighborhood of image-caption pairs.
摘要
评审与讨论
LEMON is a method designed to identify mislabeled image-caption pairs in large vision-language datasets, which often contain noisy data scraped from the web. Unlike previous approaches that rely solely on image-caption embedding similarity for filtering, LEMON leverages multimodal neighborhood information in the latent space of contrastively pretrained models to detect label errors. The authors theoretically justify and empirically validate LEMON across eight datasets and ten baselines, demonstrating that it improves label error detection by over 3% and enhances downstream captioning performance by 2 BLEU points.
给作者的问题
None
论据与证据
YES
方法与评估标准
YES
理论论述
YES
实验设计与分析
YES
补充材料
YES
与现有文献的关系
YES
遗漏的重要参考文献
YES
其他优缺点
Strengths:
- The author utilizes multimodal neighbor detection to identify mislabeled data, a simple and effective method that is easy to follow.
- The author employs theoretical analysis to prove the effectiveness of the LEMON method.
- Extensive experiments demonstrate that mislabeled data can degrade model performance, providing significant insights for future research.
Weaknesses:
- I noticed that the most recent dataset used was published in 2020. Is the method still effective on datasets published in the last two years? Do mislabeled data still exist in these recent datasets?
- The author compares the method with LLaVA but does not provide fine-tuned results for LLaVA. Since it is known that mislabeled data degrade model performance, and LLaVA does not address mislabeled data through fine-tuning, the comparison with large models lacks persuasiveness.
- The author only considers image and text modalities. However, could the method generalize to more widely used modalities such as video and audio? Can relevant experiments be provided to support this?
- While the paper claims novelty in using multimodal scoring for label noise detection, a similar approach has recently been explored in "VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models," ICLR 2024.
其他意见或建议
The study does not provide results on fine-tuning large models such as LLaVA. Since mislabeled data can degrade model performance, it is important to examine whether filtering with LEMON improves the performance of fine-tuned large models. A comparison with fine-tuned LLaVA would strengthen the persuasiveness of the results.
Additionally, the current work focuses on image and text modalities. Expanding the study to include other widely used modalities, such as video and audio, would help demonstrate the generalizability of the method. Conducting experiments on multimodal datasets beyond image-caption pairs could further validate LEMON’s effectiveness in broader applications
Thank you for the insightful review and constructive feedback!
I noticed that the most recent dataset used was published in 2020. Is the method still effective on datasets published in the last two years? Do mislabeled data still exist in these recent datasets?
We clarify that we evaluated our method on CC3M [1] (from 2021) in Appendix I.9 and the DataComp benchmark [2] (from 2023) in Appendix I.10. As we have motivated in the introduction, and also has been highlighted in many prior works [2-4], the issue of mislabeled data is only growing with time due to the use of billion-sample scale datasets collected from scraping the web.
Finally, we note that the published work which the reviewer later references [5] uses no datasets from later than 2015.
The author compares the method with LLaVA but does not provide fine-tuned results for LLaVA. Since it is known that mislabeled data degrade model performance, and LLaVA does not address mislabeled data through fine-tuning, the comparison with large models lacks persuasiveness.
The study does not provide results on fine-tuning large models such as LLaVA. Since mislabeled data can degrade model performance, it is important to examine whether filtering with LEMON improves the performance of fine-tuned large models. A comparison with fine-tuned LLaVA would strengthen the persuasiveness of the results.
We have already conducted several experiments showing that filtering with LEMoN improves the performance of downstream large models. In particular, we have finetuned GenerativeImage2Text models in Section 6.2, pretrained CLIP models on MIMIC-CXR in Section 6.4, pretrained CLIP models on CC3M in Appendix I.9, and pretrained CLIP models on DataComp in Appendix I.10. We believe this sufficiently addresses the reviewer's concern.
The author only considers image and text modalities. However, could the method generalize to more widely used modalities such as video and audio? Can relevant experiments be provided to support this?
Additionally, the current work focuses on image and text modalities. Expanding the study to include other widely used modalities, such as video and audio, would help demonstrate the generalizability of the method. Conducting experiments on multimodal datasets beyond image-caption pairs could further validate LEMON’s effectiveness in broader applications
We believe that demonstrating LEMoN's effectiveness on image-text pairs is a sufficient contribution, and extending it to other modalities is out of scope for this paper. We will note this as an area of future work in the revision.
While the paper claims novelty in using multimodal scoring for label noise detection, a similar approach has recently been explored in "VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models," ICLR 2024.
We have already compared our method to this baseline in Table 2. We strongly dispute the claim that VDC is a "similar approach" to LEMoN. VDC entirely relies on prompting LLMs and VLLMs. In contrast, our method does not utilize any prompt engineering, and instead utilizes the neighborhood information in image and text representations of contrastively pretrained models. As a result, not only does VDC perform worse empirically (Table 2), it also has much higher runtime (Table I.11). We emphasize that VDC does not utilize multimodal neighbors, or even embeddings at all, in any form.
[1] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. CVPR 2021.
[2] DataComp: In search of the next generation of multimodal datasets. NeurIPS 2023.
[3] Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv:2110.01963.
[4] What's In My Big Data? ICLR 2024.
[5] VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models. ICLR 2024.
This paper tries to apply a neighbor-based noisy sample detection method to a multimodal dataset (image-text pairs dataset) with the help of a pre-trained vision-language model. The authors also provide theoretical proof to show that their method has better noise detection capability than random detection. Experiments on various datasets have shown the proposed method's efficacy and robustness to hyper-parameters.
给作者的问题
Please check the Experimental Designs Or Analyses and Claims And Evidence.
论据与证据
- assumption 2: in appendix A.2, authors only present the visualization results for classification tasks, which is a much easier case compared with tasks involving image captioning, where the text can be natural language and can have more diverse space on .
- For assumption 2, though appendix A.2 shows the distribution of the whole dataset, it is much better to show the sample-wise visualization results, since assumption 2 is a sample-wise claim. Say, a clean sample can belong to a Gaussian distribution with a lower mean value compared with its noisy version or noisy neighbors, however, does this claim still hold when comparing and that and have even different and ?
- the conlcusion on "our proposed multimodal neighborhood score, provides a better than random signal at detecting mislabeled samples" with is tricky. As long as there are slight distribution differences over two distributions (say, with different ) in appendix A.2, this conclusion still holds, but does not provide significant insight into how the detection method is good enough.
- In the theoretical part, authors try to set for the proof, and the final AUROC result is still higher than the random signal case. Could authors please show the experiment results on directly setting in the algorithm?
- for different noise ratio experiments. For the neighbor-based intuition of the proposed method, and that the neighbor is from the original dataset, my concern for the authors: when the noise level increases, the neighbor pairs can also contain a lot of noisy pairs, which makes the algorithm not that reliable since it heavily relies on the quality of neighbor pairs. However, it seems that the impact of noise level is not presented in the theoretical part, and authors directly claim that the proposed method can achieve better noise detection performance than the random signal case. Thus, the impact of noise level in the neighbor pairs should also be considered in the theoretical part, and how this noise level impacts the final AUROC.
- Notice the experiement in Figure I.1 in Appendix for different noise levels, please explain the phenomenon and tendency on F1/AUROC on mother datasets. It seems when noise level increases, the F1 will drop, and AUROC becomes very unstable (large variance). Could you please combine these observation into the theoretical part? or how to understand this phenominon based on the theoretical part.
方法与评估标准
Yes. Authors follows the classic noise annotation setting from image clasificaiton task for vision-language datasets, and also follows the experiment setting from previous noisy vision-language related work.
理论论述
Yes. I have check the theoretical part. Please check the feedback in Claims And Evidence.
实验设计与分析
I have checked the experiment design. Related concern:
- In appendix A.2, authors only present the visualization results for classification tasks, which is a much easier case than tasks involving image captioning, where the text can be natural language and have more diverse space on . For experiments on different noise levels, please explain the phenomenon and tendency on F1/AUROC on the mother datasets. It seems that when the noise level increases, F1 drops, and AUROC becomes very unstable (large variance). Could you please combine these observations into the theoretical part or explain how to understand this phenomenon based on the theoretical part?
补充材料
Yes. All Appendix except the complete detailed proof part.
与现有文献的关系
- This paper tries to apply neighbor-based noisy sample detection method for multimodal dataset (image-text pairs dataset) with the help of pre-trained vision-language model.
- Though their previous work uses neighbor-based methods for unimodal datasets with noise, this paper is the first to apply neighbor-based methods on multimodal settings.
- Authors also provide theoretical proof to show their method has better noise detection capability than the random detection.
遗漏的重要参考文献
- Line 065 left column "While prior techniques utilize unimodal neighbors for label error detection", please add reference for this sentence.
- Line 131 right column "Prior works have alternatively aimed to maximize the F1 score", please add reference
其他优缺点
Strengths:
- clearly written, easy to follow, and understand the paper
- Originality: Though I got insights from many related previous works, the originality of this paper is enough for publication.
Weakness: please check the Experimental Designs Or Analyses and Claims And Evidence.
其他意见或建议
N/A
Thank you for the insightful review and constructive feedback!
In appendix A.2, authors only present the visualization results for classification tasks, which is a much easier case than tasks involving image captioning, where the text can be natural language and have more diverse space on J(x).
For assumption 2, though appendix A.2 shows the distribution of the whole dataset, it is much better to show the sample-wise visualization results, since assumption 2 is a sample-wise claim.
Thank you for pointing this out! To address these two concerns simultaneously, we conduct an experiment on captions from the flickr30k dataset. We select 20 random captions, then use Llama 3.1-8B-instruct to generate 50 paraphrasings of each caption (via sampling with temperature), corresponding to 50 samples from for each caption. For the samples , we randomly select 50 other captions from the dataset. To match the support of the Gaussian, we take the distance function to be the log cosine distance (note that this does not change the ordering of the score across samples). We compute this distance using the text encoder from OpenAI CLIP ViT-B/32, and plot histograms for each caption. The results are shown here. Running the same Shapiro-Wilk test from Appendix A.2, we find that of the positive samples, 8/20 are Gaussian, and of the negative samples, 16/20 are Gaussian. Thus, there is some evidence the Gaussianity assumption holds for natural language and complex paraphrase functions.
the conclusion on "our proposed multimodal neighborhood score, provides a better than random signal at detecting mislabeled samples" is tricky. As long as there are slight distribution differences over two distributions (say, with different ) in appendix A.2, this conclusion still holds, but does not provide significant insight into how the detection method is good enough.
We emphasize that Lemma 4.2 is a specialization of Theorem 4.1 meant to demonstrate that only loose conditions are necessary to obtain non-random signal. Our Theorem 4.1 provides the exact expression for the AUROC of the score as a function of the distribution parameters.
Could authors please show the experiment results on directly setting γ1=γ2=0 in the algorithm?
We have provided results for setting in Table I.9.
However, it seems that the impact of noise level is not presented in the theoretical part, and authors directly claim that the proposed method can achieve better noise detection performance than the random signal case.
The impact of noise level in the neighbors is accounted for in Theorem 4.1 via the term.
Notice the experiment in Figure I.1 in Appendix for different noise levels, please explain the phenomenon and tendency on F1/AUROC on mother datasets.
To better match the theoretical setting, we examine the performance of individual scores and , without the term and with . Looking at the influence of in Theorem 4.1, we find that as , the AUROC approaches 0.5. As , taking distribution parameters to be fixed (i.e. , etc), the AUROC approaches a fixed constant, which, under the assumptions of Lemma 4.2, is greater than 0.5.
For , the function is strictly decreasing in . Thus, from the theory, we would expect the AUROC to be strictly decreasing with higher noise rate, going down to 0.5 for . Empirically, we do observe the decrease in AUROC, with a faster decrease for mscoco than mmimdb (which according to the theory is due to dataset specific parameters like 's, 's, and the moments of ).
Regarding variance: we would like to note that the result of our Theorem 4.1 is for the "population" AUROC without finite sample considerations. In practice, as in our experiments, AUROC is estimated using finite samples from a fixed dataset. The variance that is observed is due to the variance of this statistical estimator. The variance of empirical AUROC is related to the variance of a Mann-Whitney U statistic, and has been characterized in [1]. This statistical variance is independent of our theorem.
Finally, our theory does not provide an explanation for the F1 score. This is tricker to characterize theoretically, as F1 is computed given a particular threshold on the score. This threshold is selected to be the one that maximizes the F1, and the F1 is a non-concave function of this threshold.
Essential References Not Discussed
Thank you for pointing these out. We have added [2] and [3] respectively to address these.
[1] Confidence Intervals for the Area under the ROC Curve. NeurIPS 2004.
[2] Deep k-NN for Noisy Labels. ICML 2020.
[3] Detecting Corrupted Labels Without Training a Model to Predict. ICML 2022.
The paper presents LEMoN, a method to detect label errors in paired image-text data by using a pretrained CLIP model. Given a dataset of image-text pairs, LEMoN constructs a score which is a weighted combination of CLIP score of and two nearest neighbor based intra-modal scores. The intuition is that if is mislabeled, then i) captions corresponding to images similar to will be mismatched with , and ii) the corresponding images of captions similar to are far away from . The paper provides a theoretical justification for this scoring function. Experiments are performed on 4 classification datasets and 4 image captioning datasets (with artificial noise added) where the proposed scoring outperforms relevant baselines in detecting noisy samples. The paper also reports downstream classification and captioning performance following filtering. Experiments are also performed on real world datasets CC3M and Datacomp.
### update after rebuttal: I have gone through the rebuttal and the comments of other reviewers, and have raised my score to 3. The rebuttal addressed a few of my concerns regards design choices and comp. with LNL algorithms, which I hope will be incorporated in the revision. However, echoing reviewer qofm, I have concerns over the downstream applicability of LEMoN (based on perf. on CC3M and Datacomp).
给作者的问题
- As described in weaknesses, a discussion on simpler ways to incorporate k-NN information would be informative.
- When are filtration based approaches preferred over techniques that learn with label noise?
- Empirical performance on realistic downstream tasks is unsatisfactory. Perhaps fine-tuning under limited data (few-show training samples) is better suited to evaluate these methods?
论据与证据
The paper theoretically justifies their score by deriving an expression for the detection AUROC. Extensive empirical evidence on noise simulated datasets is also provided for the same. However, it is not clear if this performance is translated to datasets with realistic noise, as evidenced by the CC3M experiment.
方法与评估标准
Experiments are performed on 8 datasets covering both classification and captioning. The paper evaluates efficacy of the proposed score on label error detection (AUROC & F1 score) as well as downstream impact of filtering noisy samples.
理论论述
Theorem 4.1 (AUROC of k-NN score) is not specific to the proposed score, and may be valid for other kinds of scoring (one such alternative is proposed below). I have not checked correctness of Thm A.1 in the appendix.
实验设计与分析
Yes, the experimental design is sound
补充材料
I reviewed parts of the appendix referenced in the main paper
与现有文献的关系
The paper extends existing work on filtering noisy labels to incorporate nearest-neighbor consistency in a multimodal fashion. This is a novel contribution. The paper also performs thorough empirical analysis. However there is no empirical comparison with a related body of work on learning with noisy correspondences [1,2].
[1] Radenovic, Filip et al. “Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 6967-6977.
[2] Huang, R., Long, Y., Han, J., Xu, H., Liang, X., Xu, C.,and Liang, X. Nlip: Noise-robust language-image pretraining. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 926–934, 2023
遗漏的重要参考文献
Some discussion on methods that learn from noisy correspondences is missing [1,3, 4]
[1] Radenovic, Filip et al. “Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 6967-6977.
[3] Andonian, Alex et al. “Robust Cross-Modal Representation Learning with Progressive Self-Distillation.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 16409-16420.
[4] Chen, Hao et al. “Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks.” ArXiv abs/2309.17002 (2023): n. pag.
其他优缺点
Strengths:
+ Preprocessing large scale web-crawled datasets is an essential step in the training/fine-tuning large foundational models. The paper explores a novel neighborhood consistency based approach to filter out noisy correspondences from datasets, which can help in improving downstream performance.
+ The proposed approach is motivated well, and the paper is well written. However, some further discussion is needed to understand why simpler cross modal alternatives are not explored (see weaknesses below).
+ Empirical evaluation is thorough, and the proposed scoring rule outperforms baselines in detecting noisy samples (improved F1 and AUROC metrics) for both classification and captioning datasets synthetically augmented with noise.
Weaknesses:
- Empirical performance on realistic noise. Downstream performance on CC3M and Datacomp is almost the same as Clip Similarity baseline. Although a human study on noise detection is performed in Section 6.5, it is not clear if the proposed score translates to downstream performance.
- Simpler ways to incorporate k-NN information. For ex. average CLIP score between the caption and neighboring images of (and vice versa). This avoids the hyperparam, but ignores paired information of neighbors. It is not clear to me if the proposed approach is the optimal way of adding multimodal consistency. A clarification of the same would be appreciated.
- It is not clear if filtering noisy data is preferred to techniques that learn in the presence of label noise. Empirical comparison with [2] on captioning datasets would strengthen the paper.
其他意见或建议
- In L1731 should refer to table I.14
- In table I.14, unfiltered outperforms both kinds of filtering on average
Thank you for the insightful review and constructive feedback!
Empirical performance on realistic noise. Downstream performance on CC3M and Datacomp is almost the same as Clip Similarity baseline.
First, we would like to emphasize that we evaluate our method against the baselines on several other datasets with human noise, including CIFAR-10N, CIFAR-100N [1], and StanfordCars and MiniImageNet [2]. These datasets contain noise from human annotations collected on Amazon Mechanical Turk and the Google Cloud Data Labeling Service respectively. Second, we would like to note that, in addition to CC3M and Datacomp, we also observe increased downstream performance of training on LEMoN filtered datasets in CIFAR-10N and CIFAR-100N (Figure 3) and mscoco (Table 3).
Empirical performance on realistic downstream tasks is unsatisfactory. Perhaps fine-tuning under limited data (few-show training samples) is better suited to evaluate these methods?
Thank you for this suggestion! We have conducted additional experiments for these CC3M-pretrained models by linear probing on the VTAB benchmark, first in the few-shot setting where we select 5 random samples per class, and next where we finetune on the standard training split of each dataset. Our results can be found here. Overall, we find similar trends as the zero-shot setting, where LEMoN marginally outperforms the baseline, with both underperforming the model that has been pretrained on the whole corpus.
Simpler ways to incorporate k-NN information. For ex. average CLIP score between the caption and neighboring images of (and vice versa). This avoids the hyperparam, but ignores paired information of neighbors.
As described in weaknesses, a discussion on simpler ways to incorporate k-NN information would be informative.
Thank you for this suggestion! We have added this alternate way of integrating neighbor information as suggested by the reviewer:
with
and
Which drops the hyperparameter as the reviewer suggested. To maintain fairness, we use the same model selection strategy and hyperparameter grid for remaining hyperparameters as . We evaluate this alternate neighborhood score versus LEMoN, and these results can be found here. We find that LEMoN outperforms this alternate neighbor method on the majority of datasets.
Finally, we note that we have also discussed some alternate ways of integrating neighbor information in Appendix B when we compare LEMoN conceptually and empirically with a baseline which uses semantic neighborhood information for a different purpose. We will add more exposition to this discussion in the revision.
It is not clear if filtering noisy data is preferred to techniques that learn in the presence of label noise. Empirical comparison with [2] on captioning datasets would strengthen the paper.
When are filtration based approaches preferred over techniques that learn with label noise?
In our view, noise-robust training algorithms are a disjoint field of work from noisy label identification. In particular, identifying noise labels is a more flexible approach, with applications beyond just removing these samples for downstream model training. By identifying mislabeled samples, we can also characterize systematic errors or biases in datasets (such as in Figures I.4 and I.5), which can then be fixed, both by repairing existing data and improving future data collection practices. This is especially important for practitioners looking to release high-fidelity datasets for others to train (and especially evaluate) on, as mislabeled samples in test sets have been shown to destabilize ML benchmarks [3]. We will add some discussion on this to the revised version of the paper.
Finally, per-sample mislabel identification methods such as LEMoN can also be deployed to flag incorrect human inputs in an online setting. One particular example might be flagging simple mistakes made by radiologists when writing notes from chest X-rays (as motivated by our MIMIC-CXR setting).
[1] Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations. ICLR 2022.
[2] Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels. ICML 2020.
[3] Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS 2021.
Thank you for addressing my comments and for conducting additional experiments. I am satisfied with the explanations and have thus increased my score.
This paper presents LEMoN, a method for detecting label errors in image-text pair datasets. The authors define a scoring function in the CLIP embedding space that combines the pairwise image-text distance with distances to nearest neighbors in both the image and text modalities. Specifically, the score integrates multimodal distances and distance scores with neighbors in each modality (text, image). The proposed method is theoretically justified and empirically validated across eight classification and captioning datasets (including the healthcare dataset), consistently outperforming existing baselines. The authors conduct comprehensive ablation studies, with a strong focus on demonstrating the method’s robustness to variations in hyperparameters.
给作者的问题
Wrote them above
论据与证据
-
I believe the proposed method to be well-motivated and theoretically sound, supported by Theorem 4.1 and Lemma 4.2. The empirical results are strong across several datasets, and the experiments are particularly thorough in evaluating the robustness to hyperparameter choices—a critical aspect given the number of hyperparameters involved.
-
While the proposed method demonstrates strong performance compared to existing baselines, its practical advantages remain somewhat unclear.
- For instance, BLIP-based filtering methods may not be prohibitively expensive, as they can be fine-tuned on a domain-specific dataset (e.g., movie or biomedical domains) with relatively modest computational costs. Such fine-tuning could potentially outperform the proposed method—as suggested by results on the MS-COCO dataset. A direct comparison of computational overhead between LEMoN and these BLIP-based approaches would strengthen the paper’s claims. Furthermore, CapFilt appears to perform reasonably well even in a zero-shot setting (i.e., without fine-tuning on MS-COCO), wondering about this simple zero-shot CapFilt result. Clarifying the necessity of LEMoN in contrast to these simpler alternatives would help reinforce its practical relevance.
- At first glance, LEMoN's strong performance on mimiccxr seems to highlight its generalizability and ease of adaptation to out-of-domain data—a potential advantage over BLIP-based methods. However, since the proposed method also relies on a domain-specific encoder (BiomedCLIP), the comparison may not be entirely fair. I believe the advantage over the BLIP variant should be more clearly explained —particularly through comparisons with zero-shot CapFilt and fine-tuned CapFilt (e.g., on domain-specific datasets), along with an analysis of their relative computational costs. In addition, exploring whether the proposed scoring function could be integrated into or combined with existing methods like CapFilt might potentially enhance its practical utility.
- Finally, many recent vision-language pipelines rely on synthetically generated high-quality captions (e.g., post-BLIP processing). In such scenarios, the role and added benefit of LEMoN seems less obvious.
- Limited impact on CC3M. I believe one of the most practical use cases for label error correction lies in large-scale web-crawled image-text datasets. However, the marginal performance gain of the proposed method over the baseline—especially when it underperforms the default (unfiltered) setting—raises concerns about its practical effectiveness.
I believe that points 2 and 3 must be thoroughly addressed in the rebuttal.
方法与评估标准
I wrote them above
理论论述
I checked them
实验设计与分析
It seems that experimental dI wrote them above.
补充材料
I reviewed all the parts.
与现有文献的关系
I believe the
遗漏的重要参考文献
believe the paper includes relevant citations overall; however, the main topic is also closely related to the issue of false negatives in vision-language models. The authors may consider to include references such as [Chun et al., ECCV 2022] and [Byun et al., MAFA, CVPR 2024]
其他优缺点
Wrote them above
其他意见或建议
Wrote them above
Thank you for the insightful review and constructive feedback!
For instance, BLIP-based filtering methods may not be prohibitively expensive, as they can be fine-tuned on a domain-specific dataset (e.g., movie or biomedical domains) with relatively modest computational costs. Such fine-tuning could potentially outperform the proposed method—as suggested by results on the MS-COCO dataset. A direct comparison of computational overhead between LEMoN and these BLIP-based approaches would strengthen the paper’s claims. Furthermore, CapFilt appears to perform reasonably well even in a zero-shot setting (i.e., without fine-tuning on MS-COCO), wondering about this simple zero-shot CapFilt result. Clarifying the necessity of LEMoN in contrast to these simpler alternatives would help reinforce its practical relevance.
CapFilt Inference Runtime: We compare the inference time (per sample runtime, milliseconds) of LEMoN with CapFilt using the same setup as Table I.11. We find that the two methods have generally comparable inference runtimes.
| mscoco | flickr30k | mimiccxr | mmimdb | |
|---|---|---|---|---|
| LEMoN | 18.8 (1.8) | 35.9 (1.2) | 52.2 (2.7) | 21.1 (1.4) |
| CapFilt | 21.4 (9.9) | 28.7 (23.8) | 31.6 (0.2) | 33.8 (3.0) |
Advantages of LEMoN over CapFilt: We clarify that for the results reported in the paper, we utilized the pretrained “Salesforce/blip-itm-base-coco” checkpoint. This model was trained on the clean training split of MSCOCO, which is why we refer to it as the "oracle". This training includes minimizing image-text matching loss, which is a binary classification objective designed to predict whether a caption matches a given image. LEMoN does not ever need access to the clean dataset (except optionally for hyperparameter tuning), only the noisy data. As such, we clarify that CapFilt is not applied in a “zero shot” setting, especially not for MSCOCO.
Further, we note that CapFilt appears to be more domain specific than LEMoN. As CapFilt has been trained on clean MSCOCO, it does well on MSCOCO and Flickr30k (both contain COCO‐style captions), but does worse than LEMoN on mmimdb.
At first glance, LEMoN's strong performance on mimiccxr seems to highlight its generalizability and ease of adaptation to out-of-domain data—a potential advantage over BLIP-based methods. However, since the proposed method also relies on a domain-specific encoder (BiomedCLIP), the comparison may not be entirely fair. I believe the advantage over the BLIP variant should be more clearly explained —particularly through comparisons with zero-shot CapFilt and fine-tuned CapFilt (e.g., on domain-specific datasets), along with an analysis of their relative computational costs. In addition, exploring whether the proposed scoring function could be integrated into or combined with existing methods like CapFilt might potentially enhance its practical utility.
We clarify that all other baselines on MIMIC-CXR except CapFilt do also utilize BiomedCLIP (where applicable), and LEMoN outperforms all such baselines. We highlight that we have also explored label error detection without an external domain-specific encoder for MIMIC-CXR in Table 4. Finally, as the reviewer points out, one can leverage representations from BLIP for label error detection with LEMoN, and LEMoN's score could also be combined with other mislabel scores (e.g. through ensembling). We highlight this as one area of future work.
Finally, many recent vision-language pipelines rely on synthetically generated high-quality captions (e.g., post-BLIP processing). In such scenarios, the role and added benefit of LEMoN seems less obvious.
We highlight that such vision-language pipelines assume that high quality synthetic caption generators already exist. LEMoN is designed to improve these synthetic caption generators by providing them with better real training data to start with (e.g. Section 6.2). Additionally, LEMoN may be used to detect errors in synthetic captions as well, thus adding the potential of improving the filtering component of such vision-language pipelines.
believe the paper includes relevant citations overall; however, the main topic is also closely related to the issue of false negatives in vision-language models. The authors may consider to include references such as [Chun et al., ECCV 2022] and [Byun et al., MAFA, CVPR 2024]
Thank you for these references – we will add them in the revised version of the paper!
Thank you for the rebuttal. I will maintain my score (though I lean more toward a borderline recommendation), as I still have concerns regarding the practical use case of the proposed method and the marginal results on the CC3M benchmark. While I don't consider these to be paper-killing issues, I strongly recommend that the authors clarify these points in the final version if the paper is accepted.
Thank you again for the constructive feedback! Regarding the CC3M results, we believe that the improvement is only marginal as CC3M is already filtered to some extent --- it has gone through four filtering steps as described in [1 Section 3]. We note that we do also conduct experiments on another even larger scale dataset (DataComp) in Table I.13, where LEMoN outperforms the CLIP similarity baseline as well as unfiltered training.
Additionally, we also emphasize that identifying incorrectly labeled data points has utility beyond just removing these samples for downstream model training. For example, we can detect systematic errors or biases in datasets (such as in Figures I.4 and I.5), and improve data collection strategies. This is especially important for practitioners looking to release high-fidelity datasets for others to train and evaluate on.
We will clarify this in the revision if the paper is accepted. Thank you again for engaging with us during the rebuttal period!
[1] Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. ACL 2018.
We have a rather solid consensus (4 weak accepts) towards acceptance of the paper. The AC agrees, having checked the paper, reviews, and rebuttals.