Debiased Orthogonal Boundary-driven Efficient Noise Mitigation
A debiased orthogonal boundary-driven low-cost model-agnostic noise mitigation paradigm with simple deployment for multiple tasks.
摘要
评审与讨论
This paper exploits the properties of high-dimensional orthogonality to identify a robust and effective boundary in cone space for separating clean and noisy samples. They propose One-Step Anti-noise (OSA), a model-agnostic noisy label mitigation paradigm that employs an estimator model and a scoring function to assess the noise level of input pairs through just one-step inference. This method demonstrates enhanced training robustness, improved task transferability, streamlined deployment and reduced computational overhead across diverse benchmarks, models.
给作者的问题
There are five questions I want to learn from you, the first is how many pairs are used to calculate the space shift?
Second, As far as I know, the pre-trained clip has some limitations in that it can make mistakes. For example, there are some cases in which positive pairs get low similarity scores and negative pairs get high similarity scores, will it have a big impact on the identification of noise and clean data?
Third, is the clip you used as an estimator model off the shelf? For example, import the openai clip and calculate the similarity.
Fourth, will it cost a lot of time to calculate the similarity? Once I have tried, it's a little bit slow.
Fifth, when it came to image classification, how do you use clip as an estimator model, I mean, how to identify the noisy pair? Filling the label in a fixed sentence and then calculating the similarity between the image and the sentence?
update after rebuttal
The author's response has partially addressed my concerns, and after reviewing the other reviewers' comments, I support the acceptance of this paper. Therefore, I maintain my original score.
论据与证据
I think the claims made in the submission are supported by clear and convincing evidence. The paper claims that t the intersection boundary is highly likely to be a shifted orthogonal boundary in cone space. fig.1 shows that the empirically optimal decision boundary deviates significantly from the theoretical orthogonal threshold zero and the intersection points of clean and noisy distributions remain consistent for the same model across different datasets, suggesting the existence of a stable, dataset-irrelevant boundary.
方法与评估标准
I think the proposed methods and evaluation criteria make sense for the problem. It uses a pre-trained clip as an estimator model. When training, the pair is input to the estimator model and outputs a similarity, by comparing with the space shift and then calculating the score as the loss weight. If the pair is a noise pair, the weight is close to zero, else as the further from the space shift, the bigger the loss weight. I think the method is simple and cost-effective.
理论论述
I didn't check the correctness of the proofs
实验设计与分析
The paper evaluates the method on three downstream tasks with noisy labels. When it comes to the analysis of results on MSCOCO, it says "table2 show that OSA outperforms all previous approaches on all metrics with a huge gap", but the data shows NPC outperforms OSA on many occasions.
补充材料
I have reviewed the supplementary material, it's well-written. The proof is comprehensive and the additional experimental results are adequate.
与现有文献的关系
The paper is related to two subjects, noisy label learning and multimodal foundation models' application. I think it's worth discussing the problem of how to accurately identify noise based solely on cosine similarity scores since I have ever noticed that one paper I have read mentioned that they think the pair is positive when the similarity is around 0.3. At that time, I am confused why the threshold is around 0.3. besides, as the clip becomes widely used in many different domains, the application on mitigating noisy labels is nice.
遗漏的重要参考文献
Related works that are essential to understanding the (context for) key contributions of the paper are discussed in the supplementary material.
其他优缺点
Strength: The motivation is good, and it seems to be useful and can be applied to many real-world scenarios. Besides, it dives into the cone effect and provides verification for the space shift. Weakness: I think the method is quite simple.
其他意见或建议
I think it may be better to have a comprehensive introduction of related work in the main text. I can't have enough knowledge about the methods you compare when I am reading the main text.
Besides, the 186,187 line, " brings xi and yi closer when c_i = 1" is a typo? It should be 'c_i=0'.
We thank the reviewer DHCR for the positive, patient and professional review, as well as the valuable suggestions for improvement. Our responses to the reviewer’s questions are below:
W1 : The claim of "table2 show that OSA outperforms all previous approaches on all metrics with a huge gap" is not fully adequete, due to NPC outperforms OSA on many occasions.
A1: Thank you for your careful review. In Table 2, OSA outperforms NPC in all metrics of R@1, which is the most important metric in image-text matching. For R@5 and R@10, OSA also outperforms NPC in most cases with a significant margin, especially in noisy scenarios. To accurately clarify this, we will revise the claim from 'on all metrics' to 'on most metrics' in the revised version.
W2 : The method is quite simple.
A2: The objective of our work is to develop a general and easily adaptable anti-noise method. Therefore, we hope the framework of our method is as concise as possible to ensure its practicality in complex real-world scenarios. To achieve this, we focus more on exploring anti-noise principles than sophisticated methods to mitigate the loss of generality arisen from complex methods.
W3 : It may be better to have a comprehensive introduction of related work in the main text to help the readers to understand the knowledge about existing methods for comparison.
A3: We are sorry to make this confusion. After your valuable reminders, we believe it would be beneficial to include more related work in the main text to enhance readers' understanding. In our next revision, we will transfer some content from the related work section to the main text.
W4 : How many pairs are used to calculate the space shift?
A4: During our training process, we randomly sample 256 images and 256 texts separately, forming pairs. We also evaluate different sample sizes ranging from to , and found that the boundaries remain stable. Therefore, we ultimately set the sampling size to match the batch size (256) in most of our experiments.
| Scale | 64x64 | 128x128 | 256x256 | 512x512 | 1024x1024 |
|---|---|---|---|---|---|
| Mean | 0.215 | 0.216 | 0.215 | 0.214 | 0.214 |
W5 : There are some cases in which positive pairs get low similarity scores and negative pairs get high similarity scores, will it have a big impact on the identification of noise and clean data?
A5: Thank you for your constructive question. To explore this, we further conduct experiments on a very large real-world dataset, CC3M, which contains 3 million image-text pairs. These samples are collected from webpages and filtered by Google AI based on instance matching. This suggests that noise in this dataset is rare and semantically relevant, which can somewhat represent negative pairs with high similarity scores. Additionally, we found that zero-shot CLIP performs poorly on this dataset, achieving only about 29 R@1. This indicates that the dataset is somewhat out-of-domain for CLIP, and that there are likely some clean samples with low performance. We report the performance of zero-shot CLIP, the Baseline (CLIP fine-tuned on CC3M), and OSA applied to the Baseline in the table below:
| i2t | t2i | |||||
|---|---|---|---|---|---|---|
| Model | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| zero-shot CLIP | 29.25 | 50.47 | 59.47 | 28.80 | 51.04 | 60.38 |
| Baseline | 42.41 | 66.70 | 75.56 | 42.45 | 67.83 | 76.46 |
| OSA | 43.34 | 67.48 | 75.79 | 43.46 | 68.33 | 76.58 |
We observe that OSA still provides noticeable performance improvement in this challenging scenario. This phenomenon further demonstrates the effectiveness and robustness of OSA. Therefore, cases where some positive pairs have low similarity scores and negative pairs have high similarity scores do not significantly impact OSA's performance
W6 : Is the clip you used as an estimator model off the shelf? For example, import the openai clip and calculate the similarity.
A6: Yes, we use the off-the-shelf CLIP model released by OpenAI.
W7 : Will it cost a lot of time to calculate the similarity? Once I have tried, it's a little bit slow.
A7: We evaluate the time cost on an NVIDIA RTX 3090, processing the MS-COCO dataset (566,435 pairs) with a batch size of 4096 takes about 153 seconds, using ~24 GB of GPU memory. At this rate, processing 1 billion samples would take approximately 75 hours on a single RTX 3090. In addition, this process can be further accelerated through parallel inference. We think this is an acceptable overhead for real-world industrial training.
W8 : How to indentify the noisy pair using CLIP in image classification?
A8: We follow the same image classification pipeline as shown in Figure 1(b), which is also the format used in the CLIP paper [1]. The specific format is: "This is an image of [CLS]."
Refs:
[1] Alec Radford et al, "Learning Transferable Visual Models From Natural Language Supervision", ICML, 2021.
This paper proposes One-Step Anti-noise (OSA), a model-agnostic noise mitigation paradigm leveraging high-dimensional orthogonality and cone effects in pre-trained models (e.g., CLIP) to distinguish noisy and clean samples. Key contributions include: 1) It identifies a shifted orthogonal boundary in cone space as a stable decision threshold, supported by proofs showing that contrastive learning separates clean/noisy samples on opposite sides of the boundary. 2) A one-step inference scoring function that reweights loss based on debiased cosine similarity, reducing computational overhead. 3) This paper also demonstrates SOTA performance on cross-modal matching (MSCOCO, Flickr30K), classification (WebFG-496), and retrieval (CARS98N) under high noise ratios.
给作者的问题
I find some of the discoveries in this paper quite interesting and valuable. I have a few questions that concern me.
-
If the target domain is drastically different, can zero-shot CLIP/ALIGN still provide a reliable boundary? Authors do not have to provide extra experiments and just provide some experience.
-
How to handle moderate overlap between clean/noisy distributions near β? In the experiments, there seems to be little distribution overlap around the boundary. If, for a particular dataset, there was a moderate overlap of clean/noisy samples’ cosine similarity near 𝛽, would you recommend a different shaping function, or a more cautious threshold?
-
Hope the authors can fix the open-source repository they provided.
论据与证据
1)Boundary Stability: Empirical results (Fig. 1 c ~ f) show consistent intersection points across datasets for the same model, aligning with theoretical analysis of cone effects.
2)Efficiency: OSA reduces training overhead by 90% compared to dual-backward methods like NPC (Table 12).
3)Model Agnosticism: Validated across ResNet, VGG, and ViT architectures (Table 4).
方法与评估标准
Methods: CLIP/ALIGN as zero-shot noise detectors is justified due to their semantic alignment capabilities. Non-linear weighting based on shifted boundaries also effectively suppresses noisy samples.
Evaluation Criteria: Standard metrics (R@K, accuracy) align with task goals. Noise ratios (20%-60%) and real-world datasets (CC120K) ensure practical relevance.
理论论述
Probability calculations (Appendix D.1) and Gaussian feature distribution proofs (Appendix D.3) are rigorous.
实验设计与分析
This paper provide comprehensive benchmarks across tasks and noise ratios.
补充材料
Appendix B: Implementation details (e.g., batch size, optimizer) are sufficient for reproducibility.
Appendix D: Theoretical proofs are logically sound but rely on idealized assumptions (e.g., Gaussian weights).
Appendix F: Additional experiments (e.g., real-world CC120K) strengthen claims.
与现有文献的关系
This work is built on CLIP’s cross-modal alignment but introduces noise-aware boundary shifts. It improves NPC by replacing dual-backward passes with one-step inference.
遗漏的重要参考文献
None
其他优缺点
Strengths:
-
This paper provides a strong theoretical and empirical foundation for why a shifted orthogonal boundary emerges in high-dimensional embedding spaces. This is a refreshing perspective, as many noise-robust methods focus on heuristics in loss space; by contrast, this work demonstrates a rigorous approach toward understanding cosine-space separation.
-
OSA is proposed as an inference-only noise-mitigation strategy, independent of specific network architectures. Experiments show that it integrates with various architectures and tasks with minimal changes to the training pipeline. This broad adaptability is a significant practical advantage.
Weaknesses:
-
While the experiments simulate large-scale scenarios (e.g., MSCOCO, CC120K), the evaluated dataset is much smaller than the dataset used to train CLIP. Denoising labels would be more meaningful when validated on large-scale datasets.
-
Although the authors show zero-shot CLIP/ALIGN are good estimators, OSA’s effectiveness depends heavily on that estimator’s domain alignment. If the estimator is weak or too out-of-domain, the boundary for noise vs. clean might become less reliable. Additional discussion about failure cases in severely domain-mismatched scenarios would strengthen the narrative.
-
The repository authors provided in the abstract is expired.
其他意见或建议
None
We thank the reviewer Ho5e for the positive, patient and professional review, as well as the valuable suggestions for improvement. Our responses to the reviewer’s questions are below:
W1 : It is more meaningful when validated on large-scale datasets.
A1: Thank you for your insightful review. To further explore the effectiveness of our method in real-world scenarios, especially on large-scale and out-of-domain training, we conduct experiments on a very large real-world dataset, CC3M, which contains 3 million image-text pairs collected from webpage and filtered by Google AI. We find that zero-shot CLIP does not perform well on this dataset, achieving only 29.25 R@1 on i2t and 28.80 R@1 on t2i. Therefore, we believe this dataset can somewhat represent practical large-scale and out-of-domain scenarios. We report the performance of zero-shot CLIP, the Baseline (CLIP fine-tuned on CC3M), and OSA applied to the Baseline in the table below:
| i2t | t2i | |||||
|---|---|---|---|---|---|---|
| Model | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| zero-shot CLIP | 29.25 | 50.47 | 59.47 | 28.80 | 51.04 | 60.38 |
| Baseline | 42.41 | 66.70 | 75.56 | 42.45 | 67.83 | 76.46 |
| OSA | 43.34 | 67.48 | 75.79 | 43.46 | 68.33 | 76.58 |
Although the samples in CC3M are filtered and have a lower noise ratio compared to natural ones, we observe that OSA still brings noticeable performance improvement. This phenomenon further demonstrates the effectiveness and robustness of OSA in real-world scenarios. Given the breadth and complexity of real-world domains, fully exploring all possible scenarios is extremely challenging. However, we believe this experiment could somewhat provide evidence and insights into OSA's effectiveness and robustness in real-world scenarios.
W2 : OSA’s effectiveness depends heavily on that estimator’s domain alignment, and may become less reliable in severely domain-mismatched scenarios.
A2: As mentioned in A1, in CC3M dataset, zero-shot CLIP only achieves a not good performance and this may somewhat represent domain-mismatched scenarios. In these scenarios, OSA still achieves improvements, indicating its reliablity across various domains. Furthermore, we also provide an optional domain adaptation (DA) solution in Section 3.2.1 to address edge-domain challenges in real-world scenarios, and it achieves significant improvements on noise detection accuracy in Table 5.
W3 & Q3: The repository authors provided in the abstract is expired. Hope the authors can fix the open-source repository they provided.
A3: We are sorry for this mistake. We have re-opened the repository, and it is now accessible!
W4 : If the target domain is drastically different, can zero-shot CLIP/ALIGN still provide a reliable boundary?
A4: Actually, we think that the boundary is an inherent property of the shared space. Therefore, we calculate the boundary using simulated images and texts to eliminate the influence of specific data domains. This suggests that the boundary itself is stable and independent of the target domain.
In cases where the target domain is drastically different—where CLIP is almost impossible to understand the target domain—samples are likely to be distributed around the orthogonal boundary, which may lead to substantial overlap. In such cases, domain adaptation may be necessary to enhance the estimator's recognition ability for the target domain. However, based on our experiments on the Stable Diffusion domain and CC3M, we find that it is uncommon for CLIP to entirely fail to understand the target domain in real large-scale training scenarios. As long as CLIP can recognize the target domain to some extent, the orthogonal boundary can effectively separate clean and noisy samples.
W5 : How to handle moderate overlap between clean/noisy distributions near β? Would a different shaping function or a more cautious threshold be better in such cases?
A5: This is an important question and highlights a common challenge in the field. In practice, it is hard to perfectly separate overlapping clean and noisy samples based on a threshold boundary. But in our work, we identify an inherent decision boundary in the model space with theoretical significance. Compared to the common strategy of using a strict threshold, this approach allows us to design more sophisticated and accurate methods for handling overlap. For instance, our high-degree scoring function, where the gradient trends follow the probability trends of random vectors near the orthogonal boundary, achieves nearly optimal weight ranking in Table 11, suggesting the effectiveness of the function design.
We therefore believe that designing a more suitable function based on the theoretical properties of the boundary is a better solution. Additionally, the overlap mainly arises from unfamiliarity with the target domain. As mentioned in A2, we also propose a domain adaptation technique to help the estimator better adapt to real-world scenarios.
This paper proposes One-Step Anti-noise (OSA), a model-agnostic noise mitigation method, addresses label noise in large-scale pre-training tasks. It leverages pre-trained models’ high-dimensional orthogonality and the cone effect, which shifts the orthogonal boundary in embedding space, intersecting clean and noisy samples. OSA computes cosine similarity and designs a scoring function to adjust sample weights, effectively mitigating noisy samples’ impact. Experimental results show OSA performs well across datasets and tasks, especially in high-noise conditions, improving model performance while reducing computational overhead.
给作者的问题
It would be beneficial if the author could provide additional insights into the design of a score function, especially for the high-degree one.
论据与证据
- Shifted Orthogonal Boundary: Pre-trained models like CLIP have an intersected boundary between clean and noisy samples due to the cone effect. The paper shows that this boundary deviates from the theoretical orthogonal boundary, verified through experiments on datasets like MSCOCO and SDM.
- Model-Agnostic Noise Mitigation: OSA is a model-agnostic noise mitigation method that works across various models and tasks, including image-text matching and image retrieval. It can be applied to pre-trained models like CLIP without model-specific modifications and performs well across different architectures.
- One-Step Inference for Noise Detection and Reduced Computational Overhead: OSA uses a one-step inference process to detect noisy samples, reducing computational overhead compared to existing methods. It can effectively assess noise levels with just a single inference pass, achieving comparable or better performance with significantly less computational cost.
方法与评估标准
Methods:
OSA, a model-agnostic paradigm, reduces noisy sample impact on model training.
-
Pre-trained models like CLIP map sample pairs to an embedding space and evaluate noise levels by calculating cosine similarity.
-
OSA constructs random sample pairs, processes them through the estimator, and calculates the average cosine similarity to obtain the spatial shift for scoring.
-
OSA designs a scoring function based on the orthogonal boundary. Samples with lower cosine similarity are assigned lower weights, while those with higher similarity are assigned higher weights.
-
OSA weights sample losses with the scoring function during target model training. Noisy samples have lower weights, while clean samples have higher weights, guiding accurate parameter updates.
-
OSA’s adaptability is enhanced by adding a weight coefficient to the training loss function, making it suitable for different architecture models.
Evaluation metrics, such as recall, accuracy, and precision and mAP, were used to evaluate OSA’s performance in various tasks.
理论论述
The paper claims that high-dimensional orthogonality in pre-trained models (like CLIP) can identify noise samples. The orthogonal boundary shifts due to the cone effect, distinguishing between clean and noisy samples.
The proof uses vector space properties and neural network embedding characteristics. It analyzes vectors in the embedding layer’s high-dimensional space, showing how the shifted boundary classifies samples as noisy or clean using cosine similarity calculations.
实验设计与分析
-
Dataset Selection: Multiple datasets, including MSCOCO, Flickr30K, and CC120K, were selected for image-text matching, retrieval, and classification. These datasets cover diverse image and text content and allow for a comprehensive evaluation of the proposed One-Step Anti-noise (OSA) method. However, they may not fully represent all possible real-world scenarios.
-
Baseline Comparison: OSA was compared with existing noise mitigation methods using common metrics to measure performance. This comparison provides a clear benchmark for evaluating OSA’s superiority.
-
Ablation Experiments: Ablation experiments were conducted to study the contributions of different OSA method components, such as the estimator model and scoring function. By removing or modifying components, researchers gained insights into how they interact and contribute to overall performance.
补充材料
I’ve reviewed parts A, B, C, and F of the supplementary material. The supplementary materials are rich and detailed, including experimental dataset information, implementation details, a review of related work, the theoretical proof process, and additional experimental results. They provide strong support for readers to understand the paper’s research content.
与现有文献的关系
The paper reviews noise mitigation literature in cross-modal matching, image classification, and image retrieval. It highlights existing method limitations, such as hyperparameter reliance, poor adaptability, and high computational cost. OSA addresses these limitations through innovative method design, enhancing noise mitigation and contributing to research progress.
遗漏的重要参考文献
The paper cites relevant literature thoroughly, with no essential references missing.
其他优缺点
Strengths:
- Originality: The One-Step Anti-noise (OSA) method is novel. It leverages the orthogonality in high-dimensional pre-trained model spaces to design a scoring function based on the cone effect, breaking limitations of traditional noise mitigation methods. OSA is model-agnostic and can adapt to various architectures.
- High Efficiency: OSA accurately completes noise detection with a single inference, outperforming popular multi-model or multiple-inference noise detection schemes.
- Comprehensive Experimental Validation: OSA has been tested on classic datasets like MSCOCO and Flickr30K in various scenarios, including image-text matching and image classification tasks. It accurately identifies noise samples and mitigates interference, demonstrating strong generalization.
Weaknesses:
- Limited Exploration in Scoring Functions: The study on employing high-degree scoring functions is insufficiently comprehensive.
其他意见或建议
- In section 2.2, the highlighted expression “Contrastive learning empowers the separation of clean and noisy samples” seems rather abrupt. Prior to this statement, the text focuses on verifying whether the origin of the intersection boundary is a shifted orthogonal boundary. However, there is a lack of sufficient lead-in and logical connection for this claim about contrastive learning enabling sample separation.
- Is the expression of “brings x_i and y_i closer when c_i=1” in section 3.1 (Line 187) a mistake? A noisy sample x_i should be far away from y_i.
We thank the reviewer PeYE for the positive, patient and professional review, as well as the valuable suggestions for improvement. Our responses to the reviewer’s questions are below:
W1 : Although the evaluation datasets cover diverse image and text content, they may not fully represent all possible real-world scenarios.
A1: Thank you for your valuable insights. To further explore the effectiveness of our method in real-world scenarios, especially on large-scale and out-of-domain training, we conduct experiments on a very large real-world dataset, CC3M, which contains 3 million image-text pairs collected from webpage and filtered by Google AI. We find that zero-shot CLIP does not perform well on this dataset, achieving only 29.25 R@1 on i2t and 28.80 R@1 on t2i. Therefore, we believe this dataset can somewhat represent practical large-scale and out-of-domain scenarios. We report the performance of zero-shot CLIP, the Baseline (CLIP fine-tuned on CC3M), and OSA applied to the Baseline in the table below:
| i2t | t2i | |||||
|---|---|---|---|---|---|---|
| Model | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 |
| zero-shot CLIP | 29.25 | 50.47 | 59.47 | 28.80 | 51.04 | 60.38 |
| Baseline | 42.41 | 66.70 | 75.56 | 42.45 | 67.83 | 76.46 |
| OSA | 43.34 | 67.48 | 75.79 | 43.46 | 68.33 | 76.58 |
Although the samples in CC3M are filtered and have a lower noise ratio compared to natural ones, we observe that OSA still brings noticeable performance improvement. This phenomenon further demonstrates the effectiveness and robustness of OSA in real-world scenarios. Given the breadth and complexity of real-world domains, fully exploring all possible scenarios is extremely challenging. However, we believe this experiment could somewhat provide evidence and insights into OSA's effectiveness and robustness in real-world scenarios.
W2 : Limited Exploration in Scoring Functions: The study on employing high-degree scoring functions is insufficiently comprehensive.
A2: The scoring function is a crucial component of our work, and we conduct an initial exploration in Appendix F.1. Specifically, we compare three types of functions: Linear, Cosine, and High-Degree functions. In Table 6, we observe that our carefully designed high-degree function (based on orthogonal boundary properties) outperforms other methods.
The rationale behind our design of the current high-degree function is as follows:
- For cosine similarity values lower than the orthogonal boundary, there is a high probability that the sample is noise due to the huge gap caused by the orthogonal boundary in anisotropic space. To mitigate the impact of noise, we assign these samples a weight of zero to prevent them from influencing training.
- On the positive side of the orthogonal boundary, the probability of a sample being noise decreases rapidly as the cosine similarity increases. Therefore, the gradient should initially increase rapidly as cosine similarity moves away from the orthogonal boundary. However, for samples with relatively high cosine similarity, we assume they have already been well-learned, so we assign them a lower weight to prevent overfitting.
Considering these factors, we designed a high-degree function in the form of Eq. 34 (). If plotted, our high-degree function exhibits a curve in the range of 0-1 that rises slowly at first, then steeply, before gradually decreasing after 0.7 on the positive side. We will provide a more detailed explanation and supplement these design considerations in the updated version.
W3 : The statement “Contrastive learning empowers the separation of clean and noisy samples” in Section 2.2 appears abrupt, lacking a clear lead-in and logical connection to the preceding discussion on the intersection boundary.
A3: We sincerely appreciate your reminder of the confusion maybe caused by this statement order. This claim aims to explain why the orthogonal boundary in the shared space of a pre-trained model can accurately and naturally separate clean and noisy samples. Inspired by your valuable suggestion, we believe this discussion would be better placed in Section 2.3, 'Qualitative Analysis of Robustness and Applicability,' to improve the logical flow. We will make this correction in our revised version.
W4 : Is the expression of “brings x_i and y_i closer when c_i=1” in section 3.1 (Line 187) a mistake? A noisy sample x_i should be far away from y_i.
A4: We sincerely appreciate your careful review. This is indeed a typo—it should be "brings and closer when ." We will correct this in our revised version.
W5 : It would be beneficial if the author could provide additional insights into the design of a score function, especially for the high-degree one.
A5: Thank you for your constructive suggestion. As mentioned in A2, there are two factors behind our high-degree function design. We would like to include all of these discussions in our revision.
The paper proposes a new approach to cope with label noise in settings involving large pre-trained models. The basic idea is that for such models with large embedding dimensions, the decision boundary between clean and noisy samples may be shifted, a claim that is verified both empirically (Figure 1) and theoretically (Section 2.2). To exploit this, the paper proposes to compute cosine similarities for a set of sample pairs to compute a candidate boundary threshold. This is then used to appropriately re-weight samples in the loss, thus reducing the effect of label noise.
Reviewers were unanimously supportive of the paper, finding it to present a novel solution to an important problem, with new theoretical insights and strong empirical findings. Reviewers were particularly positive about the technique's efficiency, and applicability to a generic model family. Initial concerns about the applicability of the findings to large-scale settings were addressed with results on CC3M.
Overall, we uphold the consensus and recommend the paper for publication, but only if room is available.
From the SAC:
This paper has spurred some discussion among the reviewers and the AC in charge. The outcome of this discussion placed the paper a bit on the borderline. Unfortunately, after careful calibration and due to the high level of competition among submissions, this paper does not clear the cut of this year's ICML. Nonetheless, we would like to encourage the authors to improve their paper according to the suggestions received by both the reviewers and the AC, and consider sending it to the next high profile ML venue.