Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization
摘要
评审与讨论
This paper proposes IDO (Instance-level Difficulty Modeling and Dynamic Optimization), a two-stage training framework for learning with noisy labels. The core contribution is a novel metric called “wrong event”, which tracks the frequency of model misclassification during training to distinguish clean, noisy, and hard samples without requiring extra hyperparameters. A Beta Mixture Model (BMM) is fitted to the distribution of wrong event scores to probabilistically model both sample cleanliness and difficulty. The method dynamically adjusts loss contributions based on these measures, achieving strong generalization performance while being computationally efficient and scalable.
优缺点分析
Strengths: The paper proposes instance-level dynamic loss weighting that captures both sample difficulty and label cleanliness. It avoids the extensive hyperparameter tuning required by prior methods like DivideMix or ELR and provides clear theoretical rationale and empirical validation for the stability of the proposed metric.
Weeknesses: In the first stage of this framework, the paper assumes that "prior knowledge, i.e., coarse distribution of wrong event, is obtained". However, the prior knowledge cannot be easily obtained in reality.
In this paper, the performance improvement may be from different components/terms of the loss function. Ablation Study is missing.
问题
In the caption of Figure 2, the authors mentioned "Since wrong events are monotonically increasing based on historical statistics instead of current model prediction, when model overfits the dataset, wrong event values for all samples do not change, rather than converging to zero as loss values typically do.". It is not clear about what is the "wrong event values". Why when wrong events are monotonically increasing, wrong event values do not change?
局限性
The authors didn't compare the results with very recent work defencing against label noises, e.g., "Noise attention learning: Enhancing noise robustness by gradient scaling, NeurIPS 2022",
最终评判理由
The authors have provided ablation study and add additional references, making the clarify better. I've increased clarification score. But after carefully considering authors arguments, my concern is still that the novelty and significance haven't been changed much. So I maintain the overall score.
格式问题
NA
Overall Response:
We sincerely thank the reviewer for their insightful feedback and detailed comments. We have carefully studied your questions and, based on your suggestions, have conducted additional analysis and experiments. We are confident that incorporating these discussions and new results will significantly strengthen the final manuscript.
Question 1: In the first stage of this framework, the paper assumes that "prior knowledge, i.e., coarse distribution of wrong event, is obtained". However, the prior knowledge cannot be easily obtained in reality.
Response:
Thanks for your concerns! We apologize for the confusing terminology on "prior knowledge". We would like to clarify that this "prior knowledge" is not assumed or required beforehand, but is generated internally during Stage 1 of our framework. It specifically refers to the initial wrong event statistics gathered during Stage 1, which are then used to initialize the dynamic optimization in Stage 2. Our method does not require any pre-existing knowledge of the dataset-specific characteristics. The results on real world noisy datasets such as CIFAR-100N, Webvision and Clothing1M also show the practical applicability of our method in reality. We will add a more precise description to "prior knowledge" in the revision manuscript to avoid misunderstanding.
Question 2: The performance improvement may be from different components/terms of the loss function. Ablation Study is missing.
Response:
We completely agree that a thorough ablation study is essential for understanding the source of performance improvements. We included these studies in our original submission and appreciate the opportunity to highlight their findings here. The performance gains from IDO stem from two main contributions:
First, Comprehensive Loss Function: As you noted, our three-part loss function models clean, noisy, and difficult samples respectively. Our ablation study in Table 5 demonstrates that each component— , , and —contributes positively to the final performance, and removing any of them leads to a drop in accuracy.
Second, Superiority of the wrong event Metric: Our proposed wrong event metric provides more accurate and stable information for noise modeling than other metrics. The ablation study in Table 4 compares wrong event against Single Loss [1], EMA Loss [2], Forgetting Event (FE) [3] , and First k-epoch Learning (FkL) [4]. The results show that wrong event consistently outperforms these other metrics across different training stages.
We will ensure these key ablation results are more prominently signposted in the revised manuscript.
Question 3: In Figure 2, the authors mentioned "Since wrong events are monotonically increasing based on historical statistics instead of current model prediction, when model overfits the dataset, wrong event values for all samples do not change, rather than converging to zero as loss values typically do.". It is not clear about what the "wrong event values" is. Why, when wrong events are monotonically increasing, wrong event values do not change?
Response:
Thank you for this question. We agree the original phrasing in the Figure 2 caption was a bit confusing and appreciate the chance to clarify. Specifically, the wrong event value of a sample is the number of disagreement between the prediction and the given label during training. At each epoch, the wrong event value of a sample either stays the same (if the prediction matches the given label) or increments by one (if they differ), so the wrong event values are monotonically increasing. As the model overfits, almost memorizing all the given labels, the wrong event value of each sample nearly stabilizes (stops increasing), unlike the loss which converges to zero. This attribute preserves the discriminative power of wrong event distribution at overfitting phase.
We will revise the caption in Figure 2 to make this explanation clearer.
Question 4: The authors didn't compare the results with "Noise attention learning: Enhancing noise robustness by gradient scaling, NeurIPS 2022".
Response:
Thanks for providing related works! We agree that comparing with more recent works is important. We include InstanceGM [5] and NAL [6] as our two more baselines in 3 new experiments on CIFAR-100.
| Methods | Sym. 60% | Asym. 40% | Inst. 40% |
|---|---|---|---|
| Standard | 41.1 | 53.4 | 58.8 |
| InstanceGM | 80.5 | 76.3 | 83.1 |
| NAL | 80.9 | 77.6 | 81.3 |
| IDO | 81.4 | 78.2 | 83.8 |
The results show that while InstanceGM is strong in its target setting (instance noise), it underperforms in symmetric and asymmetric noise because InstanceGM is too cautious in judging noise which leads to insufficient label correction. NAL performs well but is slightly behind IDO, particularly in the most challenging instance-dependent scenario. IDO demonstrates consistently state-of-the-art performance across all three settings. We will include a more comprehensive comparison with InstanceGM and NAL in our revised manuscript.
Once again, we thank you for your valuable and insightful feedback, which has significantly improved our work. We sincerely hope that our analysis and experiments can address your confusion and questions, providing a clearer understanding of our work. If you have any questions about our paper, please feel free to point out and we will try to address it as soon as possible. We would like to express our sincere gratitude for your valuable comments again. Thanks and looking forward to your reply!
[1] DivideMix: Learning with Noisy Labels as Semi-supervised Learning. ICLR'20
[2] Robust Curriculum Learning: from clean label detection to noisy label self-correction. ICLR'21.
[3] An Empirical Study of Example Forgetting During Deep Neural Network Learning. ICLR'19.
[4] Late Stopping: Avoiding Confidently Learning from Mislabeled Examples. ICCV'23.
[5] Noise attention learning: Enhancing noise robustness by gradient scaling. NeurIPS'22.
[6] Instance-dependent noisy label learning via graphical modelling.CVPR'23.
Dear Reviewer HFBm,
Greetings from the authors!
To begin with, we sincerely appreciate your acknowledgement of the following aspects of our work:
-
Strong generalization performance while being computationally efficient and scalable
-
Avoid the extensive hyperparameter tuning required by prior methods
-
Provide clear theoretical rationale and empirical validation for the stability of the proposed metric.
Regarding the prior knowledge, reasons for performance improvements, explaining caption and including more baselines, we have conducted multiple experiments and included these experiments and analysis in the revised paper. We will include them in the final version and provide the link to code repository upon acceptance.
If you have any question for our paper, please feel free to point out and we will try to address it as soon as possible. We would like to express our sincere gratitude for your insightful comments again. Looking forward to your reply! Thanks!
Dear reviewer HFBm,
Thank you for your thoughtful efforts in reviewing our work! As the discussion period nears its end, we kindly hope you can take a moment to review our rebuttal. Looking forward to your feedback!
Best Wishes!
NeurIPS Authors
Dear Review HFBm,
Hope this message finds you well. As the discussion period is nearing its end with less than one day remaining, we kindly appreciate your service and wonder if our response has addressed your review feedback. If there are any additional points or feedback you'd like us to consider, please let us know. Your insights are invaluable to us, and we're eager to address any remaining issues to improve our work.
Thank you for your time and effort in reviewing our paper.
NeurIPS Authors
Thanks for the authors explanation and the thorough comparison with the existing work. I will increase the rating.
We appreciate your reply for our response! We will include all the explanations, and comparisons with the existing work in the final version. Thanks very much!
The paper studies the problem of label noise learning in the lense of empirical observation. In particular, the paper proposes a metric, coined as wrong event, which count the number of times the model mispredict the given label after each training epoch. Such a newly-observed metric has a profound impact in identifying the clean, noisy and difficult samples. That distincs the proposed method with previous methods which may lump both the noisy and difficult samples together. Based on the wrong event metric, a mixture of 2 Beta distributions is inferred to divide samples into two groups, each with their own nature (i.e., clean or noisy) and their difficulty. These factors are then taken into account to design a loss function consisting of three components: clean loss, noisy loss and consistency loss between two different data augmentations. Empirical evaluation achieves state-of-the-art results compared to previous methods in several benchmarks.
优缺点分析
Strengths
- The motivation of separating or weighting noisy and difficult samples is sound. It is actually the main weakness of the majority methods in label noise learning, which often rely on small loss hypothesis. Tackling such a problem is one of the major strengths of the paper.
- The paper proposes a simple metric to quantify both the cleanness and difficulty of each training samples. This metric has empirically shown working well in practice.
Weaknesses
The paper is mainly a pure-observation based approach. There is limited justification for what is proposed in the paper, and in particular the wrong event. Although it has been demonstrated through several empirical results, it is unclear whether the metric is effectiveness in identifying label noise. In addition, despite being claimed with theoretical analysis, such an analysis is mostly incorrect and very limited to shed any insights into the proposed wrong event.
The claim of the theoretical analysis at line 132 and in the Appendix C2: is incorrect. The left hand side term should be in . A slightly similar concern on the loss. Since the loss is positive, .
问题
My main concern is the limited theoretical justification and analysis on the proposed metric wrong event. In its current form, it is unclear how that would result in something clean, or noisy with their own difficulty. At the end of the day, such a metric largely depends on the model used to estimate it. Then, if the model is bias, the metric is also bias, rendering its ineffectiveness.
局限性
Yes
最终评判理由
The authors acknowledge the flaw in the theoretical analysis presented in Appendix C2. Since the flaw does not affect much of the conclusion from the paper, I update my rating accordingly.
格式问题
None
Overall Response:
We sincerely thank you for your critical and insightful review. Your insightful questions have helped us to further strengthen the paper. We have conducted several new experiments based on your suggestions and are confident that incorporating these results and discussions will significantly improve the final manuscript. Below, we address each of your points in detail.
Question 1: The paper is mainly a pure-observation based approach with limited justification for what is proposed in the paper, and in particular the wrong event.
Response:
Thank you for the incisive and critical feedback. We fully acknowledge that wrong event is a mainly empirical, observation-driven metric and totally agree with the reviewer that a rigorous theoretical analysis provides the most comprehensive understanding of a method. We view this as a valuable direction for our future work.
Nevertheless, we respectfully argue that in deep learning, providing extensive, consistent evidence for a novel metric and framework's superiority constitutes a significant contribution in its own right. Empirical, observation-driven research approach is a characteristic shared by many successful methods in Learning with Noisy Labels (LNL). The small loss hypothesis [1] as you mentioned and the memorization effect hypothesis [2] are mainly proposed with an empirical, observation-driven research approach. The metrics used to model noise such as loss[3], ema-loss[4], forgetting event(FE)[5], fluctuation event[6], first k-epoch learning(fkl)[7] and label wave[8] are also mainly proposed with an empirical, observation-driven research approach.
In our original manuscript, our validation in the manuscript goes far beyond simple observation. We demonstrate the superiority of wrong event not only through state-of-the-art accuracy on six diverse benchmarks (Tables 1, 2) but also through direct, multi-faceted comparisons against other metrics:
Noise Modeling Ability (Table 4): We show wrong event achieves superior final accuracy, demonstrating its robust modeling capability across all training phases (early, middle and late stages).
Stability and Robustness (Fig. 7): We analyze the change range of normalized metrics, proving that wrong event is significantly more stable than loss.
Clean Sample Selection (Fig. 2,5,6,8, Tables 9-12): We provide a comprehensive comparison against loss using visualizations of histograms and AUC-ROC curves and calculations of precision, recall, F1-score, and AUC. These results confirm the superior selection ability of wrong event across different training strategies (from-scratch and pre-trained) and throughout all training phases (early, middle and late stages).
We believe our method, backed by this extensive evidence, offers valuable insights to the community. We will explicitly state the lack of a formal convergence proof as a limitation in the revised manuscript, highlighting it as a promising avenue for future theoretical research.
We believe that our method can offer insights for the community on this field, and we will included the limitation of lacking rigorous theoretical analysis such as full theoretical proof of convergence in the revised paper, leaving a promising future work.
Question 2: It is unclear whether the metric is effective in identifying label noise, because such a metric largely depends on the model used to estimate it. Then, if the model is bias, the metric is also bias, rendering its ineffectiveness.
Response:
This is an excellent point, and we acknowledge your valid concern regarding the interplay between the metric and potential model bias. We designed our experimental protocol specifically to test this.
Our method was validated on six diverse synthetic and real-world benchmarks (CIFAR-10/100, Tiny-ImageNet, CIFAR-100N, Clothing1M and WebVision), across various noise types (symmetric, asymmetric, instance-dependent and real world noise), architectures (ResNet18/50, ViT, ConvNeXt) and training strategy (from-scratch / pretrained). IDO consistently outperforms state-of-the-art methods in these varied settings, demonstrating the generalizability and robustness of our approach.
More directly, our analyses in Figures 4, 8 and Tables 9-12 demonstrate the metric's resilience to model state and bias:
Robustness to Model State: wrong event maintains its discriminative power in early, middle, and late training stages, regardless of whether the model is underfitting or has begun overfitting (i.e., become biased by) the noisy labels.
Robustness to Model Initialization: The metric performs effectively in both randomly initialized and powerful pre-trained settings, showing it is not dependent on a strong initial model.
Extensive experiments demonstrate the consistent efficacy of wrong event across diverse settings, particularly its robustness to various models. This comprehensive evidence demonstrates that wrong event is not fragilely dependent on a perfectly unbiased model but is a robust metric across a wide spectrum of practical conditions. We hope this fully addresses your concern.
Question 3: The claim of the theoretical analysis at line 132 and in the Appendix C2: is incorrect. The left hand side term should be in . A slightly similar concern on the loss. Since the loss is positive, .
Response:
We sincerely apologize for this error and are very grateful for your sharp attention to detail. This was an oversight on our part. We have thoroughly re-examined this point and provide a corrected, more rigorous analysis below.
The change rate of loss is indeed in every epoch with unbounded limits and frequently unpredictable mutations. A usually seen phenomenon named forgetting event [5] (last epoch right, this epoch wrong) can cause the probability of given label to plummet from a high probability (e.g. ~0.9) to a small probability (e.g. ~0.1), causing the loss to skyrocket and the relative change to become enormous. This can happen randomly and repeatedly for all the samples.
However, for wrong event, the change rate needs more discussion. Firstly, we have , meaning the wrong event value is monotonically increasing. The change rate is . We analyze this in two distinct cases:
The "0 to 1" Transition: You are correct to identify the unique case where the metric's stability might be questioned. When a sample is misclassified for the very first time, transitions from 0 to 1. At this point, the relative change is undefined. This is a singular, one-time event for any given sample, marking its initial entry into the set of "ever-misclassified" samples.
All Subsequent Changes (): For every subsequent misclassification, the denominator is at least 1. The relative change is . This value is strictly bounded within the range [0,1] far smaller than the change rate of loss. More importantly, as the model continues to misclassify a sample, increases, causing the maximum possible relative change () to decrease. The metric is therefore self-stabilizing.
This analysis provides a strong theoretical basis for the empirical stability we observed in Figure 7. The relative change of wrong event is well-behaved, bounded, and self-stabilizing after a single initial event. Conversely, the relative change of loss is unbounded and perpetually susceptible to explosive volatility. This fundamental difference in their rate of change is why wrong event serves as a far more reliable and stable signal for modeling sample characteristics in noisy environments.
We thank you again for pushing us to analyze our metric's stability more deeply. We will completely revise this section in the manuscript to reflect this rigorous and correct analysis.
Once again, we thank you for your valuable and insightful feedback, which has significantly improved our work. We sincerely hope that our analysis and experiments can address your confusion and questions, providing a clearer understanding of our work. If you have any questions about our paper, please feel free to point out and we will try to address it as soon as possible. Thanks and looking forward to your reply!
[1] Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. NIPS'18
[2] A closer look at memorization in deep networks. ICML'17
[3] DivideMix: Learning with Noisy Labels as Semi-supervised Learning. ICLR'20
[4] Robust Curriculum Learning: from clean label detection to noisy label self-correction. ICLR'21
[5] An Empirical Study of Example Forgetting During Deep Neural Network Learning. ICLR'19
[6] Self-filtering: A noise-aware sample selection for label noise with confidence penalization. ECCV'22.
[7] Late Stopping: Avoiding Confidently Learning from Mislabeled Examples. ICCV'23
Thank the authors for the reply. However, the flaw in the initial theoretical analysis cannot be simply done through a discussion. Since it changes the initial conclusion, especially the cases where the wrong event is zero across multiple iterations. In that case, there are further ablation studies required to assess such claim as well as understand why those samples have such a behaviour. Hence, additional rounds of review on the revision of the paper are neccessary. As a result, I maintain my rating. The final decision will be decided by the AC.
We sincerely thank you for your continued engagement and for providing constructive feedback on our theoretical analysis. We understand your concern that a theoretical argument requires more than just discussion. We would like to provide a step-by-step response that we hope fully addresses your concerns, including a formal resolution to the theoretical issue and a clarification on the specific sample behaviors you highlighted.
1. Clarifying the Change Rate and Isolating the Edge Case
First, we want to clarify the behavior of the wrong event change rate. For any sample where the wrong event count remains persistently at zero over multiple epochs, its value is constant, and thus its change rate is trivially zero. In this state, the metric is perfectly stable and its change rate is far smaller than the change rate of loss.
The only instance of an undefined (infinite) relative change occurs during the singular transition when a sample's wrong event count goes from 0 to 1 for the first time. Your concern rightfully focuses on the mathematical soundness of handling this specific edge case.
2. A Feasible Subtle Modification for a Theoretically Sound Metric
To formally address this specific edge case, we analyzed the metric further and found that a feasible, subtle modification resolves the issue mathematically without altering the method's practical mechanism.
Proposed Modification: We simply add one to wrong event (effectively, initializing all counters at 1 rather than 0).
Mathematical Implication: With this change, the denominator in the relative change rate, , is now always . This means the relative change is strictly bounded within the range . This provides a sharp, theoretically sound contrast to the unbounded rate of loss , whether for or .
Impact on the Method: This modification merely shifts the entire wrong event distribution by one unit, which does not affect the relative shape that our Beta Mixture Model (BMM) fits. The core mechanism of identifying clean, hard, and noisy samples remains unchanged.
This analysis demonstrates that the theoretical foundation of wrong event is robust, and its stability is not compromised by the 0-to-1 edge case, which can be elegantly resolved.
3. Analysis of Samples with Persistently Zero Wrong Events
You also raised an excellent point about the samples where wrong event remains zero. We would like to clarify that this behavior is not a flaw, but rather an intended and successful feature of our IDO framework.
These samples are precisely the easy, clean samples that the model learns correctly from the very beginning. Our framework is designed to handle them smoothly:
(1). They are assigned the highest clean probability by the BMM.
(2). They receive a difficulty score .
(3). Consequently, our loss function (Eq. 8) trains them almost exclusively with the standard cross-entropy loss , which is the ideal treatment for high-confidence clean data.
Thus, this behavior confirms our framework correctly identifies and leverages the most reliable data points, and a specific ablation study on these samples is not necessary as their handling is a core part of the validated framework design.
4. Summary of Existing Experimental Evidence for Superiority
Finally, we would like to re-emphasize that the superiority of wrong event is already extensively supported by a wide array of experiments in our manuscript, which demonstrate its practical effectiveness from multiple angles:
Superior Noise Modeling: Table 4 shows that using wrong event leads to higher final accuracy compared to using loss and other LNL metrics.
Superior Clean Sample Selection: Tables 9, 10, 11, and 12, along with Figure 8, quantitatively prove via F-score and AUC that wrong event is a far better and more stable selector of clean samples throughout all training stages .
Superior Stability: Figure 7 provides a direct visual and quantitative comparison of the normalized change, showing wrong event is significantly more stable (i.e., has lower variance) than loss .
Robustness to Overfitting: Figures 2, 5, and 6 visually demonstrate that wrong event maintains clear separation long after loss has failed due to the model's overfitting bias .
We will integrate this detailed reasoning into our revision. We hope that this formal resolution of the theoretical edge case, the clarification of the framework's mechanics, and the extensive existing experimental evidence fully address your concerns.
Once again, we thank you for your valuable and insightful feedback, which has significantly improved our work. If you have any questions about our paper, please feel free to point out and we will try to address it as soon as possible.
Once again, we sincerely thank you for your diligent feedback. We agree that theoretical soundness is paramount, and we appreciate you holding our work to a high standard.
We would like to respectfully clarify our perspective on the proposed revisions. We believe that correcting the theoretical analysis does not "change the initial conclusion" of our paper. On the contrary, it provides a more robust and mathematically sound foundation for the very same conclusions we initially presented.
-
The Core Conclusion Remains Unchanged: Our central claim has always been that wrong event is fundamentally more stable and effective than loss. This conclusion was originally established based on extensive empirical evidence (from 6 datasets, various noise types, and multiple architectures) that clearly demonstrated this property.
-
The Revision Strengthens, Not Alters, the Justification: The flaw you identified was in our mathematical description of why this stability exists. Our proposed correction—whether by focusing on the normalized absolute change (see Figure 7) or by trivially initializing the counter at 1—serves only to make the mathematical argument for this pre-existing, empirically-observed stability more rigorous. The metric's practical behavior, its implementation, and the outstanding experimental results it produces all remain identical.
-
The Study of Specific Samples Already Exists: For samples with wrong event persistently at zero in multiple epochs, this behavior is not an unexamined flaw, but rather an intended and successful feature of our IDO framework already mentioned at Section 3.2 in the original manuscript. These are the "easy, clean samples." Our framework is designed to handle them perfectly: they are assigned the highest clean probability and lowest difficulty score, and are subsequently trained with almost standard cross-entropy loss. This is the ideal treatment. The fact that wrong event stays at 0 for these samples confirms our method works as designed in identifying the most reliable data.
To use an analogy, we have presented strong observational evidence for a phenomenon. You correctly pointed out a weakness in our initial theoretical explanation for it. By fixing that explanation, we have made the overall case for the phenomenon stronger and more complete, without changing the phenomenon itself.
In summary, we have addressed the theoretical soundness by proposing a simple, elegant fix that reinforces our claims, and clarified that the sample behaviors you questioned are evidence of our framework's success. Given that the paper's core claims and all empirical results remain unchanged—and are now supported by a more rigorous theoretical argument—we are confident these clarifications are straightforward to verify. Your insights have been incredibly valuable in strengthening our work, and should you have further thoughts, we would welcome the opportunity to discuss these points in greater depth.
Given that these revisions serve to reinforce and clarify our original claims rather than altering them, we are confident that the core contributions of the paper stand firm. We hope this clarification assures you that the paper's foundation is sound and that these important, but targeted, improvements can be verified without necessitating a full re-review.
Thank you once again for helping us improve the quality and rigor of our paper.
Thank authors for detailed explanation. I think that the theoretical flaw is not too serious. In light of this, I raise my rating accordingly.
We appreciate your reply for our response!
Thank you very much for your time and for reconsidering our paper. Your rigorous questions and insightful comments were significant in helping us identify and strengthen the theoretical foundations of our work. We will include all the analysis during rebuttals in the final manuscript.
Again, we truly appreciate your willingness to engage in this discussion and for raising your rating.
Best wishes!
NeurIPS Authors
This paper addresses the problem of Learning with Noisy Labels (LNL) by introducing both a metric, Wrong Event, and a framework, Instance-level Difficulty Modeling and Dynamic Optimization (IDO), and custom loss function. The main contribution is the development and usage of the Wrong Event metric. The Wrong Event metric is an indicator of label cleanliness (whether a label is correct or not), and is computed as the cumulative sum (over epochs) that a given training example is incorrectly classified, as per its as-labelled class. The idea is that easy & noisy samples will consistently be incorrectly predicted and thus have a high Wrong Event score, and that difficult & noisy (or clean) samples will have a medium Wrong Event score. With this insight, the Wrong Event metric feeds into the IDO framework as a parameter from which cleanliness and difficulty (at the per-sample-level) are estimated (using a two-component beta mixture model). The IDO framework itself works in two stages (first stage is used to generate prior distributions for the Wrong Event distributions, and the second is for actual robust model training, making use of the custom loss function and Wrong Event metric). The paper evaluates its method on several synthetically corrupted datasets (CIFAR-10, CIFAR-100, Tiny ImageNet), and real-world noisy datasets (CIFAR-100N, Clothing1M, WebVision), with compelling results. Thought classification accuracy is only marginally improved vs. existing LNL methods, the improvements seem to persist across all datasets and rates of label noise.
优缺点分析
Quality:
- The paper seems to be technically sound, but some claims are not well supported:
- “wrong event can clearly separate noisy and clean data at all training stages” – unclear how this is demonstrated, can this be shown in fig 2, can this work after just one epoch, is there evidence for fitting to noisy samples?;
- comparing the variance of loss vs wrong event – these have different units, do you mean relative variance?;
- line 129 “with their outcomes frequently flipping between similar classes” – it would be nice to see evidence of this, and would most of these flipped-between classes not also be incorrect?;
- line 205 “With more accurate model output, the clean and noisy distribution gradually separate.” – is this true in terms of being able to discern which distribution a sample is from? Please demonstrate.
- L_{SIM} is only the squared error (you need to do this over all samples to get MSE). Because of how it is used in the overall training objective (8), this is NOT MSE.
- Results are from 5 random trials, but only means are reported. Please include error bars (std).
- Methods are appropriate, several datasets with several types of label noise (symm, asymm, instance-dept) are used.
- Authors have not touched on weaknesses of their work
Clarity:
- I found this paper difficult to read, follow, and understand. I spent considerable time to carefully read this paper to the best of my ability. There are several notational issues that lead to difficulties in understanding:
- Line 115 says f() is a model. Does f output a list of predicted probabilities or does it output the predicted class directly? In Eq (1), it seems that f(x) outputs a categorical label (should this be argmax of f?). In Eq (10), the output from f on two augmentations is averaged, but how can we average two categories (should the output be lists of probabilities?)? Also in Eq (10), it would seem that f is defined recursively based on the right hand side of the equation (is the left hand side the definition of f, or is it just a new pseudo-label and should be given a new variable). In Eq (11), it is unclear how the mathematical operations are performed on either a categorical or list of predicted probabilities output from f (is log evaluated on all probs of all classes?).
- Is noise rate given a variable of \mu or r?
- Issues with Figures:
- Fig 2 axes - Why is horizontal axis "Loss / Wrong event Distribution", What is the horizontal axis meant to be?, In (d), why is the vertical axis on the right?;
- Fig 2 - says Mean but caption says median, which is it?;
- Fig 4 - Why use line chart here? The methods on the horizontal axis are categorical and not interpolatable. Is the vertical axis meant to be number of examples, or accuracy?
- Minor grammatical issues are found throughout (use of singular vs plural, “Following setting in [12,13].”, etc.) are distracting.
Significance:
- This paper improves upon existing LNL work in a meaningful way, given the comprehensive outperformance of existing methods on all datasets and noise rates.
Originality:
- This paper is original in its presentation of the Wrong Event metric, framework, and multi-part custom loss function.
问题
- My interpretation was that the main contribution is the Wrong Event metric; however, a rather elaborate framework (and loss function) are also developed here. What is the main takeaway here? Is it this metric (such that other creative frameworks may be used) or the framework (whereby other creative metrics may be used) that is the main contribution?
- Why does the method use only 10 or 15 epochs when the earlier figures show 100 epochs? Does the intuition described in the earlier figures still hold in this case?
- What is the meaning of f() in your paper? Is this the model? What does f() output - the list of predicted probabilities or the predicted class?
局限性
- No error bars even though the checklist answers item 7 as "Yes".
- Limitations and Broader Impacts are discussed in Appendix F according to justification in checklist item 2. However, as reviewer, I am not obligated to review the appendices. It is my understanding that limitations should be discussed in the main paper.
最终评判理由
The authors have responded to all my concerns, most prominent of which was from the 'Clarity' category. I have left my original review but increased my rating from 4 to 5. See my response to rebuttal for more context.
格式问题
None.
Overall Response:
We sincerely thank you for your detailed and constructive feedback. Your insightful questions have helped us to further strengthen the paper. We have conducted several new experiments based on your suggestions and are confident that incorporating these results and discussions will significantly improve the final manuscript. Below, we address each of your points in detail.
Question 1: On the evidence for the claim that "wrong event can clearly separate noisy and clean data at all training stages."
Response:
We appreciate your question regarding this core claim. Fig.2,5,6,8 (visualizations of histogram and AUC-ROC curves) and Tab.9,10,11,12 (calculations of precision, recall, F-score and AUC) show that wrong event can clearly separate noisy and clean data at all training stages with pretrained or randomly initialized models. For the extremely early training stage, i.e., one epoch, Tab.4 shows that wrong event can achieve the accuracy of 80.2% same as loss.
Question 2: On the comparison of variance between Loss and Wrong Event, and their different units.
Response:
Yes, we are comparing the relative variance. The loss and wrong event are not in the same range. We normalize the two metrics and track the range of value changes during training. As shown in Fig. 7, the experimental results show that the range of loss has no clear upper bound, confirming its instability, and the range of wrong event converges as training progresses, demonstrating superior stability. Furthermore, due to loss mutations, the mean change of loss is also larger than that of wrong event.
Question 3: On providing evidence for "outcomes frequently flipping between similar classes" for hard samples.
Response:
Thanks for this question! This characteristic of hard samples is widely recognized in the field and has been demonstrated in several prior works. Previous papers [1][2][3][4][5][10] have shown that the hard sample predictions frequently flipping between similar classes by visualizing the learned representations using t-SNE 2D embeddings, giving the confusion matrix of prediction and labels or giving the statistic information about softmax outputs of certain samples, such as chiffon/shirt, windbreaker/jacket, 1/7, deer/horse, whale/dolphin and etc. In our single-label classification task, although these classes may be semantically related, the sample officially belongs to only one correct class. Therefore, such "flipping" is still considered a misclassification. We will add these citations to the manuscript to support this point.
Question 4: On demonstrating the process of "the clean and noisy distribution gradually separate."
Response:
Thank you for the valuable question. We can empirically illustrate the process as follows. At initialization, wrong event value for all samples is 0. During early training, the model learns correct patterns, its predictions for clean samples tend to match the given labels, so their wrong event values grow slowly or not at all. Conversely, for noisy samples, the model's prediction based on true features mismatches the given noisy label, causing their wrong event values to increase steadily. This differential growth progressively separates the distributions of clean and noisy samples, making it easier to distinguish which distribution a sample is from. Even if the model later memorizes the entire dataset, the wrong event values are "frozen" at their current, separated states, thus preserving the established distinguishability.
Question 5: On the correction that is squared error, not MSE.
Response:
We really appreciate your attention to detail. is indeed the per-sample squared error, not MSE. We will correct it in the manuscript.
Question 6: On issues with figures and grammar.
Response:
Thanks for your attention to detail. For Fig.2, we will clarify that the horizontal axis represents the "Value of Wrong Event / Loss," move the y-axis of subplot (d) back to the left, and confirm that the figure reports the median, not the mean. For Fig.4, we will replace the line chart with a more appropriate format and clarify the y-axis as accuracy(%). We will correct all the issues with figures and grammar in the final manuscript.
Question 7: On including error bars (std) for experimental results.
Response:
Thanks for your concerns! Reporting statistical significance is crucial for demonstrating robustness. We show some error bars in the CIFAR-100 experiment as follows
| Methods | Sym. 60% | Asym. 40% | Inst. 40% |
|---|---|---|---|
| Standard | 41.08±0.87% | 53.42±0.95% | 58.79±0.77% |
| DivideMix | 80.66±0.55% | 66.62±0.56% | 81.27±0.48% |
| ELR | 73.94±0.74% | 75.61±0.63% | 79.70±0.59% |
| DeFT | 79.91±0.43% | 69.94±0.80% | 82.51±0.40% |
| IDO | 81.44±0.51% | 78.20±0.44% | 83.82±0.51% |
We will include the error bars (std) for all experiments in the revised manuscript to report the statistical significance and robustness of our 5-run experiments.
Question 8: On discussing the weaknesses/limitations of our work in the main paper.
Response:
Thanks for suggestions! We run 6 new experiments under a wider range of noise levels in the three noise types to verify IDO's robustness and potential weakness. For Sym. noise, 80% is added. For Asym. noise, 20% and 45% are added. For Inst. noise, 20% and 60%. Besides, a noise-free (0 %) baseline is included. All the experiments use ResNet-50 to train on CIFAR-100 following the implements in the paper.
| Methods | Sym. 0% | Sym. 80% | Asym. 20% | Asym. 45% | Inst. 20% | Inst. 60% |
|---|---|---|---|---|---|---|
| Standard | 82.8% | 35.2% | 68.5% | 46.2% | 70.4% | 31.6% |
| UNICON | 84.8% | 64.0% | 83.5% | 73.2% | 84.9% | 58.2% |
| DeFT | 85.1% | 57.6% | 78.9% | 65.2% | 82.4% | 56.5% |
| IDO | 85.8% | 62.5% | 85.4% | 74.1% | 85.1% | 60.8% |
The results show that IDO performs well and robustly within a noise range that far exceeds the 8 %–38.5 % observed in real-world datasets [6]. Degradation only appears under extremely high noise ratios—rare in practice—because, as discussed in line 670, the highly imbalanced wrong event distribution prevents BMM and model from converging. We will move the "Limitations" section from the appendix to the main paper.
Question 9: What is the meaning of f() and noise rate in the paper?
Response:
Thanks for pointing it out! We apologize for the confusing terminology on f(). f() is the softmax probability of model output. We will explicitly define the output of the function f(⋅) in different contexts (predicted class vs. softmax probabilities ) to remove all ambiguity in Equations (1), (10), and (11). Noise rate is a given number in each task.
Question 10: On clarifying whether the main contribution is the Wrong Event metric or the IDO framework.
Response:
Thank you for the question that helped us clarify our core contribution. We regard the two components as an inseparable and mutually reinforcing system that together forms our main contribution. Current frameworks suffer from high computational costs, heavy hyperparameter tuning and corase-grained optimization, which our lightweight, hyperparameter-free IDO framework is designed to address. The dynamic loss coefficients of IDO relies on robust noise modeling, but existing metrics loss[7], EMA-loss[8], FE[9], FKL[10] have known limitations (Line 76, 296, 584). We therefore introduce wrong event, which supplies accurate noise-modeling information at no extra computational cost. While wrong event could in principle be plugged into other frameworks and IDO could adopt alternative metrics, the two components are designed to complement each other and achieve optimal performance when used jointly.
Question 11: On the question of why the method uses 10/15 epochs vs. 100 epochs shown in figures.
Response:
Thank you for this detailed observation. The difference in epoch numbers stems from the model initialization settings: Fig.2 show 100 epochs for randomly initialized ResNet-18 models, and method uses 10 or 15 epochs for pretrained ResNet-50 models. For randomly initialized models, the visualizations of histogram and AUC-ROC curves are in Fig.2,5,6,8, and calculations of precision, recall, F1-score and AUC are in Tab.9,11. For pretrain models, visualizations are in Fig.8, and calculations are in Tab.10,12. The results confirm that the core intuition behind our framework is effective in both scenarios.
Once again, we thank you for your valuable and insightful feedback, which has significantly improved our work. We sincerely hope that our analysis and experiments can address your confusion and questions, providing a clearer understanding of our work. If you have any questions about our paper, please feel free to point out and we will try to address it as soon as possible. Thanks and looking forward to your reply!
[1] Symmetric Cross Entropy for Robust Learning With Noisy Labels. ICCV'19
[2] Beyond Class-Conditional Assumption: A Primary Attempt to Combat Instance-Dependent Label Noise. AAAI'21
[3] Learning from Massive Noisy Labeled Data for Image Classification. CVPR'15
[4] Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels. ICML'20
[5] Tripartite: Tackle Noisy Labels by a More Precise Partition. arXiv'22
[6] Learning from noisy labels with deep neural networks: A survey. CoRR'20
[7] DivideMix: Learning with Noisy Labels as Semi-supervised Learning. ICLR'20
[8] Robust Curriculum Learning: from clean label detection to noisy label self-correction. ICLR'21
[9] An Empirical Study of Example Forgetting During Deep Neural Network Learning. ICLR'19
[10] Late Stopping: Avoiding Confidently Learning from Mislabeled Examples. ICCV'23
Thank you for your detailed responses to my concerns. My main issue with the paper was clarity. You have provided further explanation for all points for which I was unclear, and thus I will raise my rating to 5 (accept). Please do incorporate these explanations into your main paper. Also, note that the citation for noise rates "observed in real-world datasets [6]" itself cites several papers - I strongly suggest also citing these original sources.
We appreciate your reply for our response! We will include all the explanations, experiments and citations in the final version. Thanks very much!
Dear Reviewer 2oRY,
Greetings from the authors!
To begin with, we sincerely appreciate your acknowledgement of the following aspects of our work:
-
Idea of IDO is well-motivated and technically sound.
-
Comprehensive outperformance of existing methods on all datasets and noise rates.
Regarding the evidence for some claims, the generalization of intuition and other concerns, we have conducted multiple experiments and included these experiments and analysis in the revised paper. We will include them in the final version and provide the link to code repository upon acceptance.
If you have any question for our paper, please feel free to point out and we will try to address it as soon as possible. We would like to express our sincere gratitude for your insightful comments again. Looking forward to your reply! Thanks!
Dear reviewer 2oRY,
Thank you for your thoughtful efforts in reviewing our work! As the discussion period nears its end, we kindly hope you can take a moment to review our rebuttal. Looking forward to your feedback!
Best Wishes!
NeurIPS Authors
This paper proposes a two-stage framework, IDO (Instance-level Difficulty Modeling and Dynamic Optimization), for robust training on noisy labeled data. It introduces wrong event, a metric counting incorrect predictions per sample, which effectively distinguishes clean, noisy, and hard samples throughout training. Experiments on synthetic and real-world datasets show that IDO outperforms state-of-the-art methods in accuracy, efficiency, and scalability.
优缺点分析
The method is well-motivated and empirically sound. The paper is clearly written, with thorough explanations, visualizations, and ablation studies. In addition, the proposed approach is thoughtfully designed to reduce reliance on hyperparameter tuning.
However, the paper does not sufficiently discuss potential failure cases, such as performance under a wider range of noise levels, especially in instance-dependent noise settings. Moreover, although obtaining prior knowledge in Stage 1 is crucial—particularly in capturing the label wave number—there is a lack of detailed analysis on its impact and sensitivity.
问题
-
How does the method perform when class boundaries are highly entangled?
-
Similar to question 1, additional experiments across a wider range of noise levels—especially under instance-dependent noise—would strengthen the paper. These results would offer a more realistic understanding of the method’s robustness and practical applicability.
-
According to Table 5, the similarity loss seems to have a larger effect in more challenging noise settings (e.g., Asym. 40%, Inst. 40%). Does this imply that the method relies more heavily on consistency learning under difficult conditions? What happens if the difficulty score is fixed to a high value—i.e., treating all samples as difficult? Is it desirable for easy samples to receive a low weight for similarity loss? A deeper analysis of the behavior of the difficulty-based weighting would be helpful.
-
Including more baselines, such as InstanceGM, would provide a more comprehensive comparison and help contextualize the proposed method’s performance more clearly.
- Garg, Arpit, et al. "Instance-dependent noisy label learning via graphical modelling." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2023.
局限性
As the paper mentioned, the situation with a high noise ratio differs from the actual data environment, but analyzing performance will help in evaluating the robustness of the methodology.
最终评判理由
The authors have provided a detailed and thoughtful rebuttal with substantial additional experiments and analyses. Their responses addressed the main concerns raised in my initial review, particularly regarding robustness under diverse noise conditions, the sensitivity of Stage 1, and the behavior of the difficulty-based weighting mechanism. The inclusion of additional baselines such as InstanceGM and NAL also strengthened the empirical evaluation.
Based on the updated understanding, I have raised my score to borderline accept. I look forward to seeing the revised version reflect these improvements.
格式问题
.
Overall Response:
We sincerely thank you for your detailed and constructive feedback. Your insightful questions have helped us to further strengthen the paper. We have conducted several new experiments based on your suggestions and are confident that incorporating these results and discussions will significantly improve the final manuscript. Below, we address each of your points in detail.
Question 1: The paper does not sufficiently discuss potential failure cases, such as performance under a wider range of noise levels, especially in instance-dependent noise settings and high noise ratio settings. More experiments help to evaluate the robustness and practical applicability of the method.
Response:
Thank you for this excellent suggestion. To thoroughly evaluate IDO's robustness and practical applicability, we agree that testing on a wider range of noise levels is crucial. We have conducted 6 new experiments on CIFAR-100 using the pretrained ResNet-50 with more noise levels. For symmetric noise, a high noise ratio of 80% is added. For asymmetric noise, 20% and 45% are added. For instance-dependent noise, 20% and 60% are added. Besides, a noise-free (0%) baseline is included. The results summarized below.
| Methods | Sym. 0% | Sym. 80% | Asym. 20% | Asym. 45% | Inst. 20% | Inst. 60% |
|---|---|---|---|---|---|---|
| Standard | 82.8% | 35.2% | 68.5% | 46.2% | 70.4% | 31.6% |
| UNICON | 84.8% | 64.0% | 83.5% | 73.2% | 84.9% | 58.2% |
| DeFT | 85.1% | 57.6% | 78.9% | 65.2% | 82.4% | 56.5% |
| IDO | 85.8% | 62.5% | 85.4% | 74.1% | 85.1% | 60.8% |
These results demonstrate that IDO maintains a significant performance advantage across a wide spectrum of noise types and levels, including the challenging 45% asymmetric noise and 60% instance-dependent noise scenario. This confirms the robustness of our approach, even under conditions far more extreme than the 8%–38.5% noise typically found in real-world datasets [1]. Impressively, IDO also achieves the best performance in the noise-free (0%) setting, showing it does not degrade performance on clean data.
As we discussed in our limitations (line 670), performance degradation is only observed at an extremely high noise ratio of 80%, which is expected due to the severe imbalance in the wrong event distribution that affects the BMM and model convergence. We believe these new results comprehensively address the concern about robustness and practical applicability.
Question 2: Although obtaining prior knowledge in Stage 1 is crucial—particularly in capturing the label wave number—, there is a lack of detailed analysis on its impact and sensitivity.
Response:
We appreciate the reviewer's focus on the criticality of Stage 1. We agree that its impact and sensitivity are vital for the framework. In our original manuscript (Table 4), we investigate how the duration of Stage 1 could influence capturing the label wave number and then influence the final performance.
For the reviewer's convenience, we summarize that Table 4 shows our method's final performance remains consistently high, regardless of whether Stage 1 fits noise—across early (1/2 epochs), middle (4 epochs), or late (8 epochs) training stage—demonstrating robustness throughout. This demonstrates that our framework, especially obtaining the prior knowledge, is not sensitive to the exact stopping point of Stage 1. This robustness is also supported by the original Label Wave paper [2] with experiments on various noise settings and datasets, which established its stability.
To make this point clearer for the reader, we will add a dedicated paragraph in the final manuscript to explicitly discuss the sensitivity and impact of Stage 1, referencing these findings.
Question 3: How does the method perform when class boundaries are highly entangled?
Response:
This is a very insightful question. Highly entangled class boundaries are precisely the challenge posed by asymmetric, instance-dependent, and real-world noise. Our framework handles this through two parts: 1. The wrong event metric helps identify these hard examples, which tend to have fluctuating predictions and thus cluster in the middle of the wrong event distribution. 2. The difficulty coefficient and consistency loss then dynamically up-weights the consistency loss for these specific samples.
This forces the model to learn more robust and consistent representations for samples near the decision boundary, effectively disentangling them. As demonstrated in our new experiments for Question 1, IDO's superior performance under 45% asymmetric and 60% instance-dependent noise directly demonstrates its effectiveness in handling such entangled boundaries.
Question 4: The similarity loss seems to have a larger effect in more challenging noise settings. Does this imply that the method relies more heavily on consistency learning under difficult conditions? What happens if the difficulty score is fixed to a high value—i.e., treating all samples as difficult? Is it desirable for easy samples to receive a low weight for similarity loss? A deeper analysis of the behavior of the difficulty-based weighting would be helpful.
Response:
We appreciate the reviewer's sharp observation and agree that this is a crucial aspect of our method. Your intuition is correct: IDO relies more heavily on consistency learning under difficult conditions. To analyze the behavior of the difficulty-based weighting, we conducted 4 new experiments comparing our dynamic weight against several fixed-weighting methods.
| Weighting Method | Sym. 60% | Asym. 40% | Inst. 40% |
|---|---|---|---|
| Without =0 | 79.5 | 70.4 | 77.3 |
| Fixed =0.25 | 81.3 | 76.5 | 83.3 |
| Fixed =0.5 | 80.8 | 77.5 | 82.4 |
| Fixed =1 | 80.3 | 76.8 | 82.8 |
| Dynamic | 81.4 | 78.2 | 83.8 |
The result shows that dynamic weighting > fixed weighting > drop the loss term, and the performance shows no clear correlation with fixed value of difficulty coefficients. This supports our core hypothesis, much like the principle behind Focal Loss [3]: , showing that a weighting term to expand the loss of difficult data is really necessary and better than treating all samples the same. We consider that for easy data, cross-entropy loss serves as a more direct and effective learning signal than consistency loss while consistency loss serves as a effective learning signal for difficult data. Dropping and fixing the difficulty coefficient both decrease the performance.
Question 5: Including more baselines, such as InstanceGM [4], would provide a more comprehensive comparison and help contextualize the proposed method’s performance more clearly.
Response:
Thanks for providing related works! We agree that comparing with more recent works is important. We include InstanceGM [4] and NAL [5] as our two more baselines in 3 new experiments on CIFAR-100.
| Methods | Sym. 60% | Asym. 40% | Inst. 40% |
|---|---|---|---|
| Standard | 41.1 | 53.4 | 58.8 |
| InstanceGM | 80.5 | 76.3 | 83.1 |
| NAL | 80.9 | 77.6 | 81.3 |
| IDO | 81.4 | 78.2 | 83.8 |
The results show that while InstanceGM is strong in its target setting (instance noise), it underperforms in symmetric and asymmetric noise because InstanceGM is too cautious in judging noise which leads to insufficient label correction. NAL performs well but is slightly behind IDO, particularly in the most challenging instance-dependent scenario. IDO demonstrates consistently state-of-the-art performance across all three settings. We will include a more comprehensive comparison with InstanceGM and NAL in our revised manuscript.
Once again, we thank you for your valuable and insightful feedback, which has significantly improved our work. We sincerely hope that our analysis and experiments can address your confusion and questions, providing a clearer understanding of our work. If you have any questions about our paper, please feel free to point out and we will try to address it as soon as possible. We would like to express our sincere gratitude for your valuable comments again. Thanks and looking forward to your reply!
[1] Learning from noisy labels with deep neural networks: A survey. CoRR'20.
[2] Early stopping against label noise without validation data. ICLR'20.
[3] Focal Loss for Dense Object Detection. ICCV'17.
[4] Instance-dependent noisy label learning via graphical modelling.CVPR'23.
[5] Noise attention learning: Enhancing noise robustness by gradient scaling. NeurIPS'22.
Thank you for the detailed response. I have carefully considered both the authors’ rebuttal and the comments from the other reviewers. The additional results and explanations have improved my understanding of the paper. I hope these points will be clearly incorporated into the revised version. Overall, my concerns have been sufficiently addressed in the rebuttal, and I am raising my score to borderline accept.
We appreciate your reply for our response! We will improve our paper according to rebuttals and comments in the final version. Thanks very much!
Dear Reviewer hZpr,
Greetings from the authors!
To begin with, we sincerely appreciate your acknowledgement of the following aspects of our work:
-
Idea of IDO is well-motivated and empirically sound.
-
Clearly written with thorough explanations, visualizations, and ablation studies.
-
Thoughtfully designed to reduce reliance on hyperparameter tuning.
Regarding the potential failure cases, difficulty-based weighting, more baselines and other concerns, we have conducted multiple experiments and included these experiments and analysis in the revised paper. We will include them in the final version and provide the link to code repository upon acceptance.
If you have any question for our paper, please feel free to point out and we will try to address it as soon as possible. We would like to express our sincere gratitude for your insightful comments again. Looking forward to your reply! Thanks!
The paper tackles the problem of learning with noisy labels. It devises a two-stage process to first identify samples that are clean, noisy, or difficult, and secondly trains a noise-robust loss-reweighted model given the information from the first stage. It shows improved performance over a large set of baselines across various benchmark datasets.
Four expert reviewers evaluated the paper. They found the approach well-motivated, and some found the paper clearly-written with several illustrations. They particularly appreciated the simple "difficulty metric" and the hyperparameter-free design. They also had concerns regarding lacking a wider range of (instance-dependent) benchmarks, potential error in the equations in the appendix, and missing ablation studies and related works.
The authors provided a thorough rebuttal to the criticisms that was attended to by all the reviewers. Reviewers further discussed the rebuttal with the authors iteratively. The discussion led to all reviewers leaning towards acceptance.
The AC believes there is no outstanding major issue with the paper and therefore defers to the unanimous suggestion of the reviewers and suggests acceptance.