Learning Imbalanced Data with Beneficial Label Noise
摘要
评审与讨论
This paper proposes a new data-level approach called Label-Noise-based Re-balancing (LNR) to solve the imbalanced learning issue. LNR utilizes the introduction of asymmetric label noise to adjust decision boundaries and improve classifier performance, particularly for minority classes. Unlike existing approaches, LNR highlights the potential of beneficial label noise without introducing generative errors or losing information.
update after rebuttal
After reading the authors' responses, I have raised my score to 3.
给作者的问题
- In the field of imbalanced learning, several recent mixup-based approaches [1][2] have been proposed. Similar to this paper, these approaches perform mixing either at the feature level or across labels. From certain perspectives, the LNR proposed in this paper seems to be viewed as a special case of such approaches. However, this paper fails to mention or discuss these approaches.
- The paper claims that existing data-level approaches lead to information loss or generative errors, whereas the proposed LNR alleviates these issues. However, if a portion of the majority class instances is flipped into the minority class, why would this process not also cause information loss for the majority class?
- The experimental section lacks recent baselines, particularly in the binary classification setting (the latest baseline considered was published in 2017). This makes it challenging to accurately evaluate the contributions of this paper. More recent approaches, such as [1][2], should be included to provide a comprehensive comparison.
- The paper frequently mentions “synthetic/simulated and real-world datasets”, yet the experimental section in the main text provides no description of the experiments conducted on synthetic datasets. It is unreasonable to confine these experiments solely to the appendix, as they are necessary for the completeness of the paper and should be mentioned in the main text.
[1] Remix: Rebalanced Mixup. ECCV 2020. [2] Selective Mixup Fine-Tuning for Optimizing Non-Decomposable Objectives. ICLR 2024.
论据与证据
Claims There are two main claims made in this paper:
- Existing data-level approaches often lead to information loss or generative errors, and existing algorithm-level approaches are often tailored to specific models or problem settings.
- LNR alleviates the issues of information loss and generative errors and can be integrated seamlessly with any classifier or algorithm-level approaches.
Evidences For Claim 1, the related work section of this paper describes some existing data-level and algorithm-level approaches, and these descriptions support Claim 1. For Claim 2, the proposed LNR does not generate instances, so it does not lead to generative errors. Experimental results show that LNR can be integrated with existing algorithm-level approaches. However, considering that LNR flips some majority class instances into minority class, I am unsure whether this might lead to information loss of the majority class.
方法与评估标准
Yes. The proposed LNR utilizes the introduction of asymmetric label noise to adjust decision boundaries, thereby improving classifier performance. In terms of evaluation criteria, this paper employs the F1 score, G-mean, AUC, and the average accuracy across Many-shot, Medium-shot, Few-shot, as well as the overall dataset. These evaluation metrics are widely used in other studies on imbalanced learning.
理论论述
Yes. The paper provides detailed theoretical insights into the impacts of imbalance ratios on decision boundaries. I have reviewed these insights, and they seem to be correct.
实验设计与分析
Yes. The experimental section partially validates the effectiveness of LNR. However, there are several issues in current experimental section. For example, it lacks recent comparison baselines, and there is no mention of synthetic experiments in the main text.
补充材料
No. The authors have not provided the supplementary material.
与现有文献的关系
This paper claims that existing data-level approaches often lead to information loss or generative errors, while algorithm-level approaches are often tailored to specific models or problem settings. To address these limitations, it introduces a novel data-level approach called LNR. LNR leverages asymmetric label noise to adjust decision boundaries, effectively mitigating issues of information loss and generative errors.
遗漏的重要参考文献
Yes. This paper lacks references to some more recent and relevant studies in the field of imbalanced learning. I have listed some of them as follows: [1] Remix: Rebalanced Mixup. ECCV 2020. [2] Selective Mixup Fine-Tuning for Optimizing Non-Decomposable Objectives. ICLR 2024.
其他优缺点
Strengths:
- This paper proposes a new approach called LNR for imbalanced learning. LNR is applicable to both binary and multi-class imbalanced problems and can be seamlessly integrated with any classifier or algorithm-level approach.
- This paper provides detailed theoretical insights into the impacts of imbalance ratios on decision boundaries and introduces asymmetric label noise to mitigate these impacts.
- This paper validates the effectiveness of LNR on multiple datasets using various evaluation metrics. Experimental results show that LNR outperforms its competitors in several settings.
Weaknesses:
- In the field of imbalanced learning, several recent mixup-based approaches [1][2] have been proposed. Similar to this paper, these approaches perform mixing either at the feature level or across labels. From certain perspectives, the LNR proposed in this paper seems to be viewed as a special case of such approaches. However, this paper fails to mention or discuss these approaches.
- The paper claims that existing data-level approaches lead to information loss or generative errors, whereas the proposed LNR alleviates these issues. However, if a portion of the majority class instances is flipped into the minority class, why would this process not also cause information loss for the majority class?
- The experimental section lacks recent baselines, particularly in the binary classification setting (the latest baseline considered was published in 2017). This makes it challenging to accurately evaluate the contributions of this paper. More recent approaches, such as [1][2], should be included to provide a comprehensive comparison.
- The paper frequently mentions “synthetic/simulated and real-world datasets”, yet the experimental section in the main text provides no description of the experiments conducted on synthetic datasets. It is unreasonable to confine these experiments solely to the appendix, as they are necessary for the completeness of the paper and should be mentioned in the main text.
[1] Remix: Rebalanced Mixup. ECCV 2020. [2] Selective Mixup Fine-Tuning for Optimizing Non-Decomposable Objectives. ICLR 2024.
其他意见或建议
- On lines 73-74 of page 2, there are too many references (Goldberger & Ben-Reuven, 2022; Liu & Tao, 2015; ...... ; Zhu et al., 2003) that are unrelated to this paper.
- On line 240 of page 5, the end of Lemma 4.1 is missing a period.
- Algorithm 1 and Figure 3 are placed too far from the main text where they are mentioned.
Reply to Information Loss caused by label flipping in Claims And Evidence and Weakness 2
-
We sincerely appreciate your insight on potential information loss due to label flipping. In LNR, the majority-class samples selected for flipping are primarily outliers that have deeply encroached into the minority-class region beyond the decision boundary. In this way, LNR enriches the minority class while preserving the core distribution of the majority class.
-
Compared to undersampling approaches that discard large numbers of majority class samples (especially in extreme imbalance scenarios), LNR requires significantly fewer data-editing (only 94 label flips on CIFAR10-LT), mitigating information loss.
-
The number of flips can be regulated through the threshold , which can be optimally selected through cross-validation to avoid unexpected negative impacts on majority class performance. Sensitivity analysis (Appendix C.3) and Tables 1–2 show that with a well-calibrated , LNR enhances minority-class performance without compromising overall accuracy—a justified trade-off given the empirical gains.
-
In the revised version, we will adjust our claim on "information loss," acknowledging that trade-offs (and information loss) exist but can be well-controlled.
Reply to comment in Essential References Not Discussed section and Weakness 1
-
We appreciate your valuable feedback on incorporating more recent literature. We will add discussions on the latest approaches in imbalanced learning (please refer to our response to Reviewer AMhG for details).
-
To strengthen our comparative analysis, we have included comparisons with state-of-the-art methods such as MiSLAS, ReMix, and SelMix. However, we encountered some practical challenges in implementation: ReMix does not provide official code, and SelMix lacks reproducible code for its supervised version. Experiments on CIFAR100 and ImageNet require more time for proper adaptation and reproduction due to framework-specific complexities. We will include these results during the rebuttal period to provide a more comprehensive comparison once we have the results. Below are the CIFAR10 (IR=100) results:
| CIFAR10-LT (IR=100) | |||||
|---|---|---|---|---|---|
| Overall Accuracy | Many-shot | Medium-shot | Few-shot | ECE ↓ | |
| MiSLAS-stage2 | 82.1 | 91.0 | 80.2 | 75.7 | 3.70 |
| MiSLAS+ReMix | 82.9 | 90.0 | 79.8 | 79.8 | 19.6 |
| MiSLAS+SelMix* | 83.3 | - | - | - | - |
| MiSLAS+LNR | 83.4 | 87.6 | 80.1 | 83.6 | 4.26 |
It is worth highlighting that SelMix requires an additional balanced validation set of size 5000, while our LNR does not, yet still achieves comparable better performance to SelMix.
Reply to Weakness 1 We thank the reviewer for noting our method's and SelMix's conceptual similarities. While both methods share the common goal of improving imbalanced classification, key differences exist:
-
SelMix computes a gain matrix using class centroids from a balanced validation set to guide its mixup sampling at the class level. While this shares some conceptual similarities with our label noise selection process based on feature similarity ranks, LNR differs fundamentally: We only flip labels of majority class samples that exhibit instance-level feature similarity to minority classes without requiring any auxiliary data and no feature mixing nor sample generating.
-
In LNR, any sample with features similar to those of the minority class may be flipped, regardless of how far or close the feature centroids of the two classes are. For instance, in Figure 1, although the feature centroid of majority class 2 is relatively closer to class 9, the samples flipped to class 9 by LNR primarily come from classes 4 and 8, as these show stronger instance-level similarity to the minority class.
Notably, as shown in our CIFAR-10 experiments, LNR achieves comparable performance to SelMix while eliminating its dependency on balanced validation data. This makes our approach more practical for real-world scenarios where such balanced datasets may be unavailable.
We will expand our discussion of these comparisons in the revised manuscript (Section 3.2) to better highlight LNR's unique contributions to the field.
Reply to Weakness 3&4
-
W3: Our KEEL experiments with classic ML methods (KNN, CART) primarily validate our theoretical results and demonstrate the LNR's model-data-agnostic advantages for offering higher performance by comparing with classical methods for tabular data. The comparison with recent methods is conducted on long-tailed multi-class image data using DNN models, where these methods are explicitly targeted. Our results show consistent gains over algorithm-level methods and SOTA performance over data-level methods, highlighting our contributions.
-
W4: Kindly refer to the results and analysis of synthetic data in Appendix C.7 (after references), including tables of full results.
Thanks for responses. I have raised my score to 3.
Dear Reviewer k9AK,
We are grateful for the opportunity to address the reviewer's valuable concerns and will meticulously incorporate all suggested improvements in our final revision. The constructive feedback and encouraging recognition of our work are deeply appreciated.
We would like to kindly point out that you may review our newly updated comparison results with remix and selmix (provided in the response to reviewer AMhG's latest comments). The experimental results demonstrate that our method and the feature-dependent asymmetric label noise model achieve superior performance to the previous SOTA selmix without utilizing any additional data, whereas selmix's optimal performance relied on an extra 10k balanced validation set. We are deeply grateful for your suggestions, which have significantly improved our work.
Best wishes,
Authors of #9268 submission
In this paper, the authors study the problem of class imbalance. To be specific, they propose using asymmetric label noise in favor of the minority classes to mitigate the bias on the decision boundary between majority and minority classes. To this end, the authors formulate Bayesian optimal decision boundaries in an imbalanced setting for accuracy and F1 scores. Then, they introduce a label noise rebalancing approach based on the cardinalities of samples in classes. The proposed approach is evaluated across a diverse set of problems.
update after rebuttal
I've read the comments by other reviewers and the rebuttal provided the authors. The rebuttal addressed my concerns and I increased my recommendation accordingly.
给作者的问题
Please see Weaknesses.
论据与证据
I would say yes, to a large extent.
方法与评估标准
Yes, to a large extent.
理论论述
No, since most of these are already known in the literature.
实验设计与分析
Yes.
补充材料
Yes, the additional results.
与现有文献的关系
The paper does not sufficiently position its analysis with prior work providing the same or similar analyses / findings on the impact of class imbalance on the decision boundary.
遗漏的重要参考文献
The following work which has analyzed the impact of imbalance on the decision boundary:
Bishop's book or articles such as "Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure", "Rethinking Class Imbalance in Machine Learning", "To be Robust or to be Fair: Towards Fairness in Adversarial Training".
其他优缺点
Strengths:
- The proposed approach is novel and well-motivated.
- The improvements over the baseline methods are very strong, especially on minority classes.
- The paper is well-written and easy to follow.
Weaknesses:
-
"Motivated by the decision boundary distortion due to class imbalance," => But, this is known already in the literature. It is not clear how much of the analysis on the impact of imbalance on the decision boundary is known in the literature (even in textbooks like Bishop's book or articles such as "Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure", "Rethinking Class Imbalance in Machine Learning", "To be Robust or to be Fair: Towards Fairness in Adversarial Training") and how much is new in the paper.
-
Some arguments such as the following are not surprising (and not sure whether it is new?): "This paper theoretically shows that optimizing for accuracy leads to a decision boundary in binary classification misaligned with metrics like the F1 score."
-
The baselines should have included a simple decision-boundary-based approach similar to Reject Option Classification and more advanced and recent logit-adjustment approaches such as DRO-LT.
-
The experimental evaluation should have included more challenging long-tailed datasets such as iNaturalist, ImageNet-LT, Places-LT.
其他意见或建议
None.
Reply to Relation To Broader Scientific Literature & Essential References Not Discussed & Weakness 1 and 2
-
We sincerely appreciate you providing these books and the two recent articles—they are very insightful. We recognize that some of these prior studies share similar conclusions with the theoretical portion of our paper, and we will adequately acknowledge these overlaps in our revised manuscript. Additionally, we will supplement our discussion with existing findings on how class imbalance affects decision boundaries, including the articles you kindly pointed out.
-
However, we would like to kindly emphasize that the core contribution of our theory lies in proving the impact of our proposed asymmetric instance-dependent label noise model on decision boundaries under class imbalance. Like other imbalanced learning methods, the mechanism of how data imbalance biases the classifier's decision boundary serves as the motivation for our theoretical framework. Yet, by introducing asymmetric label noise through our defined beneficial label noise model, we theoretically demonstrate how our beneficial asymmetric noise modifies decision boundaries and empirically validate that this approach effectively and beneficially corrects biased decision boundaries.
Reply to Weakness 3 and 4
-
W3: We appreciate your suggestions about our baseline comparisons and relevant works.
- In response to Reviewer AMhG, we have expanded the discussion of related literature (i.e. logit-adjustment approaches and metrics-optimization approaches).
- We also included new experimental comparison results with MiSLAS, ReMix, and SelMix (most recent SOTA) in our response to Reviewer k9AK. We further clarified some fundamental differences between our method and SelMix. We hope these revisions and explanations adequately address your concerns, and we are deeply grateful for your insightful feedback.
-
W4: We sincerely appreciate your suggestion regarding comparative experiments on large-scale datasets. We fully acknowledge that such comparisons could provide a more comprehensive evaluation of our approach.
- We are currently conducting additional comparisons on ImageNet with state-of-the-art methods. As these experiments require more time to complete, we will promptly share the results with you should they become available during the rebuttal period.
- In addition, we would like to re-emphasize our primary research focus has been on validating the feasibility and effectiveness of feature-dependent asymmetric label noise in addressing the class imbalance, which we have thoroughly demonstrated through extensive experiments on CIFAR-10/100 under both step-wise and long-tailed imbalance scenarios.
I would like to thank the authors for providing a rebuttal. The answers provided by the authors do address my concerns, and therefore, I will increase my recommendation accordingly.
Dear Reviewer VaeL,
We sincerely thank you for your insightful comments and are deeply grateful for your acknowledgment of our efforts. We greatly value your constructive feedback and will thoroughly address all concerns in our final manuscript revision.
Best wishes,
Authors of #9268 submission
Updates on Experimental Results
Dear Reviewer VaeL,
We have carefully reimplemented both ReMix and SelMix, and conducted comprehensive comparative experiments on CIFAR-10/100. We would like to take this opportunity to update you and the other reviewers with our latest comparison results against state-of-the-art methods. We are currently finalizing the ImageNet experiments, which require more time to complete given the dataset's scale and complexity. We commit to sharing these additional results as soon as they become available.
| CIFAR10-LT (Imbalance ratio = 100) | |||||
|---|---|---|---|---|---|
| Overall Accuracy | Many-shot | Medium-shot | Few-shot | ECE↓ | |
| MiSLAS-stage2 | 82.1 | 91.0 | 80.2 | 75.7 | 3.70 |
| +ReMix | 82.9 | 90.0 | 79.8 | 79.8 | 19.6 |
| +SelMix(10k) | 83.3 | 86.8 | 80.5 | 83.5 | 2.75 |
| +SelMix(1k) | 82.7 | 81.6 | 79.9 | 87.6 | 2.40 |
| +SelMix(imb) | 81.8 | 82.2 | 81.4 | 81.9 | 6.36 |
| +LNR | 83.4 | 87.6 | 80.1 | 83.6 | 4.26 |
| CIFAR100-LT (Imbalance ratio = 100) | |||||
| MiSLAS-stage2 | 46.85 | 62.05 | 48.42 | 26.07 | 5.43 |
| +ReMix | 46.59 | 59.06 | 49.22 | 27.93 | 18.28 |
| +SelMix(10k) | 47.20 | 61.19 | 51.60 | 24.51 | 1.36 |
| +SelMix(1k) | 46.04 | 61.27 | 50.82 | 21.34 | 1.32 |
| +SelMix(imb) | 45.65 | 56.66 | 51.17 | 25.31 | 3.01 |
| +LNR | 47.23 | 60.28 | 50.66 | 26.90 | 5.28 |
-
LNR achieves state-of-the-art performance on both CIFAR10-LT and CIFAR100-LT benchmarks, with overall accuracies of 83.4% and 47.23% respectively, while crucially eliminating the need for balanced validation data - a significant advantage over the previous state-of-the-art method SelMix, whose performance deteriorates substantially when using smaller (1k) or imbalanced validation sets.
-
LNR demonstrates comprehensive improvements on few-shot, as evidenced by:
-
(a) on CIFAR10-LT, achieving better few-shot accuracy (+3.8% over ReMix).
-
(b) on CIFAR100-LT, delivering superior few-shot (26.90% vs 24.51%) results compared to SelMix(10k).
-
-
SelMix's performance is sensitive to specific validation set characteristics, as shown by two critical limitations:
-
(a) while SelMix(10k) with ideal balanced validation data achieves comparable overall accuracy, its performance may drop below the baseline MiSLAS when using either smaller (1k) or imbalanced validation sets.
-
(b) the requirement of 10k balanced validation data (equivalent to 1,000 and 100 additional samples per class for CIFAR10 and CIFAR100 respectively, representing 20 times the tail-class training samples) creates substantial practical implementation barriers.
-
-
Model calibration (Expected Calibration Error, ECE): LNR maintains better model calibration (ECE=4.26) compared to ReMix's severely compromised calibration (ECE=19.6). Although SelMix provides better ECE performance, this relies on additional balanced auxiliary data. The ECE consequently increases when the auxiliary data fails to meet balance conditions. On CIFAR10-LT, SelMix(imb)'s ECE=6.36 is higher than LNR's ECE=4.26. It conclusively demonstrates that our feature-dependent asymmetric label noise model enables more precise decision boundary correction while preserving model reliability, without requiring any external validation data.
-
These experimental results collectively validate the effectiveness and contribution of LNR's novel model/data-agnostic label noise model in solving imbalanced learning by:
- Eliminating dependency on external balanced data;
- Delivering state-of-the-art performance while establishing class fairness;
- Maintaining model calibration - representing a significant advancement over existing approaches.
The paper introduces a novel method called Label-Noise-based Re-balancing (LNR) to address imbalanced classification problems by incorporating beneficial label noise. This approach involves flipping labels of majority class samples to minority classes to adjust decision boundaries and enhance classifier performance, particularly for minority classes. The authors provide a theoretical analysis focused on binary classification and extend the method to multi-class settings. Experiments on synthetic and real-world datasets demonstrate LNR’s effectiveness and its compatibility with various classifiers.
update after rebuttal
I believe the authors have adequately addressed my concerns. There are no remaining major concerns from my side, so I have adjusted my score accordingly.
给作者的问题
Theoretical Generalization: Can you extend the theoretical analysis to demonstrate how LNR optimizes a wider range of performance metrics, particularly for multi-class settings, beyond the binary case?
Metric-Specific Comparisons: Why were methods that directly optimize the evaluated metrics (e.g., F1-score, overall accuracy) not included in the experiments, and how would LNR fare against them?
Feature Extraction and Fairness: How do you ensure that label noise does not disrupt feature extraction or compromise class fairness across diverse datasets, and can you provide additional evidence beyond the CIFAR-10-LT confusion matrix?
论据与证据
The primary claim is that introducing controlled label noise can mitigate the effects of data imbalance on decision boundaries, thereby improving classifier performance, especially for minority classes. The authors support this with theoretical derivations showing how label noise influences decision boundaries and with experimental results reporting improved metrics such as accuracy and F1-score compared to baseline methods. However, the evidence is insufficiently convincing. The theoretical contributions are limited and not novel, focusing on specific metrics without broader generalization. Empirically, the improvements may be inflated by the choice of metrics that do not fully reflect overall performance, raising doubts about the method’s true effectiveness.
方法与评估标准
LNR employs an asymmetric label noise model, where labels of majority class samples are flipped to minority classes based on a flip-rate estimator, typically implemented via a modified MLP classifier. The evaluation uses standard classification metrics like accuracy and F1-score across various datasets and classifiers. However, the choice of these metrics is bad, i.e. they fail to convincingly demonstrate LNR’s superiority, particularly in terms of overall performance. The lack of metrics aligned with methods that directly optimize performance measures of interest further weakens the evaluation framework, suggesting that the reported gains may not generalize beyond the specific experimental setup.
理论论述
The paper offers a theoretical analysis for binary classification, illustrating how label noise adjusts decision boundaries to favor minority classes. While this provides some insight, these results are largely known and restricted to two specific metrics, lacking generalization to a broader class of performance measures. Additionally, there is no clear theoretical analysis demonstrating how well LNR optimizes any given metric of interest. This limitation undermines the claim of a significant theoretical contribution, as the analysis does not extend meaningfully to multi-class scenarios or diverse evaluation criteria beyond the binary case.
实验设计与分析
The experimental design spans synthetic and real-world datasets, testing LNR with multiple classifiers to showcase its versatility. Results indicate performance improvements, particularly for minority classes, and the authors provide a confusion matrix for CIFAR-10-LT to address fairness concerns. However, there is an issue here, the absence of comparisons with methods that directly optimize the performance metrics evaluated in the paper. This omission makes it difficult to assess LNR’s relative effectiveness against state-of-the-art approaches. While the authors claim integration with methods like GCL and LDAM, the lack of direct metric-specific benchmarks limits the empirical rigor and persuasiveness of the findings.
补充材料
Yes.
与现有文献的关系
The paper discusses traditional resampling techniques and some algorithm-level methods but falls short in engaging with the broader scientific literature. It lacks a comprehensive comparison with recent advancements in imbalanced learning, especially post-2022 methods that might directly optimize the same performance metrics. The authors’ response mentions integration with methods like Mixup and SelMix but does not sufficiently differentiate LNR or contextualize it within the latest data-level or noise-based approaches. This inadequate discussion weakens the paper’s positioning within the field and its claim of opening a new avenue for imbalanced learning.
遗漏的重要参考文献
Some key references are missing, particularly those related to methods that directly optimize the performance metrics used in the paper’s evaluations (e.g., accuracy, F1-score). Examples might include recent works on metric-specific optimization for imbalanced data, such as advanced re-weighting schemes or generative methods beyond Mixup, which are not cited or compared.
其他优缺点
Strengths:
-
Innovative Concept: The use of beneficial label noise to address imbalance is a creative departure from traditional resampling, avoiding information loss and generative errors.
-
Generality: LNR’s compatibility with various classifiers and datasets is a practical advantage.
-
Theoretical Insights: The binary classification analysis offers a foundation, even if limited.
Weaknesses:
-
Limited Theoretical Scope: The analysis does not generalize beyond specific metrics or robustly extend to multi-class settings.
-
Insufficient Comparisons: The lack of benchmarks against metric-optimizing methods undermines empirical claims.
-
feature extraction impacts and fairness under-explored beyond the CIFAR-10-LT confusion matrix.
其他意见或建议
To strengthen the paper, the authors should: Expand Theoretical Analysis: Provide a broader analysis covering multiple performance metrics and a clearer link to multi-class optimization.
Enhance Comparisons: Include empirical comparisons with state-of-the-art methods that directly optimize the evaluated metrics, such as recent re-weighting or augmentation techniques.
Address Feature and Fairness Impacts: Conduct a deeper investigation into how label noise affects feature extraction and class fairness, beyond the provided confusion matrix, possibly with additional datasets or metrics like fairness indices.
Update Related Work: Incorporate and discuss post-2022 literature to better situate LNR within the current research landscape.
Reply to Claims And Evidence and Methods And Evaluation Criteria
Thank you for your thoughtful comments. We appreciate your feedback and would like to clarify that our evidence sufficiently supports our claims.
-
Our theoretical analysis focuses on the deviation between the optimal F1 decision boundary and the optimal Bayes decision boundary (rather than other metrics) because F1-score is typically the primary evaluation criterion for imbalanced binary classification tasks, especially when minority-class recognition is crucial, as in fraud detection scenarios.
-
Through a comparative analysis of how optimal decision boundaries are affected by both label noise and imbalance ratio, we designed a feature-dependent asymmetric label noise model. Our primary theoretical contribution lies in LNR's ability to make the label noise's effect on decision boundaries counterbalance the impact of class imbalance—effectively correcting the boundary shift.
-
We would also like to kindly emphasize that our evaluation metrics are comprehensive and widely accepted. These include F1, G-Mean, and AUC for binary classification tasks and Many/Medium/Few-shot metrics, which are broadly used for imbalanced multi-class classification and acknowledged by Reviewer k9AK and other reviewers.
-
The experimental results in Tables 1-2, along with a comparison against the latest SOTA methods in our response to Reviewer k9AK, provide sufficient evidence of the effectiveness of our approach. In our latest comparison with ReMix/SelMix based on MiSLAS, we also included Expected Calibration Error (ECE) to address potential concerns regarding model calibration when modifying the data. As shown in the results, LNR, due to its zero feature editing of the data, not only enhances performance but also achieves a significantly lower ECE than ReMix.
Reply to Relation To Broader Scientific Literature & Essential References Not Discussed
We sincerely appreciate your valuable feedback regarding improvements to our literature review. In response to your suggestion, we will expand the discussion of methods related to our work, including generative and mixup-based approaches, in the revised manuscript. We will revise the literature review as described in our response to Reviewer AMhG for your reference. In our response to Reviewer k9AK, we have included updated comparative results with the recent ReMix and SelMix approaches. We have also clarified the fundamental distinctions between our method and SelMix in greater detail. We sincerely hope that these additional analyses and explanations will help address your concerns.
Reply to Weaknesses and Questions
-
W1/Q1: We sincerely appreciate the reviewer's insightful suggestion regarding multi-class generalization, which is indeed an important direction for future research. While extending the theory to multi-class settings represents valuable future work, our current study specifically focuses on using theoretical analysis to motivate our carefully designed feature-dependent asymmetric label noise model. In multi-class scenarios where LNR flips labels from a majority class i to a minority class j, our binary classification analysis of decision boundaries remains applicable to understanding the boundary between classes i and j specifically.
-
W2/Q2: In our response to Reviewer k9AK, we have added a comparison with SelMix, the latest state-of-the-art metric-optimizing method. Although our method, LNR, does not assume the availability of a sufficiently large balanced validation set for metric optimization, its performance remains comparable to that of SelMix. This makes LNR more advantageous in practical applications.
-
W3/Q3:
-
Thank you for your insightful comment. In Section 4.3, we emphasize that LNR mitigates the impact on feature extraction by postponing the introduction of label noise during the fine-tuning stage, thus avoiding any potential threat to feature representations. As such, LNR does not involve risks to feature extraction. We would be happy to include a comparison of feature representations after the noise introduction in the appendix of the revised manuscript to address your concern better.
-
The fairness changes after introducing noise, as demonstrated on the CIFAR-10 confusion matrix, were primarily intended to show that these noises do not disrupt the existing fairness. On the contrary, they contribute to improved model fairness. These conclusions are also evident in the results of the many/medium/few-shot tasks on CIFAR-100. If there are specific fairness metrics you would like us to report, we would be pleased to address them during the rebuttal period.
-
I would like to thank the authors for their comprehensive response. As many of the gray areas are clarified for me, I am adjusting the score accordingly. All the best,
Dear Reviewer VQ6R,
We are very pleased to address your concerns and will further refine our work in the final revised manuscript. We sincerely appreciate your insightful suggestions and kind recognition of our efforts.
Best wishes,
Authors of #9268 submission
This paper introduces a novel Label-Noise-Re-balancing (LNR) approach to mitigate the decision boundary bias caused by data imbalance.The numerical experiments in both binary and multi-calss imbalance demonstrated the effeciency of their approach.
给作者的问题
-
Could the authors identify if there are any latest relevant work (published after 2022) and provide a fair comparison of the proposed LNR method with them that address data imbalance?
-
Could you explain the algorithm1, why sampling u from random(0,1)?
-
Although flipping labels from the majority to the minority class might improve minority class accuracy, does it decrease the accuracy of the majority class?
论据与证据
Yes
方法与评估标准
Yes.
Novelty: Instead of relying on traditional data augmentation, the authors propose a novel label-flipping method to improve minority class prediction accuracy in imbalanced datasets.
Experimental Validation: The experiments clearly demonstrate that the proposed method improves the accuracy of minority class prediction in both binary and multi-class settings.
理论论述
The paper provides a theoretical justification for how imbalanced data can lead to biased decision boundaries, highlighting a critical issue in classification tasks.
实验设计与分析
The experimental design is generally strong. However, the definitions of the imbalance ratio categories (Many-shot, Medium-shot, Few-shot) presented in Table 2 are unclear.
补充材料
n/a
与现有文献的关系
-
This paper made a valuable contribution by introducing a novel Label-Noise-Re-balancing (LNR) method, which creatively mitigate the challenge of decision boundary bias in imbalanced datasets.
-
This paper theritically demonstarted that the relationship of imbalance ratio and decision boundary bias.
-
Experimental results confirmed the efficacy of the LNR method in enhancing the prediction accuracy for the minority class.
遗漏的重要参考文献
It appeared that the paper lacks recent related work (especially after 2022), which is essential for placing the contribution in the context of current research.
其他优缺点
Strengths:
-
The paper is well-organized and presented with clear, accessible writing.
-
The LNR method offers a novel approach to addressing data imbalance.
-
The paper provides comprehensive mathematical proofs that elucidate the impact of class imbalance on decision boundary bias, strengthening the theoretical underpinning of the proposed method.
Weaknesses:
-
The paper lacks a comprehensive review of recent (after 2022) literature.This is my main concern. If the authors could resolve this, I would be happy to adjust my rating.
-
There is a lack of clear definition and handling of imbalance ratios, as terms like Many-shot, Medium-shot, and Few-shot are not precisely delineated.
-
Check the defination of FN/N in section 3.2.
-
Typos and Formatting Issues: The paper contains typos (e.g., "many-shot" in Table 2).
其他意见或建议
Regarding the organization, it might be beneficial to integrate the "Motivation and Contributions" section into the introduction to create a more cohesive narrative flow. This is merely a suggestion for improving readability.
Response to ‘Essential References Not Discussed’, Weakness 1 and Question 1
We sincerely thank the reviewer for highlighting the need to include recent methodological advancements in our literature review. As suggested, we will carefully revise the 'Related Work' section in our manuscript:
-
Algorithm-level methods refine loss functions or training paradigms to improve tail-class accuracy, primarily by decoupling feature learning from classifier training to separately enhance feature representation and classifier fine-tuning. Contrastive learning-based methods—including DRO-LT (2021), TSC (2021), BCL (2022), and SBCL (2023)—leverage contrastive losses during feature learning to boost feature discriminability and model robustness under imbalance. For classifier optimization, margin-based approaches like LDAM-DRW (2019) and τ-norm (2019) employ loss engineering to create larger decision margins for tail classes, while logit adjustment methods such as GCL (2022b) address softmax saturation by expanding the tail classes' embedding space through increased cloud sizes.
-
Data-level methods often leverage generative models or autoencoders such as -encoder (2018), DGC (2020), and RSG (2021) to synthesize few-shot samples. These approaches typically depend on high-quality pre-trained models, which can introduce additional challenges, especially on scarce tail-class data, limiting the ability to generate diverse or meaningful samples. Instead of involving a generative model, Mixup (2018) interpolates features and labels via a fixed mixing ratio , empirically demonstrating its effectiveness for data augmentation. Building on this, Remix (2020) introduces separate mixing ratios for features () and labels () to rebalance class distributions, though it retains random sampling. More recently, SelMix (2024) advanced this direction by selectively sampling pairs for mixing based on the gain on non-decomposable metrics (e.g., recall, G-mean), thereby enabling targeted improvements in specific metrics. However, SelMix's gain matrix relies on a balanced (, where 1 means fully balanced) augmented auxiliary set (5000 for CIFAR). With imbalanced or small auxiliary data, its metric optimization fails to meet theoretical constraints. This limitation is particularly acute in practice where validation data is often scarce and inherently skewed—a gap our method intentionally addresses with a carefully designed noise model, which requires neither feature editing nor a balanced auxiliary dataset.
-
Multi-expert ensemble methods (e.g., RIDE (2021), TLC (2022), SADE (2023), and BalPoE (2023)) allocate specialized "experts" to model head- and tail-class features separately, achieving notable gains. While these approaches fall outside the scope of our work, they highlight the ensemble learning for tackling class imbalance.
Response to Weaknesses 2-4
-
W2: Regarding the many/medium/few-shot categories, we acknowledge these thresholds vary across datasets. For clarity, we've detailed these specifications in Appendix C.7 and will add explicit cross-references in the main text. We sincerely appreciate you bringing these issues to our attention.
-
W3: We have carefully re-examined the definition of False Negatives (FN) as: "A False Negative occurs when a ground-truth positive sample is incorrectly predicted as negative." Based on this standard definition, the formulation of FN/N in Section 3.2 is mathematically correct. Should there be any misunderstanding on our part regarding your specific concern, we would be grateful if you could further clarify your perspective so we can address it precisely.
-
W4: All typographical errors identified in the manuscript, including those in Table 2, have now been carefully corrected. We are truly grateful for your meticulous review。
Response to Questions 2-3
Q2: We sincerely appreciate your thoughtful suggestion regarding Algorithm 1. To clarify the notation: indicates sampling a number U from the range [0,1], where label flipping occurs when . As you insightfully pointed out, this could alternatively be expressed as: . If this alternative formulation better serves clarity, we would be delighted to incorporate this change in the manuscript.
Q3: To our knowledge, all class imbalance methods inevitably improve minority class performance at some cost to majority class accuracy. This trade-off is fundamental because:
- Their strong majority class performance inherently comes at the expense of minority classes
- Therefore, improving minority class recognition necessarily reduces the "overprivileged" majority class performance.
Our fairness analysis (confusion matrices) demonstrates this trade-off: On CIFAR-10, LNR introduced 94 label noises. This exchanged 16 true positives (TP) from head classes for 204 TP gains in tail classes
Thanks for the authors' responses. I beleive many of my concerns have been resolved. As such, I am happy to raise my score to 3.
Dear Reviewer AMhG,
We sincerely appreciate your insightful comments and are pleased to address all raised concerns. Your constructive feedback has been invaluable in strengthening our work, and we will carefully implement all suggested improvements in our final manuscript.
To further substantiate our claims, we have carefully reimplemented both ReMix and SelMix and conducted comprehensive comparative experiments on CIFAR-10/100 benchmarks. We would be happy to share these updated comparison results with you and other reviewers. Currently, we are finalizing the ImageNet experiments, which require additional time due to the dataset's scale and complexity, and will promptly share these results once available.
| CIFAR10-LT (Imbalance ratio = 100) | |||||
|---|---|---|---|---|---|
| Overall Accuracy | Many-shot | Medium-shot | Few-shot | ECE↓ | |
| MiSLAS-stage2 | 82.1 | 91.0 | 80.2 | 75.7 | 3.70 |
| +ReMix | 82.9 | 90.0 | 79.8 | 79.8 | 19.6 |
| +SelMix(10k) | 83.3 | 86.8 | 80.5 | 83.5 | 2.75 |
| +SelMix(1k) | 82.7 | 81.6 | 79.9 | 87.6 | 2.40 |
| +SelMix(imb) | 81.8 | 82.2 | 81.4 | 81.9 | 6.36 |
| +LNR | 83.4 | 87.6 | 80.1 | 83.6 | 4.26 |
| CIFAR100-LT (Imbalance ratio = 100) | |||||
| MiSLAS-stage2 | 46.85 | 62.05 | 48.42 | 26.07 | 5.43 |
| +ReMix | 46.59 | 59.06 | 49.22 | 27.93 | 18.28 |
| +SelMix(10k) | 47.20 | 61.19 | 51.60 | 24.51 | 1.36 |
| +SelMix(1k) | 46.04 | 61.27 | 50.82 | 21.34 | 1.32 |
| +SelMix(imb) | 45.65 | 56.66 | 51.17 | 25.31 | 3.01 |
| +LNR | 47.23 | 60.28 | 50.66 | 26.90 | 5.28 |
-
LNR achieves state-of-the-art performance on both CIFAR10-LT and CIFAR100-LT benchmarks, with overall accuracies of 83.4% and 47.23% respectively, while crucially eliminating the need for balanced validation data - a significant advantage over the previous state-of-the-art method SelMix, whose performance deteriorates substantially when using smaller (1k) or imbalanced validation sets.
-
LNR demonstrates comprehensive improvements on few-shot, as evidenced by:
-
(a) on CIFAR10-LT, achieving better few-shot accuracy (+3.8% over ReMix).
-
(b) on CIFAR100-LT, delivering superior few-shot (26.90% vs 24.51%) results compared to SelMix(10k).
-
-
SelMix's performance is sensitive to specific validation set characteristics, as shown by two critical limitations:
-
(a) while SelMix(10k) with ideal balanced validation data achieves comparable overall accuracy, its performance may drop below the baseline MiSLAS when using either smaller (1k) or imbalanced validation sets.
-
(b) the requirement of 10k balanced validation data (equivalent to 1,000 and 100 additional samples per class for CIFAR10 and CIFAR100 respectively, representing 20 times the tail-class training samples) creates substantial practical implementation barriers.
-
-
Model calibration (Expected Calibration Error, ECE): LNR maintains better model calibration (ECE=4.26) compared to ReMix's severely compromised calibration (ECE=19.6). Although SelMix provides better ECE performance, this relies on additional balanced auxiliary data. The ECE consequently increases when the auxiliary data fails to meet balance conditions. On CIFAR10-LT, SelMix(imb)'s ECE=6.36 is higher than LNR's ECE=4.26. It conclusively demonstrates that our feature-dependent asymmetric label noise model enables more precise decision boundary correction while preserving model reliability, without requiring any external validation data.
-
These experimental results collectively validate the effectiveness and contribution of LNR's novel model/data-agnostic label noise model in solving imbalanced learning by:
- Eliminating dependency on external balanced data;
- Delivering state-of-the-art performance while establishing class fairness;
- Maintaining model calibration - representing a significant advancement over existing approaches.
The revised manuscript will incorporate these comprehensive comparative results and analyses, including expanded methodological details that highlight our advancements over previous state-of-the-art approaches. We are truly grateful for your time and expertise in evaluating our work.
Best regards,
Authors of #9268 submission
The paper proposes a novel Label-Noise-based Re-balancing (LNR) approach to address the challenge of imbalanced learning in classification tasks. The authors theoretically analyze how class imbalance biases decision boundaries and demonstrate that introducing carefully designed asymmetric label noise can correct these biases, improving classifier performance, particularly for minority classes. Unlike traditional resampling methods, LNR avoids information loss and generative errors by flipping labels of majority-class samples that exhibit feature similarity to minority classes. The method is model-agnostic and can be integrated with existing classifiers or algorithm-level approaches. Extensive experiments on synthetic and real-world datasets (including CIFAR-10/100 and KEEL datasets) validate LNR's superiority over baselines.
The paper received generally positive reviews, with reviewers acknowledging its novelty, theoretical grounding, and empirical effectiveness. While initial concerns were raised, the authors' rebuttal and additional experiments convincingly addressed these issues, leading to improved reviewer scores. Thus, I recommend acceptance.