/10

Poster3 位审稿人

最低2最高4标准差0.8

ICML 2025

Improved Algorithm for Deep Active Learning under Imbalance via Optimal Separation

Shyam Nuggehalli,Jifan Zhang,Lalit K Jain,Robert D Nowak

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

摘要

关键词

Deep LearningActive LabelingClass Imbalance

评审与讨论

审稿意见

评分: 42025-03-08

The paper proposes DIRECT, an algorithm for deep active learning under the dual challenges of class imbalance and label noise. DIRECT reduces the multi-class problem to a set of one-dimensional agnostic active learning subproblems by identifying an “optimal separation threshold” for each class. Annotating examples close to these thresholds is intended to yield a more balanced and informative labeled set. DIRECT is designed to support parallel annotation while still benefiting from guarantees drawn from classic active learning results. Experiments on several imbalanced datasets (with and without label noise) claim that DIRECT saves over 60% of the annotation budget compared to state-of-the-art methods and over 80% compared to random sampling.

update after rebuttal

Thank you for your response. I appreciate that most of my concerns have been addressed. In light of these improvements, I will raise my rating.

给作者的问题

(1) Training Strategy: How is the model retrained at each active learning iteration? Specifically, is a balanced data loader used during retraining to mitigate class imbalance, and how does this affect performance?

(2) Comparison with SIMILAR: Given that SIMILER also targets imbalanced datasets, can you provide a direct experimental comparison in SIMILAR paper’s figures. or detailed discussion on how DIRECT’s performance differs in similar settings?

论据与证据

The paper makes strong claims regarding label efficiency improvements, robustness to label noise, and scalability via parallel annotation. Experimental results on multiple datasets generally support these claims. However, some aspects remain less clearly supported:

The absence of a direct comparison with SIMILAR—even though SIMILAR also addresses imbalanced data—is a gap that makes it difficult to assess the relative benefits of DIRECT in similar settings.

方法与评估标准

The methodological approach is innovative, particularly the reduction of the active learning problem to a one-dimensional thresholding task. The evaluation criteria—balanced accuracy and annotation cost—are appropriate for imbalanced learning scenarios. Nevertheless, The paper does not clearly explain the training procedure in each active learning iteration. For example, it is ambiguous whether a balanced data loader is used during retraining, which is crucial for mitigating imbalance during model updates.

理论论述

The theoretical contribution builds on existing agnostic active learning results (e.g., from ACED) to justify that the probability of misidentifying the optimal threshold decays exponentially with the annotation budget. While the reduction to a one-dimensional problem is elegant, some concerns persist:

(1) The proofs (provided in the appendix) assume that the behavior of deep neural network outputs is amenable to a threshold classifier analysis. In practice, the effect of label noise and imbalance on such outputs might be more complex.

(2) A more detailed discussion on how the theoretical guarantees translate into deep learning contexts would strengthen the paper’s claims.

实验设计与分析

The experimental section is extensive and evaluates DIRECT under various noise levels and across different architectures (ResNet-18 and CLIP ViT-B32). The experiments support the claim of improved label efficiency. However:

(1) The starting point for active learning (e.g., the initialization strategy and subsequent retraining details) is not thoroughly clarified, raising questions about reproducibility.

(2) As mentioned, the experiments lack a direct comparison with SIMILAR which handles imbalanced datasets effectively.

补充材料

The supplementary material contains additional experimental results, detailed proofs, and analyses (including the time complexity comparison with GALAXY and BADGE).

与现有文献的关系

The paper builds on and extends several strands of active learning research:

(1) It leverages classic agnostic active learning theory to tackle deep learning challenges under imbalance and noise. (2) It directly builds on prior work such as GALAXY. (3) The reduction to one-dimensional threshold learning represents a creative attempt to bridge theory and practice in active learning for deep neural networks.

遗漏的重要参考文献

The paper has cited relevant works.

其他优缺点

Strengths:

(1) aim to address a very practical and challenging scenario in real-world applications.

(2) The reduction to a one-dimensional active learning problem is novel and well-motivated.

(3) The ability to support parallel annotation is a significant practical advantage.

Weaknesses:

(1) Experiments results that connected empirical observations with theoretical understanding could help.

其他意见或建议

Expanding on the computational cost analysis and other limitation discussion in the main text (or summarizing key points from Appendix C) would strengthen the practical implications of the work.

作者回复

2025-04-01

Experimental Settings Compared to SIMILAR

Our settings actually closely mirror the settings in SIMILAR. In SIMILAR, the rare class setup is very close to the long tail distribution setups in our paper. SIMILAR’s setting reduces the number of examples in some of the classes to form rare classes. Our long tailed setup also reduces the number of examples in different classes, but following a long tail distribution. In fact, the later half of the classes receive way less examples than the largest class, effectively making them the rare classes. We chose the long tail version for experiments since this setting has been much more widely adopted in deep learning literature than the construction in SIMILAR. Also, real world data often follow long-tailed distribution in practice. In our experiments, we included CIFAR-10LT and CIFAR-100LT in the original draft. Also, as suggested by Reviewer 7FVV, we also conducted extra experiment for ImageNet-LT (https://ibb.co/fVq3wsYJ) and iNaturalist datasets (https://ibb.co/F4fqyfdQ).

Our setup adopted from GALAXY, where we combined multiple classes into a single “other” or “out-of-distribution” class, which directly mirrors the out-of-distribution setting in SIMILAR. This is also known as the Open-Set classification setting, which is widely studied in deep learning literature.

Model Retraining Strategy

As mentioned in our paper, we use a reweighting strategy by weighting the loss of each example by the inverse class frequencies. Reweighting is usually preferred over resampling strategies (balanced dataloader suggested by the reviewer) for deep learning as we want to reduce the repetition of examples during neural network training to avoid overfitting.

Complete Time Complexity Analysis

First, as a reminder in our paper in Appendix C, the dominating computational cost has always been the neural network training and inference cost, which takes more than 90% of the total computational cost.

As for data selection algorithms, let $K$ be the number of classes, $N$ be the pool size and $B_{\text{train}}$ be the batch size, $D$ be the penultimate layer embedding dimension and $T$ be the number of batches. Below, we detail the computation cost of data selection of each algorithm we consider.

DIRECT: $O(T(KN\log N + B_{train}N))$ .
GALAXY: $O(T(KN\log N + B_{train}KN))$
BADGE: $O(TB_{train}N(K + D))$
Margin sampling/most likely positive/confidence sampling: $O(TKN)$
Coreset: $O(T^2B_{train}ND)$
SIMILAR: $O(TB_{train}ND)$
Cluster margin: $O(N^2\log N + TN(K + \log N))$
BASE: $O(TN(D+B_{train}))$

Theoretical Guarantee

We are not sure about what you mean by “In practice, the effect of label noise and imbalance on such outputs might be more complex.” As the agnostic active learning algorithm specifically targets the label noise scenario and the optimal 1D threshold classifier is defined to address the imbalance issue, we think our theoretical argument can exactly be applied to this 1-D reduction setting. Concretely, the agnostic active learning procedure in Algorithm 2 is exactly trying to recover the empirical risk minimizer (ERM) when freezing all of the neural network weights and only looking at the sigmoid/softmax score space. The agnostic active learning algorithm has been proven to be noise robust and can identify the ERM efficiently.

To bridge the theoretical guarantee and the effect for deep learning, in deep active learning, almost all algorithms are trying to balance between querying uncertain examples and diverse examples. In our case, we utilize the margin scores as an uncertainty measure. For diversity, we want to provide better data coverage in our annotation. This has been traditionally done in the representation space using penultimate layer embeddings or gradient embeddings. However, as we show in our paper, these methods (such as BADGE and Coreset) underperform DIRECT in imbalance scenarios. DIRECT focuses on class diversity, combined with uncertainty sampling, labeling a more balanced dataset of uncertain examples (i.e. ones around the optimal separation threshold). Our theoretical guarantee directly translates into how well we are ensuring class-balancedness of the annotated examples. With an accurate threshold, our annotation around the threshold would result in higher class-diversity in annotated examples.

审稿意见

评分: 32025-03-14

This paper studies active learning under both class imbalance and label noise. An improved algorithm for agnostic active learning is proposed, referred to as DIRECT, which is considered an advanced version of GALAXY. Various experiments are done to validate the property of DIRECT, showing its superiority in the case of imbalance and label noise. The results are promising.

update after rebuttal

给作者的问题

Q1: Can you consider any other metrics (Eqn. 2) in the multi-class problem, such as the use of mean instead of maximum probabilities of remaining classes?

Q2: Can you provide the computational costs for all algorithms considered in the paper?

论据与证据

Using object function for agnostic active learning in finding optimal separation points and annotating the unlabeled data points near these optimal separation points can improve active learning for imbalanced datasets on the tasks of agnostic active learning. The experiment results show this claim can be achieved.

方法与评估标准

The proposed algorithm is conventional and sound in active learning for assessing uncertainty. Performance is usually measured by prediction accuracies, which are well presented in the paper.

理论论述

None

实验设计与分析

The experimental design and analysis are logical and acceptable in studying active learning. I have two points, if possible, that should be supplemented.

The effect of several classes and the more detailed analyses, such as the confusion matrix, showing the prediction accuracy per (major/minor) class.
Sensitivity analysis from the balance to a severe imbalance in the Toy dataset. These experiments can provide more insight into the proposed algorithm.

补充材料

Only code

与现有文献的关系

The study is closely related to efficient learning and obtaining more representative data points, which is essential in the development of AI.

遗漏的重要参考文献

None

其他优缺点

DIRECT can be applied to the problem of agnostic active learning. When label noise exists, the advantages of DRECT can be dominant. The proposed algorithm's merits are small for balance with small noise. However, more robustness to the imbalance ratio can be required in some cases due to limited prior knowledge.

其他意见或建议

Please see the comments on Experimental Designs Or Analyses

作者回复

2025-04-01

Thank you for providing the insightful review. We address your concerns below.

Using a different scoring in Eqn. 2 As you suggested, we could indeed subtract the mean instead of the max of the per-class softmax scores. The mean will simply be a constant for all examples, which makes the scoring equivalent to the confidence score. We totally agree that different scores can be used to rank the examples. For confidence scores, however, some of our initial experiments found it to be less effective than margin sampling. Notably, margin sampling has been shown to be a superior scoring function over confidence and entropy in several previous large scale benchmarking papers (e.g. [1] and [2]). We will nevertheless add a future work direction to test out different scoring methods beyond these scores.

Computation cost of all algorithms First, as a reminder in our paper in Appendix C, the dominating computational cost has always been the neural network training and inference cost, which takes more than 90% of the total computational cost.

DIRECT: $O(T(KN\log N + B_{train}N))$ .
GALAXY: $O(T(KN\log N + B_{train}KN))$
BADGE: $O(TB_{train}N(K + D))$
Margin sampling/most likely positive/confidence sampling: $O(TKN)$
Coreset: $O(T^2B_{train}ND)$
SIMILAR: $O(TB_{train}ND)$
Cluster margin: $O(N^2\log N + TN(K + \log N))$
BASE: $O(TN(D+B_{train}))$

[1] Zhang, J., Chen, Y., Canal, G., Mussmann, S., Das, A. M., Bhatt, G., ... & Nowak, R. D. (2023). Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning. arXiv preprint arXiv:2306.09910.

[2] Bahri, D., Jiang, H., Schuster, T., & Rostamizadeh, A. (2022). Is margin all you need? An extensive empirical study of active learning on tabular data. arXiv preprint arXiv:2210.03822.

Sensitivity Analysis Thank you for the suggestion. We think our experiments already cover a wide range of imbalance ratios as shown in Table 1, even when fixing model training to be ResNet-18. As we have shown that our algorithm is superior across all of these ratios. Interestingly, when comparing the performance of random sampling, we see DIRECT saves an increasing amount of annotation costs as the dataset is more and more imbalanced. We will definitely add this observation to our paper.

More Detailed Analysis Thank you for the suggestion! We think this is indeed beneficial. We have to rerun some of the experiments for this. We will send over the analysis during the discussion period as soon as possible, once we get these numbers.

审稿人评论

2025-04-09

The authors' response resolves many issues. Thanks for your reply. However, I'll keep my score since more insightful experiments and analyses are required for a better paper.

作者评论

2025-04-09

Thank you for giving us a chance to supplement our further findings, and we have conducted an additional experiment as you suggested. Please see this plot where we plot the average accuracy in blocks of classes for the ImageNet-LT dataset. Specifically, the class indices are arranged so that class #1 is the most frequent class while class #1000 is the least frequent class. We can see DIRECT outperforms baseline algorithms on less frequent classes (301-1000), which explains the overall better performance of DIRECT in our experiments, despite having slightly worse accuracies on more frequent classes. This corroborates our balancedness result in our original paper, where we see DIRECT labels more samples from rare classes.

Overall, DIRECT achieves our goal, by improving over the vast majority of the (rare) classes, while only sacrificing slight performance drop on a smaller number of the most frequent classes.

审稿意见

评分: 22025-03-14

update after rebuttal

After read the rebuttal and other reviews, the reviewer maintains the initial recommendation.

The paper introduces DIRECT, a new algorithm for deep active learning under class imbalance and label noise. The main contribution is a reduction of the imbalanced classification problem into a set of one-dimensional agnostic active learning problems, allowing the algorithm to identify optimal separation thresholds for each class. DIRECT selects the most uncertain examples near these thresholds for annotation, ensuring a more class-balanced and informative labeled set. The authors claim that DIRECT significantly reduces annotation costs by over 60% compared to state-of-the-art active learning methods and over 80% compared to random sampling, while maintaining robustness to label noise.

给作者的问题

Why is the method called DIRECT? Is it an abbreviation? The paper does not seem to provide an explanation for the name.
Is the optimal separation threshold proposed in Definition 4.1 a novel contribution, or is it based on prior work?
Intuitively, in multi-class classification, $\hat{p}_i^k$ can take negative values, which seems unreasonable for probabilities.
Could the authors explain the motivation behind this paper? Specifically, in active learning, what makes certain unlabeled samples more beneficial to label, and how does this principle change when the training data is imbalanced?

论据与证据

Refer to other part.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for the problem at hand, but there are some limitations:

The authors use fine-tuning of ResNet-18 on imbalanced datasets but do not explain why fine-tuning is preferred over training from scratch. Additionally, the pre-training details of the model are not provided, which could be important for reproducibility.
The experiments do not include commonly used imbalanced datasets like ImageNet-LT and iNaturalist, which could provide additional validation of the algorithm's effectiveness.
The paper does not use the more common method of constructing imbalanced datasets by creating a long-tailed distribution[1]. Furthermore, it is unclear whether the test datasets are imbalanced or uniformly distributed, which could affect the evaluation of the algorithm's performance. [1] Long-tail learning via logit adjustment

理论论述

The theoretical claims in the paper are partially supported. The proofs in the appendix establish the equivalence between the ERM solution based on the learner's output and the "best threshold" on the same training set. However, this does not directly address the key claim that this solution serves as the optimal threshold for active learning on unlabeled data. The theoretical justification for the algorithm's robustness to label noise is also lacking.

实验设计与分析

Yes. Refer to other part.

补充材料

The supplementary material was reviewed, and it includes additional details on the experimental setups, theoretical proofs, and further results.

与现有文献的关系

The key contributions of the paper are well-situated within the broader literature on active learning and class imbalance.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

The approach of combining class separation thresholds with one-dimensional active learning is innovative and provides a fresh perspective on tackling these issues.
The experimental results are comprehensive and demonstrate significant improvements over existing methods.
The paper is well-organized.

Weaknesses

The authors claim that DIRECT addresses both class imbalance and label noise issues, and the experiments include extensive evaluations under noisy settings. However, the algorithm itself does not appear to include any explicit mechanism designed to handle label noise. This is, in my opinion, the most significant weakness of the paper.
Recent research has shown that majority and minority classes often have different learning speeds during training. The proposed method does not seem to account for this, which makes the estimation of thresholds for minority classes particularly challenging.
There is some confusion in the notation used in the paper. In Section 4.1, the label space is described as [0,1], while in Equation 1, the labels y take values of 1 and 2. This inconsistency is puzzling and could lead to misunderstandings.
The authors do not clearly explain the difference between active learning for imbalanced datasets and general active learning. Intuitively, for an imbalanced problem, labeling samples from the minority class should yield higher benefits than labeling samples from the majority class. While Definition 4.1 seems intuitively effective, the authors do not provide a detailed explanation or experimental results to demonstrate the characteristics of the samples selected by active learning.

其他意见或建议

Refer to weakness and questions.

作者回复

2025-04-01

Thank you for the insightful and detailed review. We make the following clarifications to address your concerns.

Handling Label Noise by Agnostic Active Learning Algorithm The large body of classic literature of agnostic active learning studies exactly the active learning under label noise scenarios (please see [1-3] as examples). These papers study sample complexities in identifying the optimal hypothesis when annotations are noisy. DIRECT applies an instance of the agnostic active learning algorithm in finding the optimal separation threshold, which inherits the noise robustness with theoretical guarantees to be minimax and instance optimal (shown in [1] and [3]). We will include some of these discussions in our paper.

[1] Balcan, M. F., ... & Langford, J. (2006, June). Agnostic active learning. ICML.

[2] Dasgupta, S., ... & Monteleoni, C. (2007). A general agnostic active learning algorithm. NeurIPS.

[3] Katz-Samuels, ... & Jamieson, K. (2021). Improved algorithms for agnostic pool-based active classification. ICML.

Motivation and Justification behind DIRECT In active learning, almost all algorithms are trying to balance between querying uncertain examples and diverse examples. Uncertainty can come in many forms, such as entropy score, margin score, epistemic uncertainty, or even more recent methods such as local model smoothness and influence functions. In our paper, we choose to use margin sampling as it is computationally efficient, and has been shown to perform no worse than other more advanced methods in various benchmarking efforts [4-5]. We would also note that our method can use any of the scoring methods above in place of margin score.

The other beneficial philosophy is diversity in sampling. In other words, we want to provide better data coverage in our annotation. This has been traditionally done in the representation space using penultimate layer embeddings or gradient embeddings. However, as we show in our paper, these methods (such as BADGE and Coreset) underperform DIRECT in imbalance scenarios. DIRECT focuses on class diversity, combined with uncertainty sampling, labeling a more balanced dataset of uncertain examples (i.e. ones around the optimal separation threshold).

As we demonstrated in our experiments, we include plots for both accuracy and number of minority class labels, where the latter clearly shows DIRECT labels more class-diverse sets of examples.

Furthermore, to address your concerns, the optimal separation threshold is a novel proposal in our paper in the context of active learning.

[4] Zhang, J., Chen, Y., ... & Nowak, R. D. (2023). Labelbench: A comprehensive framework for benchmarking adaptive label-efficient learning. DMLR Journal.

[5] Bahri, D., ... & Rostamizadeh, A. (2022). Is margin all you need? An extensive empirical study of active learning on tabular data.

Different Learning Speed of Minority Class We think the different learning speed argument of minority class examples contributes to the motivation of DIRECT. Firstly, as DIRECT labels a more class-balanced set of examples, the learning speed of minority classes will be much faster when compared to other active learning algorithms. Furthermore, as we are adaptively labeling more examples when identifying the optimal separation threshold (Algorithm 2), we think these additional labels can significantly help in estimating the threshold. Without these additional labels, as you suggested, the decision boundary will be very inaccurate, which is exactly what the other active algorithms are experiencing. In other words, the slow learning speed of minority class results in bad estimation in uncertainty threshold (Figure 2a), and DIRECT mitigates this issue by adaptively labeling to find the optimal separation threshold, instead of relying only on the current decision boundary.

Additional Experiments Suggested by The Reviewer We actually do have the standard long-tail experiments in our paper, for CIFAR-10LT and CIFAR-100LT. In addition, we have conducted extra experiments under the LabelBench benchmark for ImageNet-LT (https://ibb.co/fVq3wsYJ) and iNaturalist datasets (https://ibb.co/F4fqyfdQ). In both cases, we are clearly seeding DIRECT dominating the other active learning algorithms.

Training Details Our pretrained ResNet-18 is the standard checkpoint in the PyTorch library. As pretrained models are much more widely available, we believe finetuning of neural networks can provide us better signal for practice.

Notation Issues In our paper the label space is defined as $[K] = 1, 2, …, K$ . In the binary case, the label space is therefore {1, 2}. In section 4.1, as we mentioned right after the [0, 1], $\widehat{p}$ is mapping to sigmoid scores, which is different from the label space.

We just like the name DIRECT for our algorithm.

The margin score $\widehat{p}_i^k$ is indeed confusing as it is not a probability. We will change the notation to $\widehat{s}_i^k$ instead.

审稿人评论

2025-04-05

Since the authors failed to adequately address most of our concerns, particularly regarding label noise and theoretical claims, we have decided to maintain our original rating.

作者评论

2025-04-08

We would like to further clarify how the label noise is handled by the agnostic active learning algorithm in addition to what we mentioned above.

In our setting, there is an underling conditional probability function $P(y_i|x_i)$ , where for each data example $x_i$ . We first consider a dimensionality reduction $x_i \rightarrow q_i$ , where $q_i$ is the real-valued sigmoid score of $x_i$ . This gives us an ordered set of 1-dimensional features ${q_i}_{i=1}^n$ . This map produces a distribution $P(y_i|s_i)$ , $\forall s_i in S$ . The probability therefore encodes the label noise.

In addition, we also have a set of classifiers, 1-dimensional threshold classifiers $\{h_j\}$ on the real line. Our goal is to find the hypothesis that minimizes the probability of error w.r.t. $P(y_i|s_i)$ . In our notation, $j^\star$ corresponds to the threshold classifier that minimizes the empirical error w.r.t. $P(y_i|s_i)$ .

This is precisely what our agnostic active learning algorithm does. The key point is that we make no assumptions about p(y_i|x_i) and hence p(y_i|s_i), so the algorithm handles any possible noise model, expressed by p(y_i|s_i).

We will include some of this discussion in the final version of our paper.

最终决定Accept (poster)

2025-05-01

The manuscript makes a significant contribution, but it would increase the paper’s accessibility and impact, if the authors could address a few points to clarify the benefits methodology.

First, in the context of imbalanced classes, although balanced accuracy implicitly includes worst-group accuracy, it would be nice to report it explicitly in some plots as well and perhaps show ablations as a function of $\gamma$ . Further, the authors should compare / use accuracy compared to simple reweighting methods or JustTrainTwice.

Second, the reduction to agnostic active learning draws heavily on theoretical results from the ACED literature. However, readers unfamiliar with those works may not find it clear how ACED / VReduce mitigates noisy annotations (see reviewers comments). Describing it in words or even a self‑contained summary of ACED’s key sample‑complexity guarantees under label noise - perhaps in the preliminaries - would help to understand how DIRECT inherits and adapts these guarantees.

Further, while the theoretical comparison shows GALAXY is likely to sample around extra cuts under noise (M_b ≥ 1), the lower bound remains somewhat loose. To bolster the theoretical claims, the authors could either tighten the bound (e.g. show M_b > 1) or present empirical evidence explicitly demonstrating how the probability of misidentifying the optimal threshold decays with the query budget b and noise rate η. Also, this parallelization with B-parallel=5 (in experiments) really does not feel to me like a major point that should be considered the main contribution in an elevator pitch. It is a constant factor gain and it’s nice nonetheless but doesn’t change the inherently sequential nature (the m iterations in vreduce need to be done sequentially like in GALAXY).

Finally, Appendix E currently holds a wealth of results that could be better integrated into the main text. The authors should summarize the most important ablations—across datasets, noise levels, and model sizes—in a concise table or paragraph, highlighting where DIRECT’s advantages emerge. In particular, exploring B_parallel values beyond 1 and 5 (for example 10 or 20) in the same plot would illustrate the trade‑off between parallel annotation and sample‐complexity shrinkage factor c.

With these revisions, I believe this work will offer both strong practical results and a clear path for readers to understand and build upon the algorithmic and theoretical innovations.

(minor point: the current way it's written, it's not clear what is actually inputted to the Vreduce algorithm: line 279 says “unlabeled y-i hidden to learner” whereas line 236 implies y are the label corresponding to q(i) … ? - x-i is unlabeled and y-i is hidden to the learner - it doesn’t quite matter since y-i with not in L are not used, but the imprecision is a bit confusing and can mislead people who do not read carefully.)