ASPEST: Bridging the Gap Between Active Learning and Selective Prediction
Propose a new learning paradigm called active selective prediction and a novel method ASPEST for this new learning paradigm.
摘要
评审与讨论
The paper proposes a new machine learning paradigm called "active selective prediction" which combines selective prediction and active learning. Within this new setting, a new method called ASPEST is proposed which utilizes checkpoint ensembles to help reduce overfitting and overconfidence during fine-tuning, and self-training with soft pseudo-labels to reduce overconfidence.
Experiments on image, text and tabular datasets with distribution shift show ASPEST outperforms prior selective prediction and active learning methods. More specifically, on SVHN it improves AUACC from 79.36% to 88.84% with a small labeling budget.
优点
- Formulates a new learning paradigm joining selective prediction and active learning, which is beneficial but challenging. New evaluation metrics is proposed under this new setting.
- The proposed method addresses the key issues like overfitting, overconfidence that arise in this active selective prediction setting.
- The proposed method achieves improved accuracy and coverage over prior methods on distribution shifted datasets.
缺点
- The motivation of this work is under-discussed. I am not convinced that such setting is needed in the first place.
- The proposed method itself is simple, pretty much all built on existing ideas. If the setting itself is questionable, such method provides less insights to the field.
- There are ambiguities in explaining details such as ensemble and self-training components.
问题
For the sample selection strategy based on margin, is there any theoretical justification on why it helps with selective prediction, in addition to empirical evidence? And have you experimented with other sample selection strategies tailored for this problem?
伦理问题详情
N/A
This paper studies a new learning paradigm called active selection prediction, which aims to query more informative samples from the shifted target domain while increasing accuracy and coverage. This problem can be considered to be a combination of active learning and selection prediction. To solve this problem, this paper proposes a simple method called ASPEST, which utilizes ensembles of model snapshots with self-training with their aggregated outputs as pseudo labels. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method.
优点
The problem setting called active selection prediction is quite new and is only studied in this paper for the first time, to the best of my knowledge.
缺点
My biggest concern lies in that the novelty of the proposed method is very limited. I cannot see any new insights from the proposed method. There is only one sentence in the abstract about the description of the proposed method, while it seems enough to present the whole method surprisingly.
Another major drawback lies in that no theoretical analyses are provided. Many previous papers on selective prediction have provided theoretical guarantees for the proposed method. However, the proposed method in this paper is entirely heuristic, which would affect the quality of this paper.
Although the experimental results seem to support the proposed method, it is large because previous methods cannot solve the new problem studied in this paper. So I consider that the key contribution of this paper should be reflected by the novelty and theoretical analyses of the proposed method, which however should be further improved.
问题
See the above weaknesses.
This paper introduces an active learning framework that is based on a combination of deep ensemble, confidence margin, and self-training techniques:
- Sample Selection Based on Deep Ensemble (checkpoint ensemble): fine tune N models using SGD with different random seeds. An ensemble of these models is used to compute the average confidence for each sample. Those unlabelled test samples with the lowest confidence margins are selected for human labelling.
- Active Learning: All the N models are fine-tuned on these selected test samples and the original training dataset using Cross Entropy loss to adapt to test distribution and prevent forgetting previously learned information.
- Self-Training: After active learning, unlabeled samples from the test set with confidence exceeding a certain threshold (e.g., 0.95) are selected for self-training. The average label provided by the ensemble of models is assigned as the sample's pseudo label. Then the authors train the N models using these samples with KL divergence loss.
- Computational Efficiency on Large Test Sets: When dealing with an extensive unlabeled test set, to reduce computational costs, the average confidence and pseudo labels for each test sample sample are updated only after multiple epochs.
The resultant N models trained through this framework exhibit high ensemble accuracy. Additionally, their average confidence proves effective when applied to selective classification tasks.
优点
Originality/Significance: The author presents an effective framework for active learning under distribution shift, which also mitigates overconfidence.
Quality: This paper utilizes a wide range of experimental datasets, covering various types such as image and text. From this perspective, the empirical evidence is quite comprehensive, and the ablation study is also supportive.
Clarity: Overall, the paper is easy-to-read. Each element, such as the introduction of a particular loss function, is accompanied by intuitive justifications.
缺点
Originality/Significance: 1) Method: there has been some literature such as [1] utilizing model ensemble's uncertainty scores, e.g., average confidence or its variants like confidence margin to select samples for active learning. On the other hand, laeveraging high-confident pseudo-labels for self-training to enhance its accuracy has also been mentioned in many literatures, such as [2]. Combining these methods indeed can improve performance, but in terms of the bringing new method/ideas to these fields, I would find it rather limited. 2) Framework/Task: Regarding the new framework (i.e. task) proposed by the authors that combines selective classification and active learning seems somewhat unconvincing: one pertains to the training framework, while the other is related to the inference phase. The author's objective is, in my view, to enhance model accuracy during training while also considering its confidence calibration aspect to mitigate overconfidence, the latter of which the ensemble will naturally satisfy. Besides, I think this goal sometimes is already built in the active learning's framework since they usually require an accurate confidence score to do sample selection.
[1] Beluch, William H., Tim Genewein, Andreas Nürnberger, and Jan M. Köhler. "The power of ensembles for active learning in image classification." In CVPR 2018. [2] Lee, Dong-Hyun. "Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks." In Workshop on challenges in representation learning, ICML, 2013.
Clarity: The introduction part and Sec 3.1-3.3 can be somewhat confusing, making it challenging for readers to quickly grasp how selective classification and active learning are integrated. For example, in Figure 1, the low-confidence samples chosen by selective classification are also put into the human labelling. Readers will wonder how these labeled samples at the inference stage benefit the model and will guess whether it is a dynamic system. Full understanding is achieved only upon seeing the specific algorithm implementation.
问题
- For Eq. 10, how is the ground truth label in the KL divergence obtained? Is it directly using the average confidence from the ensemble as its label, or is it based on majority voting?
- The last sentence in Sec 4 seems to indicate that this method cannot generalize to other test samples that haven't appeared in Ux. Is my interpretation accurate in believing that the model can directly utilize the learned ensemble models to make new predictions on unseen test data points?
- How is checkpoint ensemble implemented? How does it fundamentally differ from deep ensemble? What makes checkpoint ensemble unique? Is my understanding correct that the approach of checkpoint ensemble involves fine-tuning N different models using SGD and varying random seeds, and then using T-round active learning to train every ensemble model? What does "checkpoint" indicate?
The paper has the topic of very interesting active selective prediction problem. In this task, the labeling procedure requires the high prediction accuracy and coverage (lower overage requires more human resources). How to label in the , usually having different distribution from the training data. The author(s) consider ensembles to make check points for the more calibrated confidence. These check points were used to query samples (check points are carefully constructed by the fine-tuned trained models on labeled samples from and and self-training procedure. The self-training procedure is learning the trained data and subsampled data having psuedo labels. The performance mainly depends on the sampling procedure for self-training. Also, experiment results are promising sine achieving the high accuracy, coverage, and the AUACC (AUC-like measure based on coverage and accuracy).
优点
The task is relatively new and the effect of self-training is well revealed. The calibration for OOD is difficult in general. This framework can be valuable to tackle the OOD problem in various aspects such as active learning and calibration. Experiments seems intensive to validate the algorithm.
缺点
How to choose cannot be easy. Can you consider the cross-validation or other strategies. There are many hyper-parameters, and many can be robust. However, the selection can be problematic. Is there any simple solution to this issue?
问题
Q1: In the setup of hyper-parameters is not clarified in a easier manner. Is the number of epochs for self-training? Q2: Is there any reason to use the intermediate prediction during epochs in self-training? Q3: When we use the conventional acquisition ftns such as Random, entropy and BALD of active learning, what’s the results for active selective prediction?
伦理问题详情
None.