5.0

/10

Rejected5 位审稿人

最低3最高6标准差1.1

4.0

置信度

ICLR 2024

Selective Prediction via Training Dynamics

Stephan Rabanser,Anvith Thudi,Kimia Hamidieh,Adam Dziedzic,Israfil Bahceci,Akram Bin Sediq,HAMZA SOKUN,Nicolas Papernot

OpenReview PDF

提交: 2023-09-22更新: 2024-02-11

TL;DR

We propose a novel approach for selective prediction based on the prediction disagreement evolution of intermediate models with the final model's prediction.

摘要

关键词

selective predictiontraining dynamicsexample difficultyforgingreject optionuncertainty quantificationreliability

评审与讨论

审稿意见

评分: 5置信度: 52023-10-31

This paper proposes SPTD (Selective Prediction based on neural network Training Dynamics), a new approach to selective prediction problem. In this approach, SPTD captures the final model along with many intermediate models learned during the SGD style training method. SPTD runs these intermediate models and find the disagreement between the final model prediction and the intermediate model predictions. With some weighting scheme, a threshold based gating function is introduced to estimate the selection region. Given that this method introduces no architectural changes, it has no train-time impact (the intermediate checkpoint storage is an additional overhead). This also means that such an approach can be utilized for not only classification tasks, but other tasks such as regression.

优点

No architectural changes implies no training time changes
Applicable for not only selective classification problems but other tasks such as regression and time-series forecasting

缺点

Added storage overhead for the intermediate models
Added inference cost for the prediction using the intermediate models compared to other selective prediction models that only require one forward pass through the architecture and the gating mechanism.
It is unclear which of the many intermediate checkpoints should be used for the inference stage.

问题

Have you plotted other baselines for the Figure 2 to see what impact these baselines have compared to SPTD?
How do you select which of the intermediate checkpoints should be used for the inference stage? It is possible to design clever selection strategies during training to reduce the storage and inference cost rather than storing intermediate points at fixed checkpoint intervals.
Have you tried other weighting schemes than (t/T)^k?
Have you compared the inference cost of STPD with other methods (which only require one forward pass through the network and some gating mechanism)?

伦理问题详情

N/A

评论- Response to Reviewer 2hdK (1)

2023-11-16

We thank the reviewer for their feedback on our work and address individual concerns below:

Added storage overhead for the intermediate models. Added inference cost for the prediction using the intermediate models compared to other selective prediction models that only require one forward pass through the architecture and the gating mechanism.

The reviewer is right that our approach requires storage of additional models and also has an increased inference time cost due to forward-propagating inputs through all models. However, deep ensembles, the best competing approach to date, also requires storage of multiple models as well as an increased inference time cost. Many past works from the selective classification domain [1,2,3,4] do not consider Deep Ensembles explicitly as a competing method. In summary, we observe that the added storage and inference time cost do contribute to better selective prediction performance.

References

[1] Feng, Leo, et al. "Towards Better Selective Classification." The Eleventh International Conference on Learning Representations. 2022.

[2] Huang, Lang, Chao Zhang, and Hongyang Zhang. "Self-adaptive training: beyond empirical risk minimization." Advances in neural information processing systems 33 (2020): 19365-19376.

[3] Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." Advances in Neural Information Processing Systems 32 (2019).

[4] Geifman, Yonatan, and Ran El-Yaniv. "Selectivenet: A deep neural network with an integrated reject option." International conference on machine learning. PMLR, 2019.

It is unclear which of the many intermediate checkpoints should be used for the inference stage.

See discussion below.

Have you plotted other baselines for the Figure 2 to see what impact these baselines have compared to SPTD?

We provide this extended experiment as part of the updated PDF in Figure 8. We see that all methods reliably improve over the SR baseline. At the same time, we notice that SAT and DE still assign higher confidence away from the data due to limited use of decision boundary oscillations. SPTD addresses this limitation and assigns more uniform uncertainty over the full data space.

How do you select which of the intermediate checkpoints should be used for the inference stage? [...]

Our method relies on first computing a very detailed approximation of the training dynamics and then subsampling (see Checkpoint Selection Strategy on page 8) and weighting the subsampled checkpoints using $v_t = (\frac{t}{T})^k$ . You can think of the weighting as a prioritization to which part of the training dynamics to pay attention to. As we show below, our weighting is an appropriate choice for the problems we consider and outperforms alternative non-convex or uniform weighting strategies. We agree with the reviewer that other weighting strategies with potentially stronger performance might exist. However, these weightings might require deliberate distributional assumptions. Our work demonstrates that $v_t = (\frac{t}{T})^k$ with a flexible choice of $k$ enables SOTA selective classification performance across a wide range of datasets without much hyper-parameter tuning.

Have you tried other weighting schemes than (t/T)^k?

We have experimented with other weightings beside $v_t = (\frac{t}{T})^k$ with $k \in [0,\infty)$ but found that our choice of $v_t$ performs best across our experimental panel. This weighting encourages us to be particularly sensitive to late disagreements while at the same time incorporating diversity from models across the full training spectrum. We have updated the submission with a new experiment shown in Figure 13 in which we also consider concave weightings $k \in (0,1]$ as well as a uniform weighting assigning the same weight to all checkpoints. It is evident that our weighting choice performs better than other approaches. Finally, we want to remind the reviewer that, as described in Section 3.2, our choice for $v_t$ is not arbitrary but inspired by recent results from sample difficulty [1,2,3].

References:

[1] Baldock, Robert, Hartmut Maennel, and Behnam Neyshabur. "Deep learning through the lens of example difficulty." Advances in Neural Information Processing Systems 34 (2021): 10876-10889.

[2] Zhang, Chiyuan, et al. "Understanding deep learning (still) requires rethinking generalization." Communications of the ACM 64.3 (2021): 107-115.

[3] Jiang, Ziheng, et al. "Characterizing structural regularities of labeled data in overparameterized models." arXiv preprint arXiv:2002.03206 (2020).

评论- Response to Reviewer 2hdK (2)

2023-11-16

Have you compared the inference cost of STPD with other methods (which only require one forward pass through the network and some gating mechanism)?

As is typical for ensemble-based methods, our method naturally incurs a higher inference-time cost compared to models that do not require access to multiple models. While an ensemble consisting of $M$ models incurs approximately $M$ times the inference cost of a model only relying on a single forward pass, we remark that ensemble-based methods (Deep Ensembles, Selective Prediction Training Dynamics, as well as their combination) provide significantly stronger selective accuracy at a fixed coverage level than single forward pass methods (see Table 1).

We hope that we have addressed the reviewer’s concerns and that the reviewer considers raising their score as a result.

评论- Response to author rebuttal

2023-11-23

Thank you for the rebuttal. I feel like this still does not answer the associated cost of inference and storage of the proposed scheme. Since it is clear that these costs are significantly larger than the schemes that utilize single forward pass, it is imperative that such metrics be clearly plotted and discussed in the context of this work.

评论- Additional clarification

2023-11-23

We thank the reviewer for sharing their additional concerns with us.

In response to these concerns about clear documentation of the associated cost (and gains) for each selective classification method, we have now compiled Table 3 (and have updated the paper accordingly) which states the space and time complexities for all SC methods for both training and inference. We also include a column which details the SC performance ranking based on Table 1 and Figure 5. We conclude that although SR and SAT are the cheapest methods to run, they also perform the poorest at selective classification. SPTD is significantly cheaper to train than DE and achieves competitive performance at $T \approx M$ . Although DE+SPTD is the most expensive model, it also provides the strongest performance.

We hope that this clarification addresses the reviewer’s remaining concerns and hope that they consider increasing their score as a result.

审稿意见

评分: 5置信度: 42023-11-01

This paper proposes a new metric for selective predictions -- SPTD. The metric is based on training dynamics of the sample. It is applicable to classification, regression and time-series forecasting. It is an inference time method -- does not need any specialized training. although, it needs some checkpoints to be stored -- which makes it compatible in combination with existing selective prediction methods.

优点

The method is novel in terms of it doesnt need a specialized training, and it can be applied on top of existing methods.
Reasonable baseline and ablations (i especially like the analysis on checkpoint granularity)

缺点

Fail to cite some very related works on training dynamics (e.g., https://arxiv.org/pdf/2009.10795.pdf) as well as using training dynamics to analyze test sample (https://proceedings.mlr.press/v163/adila22a/adila22a.pdf)
Considering the two works mentioned above, the novelty of the work seems less now. If the authors can come with a convincing argument on this, I would not be opposed to raising my score
The numerical improvement over baseline (Table 1) seems very small.
Distribution on g evaluation (Figure 4) is only done for the proposed method. If other baselines have the same pattern, the relative efficacy of the method becomes questionable

问题

Check Weakness
Have the authors try the method on OOD datasets? Can it reliably reject OOD samples?

评论- Response to Reviewer jRFU (1)

2023-11-16

We thank the reviewer for their feedback on our work and address individual concerns below:

Fail to cite some very related works on training dynamics (e.g., https://arxiv.org/pdf/2009.10795.pdf) as well as using training dynamics to analyze test sample (https://proceedings.mlr.press/v163/adila22a/adila22a.pdf)

We thank the reviewer for pointing us to these works. Here we clarify that while both suggested papers look at “training dynamics”, the quantities we are using are different: in both of the referenced papers the authors use variability while our metrics focus on convergence rates. Furthermore, note that variability is defined with respect to the correct label, whereas in selective prediction we do not know the ground truth label (and need to measure convergence in label prediction. To further set our method apart, we also demonstrate the applicability of our approach beyond classification and its composability with other selective classification approaches including composing on top of common ensembling techniques. Due to these key differences, we are confident that our work continues to be a valuable contribution to the selective prediction and uncertainty quantification communities. We have updated our related work section with a paragraph discussing this and other related training-dynamics specific works.

Considering the two works mentioned above, the novelty of the work seems less now. If the authors can come with a convincing argument on this, I would not be opposed to raising my score

The novelty in our method is the specific training dynamics metric we propose for selective prediction as well as our evaluation in conjunction with other approaches and on tasks beyond classification. The cited work uses variance of the logit output for the true label; while we do not have access to the true label, we can use the final predicted label as a proxy. Doing so, we present results in Figure 14 along with additional discussion in Appendix D.2.6 which shows that our metric (which emphasizes convergence) is more performative than the training dynamics metrics proposed in the cited work.

The numerical improvement over baseline (Table 1) seems very small.

We understand the reviewer’s concern about the small improvements shown in Table 1. At the same time, we would ask the reviewer to put the presented results into context:

Many past works from the selective classification domain [1,2,3,4] do not consider Deep Ensembles explicitly as a competing method. Since we mainly consider our method as an alternative to DE, these improvements can appear small.
Moreover, many past works [2,3] do not explicitly accuracy-align models at full coverage which can give the proposed methods an unfair head-start and as a result overestimate the method’s effectiveness. We make sure to compare all methods on an equal footing by disentangling selective prediction performance from gains in overall utility. We highlight the presence of accuracy alignment in the caption of Table 1.
Finally, we highlight that our method’s advantages extend beyond the SPTD results reported in Table 1. This includes (i) ambiguity w.r.t the training stage; (ii) retroactive applicability; as well as (iii) composability with existing SC approaches.

References

[1] Feng, Leo, et al. "Towards Better Selective Classification." The Eleventh International Conference on Learning Representations. 2022.

[2] Huang, Lang, Chao Zhang, and Hongyang Zhang. "Self-adaptive training: beyond empirical risk minimization." Advances in neural information processing systems 33 (2020): 19365-19376.

[3] Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." Advances in Neural Information Processing Systems 32 (2019).

[4] Geifman, Yonatan, and Ran El-Yaniv. "Selectivenet: A deep neural network with an integrated reject option." International conference on machine learning. PMLR, 2019.

Distribution on g evaluation (Figure 4) is only done for the proposed method. If other baselines have the same pattern, the relative efficacy of the method becomes questionable

We provide an extended study on all methods in Figure 10. Since all methods are designed to address the selective prediction problem, they all manage to separate correct from incorrect points (albeit at varying success rates). We see that SPTD spreads the scores for incorrect points over a wide range with little overlap. We observe that for SR, incorrect and correct points both have their mode at approximately the same location which hinders performative selective classification. Although SAT and DE show larger bumps at larger score ranges for incorrect points, the separation with correct points is weaker as correct points also result in higher scores (i.e. longer blue tail) more often than for SPTD.

评论- Response to Reviewer jRFU (2)

2023-11-16

Have the authors try the method on OOD datasets? Can it reliably reject OOD samples?

The reviewer is right that the field of OOD detection is an important discipline in trustworthy ML related to SC. We therefore have already provided preliminary evidence in Figure 15 in the Appendix that our method can be used for detecting OOD examples (and also adversarial examples). While these results are already encouraging, we remark that adversarial and OOD samples are less well defined as incorrect data points and can come in a variety of different flavors (i.e., various kinds of attacks or various degrees of OOD-ness). As such, we strongly believe that future work is needed to determine whether a training-dynamics-based approach to SC can be reliably used for OOD and adversarial sample identification. In particular, a study of the exact observed training dynamics for both types of samples seems vital to ensure improved detectability. We remark that in our extended synthetic experiments in Figure 8, we observe that SAT and DE assign higher confidence away from the data due to limited use of decision boundary oscillations. SPTD addresses this limitation and assigns more uniform uncertainty over the full data space, enabling better OOD detectability.

We hope that we have addressed the reviewer’s concerns and that the reviewer considers raising their score as a result.

2023-11-19

Thank you for the detailed response to th review! The authors have addressed my main concern on novelty. I have increased my score.

评论- Thank you

2023-11-19

We thank the reviewer for their swift response to our rebuttal and for raising their score as a result. Should the reviewer still have any additional concerns with respect to our work then we will be happy to address those!

审稿意见

评分: 6置信度: 32023-11-02

The paper proposes a new approach to selective prediction, which is a learning setup that allows the model to abstain from making prediction. The main idea is to keep a set of checkpoint models during training and compute a weighted average of the prediction discrepancy between checkpoint and final models. The entire process can be described concisely as follow:

g(x) = sum_(t in S) (t/T)^k * L(f_t(x), f_T(x)) where S is the set of checkpoint indices and L is a distance function, such as 0-1 loss for classification or mean absolute error for regression; k is a hyper-parameter if g(x) >= tau: abstain; otherwise, use f_T to make prediction

This is a simple but interesting idea. It seems to provide strong empirical results too. However, despite the presence of extensive experiment results, the fundamental reason why this method has an advantage over methods that calibrate uncertainty directly is still not clear to me. Furthermore, I have some practical concerns regarding the proposed approaches as well.

Overall, I believe this is an interesting work but it still lacks a deep insight into why this machinery is expected to work better than prior approaches. This is why I currently rate this paper a bit below the acceptance bar but surely, if the authors address my concerns convincingly, I will be happy to upgrade my rating -- it is possible that I might have missed something important here.

优点

Looking at the training dynamic to gauge the prediction reliability at a test point is a refreshingly interesting idea. Despite its simple formulation, I consider the idea novel -- in fact, simplicity in implementation is a plus to me.

The paper is also reasonably well-written. It is a pleasant to read this paper. All discussion points & experiment highlights are well-organized, which makes the core idea very digestible.

I also appreciate the extensive results with a lot of ablation studies. There are also some pretty interesting theoretical results in the appendix. I think some of these results do provide some theoretical insights into how the variation of certain performance metric across checkpoint models can be related to the probability of correct classification, which could help strengthen the discussion in the main text if the corresponding assumptions is well-justified.

I like that part of the appendix also provides interesting discussion points relating the proposed approach to other lines of research.

缺点

Despite the above strengths, I still have a few doubts regarding the practicality of this paper:

First, the results are presented in a way that gives the impression that one can control the coverage.

How is it possible in practice? I understand that the threshold can be adjusted to meet a certain coverage level on the training set but I am not sure how we could do that for the unseen test set.

In other words, I feel that setting tau algorithmically should be part of the solution.

Second, if I understand the main point correctly, the exploitation of the checkpoint model is mainly to help with uncertainty calibration. Then, what makes it perform better than methods that do so explicitly? I think we need a more insightful discussion here.

Third, does the tuning of the parameter k depend on test data?

Fourth, part of the claim is the proposed method can be applied on top of existing models. But is it always effective? Could we demonstrate such synergy between SPTD and other baselines? Furthermore, I wonder how well a simple probabilistic method with explicit prediction uncertainty, such as Bayesian Neural Net or Gaussian process would fare against SPTD -- we can either do that experiment or point out previous finding in the literature that already sheds light on this.

Last, will SPTD still be robust regarding checkpoint resolution if the model complexity increases? ResNet18 is probably not a SOTA model for hard classification task such as CIFAR-100 -- how does SPTD work on larger model, such as ViT?

问题

I have raised some questions in the Weakness section. In addition, I also have a few other minor questions:

How do you set the threshold? In your experiment, was it set with respect to the training or test set?

I find this statement confusing: "checkpoint each model after processing 50 mini-batches of size 128. All models are trained over 200 epochs" -- by this statement alone, it means if you are making 200 passes over the entire training set & for CIFAR-10, that means looping over a total of 50K * 100 / 128 batches -- so if you checkpoint every 50 batches, approximately > 1K checkpoint needs to be stored -- this seems a bit too extravagant

That being said, I feel like I have missed something here since the later context clearly points out that the no. of checkpoint models used in the experiment is between 25-50

In addition, I wonder whether an explicit probabilistic method (such as BNN & GP) would be over-confident in the synthetic experiment. Can we do a quick check on this?

评论- Response to Reviewer ux7A (1)

2023-11-16

We thank the reviewer for their detailed assessment of our work and address individual concerns below:

First, the results are presented in a way that gives the impression that one can control the coverage. [...]

As part of our evaluation, we directly compute the accuracy/coverage tradeoff on the test set. This is consistent with how prior works evaluate their selective classification approaches [1,2,3,4] and allows us to highlight relative performance improvements of our method over competing works.

References

[1] Feng, Leo, et al. "Towards Better Selective Classification." ICLR. 2022.

[2] Huang, Lang, Chao Zhang, and Hongyang Zhang. "Self-adaptive training: beyond empirical risk minimization." NeurIPS 33 (2020).

[3] Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." NeurIPS 32 (2019).

[4] Geifman, Yonatan, and Ran El-Yaniv. "Selectivenet: A deep neural network with an integrated reject option." ICML, 2019.

Second, if I understand the main point correctly, the exploitation of the checkpoint model is mainly to help with uncertainty calibration. [...]

The reviewer is right that our method helps with uncertainty calibration. Although we think that the precise reason for the effectiveness of our particular method (in particular theoretical guarantees) should be part of future work, we do have some evidence that our method provides improved results due to diversity ensembling. Note that while model ensembling can be achieved in various different ways, many past works have found that a key ingredient to well-performing ensembling is sufficient diversity between ensemble members [1,2,3]. We do provide some empirical evidence for this connection in Figure 8 where we see that the decision boundaries considered by SPTD are significantly more diverse than the boundaries derived by DE or SAT.

References

[1] Kuncheva, Ludmila I., and Christopher J. Whitaker. "Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy." Machine learning 51 (2003): 181-207.

[2] Sollich, Peter, and Anders Krogh. "Learning with ensembles: How overfitting can be useful." NeurIPS 8 (1995).

[3] Morishita, Terufumi, et al. "Rethinking Fano’s Inequality in Ensemble Learning." ICML, 2022.

Third, does the tuning of the parameter k depend on test data?

Although the parameter $k$ can be tuned on a per-dataset basis, we found that on the datasets we consider the particular choice of $k$ is robust in the interval $[1,3]$ (see Figure 12). Ablating the parameter in this interval delivers comparable performance. We also consider even more choices for the weighting $v_t$ in Figure 13 and find that concave and uniform weighting schemes do not outperform the convex choice proposed in the main section of the paper.

Fourth, part of the claim is the proposed method can be applied on top of existing models. But is it always effective? Could we demonstrate such synergy between SPTD and other baselines? [...] Furthermore, I wonder how well a simple probabilistic method with explicit prediction uncertainty [...]

Table 1 already provides evidence that employing SPTD on top of DE yields improved performance. We also experiment with using SPTD on top of SAT and find that SPTD also improves performance in combination with SAT (see Figure 11). This result reinforces our claim that applying SPTD on top of existing approaches yields improved performance across various methods.

We are skeptical that additional results using BNNs or GPs would provide insightful alternate baselines. The reason for this is that Deep Ensembles, for which we provide results, have been found to dominate these methods [1,2]. Bayesian Neural Networks only allow for unimodal uncertainty quantification (approximate a single mode in the loss landscape) while Deep Ensembles allow for multimodal uncertainty estimation (approximate multiple modes in the loss landscape). Moreover, to the best of our knowledge, Gaussian Processes are rarely used as an uncertainty quantification method for high-dimensional image classification with large amounts of data. Inverting the Gram matrix is typically intractable in these use cases and approximations of the Gramm Matrix and its inversion reduce the quality of the provided uncertainty.

References

[1] Fort, Stanislav, Huiyi Hu, and Balaji Lakshminarayanan. "Deep ensembles: A loss landscape perspective." arXiv preprint arXiv:1912.02757 (2019).

[2] Ovadia, Yaniv, et al. "Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift." NeurIPS 32 (2019).

评论- Response to Reviewer ux7A (2)

2023-11-16

Last, will SPTD still be robust regarding checkpoint resolution if the model complexity increases? [...]

We have previously experimented with VGG architectures and have found our results to be consistent independent of the chosen architecture. We remark that ViT models have not yet shown to consistently provide better utility compared to ResNets on the datasets we consider [1] and hence opted against ViT experimentation in our work. To still showcase our method with larger model complexity, we show results for CIFAR-100 on a Resnet-50 architecture in Table 3. We observe that the effectiveness of SPTD carries over to this larger architecture.

References

[1] Zhu, Haoran, Boyuan Chen, and Carter Yang. "Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective." arXiv preprint arXiv:2302.03751 (2023).

How do you set the threshold? [...]

Across all experiments, $\tau$ is chosen to achieve a desired targeted coverage level. This is done by first computing the selection score $g$ across all data points, ranking the data points based on $g$ (effectively sorting them in ascending order), and then picking $\tau$ such that the first c% of test points are accepted. We don’t report these values for our experiments as they are method-dependent (different methods have different selection score distributions and therefore different thresholding values) and typically less interpretable than a specific targeted coverage level (for example, “accept 90% of data points” is often easier to understand than “accept all data points with selection score smaller than 0.5”). As described above and as is consistent with prior work, we compute the threshold on the test set directly.

I find this statement confusing: "checkpoint each model after processing 50 mini-batches of size 128. All models are trained over 200 epochs" [...]

The reviewer is right that we do indeed initially approximate the training dynamics on a very granular scale, leading to more than 1k checkpoints per dataset. As forward-propagating through > 1k checkpoints is evidently prohibitive, we show in the Checkpoint Selection Strategy paragraph on page 8 that subsampling 50, 25, or even 10 checkpoints still leads to strong selective classification. Especially for high targeted coverage (>50%), the performance across all checkpointing resolutions is indistinguishable from each other (Figure 5). This insight allows us to reduce the computational overhead of our method to the same cost as Deep Ensembles at test time while having a considerably less expensive training stage. Note that there is no need to compute the fine-grained trajectory and subsequently downsample to fewer checkpoints; this was merely done for our experiment to understand the limiting behavior of our approach. Practical implementations of our method can directly store checkpoints at a more coarse-grained resolution.

In addition, I wonder whether an explicit probabilistic method (such as BNN & GP) would be over-confident in the synthetic experiment. [...]

We provide an extended experiment as part of the updated PDF in Figure 8. We see that all methods reliably improve over the SR baseline. At the same time, we notice that SAT and DE still assign higher confidence away from the data due to limited use of decision boundary oscillations. SPTD addresses this limitation and assigns more uniform uncertainty over the full data space. We also provide a result using Bayesian linear regression in Figure 9 which gives results comparable to deep ensembles.

We hope that we have addressed the reviewer’s concerns and that the reviewer considers raising their score as a result.

评论- Re: Follow-up

2023-11-19

Thank you for the detailed response. Your response has addressed many of my points but some remains.

First, I still think the authors' response to the uncertainty calibration is a bit weak given that it is one of the main point of the paper. Please do consider generate more diverse experiments to support this point in the next revision of this paper. But, I meant this as a constructive feedback only and I can take the current response under the scope of this rebuttal.

Second, I am now quite concerned with the statement that "As described above and as is consistent with prior work, we compute the threshold on the test set directly" -- if I understand this correctly, that means the authors have peeked into the test set to configure the learning algorithm. This seems to align with my point earlier that ultimately, we actually do not have a way to control the coverage. I suppose this has to be computed based on some statistics on the training set somehow.

Can the authors please elaborate more on this without deferring to prior work? It would even be better if the authors can explain the rationale here & highlight particularly how this algorithm will be used in a scenario where we do not know the test set in advance?

In addition, what will be the result if the authors use a threshold computed on the training dataset instead?

In a different discipline, it is generally not acceptable to configure the prediction algorithm based on statistics of the test set so I really have to press on this. Please address the above points.

评论- Addressing additional concerns

2023-11-19

We thank the reviewer for considering our rebuttal and were glad to hear that our response addressed most of their concerns.

First, I still think the authors' response to the uncertainty calibration is a bit weak given that it is one of the main point of the paper. Please do consider generate more diverse experiments to support this point in the next revision of this paper. But, I meant this as a constructive feedback only and I can take the current response under the scope of this rebuttal.

As per the reviewer's suggestion, we will provide an expanded discussion (including additional experimental evidence) on the reasons for the effectiveness of our method in a camera-ready revision of the paper.

In the following, we address the remaining concern regarding setting thresholds on the training / test set.

[...] that means the authors have peeked into the test set to configure the learning algorithm.

We first clarify that the learning algorithm has no access to the test set. That is, model training is not influenced by any test data points. We do however specify the threshold on the test set which happens at evaluation time.

This seems to align with my point earlier that ultimately, we actually do not have a way to control the coverage. I suppose this has to be computed based on some statistics on the training set somehow.

The reviewer is rightfully pointing out that labels are unavailable at test time and that a realistically deployable approach has to compute thresholds based on a validation set (and not the training or test set). In the case of selective classification, the training, validation, and test sets follow the i.i.d. assumption, which means that an approach that sets the threshold based on a validation set should work performantly on a test set, too. Under consistent distributional assumptions, estimating thresholds on a validation set functions as an unbiased estimator of accuracy/coverage tradeoffs on the test set. By the same virtue, setting thresholds directly on the test set and observing the SC performance on that test set should be indicative for additional test samples beyond the actual provided test set. It is important to remark that the validation set should only be used for setting the thresholds and not for model selection / early stopping which would indeed cause a potential divergence between SC performance on the validation and test sets. Note that violations of the i.i.d assumption can lead to degraded performance due to mismatches in attainable coverage as explored in [1].

To confirm this intuition, we present an experiment in Figure 16 (for STPD) and Figure 17 (for SAT) where we select 50% of the samples from the test set as our validation set (and maintain the other 50% of samples as our new test set). We first generate 5 distinct such validation-test splits, set the thresholds for $\tau$ based on the validation set, and then evaluate selective classification performance on the test set by using the thresholds derived from the validation set. We compare these results with our main approach which sets the thresholds based on the test set directly (ignoring the validation set). We see that the results are statistically indistinguishable from each other, confirming that this evaluation practice is valid for the selective classification setup we consider.

We hope that this response addresses the reviewer’s concerns but are happy to provide further clarification if needed.

References

[1] Bar-Shalom, Guy, Yonatan Geifman, and Ran El-Yaniv. "Window-Based Distribution Shift Detection for Deep Neural Networks." Thirty-seventh Conference on Neural Information Processing Systems. 2023.

评论- Thank you for the response

2023-11-22

The authors said that:

We first clarify that the learning algorithm has no access to the test set. That is, model training is not influenced by any test data points. We do however specify the threshold on the test set which happens at evaluation time.

But then, that threshold will be given to the model? In a perfectly non-leaking scenario, I tend to think that you would still have to use the threshold that you computed on either the training or the validation set.

I think your experiment where 50% of the samples from the test set is used to compute the threshold and the model is evaluated on the remaining 50% is a step in the right direction. But, your following statement might over-generalize a bit: the indistinguishable results could be because you are computing the statistics from the same test distribution.

Why wouldn't you use a fraction of the training data instead?

Overall, the key experiments should have been remade with the threshold computed from only training data. That would have been the best response to me.

Otherwise, the response seems to both agree (the 2nd part) and disagree (the 1st part) with my point that the test set has been leaked & the concluding experiment attempts to prove that the leak had not had any strong impact on the performance, which is over-generalizing as I mentioned above.

As such, I still cannot increase my score but this is not my final score yet. I will also consult with the other reviewers during the discussion to see what they think of this issue & will adjust my score accordingly.

评论- Additional Clarification

2023-11-23

We thank the reviewer for their active participation in the rebuttal. We address their remaining concerns below.

As stated above, our training and test sets follow the i.i.d assumption, meaning that both distributions are indistinguishable from each other. As a result, subsampling points from the training set for the validation set yields the same distributional pattern as subsampling from the test set. We opted for an experiment that subsamples from the test set as it does not require us to retrain our models (it is not allowed to train on points on which we later on evaluate) and allows us to maintain competitive full-coverage accuracy scores. Hence, our take-away message from Figures 16 and 17, meaning that tuning the threshold on validation sets sampled from the test works optimally (w.r.t the optimal performance curve on the test set), is going to hold regardless of whether we set the threshold on a subset of the training distribution or a subset of the test distribution as they are the same.

To reinforce this, we provide an additional experiment similar to Figure 16 in Figure 18 where we did indeed separate out a partition of 5000 points from the training set instead of the test set. As expected, the distributions are once again statistically indistinguishable, meaning that setting a threshold on the validation set sampled from the training set yields the optimal performance the method achieves on the test set (given by computing the methods ordering of test points which was what our main experiments looked at).

To summarize this discussion, we want to clarify that what is being studied in our main experiments in the paper is which selective classification method is best at separating incorrect and correct test points; to evaluate this, one applies the method to the test set to see how it separates points (which is concisely described with the performance over coverage plots, or the distribution of scores plot), and uses this to compare the optimal selective prediction power of each method. What the reviewer is asking is whether these optimal selective prediction performances (described by the separation in scores the methods give) are achievable by using validation sets to pick thresholds, and in particular, whether this changes the takeaways of which method performs best. As shown in our above response, the achievable performance by using a validation set sampled from the test or train set matches the optimal performance for SPTD and hence its achievable performance will beat the optimal performance of the other methods as we already know its optimal performance does.

We hope that this clarification addresses the reviewer’s remaining concerns and hope that they consider increasing their score as a result.

评论- Re: Post-discussion thought

2023-12-03

Thank you for the clarification. I think empirically, it does address my concern to a reasonable extent. As a matter of principle, I increase the score to 6.

However, I honestly am still a bit doubtful regarding the fact that with such different choices of validation set, the results still remain almost the same as shown in those figures. I know that the authors stated that the distributions of the validation and the test set are indistinguishable but it still feels strange for the performance to match that well.

Wouldn't this argument apply to generic supervised learning setting as well, where a model that benefits from leaked information from the test set tends to perform better?

Please consider discussing this point in the revised paper.

审稿意见

评分: 6置信度: 42023-11-04

This work introduces SPTD, a novel selective prediction method that relies on measuring---for a test sample $x$ ---the disagreement between predictions obtained from multiple checkpoints of the model. More precisely, the disagreement measures $a_t(x)$ for checkpoint $t$ , $1\leq t \leq T$ are combined with weights $v_t$ : $g(x) = \sum_{t \in [T]} v_t a_t(x)$ They propose simple formulations of $a_t(x)$ and $v_t$ which work for both the classification and regression cases. They test their approach on $4$ vision datasets and $3$ regression tasks. On all of those tasks, they show that their method can---alone or in combination with deep ensemble---outperform other baselines in providing a better utility/coverage tradeoff.

优点

I find the paper well written, clearly presenting each relevant concept and experiment. The method is simple, which facilitates its adoption by ML practitioners. The experiments are convincing.

缺点

The novelty of the method is limited, the ideas of re-using past checkpoints to form an ensemble can be found in e.g. [1]
The results for SPTD and Deep Ensemble (DE) are both relatively close to one another and it would be nice to derive conditions under which one method is expected to be better than the other.
It is unclear how the performance of SPTD is tied to optimization noise. Especially, regression experiments use full-batch gradient descent, how would the results evolve when using smaller batches?
The values for $v_t=(\frac{t}{T})^k$ seem a bit arbitrary.

References:

[1] Checkpoint Ensembles: Ensemble Methods from a Single Training Process

问题

See above.
How would your method compare to using conformal-based selective prediction [2, section 5.5]?

References:

[2] A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

评论- Response to Reviewer GEsV (1)

2023-11-16

We thank the reviewer for their positive assessment of our work and address individual concerns below:

The novelty of the method is limited, the ideas of re-using past checkpoints to form an ensemble can be found in e.g. [1]

We thank the reviewer for alerting us to this relevant work! We remark that past work on ensembling (including [1]) considers averaging the predictions (and/or computing second moments) to get a better estimator of a model’s prediction (and/or uncertainty). Instead we investigate using checkpoints/ensembles as a means to check if a datapoint follows temporal patterns typical of datapoints we should accept from an example difficulty viewpoint, leading to novel aggregation schemes for ensembles. To state in other words, although [1] also uses training dynamics implicitly by constructing an average checkpoint ensemble, the goal in [1] is not selective prediction but boosting overall accuracy. Our approach focuses on selective prediction which requires us to derive a different aggregation scheme than [1]. We also demonstrate the applicability of our approach beyond classification and its composability with other selective classification approaches including composing on top of common ensembling techniques. Due to these key differences, we are confident that our work continues to be a valuable contribution to the selective prediction and uncertainty quantification communities. We have updated our related work section with a paragraph discussing this and other related training-dynamics specific works.

The results for SPTD and Deep Ensemble (DE) are both relatively close to one another and it would be nice to derive conditions under which one method is expected to be better than the other.

We understand the reviewer’s concern about the small improvements shown in Table 1. We also agree with the reviewer that deriving theoretical conditions for the effectiveness of SPTD would be interesting future work. At the same time, we would ask the reviewer to put the presented results into context:

Many past works from the selective classification domain [1,2,3,4] do not consider Deep Ensembles explicitly as a competing method. Since we mainly consider our method as an alternative to DE, these improvements can appear small.
Moreover, many past works [2,3] do not explicitly accuracy-align models at full coverage which can give the proposed methods an unfair head-start and as a result overestimate the method’s effectiveness. We make sure to compare all methods on an equal footing by disentangling selective prediction performance from gains in overall utility. We highlight the presence of accuracy alignment in the caption of Table 1.
Our current intuition for the strong performance of SPTD (and DE+SPTD) is based on increased diversity. Note that while model ensembling can be achieved in various different ways, many past works have found that a key ingredient to well-performing ensembling is sufficient diversity between ensemble members [5,6,7]. We do provide some empirical evidence for this connection in Figure 8 where we see that the decision boundaries considered by SPTD are significantly more diverse than the boundaries derived by DE or SAT.
Finally, we highlight that our method’s advantages extend beyond the SPTD results reported in Table 1. This includes (i) transparency w.r.t the training stage; (ii) retroactive applicability; as well as (iii) composability with existing SC approaches.

References

[1] Feng, Leo, et al. "Towards Better Selective Classification." The Eleventh International Conference on Learning Representations. 2022.

[2] Huang, Lang, Chao Zhang, and Hongyang Zhang. "Self-adaptive training: beyond empirical risk minimization." Advances in neural information processing systems 33 (2020): 19365-19376.

[3] Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." Advances in Neural Information Processing Systems 32 (2019).

[4] Geifman, Yonatan, and Ran El-Yaniv. "Selectivenet: A deep neural network with an integrated reject option." International conference on machine learning. PMLR, 2019.

[5] Kuncheva, Ludmila I., and Christopher J. Whitaker. "Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy." Machine learning 51 (2003): 181-207.

[6] Sollich, Peter, and Anders Krogh. "Learning with ensembles: How overfitting can be useful." Advances in neural information processing systems 8 (1995).

[7] Morishita, Terufumi, et al. "Rethinking Fano’s Inequality in Ensemble Learning." International Conference on Machine Learning. PMLR, 2022.

评论- Response to Reviewer GEsV (2)

2023-11-16

It is unclear how the performance of SPTD is tied to optimization noise. Especially, regression experiments use full-batch gradient descent, how would the results evolve when using smaller batches?

As per the reviewer’s suggestion, we have run our regression experiments using mini-batches and find that the results were statistically indistinguishable from the results with full-batch gradient descent. Although the results are the same for the regression experiment, we do hypothesize that the randomness added by SGD is helpful in yielding more diverse intermediate models. As discussed above, diversity is a desirable property for ensemble models which in turn enables improved selective prediction performance.

The values for $v_t = (\frac{t}{T})^k$ seem a bit arbitrary.

We note that the weighting $v_t$ is chosen to reflect sample difficulty patterns as described by past work [1,2,3]. As we already describe in Section 3.2, prior work has found that easy-to-optimize samples are learned early in training and converge faster. Our convex weighting is inspired by this insight as well as the convergence patterns derived in Figure 6. We have experimented with other weightings beside $v_t = (\frac{t}{T})^k$ with $k \in [0,\infty)$ but found that our choice of $v_t$ performs best across our experimental panel. This weighting encourages us to be particularly sensitive to late disagreements while at the same time incorporating diversity from models across the full training spectrum. We have updated the submission with a new experiment shown in Figure 13 in which we also consider concave weightings $k \in (0,1]$ as well as a uniform weighting assigning the same weight to all checkpoints. It is evident from this experiment that our weighting choice performs better than other approaches.

References:

[1] Baldock, Robert, Hartmut Maennel, and Behnam Neyshabur. "Deep learning through the lens of example difficulty." Advances in Neural Information Processing Systems 34 (2021): 10876-10889.

[2] Zhang, Chiyuan, et al. "Understanding deep learning (still) requires rethinking generalization." Communications of the ACM 64.3 (2021): 107-115.

[3] Jiang, Ziheng, et al. "Characterizing structural regularities of labeled data in overparameterized models." arXiv preprint arXiv:2002.03206 (2020).

How would your method compare to using conformal-based selective prediction [2, section 5.5]?

The conformal-based selective prediction method in [2] is equivalent to softmax response (SR); they pick a threshold on $\hat{P}(x) = \max(\hat{f}(x))$ . In particular, following typical conformal prediction methods, they use a calibration set to pick the threshold, while we (aligning with past selective prediction work) picked the threshold that gave a specific coverage/accuracy on the test set (as to compare this with other methods). In other words, conformal prediction in [2] deals with how to pick the threshold for softmax response, while our work shows there is a metric that gives a better signal for correctness than SR. Future work could try to apply conformal prediction to our SPTD metric as a means to pick the threshold.

We hope that we have addressed the reviewer’s concerns and that the reviewer considers raising their score as a result.

评论- Thank you for your rebuttal

2023-11-19

I thank the authors for their rebuttal. I appreciate the additional experiments. Overall, I do believe the method has its merits, being a simple approach offering good performance. I feel my score is adequate, but I raise my confidence.

评论- Thank you

2023-11-19

We thank the reviewer for considering our rebuttal and for increasing their confidence in their positive score. Should the reviewer still have any additional concerns with respect to our work then we will be happy to address those!

审稿意见

评分: 3置信度: 42023-11-04

The paper presents SPTD, a new method for Selective Prediction (SP) based on an ensemble approach using checkpoints from the training dynamics. Unlike several of the recent methods, the proposed approach works for several tasks including classification, regression, and time series. The paper presents a comparison between SPTD and different recent state-of-the-art methods.

优点

The points of strengths include:

1- The method works for several tasks including classification, regression, and time series.

2- The method seems to outperform the previous state-of-the-art selective classification methods.

3- Several experimental results presented

缺点

The points of weaknesses include:

1- The proposed idea lacks novelty as it is very similar to using ensembles of models. The difference here is that the ensembles are generated on a fixed schedule from the training dynamics.

2- Checkpoints are chosen based on a fixed schedule which can correspond to models of bad performance. A better approach is to follow the approach from [Huang et al. 2017] which constructs an ensemble by choosing points of good performance using a cyclic learning rate. Using good checkpoints removes the need for the complicated weighted aggregation of the disagreement functions as all the snapshots are good models.

3- The proposed method has several hyperparameters including the number of checkpoints and the weights to calculate the selection function $g$ from the disagreement function $a$ .

4- Several choices are not clear as described in the following section.

Huang, Gao, et al. "Snapshot ensembles: Train 1, get m for free." ICLR 2017.

问题

1- How was $\tau$ chosen?

2- "Checkpoint each model after processing 50 mini-batches of size 128", how many checkpoints are chosen?

3- In Table 1, the name of the baseline is SAT but in the text, it is mentioned that the baseline is SAT with Entropy Regularization (ER) and Softmax Response (SR) Selection from [Feng et al. 2023]. Which one is the baseline? If the latter, then please update the table as SAT and SAT+ER+SR are 2 different methods.

4- Why not add SelectiveNet as a baseline for the regression task?

5- How were $g$ and $\tau$ chosen for the Deep Ensembles (DE)?

6- What is the intuition of SPTD performing better than DE for some coverages? It seems counter-intuitive as DE consists of high-performing models vs SPTD which has fixed checkpoints that do not have to be high-performing.

7- How are the disagreement function, g, and $\tau$ chosen for DE+SPTD?

评论- Review Response (1)

2023-11-16

We thank the reviewer for their feedback on our work and address individual concerns below:

The proposed idea lacks novelty as it is very similar to using ensembles of models. The difference here is that the ensembles are generated on a fixed schedule from the training dynamics.

See discussion for next point.

Checkpoints are chosen based on a fixed schedule which can correspond to models of bad performance. A better approach is to follow the approach from [Huang et al. 2017] which constructs an ensemble by choosing points of good performance using a cyclic learning rate. Using good checkpoints removes the need for the complicated weighted aggregation of the disagreement functions as all the snapshots are good models.

We thank the reviewer for alerting us to [Huang et al. 2017]! We remark that past work on ensembling (including [Huang et al. 2017]) considers averaging the predictions (and/or computing second moments) to get a better estimator of a model’s prediction (and/or uncertainty). Instead we investigate using checkpoints/ensembles as a means to check if a datapoint follows temporal patterns typical of datapoints we should accept from an example difficulty viewpoint, leading to novel aggregation schemes for ensembles. To state in other words, although [Huang et al. 2017] also uses training dynamics implicitly by constructing an average checkpoint ensemble, the goal in [Huang et al. 2017] is not selective prediction but boosting overall accuracy. Our approach focuses on selective prediction which requires us to derive a different aggregation scheme than [Huang et al. 2017]. We also demonstrate the applicability of our approach beyond classification and its composability with other selective classification approaches including composing on top of common ensembling techniques. Due to these key differences, we are confident that our work continues to be a valuable contribution to the selective prediction and uncertainty quantification communities. We have updated our related work section with a paragraph discussing this and other related training-dynamics specific works.

The proposed method has several hyperparameters including the number of checkpoints and the weights to calculate the selection function $g$ from the disagreement function $a$ .

We want to clarify that our method operates in the same number of hyper-parameters (checkpointing resolution, weighting parameter $k$ ) as other competing approaches. While SR is a hyper-parameter free method, Deep Gamblers (reward, pre-training duration), SAT (pre-training duration, SAT momentum for moving average), as well as Deep Ensembles (number of members, aggregation weighting) require tuning 2 hyperparameter, just like SPTD.

Several choices are not clear as described in the following section.

We address these concerns below.

How was $\tau$ chosen?

Across all experiments, $\tau$ is chosen to achieve a desired targeted coverage level. This is done by first computing the selection score $g$ across all data points, ranking the data points based on $g$ (effectively sorting them in ascending order), and then picking $\tau$ such that the first c% of points are accepted. We don’t report these values for our experiments as they are method-dependent (different methods have different selection score distributions and therefore different thresholding values) and typically less interpretable than a specific targeted coverage level (for example, “accept 90% of data points” is often easier to understand than “accept all data points with selection score smaller than 0.5”). Note that this procedure of reporting coverage instead of $\tau$ is the default way to evaluate selective prediction approaches [1,2,3,4].

References

[1] Feng, Leo, et al. "Towards Better Selective Classification." The Eleventh International Conference on Learning Representations. 2022.

[2] Huang, Lang, Chao Zhang, and Hongyang Zhang. "Self-adaptive training: beyond empirical risk minimization." Advances in neural information processing systems 33 (2020): 19365-19376.

[3] Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." Advances in Neural Information Processing Systems 32 (2019).

[4] Geifman, Yonatan, and Ran El-Yaniv. "Selectivenet: A deep neural network with an integrated reject option." International conference on machine learning. PMLR, 2019.

评论- Review Response (2)

2023-11-16

"Checkpoint each model after processing 50 mini-batches of size 128", how many checkpoints are chosen?

Like $\tau$ , the total number of checkpoints is also dataset-dependent and we report the exact values for the different datasets in Table 2 in the Appendix. Note that in the paper we first aim to derive a very detailed characterization of the training dynamics, leading to hundreds of checkpoints for each model, but then downsample from this full set of checkpoints to obtain a more tangible algorithm that is still performative. We discuss this selection in our Checkpoint Selection Strategy paragraph on page 8. As discussed there, subsampling 10-25 checkpoints is enough for performative selective prediction on the datasets we consider. Moreover, when particularly targeting the high coverage spectrum (>50%), we find that SPTD provides SOTA selective classification performance across a wide range of checkpointing resolutions.

In Table 1, the name of the baseline is SAT but in the text, it is mentioned that the baseline is SAT with Entropy Regularization (ER) and Softmax Response (SR) Selection from [Feng et al. 2023]. Which one is the baseline? If the latter, then please update the table as SAT and SAT+ER+SR are 2 different methods.

We thank the reviewer for their attention to detail and have updated our draft with their suggestion.

Why not add SelectiveNet as a baseline for the regression task?

We have not benchmarked SelectiveNet on neither the classification nor the regression task since ensemble-based methods (which we do include in our comparison) dominate SelectiveNet in our classification experiments. We have now also ran SelectiveNet (SN) on the regression task and observe in our updated Figure 7 that SelectiveNet outperforms parametric outputs but does not reach selective utility levels obtained by DE or SPTD.

How were $g$ and $\tau$ chosen for the Deep Ensembles (DE)?

Consistent with the Deep Ensembles paper [1], we choose $g(x) = \frac{1}{M} \sum_{m \in [M]} \max f_m(x)$ . Note that $f_m(x)$ corresponds to the full $C$ -class logit output and $\max f_m(x)$ extracts the maximum logit value. $\tau$ is chosen as described above.

References

[1] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017).

What is the intuition of SPTD performing better than DE for some coverages? It seems counter-intuitive as DE consists of high-performing models vs SPTD which has fixed checkpoints that do not have to be high-performing.

Although we believe that future work should discuss this question in more detail, our current intuition for the strong performance of SPTD (and DE+SPTD) is based on increased diversity. Note that while model ensembling can be achieved in various different ways, many past works have found that a key ingredient to well-performing ensembling is sufficient diversity between ensemble members [1,2,3]. We do provide some empirical evidence for this connection in Figure 8 where we see that the decision boundaries considered by SPTD are significantly more diverse than the boundaries derived by DE or SAT.

References

[1] Kuncheva, Ludmila I., and Christopher J. Whitaker. "Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy." Machine learning 51 (2003): 181-207.

[2] Sollich, Peter, and Anders Krogh. "Learning with ensembles: How overfitting can be useful." Advances in neural information processing systems 8 (1995).

[3] Morishita, Terufumi, et al. "Rethinking Fano’s Inequality in Ensemble Learning." International Conference on Machine Learning. PMLR, 2022.

How are the disagreement function, $g$ , and $\tau$ chosen for DE+SPTD?

For DE+SPTD, we first compute SPTD for multiple models $m \in [M]$ as follows: $g_{m}(x) = \sum_{t} v_t a_{m,t}(x)$ . Note that $a_{m,t}(\cdot)$ now corresponds to the disagreement at time $t$ for model $m$ . In a second step, we combine all SPTD scores into a single DE+SPTD score: $\frac{1}{M} \sum_{m \in [M]} g_{m}(x)$ . This procedure is informally described in the Accuracy/Coverage Trade-off paragraph on page 7. Once again, $\tau$ is chosen as described above.

We hope that we have addressed the reviewer’s concerns and that the reviewer considers raising their score as a result.

评论- RE: Author Response

2023-11-20

I would like to thank the reviewers for their detailed responses to all the comments and their additional experimental results. I still have concerns about the paper.

Regarding the General Response

Reason for better performance (especially compared to ensembles): Being an ensemble method, our work benefits from diversity between individual members. Preliminary experimentation confirms that our approach yields models that are significantly more diverse than fully converged models from Deep Ensembles.

I do not agree with the authors that SPTD is likely to have more diversity compared to DE. One missing point here is how the checkpoints are chosen. If there are 25 checkpoints, are those generated from the last 25 epochs of training? Or are those selected at random by subsampling the large number of checkpoints? If the former, then the models will be similar. If the latter, then there are no guarantees on the quality of the checkpoints. On the other hand, DE and Snapshot Ensembles (SE) [Huang et al. 2017] rely on visiting different parts of the search space by either using random restarts (for DE) or by increasing the learning rate close to local optima (for SE). This should yield more diverse models than SPTD that uses models that are close in the optimization landscape.

Choice of $v_t$ : We note that our proposed convex weighting is inspired by results from example difficulty as well as our observed convergence patterns.

DE uses uniform weighting and there is no need to choose neither the functional form of the weights nor the hyper-parameters $k$ . This makes DE (and other ensembles methods) more practical to use.

Regarding the response to my questions:

the total number of checkpoints is also dataset-dependent and we report the exact values for the different datasets in Table 2 in the Appendix.

For SPTD, only $T$ and $k$ are reported, not the number of checkpoints. Could you please clarify how many checkpoints were used?

We want to clarify that our method operates in the same number of hyper-parameters

There is also the choice of the weighting function which is arbitrary. The results shared in Figure 13 confirms the importance of the weighting function which can be dataset dependant.

Additional questions:

1- Why do all the methods have identical performance at 100% coverage when the performance at 100% should only depend on the training method used and should not affected by the selection criteria. For example, SR and SAT+SR+ER both score 77.6% accuracy for StandfordCars, although they should have different training mechanisms. [Feng et al. 2023, Table 3] reported a 5% performance gap for the 100% coverage between SR and SAT+SR+ER.

2- Based on the question above, could you please explain how the SR baseline was implemented? Which training method was used, loss function,...,etc?

3- For regression (Figure 7), could you please share the Mean Square Errors comparison?

References

Huang, Gao, et al. "Snapshot ensembles: Train 1, get m for free." ICLR 2017.
Feng, Leo, et al. "Towards Better Selective Classification." The Eleventh International Conference on Learning Representations. 2023.

评论- Additional clarification (1)

2023-11-21

We thank the reviewer for considering our rebuttal and for sharing their additional concerns with us. We address these below.

I do not agree with the authors that SPTD is likely to have more diversity compared to DE. [...]

Our models are picked over the full optimization trajectory and not just the final checkpoints (this is described in the Checkpoint Selection Strategy paragraph where we mention subsampling). On the issue of this not guaranteeing quality, we want to emphasize that what our method shows is that it is still useful to have a diverse ensemble of not necessarily performative models when we can leverage temporal relations. That is, while the usual intuition for ensembles is one wants many different performative models to estimate uncertainty in decision boundaries, we show that an effective method for selective classification is to use non-performative models with the knowledge that over time the sequence of predictions on any given input should converge for correct points. This convergence then gives us another estimate for uncertainty in decision boundaries.

Moreover, our experimental results from Figure 8 show that the decision boundaries for DE are more aligned compared to the ones yielded by SPTD (showing significantly more non-linear decision boundaries), giving the intuition that our approach of using temporal patterns lead to more diverse uncertainty estimates (which are still high quality given our selective classification results). We believe that the theoretical underpinnings for the effectiveness of our method, as well as the precise role that model diversity plays for SC, should be further explored in future work.

DE uses uniform weighting and there is no need to choose neither the functional form of the weights nor the hyper-parameters [...]

We remark (and further clarify below) that our weighting is robust to the chosen task (see Figures 12 and 13). As such, the tuning effort for SPTD is of minor concern as setting $k \in [1,3]$ delivers strong performance across a wide range of experimental setups.

For SPTD, only $T$ and $k$ are reported, not the number of checkpoints. Could you please clarify how many checkpoints were used?

Our main results use the full training trajectory, i.e. the number of checkpoints reported in Table 2. As forward-propagating through > 1k checkpoints might be prohibitive, we show in the Checkpoint Selection Strategy paragraph on page 8 that subsampling 50, 25, or even 10 checkpoints still leads to strong selective classification. Especially for high targeted coverage (>50%), the performance across all checkpointing resolutions is indistinguishable from each other (see Figure 5). This insight allows us to reduce the computational overhead of our method to the same cost as Deep Ensembles at test time (10 forward passes) while having a considerably less expensive training stage.

There is also the choice of the weighting function which is arbitrary. [...]

We clarify that the choice of weighting is not arbitrary but, as we explain in Section 3.2, informed by example difficulty patterns presented in prior work (as well as our convergence plots from Figure 6). Using this particular weighting, we only observe negligible differences for the particular choices of $k \in [1,3]$ as we show in Figure 12 and Figure 13. Figure 12 shows that, across datasets, the performance is comparable for any $k \in [1,3]$ . We expand on this in more detail in Figure 13, showing that for CIFAR-10 the largest possible AUC deviation within the $[1,3]$ interval is negligibly small (rightmost subplot, consider the y axis steps). Hence, our experiments lead us to conclude that there is only a weak dataset dependence in how one picks the parameter $k$ in our weighting function. If the reviewer has additional reasons for a MSE comparison, we would be happy to run the experiment.

Why do all the methods have identical performance at 100% coverage [...]

This is due to us explicitly aligning the base accuracy of all models. Many past works, including [Feng et al. 2023], do not explicitly accuracy-align all models at full coverage. Not aligning full-coverage accuracy can give better generalizing models an unfair head-start in the accuracy vs. coverage curve. As a result, this can overestimate the method’s effectiveness at selective classification. We explicitly make sure to compare all methods on an equal footing by disentangling selective prediction performance from gains in overall utility. This is done by early stopping model training when the accuracy of the worst performing model is reached. We note the presence of accuracy alignment in the caption of Table 1.

评论- Additional clarification (2)

2023-11-21

Based on the question above, could you please explain how the SR baseline was implemented? [...]

Our implementation of softmax response, as outlined in the first paragraph of the related work section, is based on [1]: A threshold $\tau$ is applied to the maximum response of the softmax layer $\max_{y \in Y} f (y|x)$ . This simply corresponds to standard training on a default ResNet-18 architecture with SGD and a cross-entropy loss function and thresholding the maximum value of the softmax. See Datasets & Training in Section 4.1 for more details.

[1] Geifman, Yonatan, and Ran El-Yaniv. "Selective classification for deep neural networks." Advances in neural information processing systems 30 (2017).

For regression (Figure 7), could you please share the Mean Square Errors comparison?

We remark that the $R^2(y,\hat{y}) = 1 -\frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^{2}}{ \sum_{i=1}^{n} (y_i - \bar{y})^{2}}$ is a normalized version of the $MSE(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^{2}$ . A selective classification approach that yields a higher (lower) $R^2$ score necessarily yields a lower (higher) MSE. As a result, the ordering of SC methods as well as the relative magnitudes are consistent regardless of whether (R)MSE or $R^2$ is used as a performance metric. Hence, we are confident that MSE results would not provide additional evidence that is not already captured using our current $R^2$ results.

We hope that this addresses the reviewer’s concerns but we are happy to provide further clarification if needed.

评论- RE: Additional clarification

2023-11-21

Thank you for the clarification. I still have concerns regarding:

Our main results use the full training trajectory, i.e. the number of checkpoints reported in Table 2. As forward-propagating through

If I understand correctly, $T$ is the number of checkpoints. Does this mean that you used 1600 checkpoints for CIFAR, 2200 for Food101, and 800 for StanfordCars for the results in Table 1? If yes, then this makes SPTD very expensive to run. Moreover, it gives SPTD an unfair advantage over DE which only uses 10 models. Also the gains over STA+ER+SR is small so if such large number of checkpoints is required, then it will be hard to use SPTD in practice. I understand that Figure 5 shows the effect of the number of checkpoints, and there are some small differences in performance for CIFAR and StandfordCars, so why report the results in Table 1 with very large number of checkpoints?

This is due to us explicitly aligning the base accuracy of all models. [......]

This is done by early stopping model training when the accuracy of the worst performing model is reached.

I am very concerned about this point. In practise, the goal is to use the best possible model. Training all the methods, till the best possible accuracy is achieved, will give a clearer picture of the performance. If the concern is SAT+ER+SR provides a strong baseline at 100% coverage, this should not hurt SPTD in any way as SPTD can use the training checkpoints from SAT+ER+SR for the selection process on the best final model achieved by SAT+ER+SR. Similarly, if DE has a very strong performance, we should not evaluate it at a lower performing point.

评论- Additional clarification

2023-11-22

We appreciate the reviewer’s active participation during the discussion phase and address their additional concerns below.

If I understand correctly, $T$ is the number of checkpoints. [...]

We understand and agree with the reviewer’s concern that forward-propagating through 100s of models is expensive. As the reviewer rightfully points out, we are indeed able to reduce our method’s footprint to the same inference time cost as Deep Ensembles while incurring only a minor cost in terms of selective prediction performance. Our original reason for reporting the larger number of checkpoints was to follow the narrative of first presenting the most “complete” version of the method, and then presenting how to simplify it. We agree with the Reviewer that this leads to unnecessary confusion and as a result, we will update Table 1 with a reduced number of checkpoints comparable to DE. However, we respectfully disagree with the reviewer that a more comprehensive checkpoint trajectory gives SPTD an unfair advantage over Deep Ensembles. After all, Deep Ensembles trains many distinct models, which amounts to a considerably more expensive training stage (10x the cost of SPTD).

I am very concerned about this point. In practise, the goal is to use the best possible model. [...]

We expand on the important point that many previous approaches conflate both (i) generalization performance and (ii) selective prediction performance into a single score: the area under the accuracy/coverage curve. This metric can be maximized either by improving generalization performance (choosing different architectures or model classes) or by actually improving the ranking of points for selective prediction (accepting correct points first and incorrect ones last). As raised by a variety of recent works [1,2,3], it is impossible and problematic to truly assess whether a method performs better at selective prediction (i.e., determining the correct acceptance ordering) without normalizing for this inherent difference yielded as a side effect by various SC methods. In other words, an SC method with lower base accuracy (smaller correct set) can still outperform another SC method with higher accuracy (larger correct set) in terms of the selective acceptance ordering (an example of which is given in [4, Table 3]). Accuracy normalization allows us to eliminate these confounding effects between full-coverage utility and selective prediction performance by identifying which models are better at ranking correct points first and incorrect ones last. This is of particular importance when comparing selective prediction methods which change the training pipeline in different ways, as is done in the methods presented in Table 1.

Nevertheless, as pointed out by the reviewer, when just comparing SPTD to one other method, we do not need to worry about accuracy normalization. In this direction, we thank the reviewer for their relevant suggestion on running SPTD on top of an unnormalized SAT+ER+SR run and provide these experiments in Figure 11. We see that the application of SPTD on top of SAT+ER+SR allows us to further boost performance (similar to the results where we apply SPTD on top of DE in Table 1). So to conclude, experimentally, when using the best model, we see that SPTD still performs better at selective prediction than the relevant baseline for that training pipeline. We wish to reiterate that this issue of accuracy normalization highlights another merit of SPTD, which is that it can easily be applied on top of any training pipeline (including those that lead to the best model) and allows easy comparison to the selective classification method that training pipeline was intended to be deployed with.

References

[1] Geifman, Yonatan, Guy Uziel, and Ran El-Yaniv. "Bias-reduced uncertainty estimation for deep neural classifiers." ICLR 2019.

[2] Rabanser, Stephan, et al. "Training Private Models That Know What They Don't Know." NeurIPS 2023.

[3] Cattelan, Luis Filipe, and Danilo Silva. “How to fix a broken confidence estimator: Evaluating post-hoc methods for selective classification with deep neural networks”, arXiv preprint arXiv:2305.15508.

[4] Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." NeurIPS 2019.

We hope that this addresses the reviewer’s concerns but we are happy to provide further clarification if needed.

评论- RE: Additional clarification

2023-11-23

I would like to thank the authors for their detailed clarification.

The difference in performance between SAT+SR+ER and SPTD in Table 1 ranges from 0.2% to 1.5% across different datasets. On the other hand, in Figure 5, the difference in performance between SPTD Full and SPTD-10 can reach 1%. This is why the evaluation in Table 1 is not providing the full picture regarding the performance of SPTD vs. the other cheaper baselines.

Although the idea is interesting, has potential, and the authors have put efforts into running several experiments, I believe the paper requires more work on the experimental analysis to fully understand the benefits that SPTD brings and the costs required.

My main concerns are:

1- The large number of checkpoints (which requires multiple forward passes as well as storing all the model weights) used in Table 1 is not justified by the small improvements achieved compared to the other baselines. 2- The accuracy normalization issue which does not provide a clear understanding of the best performing method. In practise, if we want to deploy a selective model, we will be looking for the model that achieves the best performance for a specific coverage not the model that achieves the best selective classification performance compared to the 100% coverage.

Consequently, I am lowering my score.

评论- General Author Response

2023-11-16

We thank all reviewers for their assessment of our work and have performed a number of changes to improve our paper as a result of their feedback. First and foremost, we were glad to see reviewers acknowledge the simplicity and flexibility of our method, clarity of our presentation, as well as the comprehensiveness of our experiments. As part of our rebuttal, we have addressed the following key points of criticism in particular (details in individual responses):

Novelty of our method: We clarify that our method is novel in terms of its application to the selective prediction problem, the precise choice of aggregation of individual members, its applicability beyond classification, as well as its composability with existing selective prediction approaches. We have acknowledged prior and related work in a new paragraph in our background section and have added an experimental comparison to the logit variance approach raised by reviewer jRFU.
Reason for better performance (especially compared to ensembles): Being an ensemble method, our work benefits from diversity between individual members. Preliminary experimentation confirms that our approach yields models that are significantly more diverse than fully converged models from Deep Ensembles.
Choice of $\tau$ : We clarify that, as is typical in selective prediction works, we do not report $\tau$ directly but instead compute the accuracy/coverage tradeoff (which in turn sweeps over all possible choices of $\tau$ ).
Choice of $v_t$ : We note that our proposed convex weighting is inspired by results from example difficulty as well as our observed convergence patterns. We have added new results on alternate weightings that showcase that our current choice of weighting enables performative selective prediction.
Cost of our method: The inference-time complexity (both space and time) of our method is comparable to Deep Ensembles while having a significantly leaner training stage.

Updated PDF: We have updated our submission PDF and have color-coded our changes for easy inspection. We highlight shortened passages in orange, modifications/fixes in red, and new additions in blue.

We are happy to further engage with reviewers as part of the discussion phase and hope that the reviewers consider raising their scores.

AC 元评审

2023-12-06

This paper studied the problem of selective prediction where the learned model is allowed to abstain from making prediction. The key idea behind the proposed approach is to keep a set of checkpoint models during training and compute a weighted average of the prediction discrepancy between checkpoint and final models.

The reviewers' found the idea interesting, but raised a number of questions around when/why this method is expected to work well. There was a lot of discussion, but reviewers' were not convinced with some of the responses. Specifically, the two outstanding concerns are:

The large number of checkpoints (which requires multiple forward passes as well as storing all the model weights) used in Table 1 is not justified by the small improvements achieved compared to the other baselines.
The accuracy normalization issue which does not provide a clear understanding of the best performing method. In practice, if we want to deploy a selective model, we will be looking for the model that achieves the best performance for a specific coverage not the model that achieves the best selective classification performance compared to the 100% coverage.

Therefore, I recommend rejecting the paper and strongly encourage the authors' to revise the paper based on the review comments for re-submission.

为何不给更高分

Significant weaknesses as mentioned in the meta review.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject