$\nu$-ensembles: Improving deep ensemble calibration in the small data regime
We use unlabeled data to improve deep ensemble diversity and calibration for small to medium-sized training sets.
摘要
评审与讨论
This paper introduces ν-ensembles, a novel deep ensemble algorithm that achieves both efficiency and conceptual simplicity. When presented with an unlabeled dataset, ν-ensembles generate distinct labelings for each ensemble member and subsequently fit both the training data and the randomly labeled data.
优点
The strength of ν-ensembles lies in their ability to enhance deep ensemble diversity and calibration without significantly increasing computational demands. Key strengths include improved calibration in both in-distribution and out-of-distribution settings, achieved without complex implementation or extensive hyperparameter tuning. This method maintains the efficiency of standard deep ensembles, ensuring diversity through a straightforward process of assigning random labels to unlabeled data points. The theoretical grounding via PAC-Bayesian analysis provides a guarantee of diversity, accuracy, and calibration on test data, making ν-ensembles a promising and efficient technique for enhancing deep neural network ensembles.
缺点
- The paper lacks the related works of other calibration method such as train time calibration loss, and post hoc calibration which is very important in this domain.
- From my experience, the ECE measurement could be very unstable when classification accuracy is low. For experiments in table 1 for CIFAR100, the accuracy is very low, and the results may not reliable.
- The experiments lack the comparison with SOTA methods such as Focal Loss Calibration and Adaptive Label Smoothing.
问题
In table 1, how many times does the author run the experiments? Since the ECE measurement can be very stable among low prediction accuracy models, the ECE reported in Table can have very large variance. Please report the variance of multiple runs to verify the effectiveness of your method.
The experiment is limited to CIFAR10 datasets. Since the authors mention that the small dataset regime often happens in medical area. It is better to verify your algorithm on the small medical datasets.
We would like to thank the reviewer for his/hers detailed and careful review. We will respond in detail in the upcoming days. For the moment, could you please kindly provide us with references on the "Focal Loss Calibration" and "Adaptive Label Smoothing" methods as well as the medical datasets you are referring to? (Ideally if any are available using tensorflow datasets - torchvision, that would be very convenient).
Sure.
Focal Loss Calibration: Mukhoti, Jishnu, et al. "Calibrating deep neural networks using focal loss." Advances in Neural Information Processing Systems 33 (2020): 15288-15299.
Dual Focal Loss: Tao, Linwei, Minjing Dong, and Chang Xu. "Dual Focal Loss for Calibration." arXiv preprint arXiv:2305.13665 (2023).
Adaptive Label Smoothing: Ghosh, Arindam, Thomas Schaaf, and Matthew Gormley. "AdaFocal: Calibration-aware Adaptive Focal Loss." Advances in Neural Information Processing Systems 35 (2022): 1583-1595.
Medical datasets: I am not in the scope of medical area, but I feel like there exist some small medical dataset. I am sorry if there not exist.
Thank you for your detailed and useful review!
Addressing the weaknesses:
We would firstly like to point out that our method is a diversity promoting method for ensembles and not a regularization method or a new loss function for single networks. Thus in principle it can be combined with a number of calibration improving methods for single networks such as the focal loss, augmentation, and label smoothing. Indeed in the original submission we included standard augmentation flips + crops which improve \nu-ensembles.
We read the mentioned related works in detail. To the best of our understanding the primary contributor to improved calibration for single networks is temperature scaling. For example, in the most recent Focal Loss variant (dual focal loss) [1], temperature scaling results in large improvements over standard evaluation ~10-20%, however the dual focal loss only gives small additional improvements ~1%. We thus implemented temperature scaling which combined with nu-ensembles is in most cases the best calibrated ensemble.
Addressing the reviewer questions:
We provide additional experiments with multiple seeds in the Appendix. The results remain largely unchanged. We note also the following: 1) when the training data is small our method outperforms in calibration also in other metrics such as the TACE, the Brier and the NLL with especially the TACE considered to be a more robust variant of the ECE. 2) One can observe in Figure 2 that we continue getting relatively robust gains not only in the small training data regime but also in modest to large training set sizes 10000 to 40000 where the accuracy is within an acceptable range. Please note that using multiple seeds for all the experiments is prohibitive even though the initial setup is with a small training set.
Unfortunately we are not aware of what medical datasets could be relevant in our setting. We thus evaluate our approach on two additional standard image classification datasets, stl10 and svhn. We include these experiments in the Appendix.
We would be happy to provide any additional clarifications.
Tao, Linwei, Minjing Dong, and Chang Xu. "Dual Focal Loss for Calibration." arXiv preprint arXiv:2305.13665 (2023).
This paper introduces an ensembling technique for making use of unlabeled data in -class classification. Namely, the authors suggest training models, each of which see a different (randomly selected without replacement) label for each unlabeled data point. In this way, at least one model is guaranteed to have trained on a correct data point (since we exhaust all labels). The authors show that this approach can have benefits with respect to calibration metrics such as ECE when compared to other ensembling approaches on small-scale datasets.
优点
- Originality: Although the proposed method is quite simple, to the best of my knowledge I have not seen a similar approach analyzed empirically or theoretically in the literature on ensembling.
- Quality: The paper motivates the proposed method with a simple example and attempts to provide theoretical justification with a PAC-Bayes bound and related analysis. The algorithm and associated experiments in the paper are described well, but I have reservations about the quality of experiments as detailed in weaknesses below.
- Clarity: The experiments in the paper are easy to follow, but the theoretical aspect of the work is not as clear.
- Significance: Improving ensembles is an important problem, and the idea of diversifying ensembles has received much attention over the past few years. As such, the paper considers a significant problem, but I question the progress made on this problem by the proposed method.
缺点
Main Weaknesses
-
Insufficient experimental setup for proposed method. The authors claim that for small-scale datasets their method preserves the performance boost of standard ensembling but results in better calibration, while maintaining the same level of efficiency (as opposed to joint training methods that are compared to). However, this comparison seems incomplete - firstly, my understanding is that the compared-to ensembling approaches do not make use of the additional unlabeled data (at least standard ensembling does not). In Table 1 (the main table in the paper) the results are with respect to a training size of 1000 data points, but the unlabeled data size and validation size are 5000 points each. As a result, this comparison seems unfair - one should at least consider some other pseudo-labeling scheme for the unlabeled data, since it makes up the majority of the data being considered.
Additionally, even this part aside, the authors should compare the method to training ensembles with some kind of data augmentation (label smoothing [1], Mixup [2], etc.) since these methods are not only known to improve feature learning diversity but also regularize predicted confidences. Furthermore, training with these methods is going to be even more efficient than the proposed approach, and I expect would perform better. My reasons for expecting this are two-fold: firstly, the proposed approach intuitively regularizes confidence by having the ensemble uncertainty be high on the unlabeled data (essentially these points should be predicted uniformly randomly based on how the ensembles are trained), but Mixup and label smoothing are approaches that can do this as well. Additionally, and more importantly, the authors themselves note that their approach does not work (and can even hurt) for larger dataset sizes, but the aforementioned data augmentations are known to improve calibration even in that regime.
-
Theoretical approach needs greater clarity. The theory here needs significantly more clarification in my view. For example, the authors define to be a uniform combination of point masses on different weights (and even here the notation should be made more precise, is not defined a priori) and then claim that is the empirical variance of , but that is not what it represents from Equation (2), which is the variance of the predictive distribution of the ensemble when evaluated with respect to the true labels of data points in . Furthermore, for the predictive distribution the authors use in Equation (2) and it should be clarified at this point that corresponds to the true label of (which the authors mention later). More importantly, the proof of Proposition 1 is very hard to make sense of. What is the indicator variable of not being in the random labels? Aren't the random labels supposed to be exhaustive? Even ignoring this, how does the second term become zero when passing from the first line to the second line?
Recommendation
Overall I do not think the merits of the proposed approach are significant enough to merit acceptance, so my recommendation is reject. It is possible I misunderstood some aspects of the theory and I am happy to correct some of my statements here upon author clarification, but I feel even with that the authors would need more comprehensive experimental comparisons to emphasize the usefulness of the approach.
问题
My main questions are stated above as part of weaknesses.
We would like to thank the reviewer for his/her detailed and careful review. We will respond in detail in the upcoming days. For the moment, we kindly ask that you provide us with the references to label smoothing [1], Mixup [2] so that we can more accurately formulate our response (we understand that there are a number of works on both topics).
My apologies for the delayed response and leaving out the references in the original review. They are below, along with some additional related references.
[1] When Does Label Smoothing Help? (https://arxiv.org/abs/1906.02629)
[2] mixup: Beyond Empirical Risk Minimization (https://arxiv.org/abs/1710.09412)
[3] On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks (https://arxiv.org/abs/1905.11001)
[4] When and How Mixup Improves Calibration (https://arxiv.org/abs/2102.06289)
Thank you for you detailed and helpful review!
Addressing the weaknesses:
- Insufficient experimental setup for proposed method.
Please note that the agree to disagree method uses unlabeled data, however, it has both lower accuracy and calibration compared to our approach. We discuss the use of pseudo-labelling in the related works section. We do not dispute that pseudo labelling could improve standard ensembles, perhaps significantly. However, this is natural, as our method is extremely simple and makes minimal assumptions about the unlabeled data, namely only that it has been generated i.i.d. from the source distribution. Pseudolabelling, as for example implemented in [1], requires multiple rounds of pseudolabelling where a small percentage of the unlabelled data is labelled at a time. To obtain the final best performance a final distilled networked needs to be trained. This makes the implementation quite involved and includes many hyperparameters. By contrast, we propose a method that is almost as simple as temperature scaling or data augmentation with a single hyperparameter.
We emphasize that our approach is not a regularization method of individual networks but a diversity promoting method of ensembles. Thus in principle it can be combined with any of the methods proposed by the reviewer (mixup, focal loss, temperature scaling). In particular we already implement standard data augmentation (flips + crops) which benefit nu-ensembles. Finally we include in the revised version experiments on temperature scaling + nu-ensembles which have the best ECE and other calibration metrics compared to the alternatives in most cases.
- Theoretical approach needs greater clarity.
To the best of our knowledge is defined as the variance of the predictive distribution in our text. If there is some point where we also erroneously define it as the variance of we would be happy to correct this.
To the best of our knowledge it is fairly common to overload the notation for y and similar objects such that they represent both variables and variables instantiated with specific values [2]. We have moved the proof to the Appendix where we make every proof step in much more detail.
We would like to clarify that for each data point the pseudolabels we create are exhaustive up to the number of ensemble members. Thus if we have 3 ensemble members and 10 classes we create K=3 unique labels. The more ensemble members we have, the more unique labels we have and the larger our variance term becomes. This makes both intuitive sense and is the cause of the variable K appearing in Proposition 1.
If the true label y is not in the K randomized labels then all . Intuitively we've assumed that each ensemble member perfectly fits it's randomized label. Then since all ensemble members do not fit the true label they will all have .
We would be happy to provide any additional clarifications.
[1]: Saachi Jain, Dimitris Tsipras, and Aleksander Madry. Combining diverse feature priors. In International Conference on Machine Learning, pp. 9802–9832. PMLR, 2022. [2]: Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
Thank you for the clarifications; Proposition 1 is easier to follow now. I have some follow-up questions regarding the experiments: for the new temperature scaling results, why does standard + temperature scaling (TS) have different accuracy than standard alone? TS should not affect the accuracy. Additionally, I'm still concerned regarding the relative improvement (which seems minor when it occurs) of the proposed method over just standard + TS, which does not even use unlabeled data if I understand correctly.
The slight difference is due to a mismatch between the seeds we used. We'll update for these negligible differences either by the end of the discussion or for the final after we receive the feedback from other reviewers.
We would like to point out that our improvements are in line with other SOTA results in the literature. We copy here the results of the recent "Dual Focal Loss for Calibration" ICML 2023 Hawai'i, page 6:
ECE, CIFAR-100, ResNet-110
| Tempering | Weight Decay | Dual Focal |
|---|---|---|
| Yes | 0.0443 | 0.029 |
(after a careful look at the literature, improvements in these papers are similar for Label Smoothing, Focal Loss and different architectures when activating tempering)
At the same time our method is not a regularization method of individual networks, but a diversity inducing method of ensembles (as exemplified by our analysis between sampling with replacement and sampling without replacement). As such it can in principle be combined with any number of calibration improving techniques for individual networks. We primarily compare with other diversity inducing techniques and outperform them.
Finally, our method has a crucial advantage over the calibration improving methods mentioned here, including tempering. Namely in the simplest case we do not need access neither to the training objective nor to the preactivation logits. These are necessary for the Focal Loss and Temperature scaling for example, but might be cumbersome to change when using standard ML libraries such as scikit-learn among others.
The paper proposes a very neat method for improving the diversity of deep ensembles: It assigns random labels to a set of unlabelled data and lets each ensemble component fit different random labels such that these ensemble components can be diverse. The paper further provides theoretical guarantees for the resulting ensembles' behavior on test samples. The empirical results further show that the method acquires significantly better calibration on small training dataset regime, without sacrificing accuracy. Importantly, the method only introduces little extra training overhead while outperforming baseline approaches that are way more complicated. Overall, I think the proposed idea is novel, interesting, easy-to-use, and could be of great impact.
优点
-
The proposed method is easy! It is much easier and efficient to implement than other methods for enhancing ensemble diversity, such as Stein-based methods.
-
The proposed method comes with theoretical guarantees: Although the method sounds like some heuristic, the author provides PAC-Bayes bounds for its performance on test data.
-
The empirical performance improvement is significant: The results show that the proposed method improves the calibration error to a great extent for both in-distribution test data and out-of-distribution data (i.e. corrupted data), without hurting the accuracy.
缺点
-
The method "Sample y randomly without replacement", however, when the number of ensemble is larger than the number of classes, it is unclear to me how the method should be applied.
-
Since the method assumes having access to a validation dataset, a baseline worth considering would be temperature scaling.
-
The presentation of the results can be improved: There is no legend for the lines in Figure. 2; The usage of bold font is not consistent and confusing in Table. 1
问题
Why the method becomes less effective when we have access to more data?
If I understand correctly, the method assigns random labels to in-distribution data, this sounds weird to me, as it implies that the ensemble would have high uncertainty on these in-distribution samples. I think one can also consider introducing OOD samples into training and assigning random labels to them for each ensemble member.
Thank you for your detailed and useful review!
Addressing the weaknesses:
When the number of ensemble members is larger than the number of classes we can sample the labels randomly with replacement. We provide a theoretical analysis of this setting in proposition 2. Our analysis shows that the variance of the ensemble should continue to grow (albeit slowly). We note that in practice a setting beyond 5-10 ensemble members is to the best of our knowledge rare.
We include experiments using temperature scaling in the revision. The combination of nu-ensembles + temperature scaling is the best in terms of ECE and other calibration methods in most experiments. We also note that nu-ensembles are almost as simple to implement as temperature scaling, and can also be used in settings where applying standard temperature scaling is difficult. These cases are for example when using machine learning libraries that make modifying models difficult and/or output only the softmax activations instead of the logit preactivations (for example this is the case with some scikit-learn classification models).
Addressing individual questions:
"Why does the method become less effective when we have access to more data?"
We make very simple assumptions about our unlabeled data. When we have few training data new unlabeled data probably cover parts of the input space that the training data doesn't cover. We push our predictors to be uncertain on these parts of the input space. As the training dataset becomes larger and larger, training samples start "hitting" parts of the input space covered by unlabeled samples. It no longer makes sense to assume that we know nothing about this part of the input space. Our regularization parameter \beta captures this tradeoff and becomes close to 0 when we encounter this problem. We also include Figure 1 in the Appendix, which uses a toy example to capture these tradeoffs.
"I think one can also consider introducing OOD samples into training."
Ensembles tend to work well on in-distribution data, while on OOD data other methods such as weight ensembles and model soups tend to work best. Having high uncertainty on in-distribution parts of the input space is actually very natural. We are primarily motivated by the intuition provided by Gaussian processes. These provide relatively high uncertainty on in-distribution parts of the input space that are however far from training samples.
We've tried to incorporate the rest of the reviewers suggestions in our revision.
We also note that nu-ensembles are almost as simple to implement as temperature scaling, and can also be used in settings where applying standard temperature scaling is difficult. These cases are for example when using machine learning libraries that make modifying models difficult and/or output only the softmax activations instead of the logit preactivation
This sounds very appealing! I recommend the authors include this argument in the revisions.
I remain my score after reading the response, I am leaning towards acceptance but I am open to other reviewers' opinion.
The authors present a method for improving the calibration of deep neural network ensembles in the small data regime when access to an unlabelled data set is assumed. In particular, they propose the counterintuitive idea of randomly labelling the unlabelled dataset (distinctly for each ensemble member) and training the deep ensemble on the joint supervised and randomly labelled data. The randomly labelled data promotes ensemble diversity. A PAC bound which relates generalisation performance to ensemble diversity is derived while the diversity of the ensemble is demonstrated to be related to the ensemble size. Experiments on various slices of CIFAR-10 and CIFAR-100 show that while the method does not improve accuracy relative to standard ensembles, there are substantial gains on calibration. Calibration does not improve consistently over more complicated/expensive diversity promoting ensemble methods.
优点
- The paper is very well written and clear.
- The idea for the method, of using randomly labelled unsupervised data to promote ensemble diversity is simple, cheap and easy to implement and in so far as promoting diversity makes sense.
- Some theoretical results are presented in which the ensemble diversity is related via a PAC bound to the generalization performance (I have some other comments on these results below).
- The experimental results are convincing that at least in the small data regime with relatively little unsupervised data the calibration relative to standard ensembles is significantly improved.
缺点
Please see my questions in the section below for potential weaknesses that can be addressed through further experiments.
- The method is targeted solely at the small data regime, gains in calibration go to zero as the amount of labelled data increases.
- The method introduces a new hyperparameter which must be tuned.
- The experiments are presented without error bars and it is unclear if they come from a single run or are averaged over multiple seeds, standard practice, especially when considering the relative small datasets considered in this paper is to run experiments with multiple random seeds and present averages and standard deviations of the metrics of interest (or better yet other forms of statistical test of the significance of the results).
- Experiments are conducted on small slices of CIFAR-10 and CIFAR-100, while performance in the large data regime is alluded to in the paper, an experimental evaluation of this setting (for example ImageNet is fairly standard in the ensemble literature) would be much appreciated.
- From equation 3, it seems to be the case that as the number of classes (c) increases the gains in ensemble diversity go to zero, so the method is both likely to give no gains in the large data and large number of classes regime.
- The primary theoretical motivation for the method is equation 1, which is a PAC bound on the generalization performance, it is difficult to get a sense of how tight this bound is and to what extent there is a competition between the various terms in the bound.
Small things (didn't effect rating):
- Typo: "coincides we standard weight decay" -> "coincides with standard weight decay"
- It took me a while when reading the paper printed out to realise that there are two colours plotted in the left hand side of Figure 2 - as the orange is almost fully hidden by the red, making this clear in the figure or caption would be helpful to readers.
问题
- While the experimental results do not show big drops in accuracy, I am quite concerned that given vastly more unlabelled data the method would lead to overfitting the random labels and thereby harm test set accuracy (as is a well known phenomenon in the noisy label literature). More formally one could imagine that vast amounts of unlabelled data would promote the diversity term in the RHS of equation 1, but I given results in the noisy labels literature, I would find it hard to believe that this would not come at a corresponding cost in the first term on the RHS of equation 1. Could the authors please comment on this concern? Experimentally, I would be interested in seeing an experiment on ImageNet, for example, where the labelled set is of size 50k and the unlabelled set is 950k examples, a standard resnet50 or similar capacity model is used with 4 ensemble members (as per other papers in the literature) and a comparison to standard ensembles in terms of accuracy and calibration is given. This is a significant concern for me, as usually with methods that make use of an unsupervised dataset, the expectation is that as the unlabelled dataset grows, the gains from using it grow to. I fear this will not be the case for this method, which would limit the method to the small dataset, small number of classes and small unlabelled dataset regime. I recognise that the hyperparameter can to a certain extent control this trade-off, so if further experiments are conducted to address this concern, please report the results over the hyperparameter range.
Thank you for you detailed and helpful review!
Addressing the weaknesses:
We include in the Appendix additional experiments with multiple seeds for the Standard and -ensembles case. The conclusions are the same as the main. We understand that the reviewer views the use of our method in the small data regime as a weakness, however at the same time the small data regime is quite widespread in practice and the subject of considerable research in both ML conferences and journals. PAC-Bayes bounds are in general loose except for specific instantiations that have been optimized numerically to be tight. At the same time they are known to correlate well with out-of-sample performance [1], and also they have provided usefull objectives for improving deep neural networks [2].
Addressing the questions:
Our method is extremely simple and doesn't incorporate any prior information about the unlabeled data apart from the fact that they have been generated i.i.d from the source distribution. Thus, we claim that it works for small to medium training set sizes. The converse (as the reviewer proposes) is also true, the unlabeled set should also not be too large, as in that case we need more complicated assumptions about how to label the unlabeled data. We created Figure 1 in the appendix that illustrates these tradeoffs. For a small training set and small unlabeled set we can both fit the training set while also forcing diversity out-of-sample. For, very large unlabeled sets we underfit the training data. But this is natural as our assumptions are very simple. Conversely for very large training sets the unlabeled data are not useful (or might even hurt the final predictor). The PAC-Bayes bound captures this tradeoff with a complexity term.
We want to note that the above is not a significant problem for our algorithm in the following sense: It suffices to do a hyperparameter search over \beta using a validation set to eliminate the issue. If we include \beta = 0 in our grid, the hyperparameter search can default to standard ensembles in these pathological cases by choosing this \beta=0 value. In fact this is what happens when our ECE gains become 0 as the training set increases in Figure 2 of the main.
We would be happy to provide any additional clarifications.
[1]: Jiang, Yiding, et al. "Fantastic Generalization Measures and Where to Find Them." International Conference on Learning Representations. 2019. [2]: Foret, Pierre, et al. "Sharpness-aware Minimization for Efficiently Improving Generalization." International Conference on Learning Representations. 2020.
The paper introduces a method to enhance the calibration of deep ensembles, particularly in situations where there is a small amount of labeled data and some unlabeled data. For each point in the unlabeled dataset, the ensemble members are trained with different randomly selected labels. The authors provide a theoretical justification for this approach, drawing on PAC-Bayes bounds to argue that it leads to lower negative log-likelihood and higher ensemble diversity on test samples. Empirically, they demonstrate that ν-ensembles outperform standard ensembles in terms of diversity and calibration, especially when the training dataset is small or moderate in size.
优点
- The paper gives a method to improve calibration error for deep ensembles using unlabeled data. The use of unlabeled data to improve calibration error of deep ensembles has not been explored much before as most of the works have focused on joint training approaches which can be memory and computationally expensive.
- The paper is overall well written and easy to understand.
- The paper presents supports their method with both theoretical and experiments.
缺点
- One major weakness of the paper is that their method only improves calibration error not accuracy but they have not compared to any other calibration technique like temperature sampling.
- The other issue is that the method appears very similar to the Agree to disagree work mentioned in the paper where they also use unlabeled data to maximize diversity and the idea seem incremental. Can the authors please explain in detail how exactly Agree to disagree maximizes diversity on the unlabeled set?
- Another limitation is that this method only improves calibration in the small data regime.
- Another limitation is that there are only two datasets used in the paper - CIFAR-10 and CIFAR-100. It would be nice to have additional datasets.
问题
- The paper says that the labels for unlabeled data points are chosen without replacement. What happens if we sample with replacement? One should expect the same empirical results to hold but maybe the theoretical argument will not hold?
- I understand the text written at bottom of the Figure 1 but I don’t understand the figure. What are the 3 columns in the figure?
- One part that is not clear to me is when we are forcing the models to make random predictions on unlabeled data which is from the same distribution, why we are not hurting the accuracy or the cross entropy loss of the model? When training data is small and unlabeled data set is bigger, can the authors share their regularization parameters and if they had to give small weights on the regularization term?
- The colors used in figure 2 and 3 are very similar and it is hard to distinguish different lines.
- There are other works which also use this idea of diversifying using unlabeled datapoint for other problems. For example, DIVERSIFY AND DISAMBIGUATE: OUT-OF-DISTRIBUTION ROBUSTNESS VIA DISAGREEMENT. Can the authors please compare to this work also?
- Did the authors try using the unlabeled data from different distributions like random Gaussian noise. One benefit would be that fitting random labels on this dataset will not interfere with the learning on the original distribution.
Thank you for your detailed and helpful review!
Addressing the weaknesses:
Please note that in the revised version we include experiments with temperature scaling as well as additional datasets (SVHN and STL10) in the Appendix. In most cases nu-ensembles + temperature scaling results in the best calibration error, results for svhn and stl10 match those of CIFAR-10, CIFAR-100. Additionally, we would like to make two other points 1) our method increases the diversity of a deep ensemble and is not a regularization method. Thus in principle it can be combined with any other regularization method for individual networks (such as temperature scaling, mixup augmentation etc). 2) our method is applicable also in cases where it is difficult to modify the training objective or have access to the logits, such as for example when using standard libraries such as scikit-learn. This is in contrast to many calibration promoting approaches such as tempering.
Our method has significant differences and improvements over the Agree-to-Disagree method. We describe these in the main text but we do a recap here
The agree-to-disagree method trains an ensemble one member at a time. At step one they train a network to fit the labeled training set. At step two they train a second network that fits the labeled training set and which makes different predictions than the first network on an unlabeled set. The exact regularizer pushes the logits on unlabeled data to be different between the two networks. The method proceeds in this manner, adding one network to the ensemble at a time. By contrast nu-ensembles create K pseudolabeled datasets and then can train K ensemble members in parallel using the standard deep ensemble algorithm. Our method is much faster, simpler to implement and can be trivially parallelized. Also, agree to disagree was evaluated only on distribution shift datasets in the original paper. In our experiments we significantly outperform agree to disagree ensembles which undefit the data. Finally, the agree to disagree paper has only a very limited theoretical justification for their approach. By contrast we include a very detailed analysis based on a PAC-Bayes bound.
| metric | Agree to Disagree | -ensembles |
|---|---|---|
| embarrassingly parallel | no | yes |
| best test performance | no | yes |
| theoretical justification | minimal | PAC-Bayes bound |
| implementation | modified objective | standard objective with randomized labels |
Smaller comments:
As we discuss in the main paper the small data regime occurs frequently in practice and is also a well studied setting in machine learning conferences and journals.
We include additional datasets stl10 and svhn in the revised appendix.
Addressing the reviewer questions:
We include in the revised text a detailed analysis of the setting where we sample with replacement (Proposition 2). Our theoretical analysis implies that for the same number of ensemble members we would obtain worse diversity and thus worse calibration. We include experiments in the Appendix that validate our theoretical predictions, that is, \nu-ensembles without replacement result in improved diversity and test calibration.
The three columns of the Figure represent 3 different ensembles of different sizes. Column 1 is an ensemble with of size 1, column 2 is an ensemble of size 2 and column 3 is an ensemble of size 4. The true label here (the cat) is y=3. Ensemble member 1 fits label y=1, ensemble member 2 fits label y=4, ensemble member 3 fits label y=2, ensemble member 4 fits label y=3. Thus only ensemble member 4 has p(y=3|x,f) = 1 (It's the only one that fits the correct label).
Diversity and disambiguate is a method that has been developed for out-of-distribution generalization and is not based on an ensemble. In general we note that the weight averaging line of papers has made it clear that deep ensembles do not necessarily generalize better to out-of-distribution data. Thus, the two methods are not directly comparable.
We would be happy to provide any more clarifications.
Additional experiments to address reviewer questions:
-
We include a detailed theoretical analysis of the case where we sample with replacement. We predict that sampling with replacement results in worse diversity and worse test calibration than sampling without replacement. We include experiments on CIFAR-10 and CIFAR-100 in the Appendix that validate this prediction. In addition, these experiments validate our claim that our method is a diversity inducing method of ensembles and not a regularization method of individual ensemble members.
-
We include additional experiments on the SVHN and STL10 datasets in the Appendix. The results agree with the results on the CIFAR-10 and CIFAR-100 datasets.
-
We include additional experiments on temperature scaling. Temperature scaling + nu-ensembles results in the best calibration in most cases.
-
We include in the Appendix additional experiments with multiple random seeds for the Standard and -ensembles case. The experiments agree with the ones in the main text.
We would also like to highlight the following:
Our method is not a calibration promoting method of individual networks but a diversity promoting method of ensembles. Thus in principle it can be combined with any number of other calibration promoting methods of individual networks. Indeed we combine it with standard augmentation and temperature scaling and get improvements in both cases. Additionally, our method is applicable when modifying the training objective and/or model architecture is cumbersome (for example when using libraries such as scikit-learn that in certain cases give access only to softmax outputs vs logits, and hide the optimization objective). Notably, this makes our method usable when various methods such as tempering, focal loss etc might be more cumbersome to use.
The reviewers and meta reviewer all carefully checked and discussed the rebuttal. They thank the authors for their response and their efforts during the rebuttal phase.
The reviewers and meta reviewer all acknowledge that the submission studies an interesting topic relevant to the community, to construct well-calibrated ensembles in a simple fashion. The response helped with some concerns (e.g., characterization of the variance without replacement + results with multiple seeds, although no error bars are unfortunately provided to assess differences at the 3 decimal place).
At this stage, however, the submission still have some arguably important concerns that warrant further consolidations/investigations, for instance:
- [limited significance/scope] The paper focuses on a very specific setting with small datasets and small architectures, which directly impact the significance of the results.
- [critical role of and unlabeled data] Several reviewers were confused about the exact way of defining the set of unlabeled data and its impact on the learned models (e.g., as the scale increases). Moreover, the role (and tuning) of seems critical. More sensitivity analysis and insights are needed.
- [plug-in argument against temperature] It is not true that we cannot apply temperature scaling from a library outputting softmax predictions. If is the matrix of logits and , it is easy to check that for a temperature , since the softmax is invariant to an additive constant. This is often used in libraries, e.g., robustness metrics.
- [role of the variance for large ] When the number of classes increases, the variance term shrinks to zero. It would be important to test the approach on public datasets such as imagenet 1k or, even better, imagenet 21k.
- [clarity of the theoretical part] Th1 is difficult to comprehend without looking at the rest of the paper (e.g., definition of U only clarified in Prop 1).
- [ability of the model to fit the random labels] A recurring argument of the paper lies in the ability of the models to closely fit the random labels. This assumption is not trivial and hinges on the model size and the properties of the datasets. Again, this assumption requires more insights and analysis, at various scales.
- [further comparison] Given that random search (RS) was performed to tune the models, a comparison with Wenzel et al. (2020b) seems not only important but also easy to get (the different members are already generated by RS).
Because of the competitive landscape of the submissions and the long list of remaining concerns (see above), the paper stays under the cut and has ultimately not been selected for acceptance.
We are convinced that the suggestions above will help strengthen the paper for a future resubmission, which the reviewers and meta reviewer all encourage.
为何不给更高分
- Limited significance and scope
- Some critical hyperparameter is introduced without fully understanding its role and sensitivity
- Clarity of the manuscript
- Some additional baselines could be considered
- Several arguments/assumptions are unsupported by analysis and/or experiments (e.g., ability of the models to fit random labels)
为何不给更低分
N/A
Reject