7.5

/10

Spotlight4 位审稿人

最低7最高8标准差0.5

3.3

置信度

正确性3.5

贡献度3.5

表达3.5

NeurIPS 2024

On the Use of Anchoring for Training Vision Models

Vivek Narayanaswamy,Kowshik Thopalli,Rushil Anirudh,Yamen Mubarka,Wesam A. Sakla,Jayaraman J. Thiagarajan

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

Anomaly DetectionOOD GeneralizationML SafetyAnchoringDeep Neural Networks

评审与讨论

审稿意见

评分: 7置信度: 22024-07-10

This paper identifies a major problem with anchored training, that the performance of anchored training does not increase with increasing reference set size, and proposes a simple regularization approach to overcome this problem. This approach is evaluated on OOD generalization, calibration and anomaly rejection, and task adaptation, and various facets of anchored training are analyzed.

优点

The paper makes the interesting finding that the performance of anchored training does not increase with increasing reference set size, and that this problem is not alleviated by more sophisticated inference strategies. The paper also proposes a simple reference-masking regularization technique to help alleviate this problem. The experiments show the effectiveness of the proposed approach, and there is also analysis of how the method interacts with data augmentation and noisy labels. An ablation study of the $\alpha$ parameter is also performed. Training recipes are also provided, making the paper easier to reproduce.

缺点

One weakness is that the reference set selection strategy and reference set sizes are not explained for the experiments.

The impact/novelty is a bit limited because of the lack of comparisons to non-anchored training works.

Minor points: in the tables, decreases in performance could be colored in a color other than pink. Figure 1 could be improved with error bars. One highlighting was missed in Table 3. The abbreviation LP is not defined.

问题

Is there any explanation for why the accuracy stays relatively constant (Figure 1) regardless of reference set size?
Does training for more epochs help alleviate the reference set size problem?

局限性

The limitations of this work are discussed by the authors at the end of the paper. Negative societal impact is probably not a concern for this paper.

作者回复

2024-08-07

We thank you for your positive feedback. We hope our responses address the questions you have raised.

1. Accuracy Remains Constant Regardless of Reference Set Size

We would like to clarify that this is precisely the problem with the original anchored training protocol, which we solve in this paper. As shown in Fig. 1 of the paper, the original anchoring protocol does not fully leverage the diversity of reference-residual pairs with increasing reference set size and maintains a relatively constant accuracy regardless of the reference set size. We hypothesize that this is because as the size of the reference set increases, the number of reference-residual pairs grows combinatorially. For e.g., when the reference set is the entire dataset D, there are |D| choose 2 pairs, making it impractical to explore all pairs within a fixed number of training iterations. This results in insufficient sampling of reference-residual pairs, increasing the risk that anchored training may overlook the reference and make predictions based solely on the residuals leading to non-generalizable shortcuts. This is problematic as a sample should not be identifiable without knowing the reference.

2. Alleviating the Reference Set Size Problem

One possible way of alleviating this problem is by reducing the reference set size. However, this reduces the diversity of the reference-residual pairs exposed during training and can lead to a poor solution. While the issue of diversity can be combated with large reference set sizes, increasing the number of epochs alone does not solve the problem as there exists a combinatorially large number of reference-residual pairs which cannot be practically explored, and the model will still be vulnerable to shortcuts. Moreover, modifying the number of training epochs results in non-trivial modifications in the training hyper-parameters (e.g., learning rate schedules) and can lead to poorly convergent models if the hyper-parameters are chosen incorrectly. Hence, we propose a reference masking regularizer for anchored training, that helps mitigate shortcut decision rules while also being computationally efficient.

3. Reference Set Selection Strategy

For all experiments in Section 4, we utilize the entire training dataset as the reference set and train both the original and the proposed anchored models. During inference, we randomly select a single reference from the reference set and perform evaluation on the different test datasets. We will better clarify this in the final version of the paper.

4. Impact/Novelty

Anchoring is a framework that is agnostic to any training strategy, domain or application and can be wrapped along with other strategies (data augmentation, loss functions, regularizers, ensembling) that help improve model performance. While this is attractive, our paper identifies the shortcomings of the existing protocol and deals with a fundamental problem of how to train and make predictions with anchored models in practice. We develop a novel reference masking protocol for training anchored models that can significantly improve overall model generalization. We demonstrate significant quantitative performance improvements over standard and original anchored training protocol across different datasets, tasks and architectures. Particularly, we find that our proposed algorithm leads to a wider and flatter optima corresponding to superior solutions (Fig. 4), and can be used on top of augmentation strategies (Fig. 5a) and better handle training label noise (Fig. 5b). We systematically establish the empirical efficacy of our approach on OOD generalization and model safety tasks ranging from calibration, anomaly rejection as well as task adaptation and domain generalization. We expect to foster interesting research directions with anchoring and even impact applications in different domains (e.g., text, graphs etc.) where generalization and model safety are paramount concerns.

5. Formatting Issues and Typos

We will make the changes to the tables and figures in the final version of the paper.

评论- Request to check our response

2024-08-13

We thank the reviewer for taking the time to review our paper and providing useful feedback. As the discussion phase is ending soon, we will greatly appreciate if the reviewer can check our response and let us know if there are any additional questions.

2024-08-13

Thank you for the detailed response addressing my questions about the accuracy, increasing the number of epochs, and impact and novelty, which have clarified and provided additional insight on the paper, as well as those of the other reviewers. Therefore, I would like to increase my score. I have one follow-up question on the accuracy remaining constant regardless of reference set size in vanilla anchored training: in the paper, it is mentioned that when $|\mathcal{R}| \le 50$ , the model will likely see all combinations of reference and sample, but in Figure 1, we see either no increase in accuracy when $|\mathcal{R}| = 5$ and when $|\mathcal{R}| = 50$ (CIFAR-10) or only a slight increase (CIFAR-100). Is there an explanation for this? One might expect that the accuracy will increase until $|\mathcal{R}|$ increases to the point where not all combinations can be seen by the model.

评论- Thanks and a clarification

2024-08-13

We thank the reviewer for checking our rebuttal and considering to improve the score.

In anchoring, the quality of the converged optima depends upon the diversity of the reference-residual pairs (induce a rich family of functions) exposed during training. However, we note that the reference set size is only a surrogate for diversity, and more importantly, we do not use any sophisticated strategy for reference set selection (random sampling). As a result, even when all combinations are exposed, it is not guaranteed that the diversity of functions at |R| = 50 is significantly higher than that of |R| = 5. If there is a potentially better way of picking reference sets that are guaranteed to lead to diverse functions, we can expect stronger performance gains at |R| = 50. However, it is not clear how to design such a reference selection protocol. Instead, we recommend the use of very large reference sets (entire training data or even training data along with its augmented versions). That is where the problem of under-exposure of all combinations kicks in, thus motivating our regularizer.

2024-08-14

Thank you to the authors for the explanation. I will continue to recommend acceptance.

评论- Thanks!

2024-08-14

We appreciate the reviewer's recommendation for acceptance. If our responses have addressed the questions adequately, we would like to inquire if there is a possibility of revising the score.

审稿意见

评分: 8置信度: 42024-07-13

The authors analyze the effect of anchored training through a series of small experiments and find that, contrary to claims in prior works, increasing the size of the reference set is not beneficial and that this shortcoming cannot be mitigated through existing inference strategies. The authors provide a simple yet efficient fix by randomly masking out the reference during training, and forcing the model to make high entropy predictions in those cases. This solution does not incur any training overhead, and the authors demonstrate in extensive experiments that the fix is applicable to different models and datasets, yields improvements for OOD performance over various distribution shifts, and improves calibration and anomaly resilience.

优点

The paper is very well written and structured and is overall easy to follow. The initial experiments highlight the studied problem well.
The authors showcase an important limitation to existing anchoring techniques that was unknown to the community.
The proposed solution is simple and is demonstrated to consistently improve performance across models and datasets.
The experiment section is extensive and covers both OOD performance as well as safety-relevant metrics. The results convincingly demonstrate the effectiveness of the proposed method.

缺点

The paper is very well written, I don't see any major weaknesses that would prevent an accept.

Minor weakness: The optimal $\alpha$ is determined when using the entire dataset as a reference set. However, as is clear from the motivation, risk of spurious shortcuts is larger with a smaller reference set. Wouldn't this imply that the optimal $\alpha$ would be larger for smaller reference sets? How should this value be chosen in practice and for datasets larger than ImageNet-1k?

问题

Tab. 2 Do you have any insights why the improvements on ImageNet-S and ImageNet-R are drastically different for SWIN transformers and ViT?
(minor) The formatting of paragraph headers in the introduction is weird and inconsistently using underline.
(minor) Erroneous comma in L.165
(minor) It is hard to visually assess from Fig. 4 whether the optima are significantly different.

局限性

Limitations were sufficiently addressed, especially the empirical nature of the work.

作者回复

2024-08-07

We thank you for your positive feedback. Here are our responses to your questions. We plan to incorporate some of these clarifying comments to the manuscript as well.

1. Choice of $\alpha$

We want to clarify that at low reference set sizes, there is a high likelihood of exposing the model to all possible combinations of samples and references, and hence the risk of learning shortcuts is minimal. In this case, overemphasizing the reference masking probability (i.e., increasing $\alpha$ ) can significantly inhibit this exposure. Consequently, this leads to underfitting as the model is tasked with learning solely from the residuals which is undesirable in practice (Blue curve for reference set size $\leq$ 50 in Fig.1 of the paper). Reducing $\alpha$ can combat this behavior, as evidenced by the original anchored training (special case of reference masking with $\alpha = 0$ , red curves for reference set size $\leq$ 50 in Fig.1).

Now, with larger reference sets (e.g., datasets in the scale of ImageNet 1K), the number of reference-residual pairs grows combinatorially, making it impractical to expose the model to all diverse pairs in a fixed number of training iterations. In such a scenario, reducing $\alpha$ can increase the risk of learning shortcuts and lead to suboptimal performance. Increasing $\alpha$ on the other hand can in fact aid training as it systematically avoids these shortcuts and improves generalization. In summary, the optimal $\alpha$ value depends both on the reference set size and the convergence behavior of model training.

2. Performance Improvements on ImageNet-R/S with VITb and SWINv2B

Anchored training with ImageNet-1K involves exposing the model to significantly large and diverse reference-residual combinations. Our results show that, in order to model such a large joint distribution and to leverage the diversity of the reference set, along with the proposed masking regularizer, higher capacity networks (VITb & SWINv2B) are also required. We hypothesize that this behavior helps such networks better handle challenging, far out-of-distribution datasets such as ImageNet-R/S. For instance, we observe an average of 1.5% and 1.7% improvements in ImageNet-R and S accuracies respectively when using such architectures over lower capacity models.

3. Formatting

We will make changes to the paragraph header, correct underline inconsistencies in the introduction and improve Fig. 4 in the final version of the paper.

2024-08-08

I thank the authors for their explanation and clarification. I believe a succinct discussion of alpha in different scenarios like the one provided here would be useful to include in the paper.

I have no further questions and continue to recommend acceptance.

评论- Thanks!

2024-08-12

We sincerely appreciate the reviewer for going over our response and championing our paper!

审稿意见

评分: 7置信度: 42024-07-13

In this paper, the authors propose a new strategy to train anchoring-based models, significantly improving performance, training efficiency, and model generalization compared to previous approaches. The key to the method is the added masking strategy that allows the model to better profit from anchoring-based training. The authors demonstrate that modifications only in inference (using several samples or searching for the best references) or the number of used references do not improve model performance, while the application of the masking procedure significantly improves it, as shown on various image classification datasets, specifically CIFAR-10, CIFAR-100, and ImageNet, using different architectures (both CNN and attention-based). The experiments demonstrate the effectiveness of the proposed method and the significant benefit of using it for improved generalization.

优点

The paper is clearly written and easy to follow. The idea is intuitive and easy to grasp. The related work section provides an adequate discussion of existing approaches to anchoring-based training. The analysis narrative, with the presented drawbacks of existing methods, is very clear and easy to understand.
The idea of masking the reference input argument is very clear and logical. The intuition behind why the problem could occur: 1) argument size grows combinatorially, and therefore 2) the model could learn to ignore the reference argument; seems correct, which is further clearly supported by the experiments.
The authors provided an extensive evaluation of their approach, spanning different datasets and architectures, which provides a solid grounding to support the proposed method.

缺点

It seems that the evaluation could benefit from an additional comparison with other existing state-of-the-art OOD/uncertainty methods to better represent the quality of the results (not just in comparison with former anchoring-based approaches, but overall).
From the perspective of the experimental evaluation, I would be curious to see evidence that the behavior demonstrated in the paper would hold in other domains, such as texts, graphs, more complicated vision tasks (e.g. segmentation), not limiting to image classification task.

问题

Why do the authors focus on vision models when the method seems to be very generic and applicable to other domains as well?
One of the claims the authors make is that the proposed masking procedure helps with the problem of the model ignoring reference input. They support this claim with, for example, Figure 2—an experiment showing that without this masking, we do not observe improvements in terms of performance, which is only a proxy for the claim. Is it possible to measure the sensitivity of the model with regard to reference inputs (for example, by adding noise to it and measuring the change in the outputs)?
As far as I understand from the method description, the final method in Section 4 uses only one reference image for inference. How does the performance change with an increased number of references? The lack of improvements in performance (e.g., as in the right plot in Figure 2) seems strange to me since we would observe the opposite behavior in all existing ensembling approaches (e.g., [1, 2, 3, 4]). How would one explain such behavior? Additionally, it would be good to see some comparisons with these methods or at least include them in the discussion.

[1] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." NeurIPS 2017

[2] Wen, Yeming, Dustin Tran, and Jimmy Ba. "Batchensemble: an alternative approach to efficient ensemble and lifelong learning." ICLR 2020

[3] Durasov, Nikita, et al. "Masksembles for uncertainty estimation." CVPR 2021

[4] Laurent, Olivier, et al. "Packed-ensembles for efficient uncertainty estimation." ICLR 2023

局限性

作者回复

2024-08-07

We thank you for your positive feedback. We hope our responses address your questions.

1. Generic Applicability of Anchoring

Thank you for this question. We concur with you that anchoring is a protocol for training deep neural networks for use with any domain for any application (e.g., texts, graphs or vision tasks such as segmentation). In this paper, our goal was to identify the shortcomings of the original anchored training [1] and develop algorithms to improve the same, and vision was chosen only as a domain of convenience. Though we have performed initial experiments on using anchored training for other data domains (e.g., graphs, text), we did not include them to restrict the scope of the current submission. We find that our proposed anchoring approach produces performance improvements even in non-vision tasks. In the final version of the paper, we will include this discussion as part of the concluding remarks. In summary, anchoring is a domain-agnostic, architecture-agnostic, and task-agnostic training strategy.

2. Clarifying ‘Model Ignoring Reference Input’

It must be noted that the anchoring principle as demonstrated in [1] implicitly explores a large family of functions during training due to the lack of shift invariance of the underlying neural tangent kernel when the input is translated by a reference. We believe that this strategy produces a local optima similar in spirit to stochastic weight averaging [2] that averages multiple solutions along the trajectory of gradient descent. Unique to anchoring, the quality of the converged optima depends upon the diversity of the reference-residual pairs (induce a rich family of functions) exposed during training.

However, we observe that the original anchored training even with large reference sets does not fully leverage the reference-residual diversity and converges to a poor local optima (Fig. 4b). We attribute this to the anchored model relying on shortcuts to make predictions. Please note that, the usage of the term shortcut (ignoring the reference) in the context of anchoring is different from convention. Shortcuts manifest during anchored training when the model can predict well only with certain arbitrary references but on an average converges to a poor optima. Basically, the functions induced by ignoring the reference are entirely different from the ones obtained without ignoring them, making the model eventually converge to a sub-optimal (implicitly averaged) local optima. We will better clarify this in the final version of the paper.

3. Clarifying the Anchoring Inference Mechanism

We would like to emphasize that anchored training (i) enforces prediction consistency of a sample with any reference; (ii) produces a single model; and (iii) converges to a local optima in a manner akin to stochastic weight averaging [2]. Moreover, when the diversity of the reference-residual pairs is well leveraged during training, it allows the model to converge to a wider optima improving model generalization (Fig. 4 in the main paper). Therefore, the `quality’ of the optima governs performance during inference and as a result the inference strategy (e.g., choosing K random references for inference) does not alter the (mean) model performance. It must be noted that anchoring must not be viewed under the lens of model ensembles that train multiple models where each member explores different yet possibly diverse local optima. While [1] measures discrepancies in predictions of a sample with different references as a notion of epistemic uncertainty, we find that the mean performance does not change. Fig. 2 in the main manuscript compares different inferencing protocols (1 random anchor, K anchors and transduction) and finds no significant differences in accuracy.

4. Comparison with Existing OOD/Uncertainty Estimation Methods

Thank you for this very important question. While a systematic evaluation of OOD detection with state-of-the art methods is most essential, it is beyond the scope of this paper. Our paper aims to establish anchoring as a useful training protocol and demonstrate its efficacy across a spectrum of tasks and model architectures. Although [1] provided evidence of the efficacy of epistemic uncertainties from anchoring for OOD detection, we will be conducting a large scale study as a part of our immediate future work.

[1] Thiagarajan et al. Single model uncertainty estimation via stochastic data centering, Neurips 2022

[2] Izmailov et al. "Averaging weights leads to wider optima and better generalization”, UAI 2018

2024-08-12

Thank you for your detailed rebuttal and for addressing the questions and concerns raised. I appreciate the additional insights provided regarding the generic applicability of anchoring, the clarification on the model's interaction with reference inputs, and the explanation of the inference mechanism. Your responses have clarified several key aspects of the paper, particularly the distinction between your approach and traditional ensembling methods.

While the paper focuses on anchoring within the vision domain, I understand the rationale for this choice and acknowledge the potential for broader applicability in other domains. The planned inclusion of discussions on non-vision tasks in the final version will be a valuable addition.

Given the solid contributions of your work and the thorough responses provided, I am maintaining my original rating.

评论- Thanks!

2024-08-13

We appreciate the reviewer for going over our response and recommending acceptance.

审稿意见

评分: 8置信度: 32024-07-13

This paper presents a thorough discussion on the use of anchoring for training vision models. In particular, the paper tackles 1) the problem of reference diversity when training with anchoring to explain how superior generalization can be achieved 2) addresses the problem of spurious correlations learnt between the residual and 3) how different inference-time strategies can enable greater out-of-support generalization. Overall, this comprehensive study of anchoring provides useful guidelines for how anchoring should be applied to extract maximum performance. The paper empirically confirms this via the proposed anchoring scheme outperforming prior work noticeably.

优点

Clarity: The paper is very clearly written and easy to follow. Readers unfamiliar with the literature like myself are able to understand what anchoring is, how it can be useful for (out of support) generalization and how current methods fail to apply anchoring in the most effective way.
Thoroughness of Evaluation: The paper conducts thorough ablations on several components of the anchoring pipeline. Reference diversity, reference masking, inference procedure etc. More

缺点

No obvious weaknesses.

问题

Have the authors compared the out of support generalization of anchoring procedures to other methods for domain generalization (which tackles a similar problem)? Considering datasets and baselines from In Search of Lost Domain Generalization (https://arxiv.org/abs/2007.01434) can further broaden the impact of this paper.

局限性

N/A

作者回复

2024-08-07

We thank you for your positive comments and feedback. We hope our response addresses your concern.

Domain Generalization Benchmarks

Thank you for this question. We would like to highlight that we performed experiments on DomainNet which is one of the benchmarks from DomainBed [1] (Line 293 of the main paper). Following the setup in [1], we trained on one of the domains (source) and evaluated the model on the remaining domains (target). While it is common in the domain generalization literature to train models end to end on the source dataset, we instead trained a linear probe on top of an anchored feature extractor pre-trained on ImageNet. This was motivated by the need to investigate the impact of anchored training in producing better generalizable feature extractor backbones. In particular, we trained a linear probe with ERM [1] using the ‘real’ and ‘sketch’ (source) splits respectively from DomainNet. We then evaluated performance on the other (target) domains and observed performance improvements over the non-anchored variant.

With respect to the domain generalization specific baselines and training strategies (e.g., DRO, IRM) used in [1], we would like to emphasize that our proposed anchored training protocol can be simply used as a wrapper on all such methods and we expect it to improve overall performance similar in spirit to our analysis on augmentation methods in Section 3.2 (Fig 5a of the main paper). We plan to perform an extensive analysis on domain generalization as a part of our future work.

[1] Gulrajani, Ishaan, and David Lopez-Paz. "In Search of Lost Domain Generalization." ICLR 2021

2024-08-13

I have read the rebuttal and continue to recommend acceptance for this work.

最终决定Accept (spotlight)

2024-09-25

This paper investigates anchoring-based training of vision models. It indicates that reference diversity is important for anchoring training to achieve generalization, and existing methods tend to learn spurious shortcuts due to insufficient sampling. It further proposes a masking strategy to improve performance and efficiency. Extensive experiments demonstrate the effectiveness of the proposed approach.

Four experts reviewed this paper and all recommended acceptance. The reviewers liked that the work presents an important limitation to existing anchoring techniques that was unknown to the community, and proposed a simple and effective method to improve both OOD and safty in anchoring training. Moreover, the discussion is comprehensive and provides useful guidelines for how to use the proposed strategy to achieve maximum performance. The paper is very well-written and easy to follow. Based on the reviewers' feedback, the decision is to recommend the paper for acceptance.