Vision Transformer Neural Architecture Search for Out-of-Distribution Generalization: Benchmark and Insights
This work introduces OoD-ViT-NAS, the first systematic ViT NAS benchmark designed for out-of-distribution (OoD) generalization, and provide analytical insights on how ViT architectures impact OoD performance.
摘要
评审与讨论
The paper shows that in-domain accuracy and Training-free NAS accuracy predictions correlate poorly with out-of-domain accuracy, while characteristics of the model such as Flops, Params and Embed-Dim (number of channels) correlate much better with out-of-domain accuracy. To do this, they train a supernet that contains all possible architectures (subnets), then they validate these subnets on 9 datasets ImageNet/C/D/P/O/A/R/Sketch/Stylaized.
优点
They explore various ways for increasing OOD accuracy, which is most important for real-world problems. They show that conventional approaches, such as Training-free NAS or relying on in-domain accuracy, do not work. While they found that Flops, Param and Embed-dim correlated well with OOD accuracy. To do this, they introduce OoD-ViT-NAS benchmark for NAS research on ViT’s OOD generalization, that includes 3000 diverse ViT architectures which are evaluated on 9 datasets. This allows us to look for new and better approaches for finding new models structures to improve OOD accuracy.
缺点
You show that the embedding dimension has the highest impact among ViT architectural attributes, while network depth has a slight impact on OoD generalization. But the paper lacks an explanation that although embed-dim (number of channels) correlates with OOD accuracy better than depth, this does not necessarily mean that to increase OOD accuracy only embed-dim should be increased, since the computational cost and memory costs may be different for embed-dim and depth. Therefore, to create either the largest and most accurate model that fits in memory, or the most optimal model in terms of speed and accuracy, it may be optimal to increase the depth rather than embed-dim. There are also papers [1,2,3] that show that it is necessary to simultaneously increase several model parameters (depth, number of channels, resolution) in optimal proportions.
You present charts of OOD accuracy and number of parameters (Fig. 4, K19 - K24), but it would also be great to present charts of OOD-accuracy and latency, and OOD-accuracy and GPU memory consumption. Because these 3 model characteristics (OOD-accuracy, Latency, Memory consumption) are the most important for real-world problems.
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Mingxing Tan, Quoc V. Le, 2019
- EfficientNetV2: Smaller Models and Faster Training, Mingxing Tan, Quoc V. Le, 2021
- EfficientDet: Scalable and Efficient Object Detection, Mingxing Tan, Ruoming Pang, Quoc V. Le, 2019
问题
-
Have you tried plotting OOD-accuracy versus Latency or GPU memory consumption while increasing various model parameters such as: embed_dim, depth, MLP-ratio, num of heads, etc.? Or find correlations between ODE accuracy and these parameters, taking into account their impact on latency and memory consumption.
-
Have you tried to find optimal model scaling factors (similar to works [1, 2, 3]) to increase OOD-accuracy in optimal way with respect to latency or memory consumption?
- EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Mingxing Tan, Quoc V. Le, 2019
- EfficientNetV2: Smaller Models and Faster Training, Mingxing Tan, Quoc V. Le, 2021
- EfficientDet: Scalable and Efficient Object Detection, Mingxing Tan, Ruoming Pang, Quoc V. Le, 2019
局限性
It might be worth adding to the limitations: The results obtained on Imagenet classification datasets may not correlate with the results of real-life problems. Because Imagenet classification requires to predict the class of only one usually large object in the low resolution image without predicting its location (box, polygon or mask), whereas in real problems it is usually necessary to predict the class and position of many objects, incl. very small in high resolution image. Moreover, there is ambiguous in the Imagenet classification task, the class of which of the many objects in the image should be predicted? Thus, a continuation of the research in this direction may be the finding of correlations, NAS approaches, new network structures, their parameters and scaling factors to achieve the highest OOD accuracy on tasks close to real ones (such as Dense prediction tasks: Segmentation, Detection, etc. ) in an optimal way with respect to latency and memory consumption.
We appreciate positive feedback and the valuable comments of the Reviewer.
Q1: You show that the embedding dimension has the highest impact among ViT architectural attributes, while network depth has a slight impact on OoD generalization. But the paper lacks an explanation that although embed-dim (number of channels) correlates with OOD accuracy better than depth, this does not necessarily mean that to increase OOD accuracy only embed-dim should be increased, since the computational cost and memory costs may be different for embed-dim and depth. Therefore, to create either the largest and most accurate model that fits in memory, or the most optimal model in terms of speed and accuracy, it may be optimal to increase the depth rather than embed-dim. There are also papers [1,2,3] that show that it is necessary to simultaneously increase several model parameters (depth, number of channels, resolution) in optimal proportions.
A1: Thank you for the insightful comments. Taking into account the constraint on memory, we design an experiment based on human design ViT architecture that provides variability to respond to reviewer’s concerns. The results can be found in Table R.1 in the rebuttal PDF. The results are generally consistent with our findings, where increasing embedding dimension is the most effective solution among ViT architectural attributes to improve OoD generalization.
Increasing multiple ViT structural attributes optimally is very interesting; we will study this direction in the future.
Q2: You present charts of OOD accuracy and number of parameters (Fig. 4, K19 - K24), but it would also be great to present charts of OOD-accuracy and latency, and OOD-accuracy and GPU memory consumption. Because these 3 model characteristics (OOD-accuracy, Latency, Memory consumption) are the most important for real-world problems.
A2: Following your suggestion, we provide charts of OOD-accuracy and latency, and OOD-accuracy and GPU memory consumption in Figure R.3 and Figure R.4 in the rebuttal PDF.
Q3: Have you tried plotting OOD-accuracy versus Latency or GPU memory consumption while increasing various model parameters such as: embed_dim, depth, MLP-ratio, num of heads, etc.? Or find correlations between ODE accuracy and these parameters, taking into account their impact on latency and memory consumption.
A3: Following your suggestion, we provide charts of plotting OOD-accuracy versus FLOPs while increasing various ViT structural attributes in Figure R.2 in the rebuttal PDF.
Q4: Have you tried to find optimal model scaling factors (similar to works [1, 2, 3]) to increase OOD-accuracy in optimal way with respect to latency or memory consumption?
A4: We have not tried this setup yet. In this study, we focus on the impact of individual ViT structural attributes on OoD generalization. Optimal scaling of multiple ViT structural attributes is an interesting direction for our future research.
Q5: It might be worth adding to the limitations: The results obtained on Imagenet classification datasets may not correlate with the results of real-life problems. Because Imagenet classification requires to predict the class of only one usually large object in the low resolution image without predicting its location (box, polygon or mask), whereas in real problems it is usually necessary to predict the class and position of many objects, incl. very small in high resolution image. Moreover, there is ambiguous in the Imagenet classification task, the class of which of the many objects in the image should be predicted? Thus, a continuation of the research in this direction may be the finding of correlations, NAS approaches, new network structures, their parameters and scaling factors to achieve the highest OOD accuracy on tasks close to real ones (such as Dense prediction tasks: Segmentation, Detection, etc. ) in an optimal way with respect to latency and memory consumption.
A5: We believe this is a common issue among most OoD generalization methods that focus on image classification. We agree with the reviewer's perspective on this limitation and will add it in the limitation section in the revision. This is an interesting direction for our future work.
We sincerely hope that reviewers could consider increasing the ratings if our responses have addressed all the questions.
Thanks to the authors for additional comparison results that add confidence to the conclusions of the paper. It can be seen that increasing embed-dim increases OOD accuracy in an optimal way relative to increasing the inference latency.
Two minor remarks to the rebuttal:
- It is better in tables R.1 and R.2 to use GPU memory consumption instead of the number of parameters, which do not always correlate well with GPU memory consumption .
- In the Figure R.2, it is better to use larger ranges for MLP-Ratio and #Heads, so that curves are visible instead of points in one place.
We sincerely appreciate the reviewer’s positive feedback and the time taken to review our rebuttal. Thank you for your suggestion, we will update our paper in the revision.
This paper introduces OoD-ViT-NAS, a benchmark designed for evaluating ViT architectures' ability to generalize under Out-of-Distribution (OoD) shifts. This paper reveals that ViT design significantly impacts OoD generalization, In-Distribution (ID) accuracy does not reliably predict OoD performance, and simpler metrics like parameter count can sometimes predict OoD accuracy better than complex training-free NAS methods.
优点
This paper propose a new benchmark for ViT NAS.
缺点
- While the paper introduces a new benchmark, it doesn't sufficiently discuss the practical implications of the findings or potential applications. Including a section on how these insights can influence future ViT designs would add significant value. Furthermore, assessing whether ViTs designed based on these insights outperform those designed by humans would be meaningful to include in experiments and results.
- The analysis of the experimental results is somewhat superficial. It presents observations without delving into deeper discussions. There should be more profound insights and detailed discussions on why certain ViT architectural attributes perform better under OoD conditions. This could involve providing theoretical justifications or proposing hypotheses that could be explored in future research.
- The paper states that simple proxies like #Param and #Flop outperform more complex Training-free NAS methods but does not provide enough insight into why this might be the case. A deeper analysis of these findings, including possible reasons and implications, would strengthen the argument.
问题
Please address the weaknesses mentioned above.
局限性
No potential negative societal impact.
We appreciate positive feedback and the valuable comments of the Reviewer.
Q1: While the paper introduces a new benchmark, it doesn't sufficiently discuss the practical implications of the findings or potential applications. Including a section on how these insights can influence future ViT designs would add significant value.
A1: Thank you for your suggestion. We clarify that in addition to the new benchmark, our study produces significant insights for guiding the design of ViT architectures. Specifically, among ViT structural attributes, increasing embedding dimension can generally improve OoD generalization of ViT architectures. Our insight leads to a simple method, which can achieve ViT that can outperform well-established human-designed ViT architectures. Please refer to Q2 for the details.
Besides, our investigation also provides valuable ViT architectural guidelines for OoD generalization:
-
First, departing from existing OoD generalization methods, our study is the first to show that ViT architecture design has a considerable influence on OoD generalization of ViT. This observation encourages future research to put more focus on ViT architecture research for OoD generalization.
-
Second, our study suggests that current architectural insights based on ID accuracy might not translate well to OoD generalization.
-
Third, when utilizing training-free NAS to search for an OoD generalization ViT, the state-of-the-art zero-cost proxies methods are ineffective. Instead, simple proxies like #Param or #FLOPs could be the best options available so far.
Q2: Furthermore, assessing whether ViTs designed based on these insights outperform those designed by humans would be meaningful to include in experiments and results.
A2: Thank you for the insightful feedback. Following the reviewer's suggestion, we include a comparison between ViT architectures designed based on our insights and well-established human-designed ViT architectures, as shown Table. R.2 in the rebuttal PDF. The results demonstrate that architectures based on our insights outperform those designed by humans.
For example, scaling up ViT architecture (e.g., from ViT-B-32 to ViT-L-32) by humans typically involves compound scaling of multiple ViT structural attributes. However, our findings suggest that not all ViT structural attributes need to be increased to benefit OoD generalization. Among these attributes, increasing the embedding dimension is the most crucial factor for improving OoD generalization. By only increasing the embedding dimension, our ViT architectures (e.g., Increasing embedding dimension of ViT-B-32) are significantly more efficient and outperform compound scaling architectures (e.g., ViT-L-32).
Q3: Providing theoretical justifications or proposing hypotheses that could be explored in future research.
A3: Following Reviewer’s suggestion, in this rebuttal, we provide an additional frequency analysis on our main finding showing that increasing embedding dimensions helps ViT learn more High-Frequency-Component (HFC), leading to better OoD generalization.
Particularly, in the literature, a model achieves higher accuracy on HFC of the test samples indicates that the model has learned more HFC [c1, c4]. By learning more HFC, the model improves OOD generalization [c1, c2, c3].
Our hypothesis is that increasing embedding dimension helps ViTs learn more HFC resulting in improving OOD generalization. We strictly follow the experiment setups in [c3] for the HFC filtering process. As shown in Figure R.1 in the rebuttal PDF, when increasing embedding dimensions, the performances obtained on HFC of test samples are improved. According to [c1, c4], our observation supports that increasing embedding dimension helps ViT learn more HFC. This leads to better OoD generalization following [c1, c2, c3].
Q4: Simple proxies like #Param and #Flop outperform more complex Training-free NAS methods but does not provide enough insight into why this might be the case. A deeper analysis of these findings, including possible reasons and implications, would strengthen the argument.
A4: Thank you for your feedback. Our first study on NAS for ViT on OoD generalization initially show that #Param or #Flop outperforms more complex Training-free NAS methods. A possible reason is that existing Training-Free NAS are specifically designed for ID performance, not for OoD generalization. In the future work, we aim to further understand this observation.
We sincerely hope that reviewers could consider increasing the ratings if our responses have addressed all the questions.
[c1] Bai et al. "Improving vision transformers by revisiting high-frequency components." ECCV 2022.
[c2] Gavrikov et al. "Can Biases in ImageNet Models Explain Generalization?." CVPR 2024.
[c3] Wang et al. "High-frequency component helps explain the generalization of convolutional neural networks." CVPR 2020.
[c4] Shao et al. "On the adversarial robustness of vision transformers." arXiv 2021.
Thank you for your responses and the latest results provided in A2.
A2 and A3 have largely addressed my concerns. A1 has partially addressed my concerns. Although I think the author's response in A1 is not perfect, I acknowledge that this is an open issue worthy of further discussion, so what the authors have done so far is acceptable.
One remaining concern is regarding #Param and #Flop in A4. From the results in Fig. 4 of the paper, it appears that the model's performance is generally monotonically related to #Param and #Flop. Does this mean that for this task, fine-grained architectural design might not be necessary, and that simply increasing the model size could improve performance? Besides, the conclusion that larger models perform better seems very intuitive, but aside from demonstrating that the architecture scales well, what further conclusions can be drawn from this? Additionally, this correlation trend is overall, but within certain #Param and #Flop ranges, there is some variance. So, when model size and computational cost are in certain ranges, does this #Param and #Flop-based approach become ineffective? Is it difficult to more accurately assess the performance differences between two models of similar size?
QA-Q1: A2 and A3 have largely addressed my concerns. A1 has partially addressed my concerns. Although I think the author's response in A1 is not perfect, I acknowledge that this is an open issue worthy of further discussion, so what the authors have done so far is acceptable.
QA-A1: We are deeply grateful to the reviewer for taking the time to review our rebuttal. We are glad that we have addressed the reviewer's concerns.
QA-Q2: One remaining concern is regarding #Param and #Flop in A4. From the results in Fig. 4 of the paper, it appears that the model's performance is generally monotonically related to #Param and #Flop. Does this mean that for this task, fine-grained architectural design might not be necessary, and that simply increasing the model size could improve performance?
QA-A2: Thank you for your comment. We note that variance in Fig. 4 suggests that simply increasing the model size may not always improve; this is also supported by Tab. 2 in our main paper showing that #Param and #Flop are not accurate Training-free NAS. Importantly, other factors need to be considered (e.g. fine-grained architecture design as we investigate in our paper).
The observation of variance is also mentioned by the Reviewer.
QA-Q3: Besides, the conclusion that larger models perform better seems very intuitive, but aside from demonstrating that the architecture scales well, what further conclusions can be drawn from this?
QA-A3: From Fig. 4, larger models perform better in general but there is some variance (also mentioned by reviewer), thus it is not necessarily true that scaling up architecture can get better performance. Our analysis on architecture attributes suggests we need to scale up architecture carefully, e.g. with more focus on embedding dimension as our work points out.
QA-Q4: Additionally, this correlation trend is overall, but within certain #Param and #Flop ranges, there is some variance. So, when model size and computational cost are in certain ranges, does this #Param and #Flop-based approach become ineffective?
QA-Q4: Yes, they do. As shown in Fig. 4, within certain ranges of #Param, some architectures outperform others. This suggests that #Param and #Flop-based approaches become ineffective for predicting architecture performance within specific model size ranges. To further address Reviewer’s question, we analyze the correlation of #Param for models in certain ranges, with results presented in Tab. C1.a-b below. Specifically, to illustrate the constraint on #Param, we conduct a Kendall τ correlation analysis on our benchmark using architectures sampled from Autoformer-Small, applying different constraints on #Param ranges.
Table C1. a. Kendall τ ranking correlation between the OoD accuracies and the #Param proxy. To illustrate the constraint on #Param, we gradually reduce the constraints on #Param
| #Param range (M) | Correlation with ID Acc | Correlation with OoD Acc |
|---|---|---|
| 14-34 | 0.7885 | 0.5789 ± 0.2944 |
| 14-30 | 0.7851 | 0.5754 ± 0.2819 |
| 14-26 | 0.7633 | 0.5741 ± 0.2314 |
| 14-22 | 0.7728 | 0.5487 ± 0.2003 |
| 14-18 | 0.5692 | 0.3056 ± 0.1988 |
Table C1. b. Kendall τ ranking correlation between the OoD accuracies and the #Param proxy. To illustrate the constraint on #Param, we investigate different ranges of #Param
| #Param range (M) | Correlation with ID Acc | Correlation with OoD Acc |
|---|---|---|
| 14-34 | 0.7885 | 0.5789 ± 0.2944 |
| 30-34 | 0.1577 | 0.1419 ± 0.0818 |
| 26-30 | 0.4591 | 0.1998 ± 0.2906 |
| 22-26 | -0.0471 | 0.0513 ± 0.2215 |
| 18-22 | 0.7247 | 0.4036 ± 0.2225 |
| 14-18 | 0.5605 | 0.2955 ± 0.1945 |
QA-Q5: Is it difficult to more accurately assess the performance differences between two models of similar size?
QA-Q5: Yes, this is challenging, but our work makes contribution on this. Specifically, by creating the OoD-ViT-NAS benchmark with fine-grained architectural differences and a range of ViT attributes, we enable the assessment of performance differences between two models of similar size. This is possible as hinted by Fig. 4: we can specify the range of models of similar size, and examine these models' OoD accuracy.
We hope that reviewer can consider increasing the rating if our response can address your concerns.
Thank you for your response. I believe most of my concerns have been addressed. I also believe this paper may have a moderate-to-high impact on the field. Therefore, I would adjust my rating to a weak accept.
We are deeply grateful to the reviewer for the positive comments and the increased rating. We are glad that our response has addressed your concerns.
Once again, we thank reviewer's valuable time, which we appreciate a lot. If convenient, we would really appreciate if the initial category assessment (Soundness, Presentation, Contribution) may be reviewed.
Thanks for the reminder. I've updated the scores accordingly.
This paper studies the out of distribution generalization of vision transformer architecture designs. Specifically the paper studies how in-distribution accuracies relate to OOD accuracies for a set pf 3000 architectures. In addition the paper also studies the correlation between different zero-cost proxies and OOD accuracies (on different variants of imagenet), observing that simply parameters or FLOPs are good proxies. Furthermore, the paper also studies the impact of different architectural decisions on the OOD accuracies, discovering that embedding dimension is of utmost importance for OOD generalization. The authors release code and the raw results to reproduce their experiments on AutoFormer-T,B and S space.
优点
- The paper is the first one to study out-of-ditribution generalization of architectures discovered by NAS
- The paper also studies zero cost proxies on the ViT space showing ineffectiveness of modern/newer ZCPs
- The paper is written in a clear and easy to understand manner
- Experiments on ZCPs are thorough and the results are insigntful
缺点
- Currently I am missing the potential use-cases of the benchmark. The paper is more of a study and not a method (and fits better in the datasets and benchmark track of NeurIPS). Also though the paper is posed as a benchmark,I miss the intended use-case of the benchmark. Do the authors intend on releasing surrogates based on proxies, to allow API based querying these metrics for any possible architecture?
- The authors only study 3000 architectures, which are too few to lead to generalizable insights.
- Analysis of architecture importance: The search space of AutoFormer is a bit biased in the sense that there is a larger (factor two) variabilities in embedding dimension and a less variability for number of layers. I wonder if this is the reason embedding dimension has a higher correlation with OOD accuracy compared to other architecture dimensions like number of heads, embedding dimensions?
- Multi-Objective Search: Since inheriting weights of a supernet and evaluation is fast, generally one would perform multi-objective random or evolutionary search directly on the pretrained supernet. This would lead to different Pareto-Fronts for different dataset distributions. Could the authors also present the results of multi-objective search directly on the OOD datasets?
- What is the cost/search time different in computing Pareto-Fronts based on MO-search v/s Zero-Cost-Proxies (a lot of these proxies are actually not zero cost)?
问题
- Do the authors train their own supernet, or train a one-shot supernet from scratch?
- How does this benchmark contribute to aiding development of future NAS techniques for OOD generalization?
- Check weakness
局限性
The authors discuss limitations and broader impact sufficiently.
We appreciate the positive feedback and valuable comments.
Q1: The paper is more of a study and not a method (and fits better in the datasets and benchmark track of NeurIPS).
A1: We justify that our work is suitable for the NeurIPS main track as we develop novel insights leading to a simple yet effective method to improve OoD generalization for ViT. Specifically, our insight of increasing embedding dimensions improves OoD generalization of ViT leading to a simple yet effective method can produce high OoD performance ViTs architectures. This method can achieve ViT architectures that outperform well established human-designed ViTs (See Table R.2 in the rebuttal PDF)
We create the benchmark due to lack of comprehensive data for our in-depth analysis of ViT architecture attributes for OoD generalization. Subsequently, this benchmark is also valuable for future research and other use-cases.
In contrast, papers (e.g [b2, b3]) in benchmark and dataset track usually merely construct datasets, propose benchmark frameworks for specific tasks, or discuss improvements in dataset development.
Q2: Also though the paper is posed as a benchmark, I miss the intended use-case of the benchmark.
A2: Our OoD-ViT-NAS serves as a comprehensive benchmark for training-free NAS, akin to other benchmarks [10, 11, 17, 61]. It allows us to assess the effectiveness of new training-free NAS. Compared to other benchmarks, our OoD-ViT-NAS emphasizes high-resolution input and ViT search spaces, addressing not only ID performance but also the first for OoD generalization.
Additionally, our benchmark allows us to analyze and discover new and improved approaches for finding ViT architectures that improve OoD accuracy.
Q3: Do the authors intend on releasing surrogates based on proxies, to allow API based querying these metrics for any possible architecture?
A3: Thank you. We will consider your suggestion when releasing our benchmark publicly.
Q4: The authors only study 3000 architectures, which are too few to lead to generalizable insights.
A4: We study 3000 ViT architectures, which is significantly more than existing research on ViT architectures on OoD generalization (3 ViT architectures in [13], 10 ViT architectures in [14] and 22 ViT architectures in [15]).
Our architectures consider important ViT structural attributes for model capacities [6] and support a wide range of model sizes.
Q5: If the bias in Autoformer search space is the reason embedding dimension has a higher correlation with OOD accuracy compared to other architecture dimensions like number of heads, embedding dimensions?
A5: We argue that the bias of ViT structural attributes in Autoformer does not affect our main finding on embedding dimensions.
Reviewer is correct and there is a bias for the variability for embedding dimension than other structural attributes within Autofomer search space as we utilize pre-trained supernets Autofomer to efficiently and effectively produce a large number of ViTs enabling us to deeply study ViT architecture attributes for OoD generalization.
To address this bias, we conduct experiments on human design ViT. To achieve large variabilities in depth, MLP ratio and #Heads, we increase these attributes to the extent that the capacities of the altered architectures based on these attributes are larger than that based on embedding dimensions. As shown in Table R.1 in the rebuttal PDF, the results further confirms our findings that embedding dimension is the most important ViT structural attribute to OoD generalization.
Q6: Could the authors also present the results of multi-objective search directly on the OOD datasets?
A6: Following the reviewer's suggestion, we provide additional Evolutionary Search (ES) results directly on OoD datasets in Table. 1B
To clarify, our main purpose is to produce a large number of ViTs with trained weights to analyze the impact of ViT structural attributes to OoD generalization. Therefore, we do not conduct multi-objective random or ES. In contrast, other NAS papers conduct such search algorithms to search for the best ID Acc architecture.
In response, we provide ES results directly on OoD datasets, conducting ES on the Autoformer-Small search space. We strictly follow the ES setting in [6], except using OoD dataset instead of ID dataset during ES.
Table 1B. Our ES experiment.
| Metrics during ES | IN-R OoD Acc | IN-Sketch OoD Acc |
|---|---|---|
| OoD Acc | 32.09 | 33.81 |
| DSS | 29.85 | 30.44 |
| #Param | 31.65 | 33.28 |
Q7: What is the cost/search time different in computing Pareto-Fronts based on MO-search v/s Zero-Cost-Proxies (a lot of these proxies are actually not zero cost)?
A7: While the setup for MO-search particular for Autoformer is faster than other setups such as [b1], evaluating each candidate still takes longer time than computing “Zero-Cost-Proxies” for that candidate
We apologize for any confusion on the naming. To clarify, we consistently use the term “Zero-Cost-Proxies” as in previous works [10, 17, 60, 62].
Q8: Do the authors train their own supernet, or train a one-shot supernet from scratch?
A8: To clarify, we did not train the supernet ourselves; instead, we used pre-trained supernets from Autoformer [6] to efficiently and effectively produce a large number of ViTs with trained weights
Q9: How does this benchmark contribute to aiding development of future NAS techniques for OOD generalization?
A9: Please refer to A1
We hope the reviewer could consider increasing the ratings if our responses have addressed all questions.
[b1] Real et al. "Regularized evolution for image classifier architecture search." AAAI 2019.
[b2] Zheng et al. "Graph robustness benchmark: Benchmarking the adversarial robustness of graph machine learning." NeurIPS Benchmark & Dataset 2021
[b3] Croce et al. "Robustbench: a standardized adversarial robustness benchmark." NeurIPS Benchmark & Dataset 2021
I thank the authors for their response. I am increasing my score to 6.
We sincerely appreciate Reviewer pt1z for the positive feedback and the increased rating.
Dear Reviewer pt1z
We would like to summarize the contributions of our work following the discussion:
-
Our goal is to conduct an in-depth investigation into the impact of ViT architecture on OoD generalization.
-
To achieve this, we construct a comprehensive benchmark due to the lack of existing data that would allow for our in-depth analysis. This benchmark is not only crucial for our research but also valuable for future studies and other use cases. We are pleased that the discussion allowed us to clarify some potential applications of this benchmark.
-
With this benchmark at hand, we conduct ViT architectural studies on OoD generalization through the lens of Training-Free NAS and ViT structural attributes.
-
We are glad that you found our study into Training-Free NAS thorough and insightful. We also appreciate your comments on this study and are glad to provide additional results for Evolutionary Search directly on the OoD dataset.
-
Our key insight, increasing embedding dimensions enhances the OoD generalization of ViTs, leads to a simple yet effective method that can produce high-performing ViT architectures. This finding has been validated not only within our benchmark but also in established human-designed ViTs.
-
Your comment regarding the bias in the Autoformer search space for ViT structural attributes is greatly appreciated. During the discussion, we provided additional results to confirm that our findings are not affected by such bias.
Once again, we sincerely appreciate your valuable time, insightful comments, and for raising the overall rating. If convenient, we would be grateful if you could review the initial assessments of Soundness, Presentation, and Contribution.
Thank you very much.
This work presented OoD-ViT-NAS, a comprehensive NAS benchmark with a focus on out-of-distribution generalisation of vision Transformer architectures. The authors created a benchmark with 3000 diverse ViT architectures evaluated on 8 common large-scale OoD datasets, and provided the first comprehensive investigation of how ViT architectures affect OoD generalisation. They also conducted the first study exploring NAS for ViT’s OoD, analysing the impact of individual ViT architectural attributes on OoD generalisation.
优点
This study introduces the first large-scale benchmark for ViT OoD generalisation, which could be a valuable resource to the NAS field.
The authors conducted a thorough investigation on the impact of ViT architectures in relation to OoD generalisation, covering multiple aspects such as architectural attributes, correlation between ID and OoD performance and the effectiveness of training-free NAS methods.
This study reported several findings, such as the low correlation between ID and OoD accuracy, the superiority of simple proxies, i.e., model params and FLOPs, over more computationally complex training-free metrics. Although the latter has been revealed in other work on computer vision tasks, the authors further validate that point on the OoD tasks.
The reproducibility of this work is good as the authors provided a substantial amount of details of their methodology, hyper-parameters and computational resources.
缺点
This study is limited to the Autoformer architecture space, which may limit the generalisability of the findings to other ViT architectures. Also it comprises the novelty of this work.
Although the empirical studies are comprehensive, there are not many theoretical explanations for the observed phenomena. More analysis and insights should be provided to support this work.
The presentation can be improved. For example, Fig 1 appears on Page 2, but is first mentioned at Line 226, Page 7.
问题
Line 119 shows that the work is based on Autoformer. What is the reason behind this choice? What is new?
Line 145, OoD Classification Accuracy, what is the difference to the ID Acc in Line 142? Maybe a formula could clearly explain that.
Line 154, we make use of the One-Shot NAS approach. Why not use zero shot NAS instead? What are the benefits that one short NAS can bring but not zero shot NAS?
Table 2 shows correlations of some zero proxies. Can SWAP be applied here as it demonstrates a much higher correlation than those on the table? SWAP-NAS: Sample-Wise Activation Patterns for Ultra-fast NAS. Peng et. al. ICLR'24.
局限性
The authors discussed the limitations of this work, but in the appendix instead of the main text.
We appreciate positive feedback and the valuable comments of the Reviewer.
Q1: This study is limited to the Autoformer architecture space, which may limit the generalisability of the findings to other ViT architectures.
A1: Autoformer is the most widely used search space in many recent ViT works [7, 65, 60, 10, 11, 17, 66, 67]. In addition, a comparison of other ViT search spaces is summarized in Table A1 below. In summary, AutoFormer allows us to effectively and efficiently pool a large number of ViTs for our OoD-ViT-NAS benchmark and enables us to deeply study ViT architecture attributes for OoD generalization.
Table. A1. Compare Autoformer with other spaces.
| Criteria | Autoformer [6] | S3 [16] | BossNAS [a1] | NAS-ViT [12] | PiT [a6] |
|---|---|---|---|---|---|
| One-shot NAS w/o the need of further finetuning/retraining | ✓ | ✓ | X | ✓ | X |
| Availability of range of ViT architectural attributes in the sampled ViT | ✓ | ✓ | X | X | ✓ |
| Accessible source code/models | ✓ | X | ✓ | ✓ | ✓ |
Our findings could be applicable to numerous ViT works utilizing this search space. Additionally, the Autoformer search space is accessible and comprehensive, including important ViT structural attributes for model capacities [6] and accommodating a wide range of model sizes.
Furthermore, we further confirm our main findings on human design ViT architecture. As shown in Table R.1 in the rebuttal PDF, the observation is consistent with our findings in the main paper, where embedding dimension is the most important ViT structural attribute to OoD generalization.
Q2: More analysis and insights should be provided to support this work.
A2: Following Reviewer’s suggestion, in this rebuttal, we provide an additional frequency analysis on our main finding showing that increasing embedding dimensions helps ViT learn more High-Frequency-Component (HFC), leading to better OoD generalization.
Particularly, in the literature, a model achieves higher accuracy on HFC of the test samples indicates that the model has learned more HFC [a2, a5]. By learning more HFC, the model improves OOD generalization [a2, a3, a4].
Our hypothesis is that increasing embedding dimension helps ViTs learn more HFC resulting in improving OoD generalization. We strictly follow the experiment setups in [a4] for the HFC filtering process. As shown in Figure R.1 in the rebuttal PDF, when increasing embedding dimensions, the performances obtained on HFC of test samples are improved. This observation holds true across setups varying radius r when filtering HFC. According to [a2, a5], our observation supports that increasing embedding dimension helps ViT learn more HFC. This leads to better OoD generalization following [a2, a3, a4].
Q3: Fig 1 appears on Page 2, but is first mentioned at Line 226, Page 7.
A3: We will improve the presentation as suggested in the revision.
Q4: What is the reason behind the choice of Autofomer? What is new?
A4: Our reason for selecting AutoFormer is that: AutoFormer is the most widely used [7, 65, 60, 10, 11, 17, 66, 67]; In addition, AutoFormer allows us to effectively and efficiently pool a large amount of ViT for our OoD-ViT-NAS benchmark and study a range of ViT structural attributes to OoD generalization.
Please refer to Q1 for the details.
Q5: What is the difference between OoD and ID Acc.
A5: The OoD Acc and ID Acc metrics are similar, except for computing on different evaluation data (i.e., ID dataset and OoD dataset for ID Acc and OoD Acc, respectively).
Specifically, the classification accuracy (Acc) is the number of correct predictions divided by the total number of data points:
Where is the classifier. is the evaluation dataset, where is the input and is the label for the i-th data point. is the number of data points in . Here, the difference between ID Acc and OoD Acc are Acc computed on different dataset .
Q6: What are the benefits that one short NAS can bring but not zero shot NAS to construct OoD-ViT-NAS?
A6: One-Shot NAS allows us to efficiently produce a large number of ViT (i.e., 3,000) with trained weights, by sampling from the supernet [20, 6]. In contrast, the Zero-Shot NAS approach does not provide ViT with trained weights. Therefore, Zero-Shot NAS is unable to construct the benchmark necessary for our investigation.
Q7: Result for SWAP proxy
A7: We provide additional results for SWAP. It achieves promising results for CNN search spaces. However, as shown in Table A2 below, when dealing with ViT-based search spaces, SWAP becomes less effective than ViT-design proxies and simple proxies such as #Params or FLOPs, which is consistent with our observations.
Table. A2. Experimental results for SWAP proxy
| Training-free NAS | Originally Proposed For Performance | Originally Proposed For Architecture | Correlation with ID Acc | Correlation with OoD Acc |
|---|---|---|---|---|
| SWAP | ID Acc | CNNs | 0.2651 ± 0.3381 | 0.1201 ± 0.1069 |
Q8: The authors discussed the limitations of this work, but in the appendix instead of the main text.
A8: Thank you, we will bring limitation to main text if possible.
We sincerely hope that reviewer could consider increasing the ratings if our responses have addressed all the questions.
[a1] Li et al. "Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search." ICCV 2021.
[a2] Bai et al. "Improving vision transformers by revisiting high-frequency components." ECCV 2022.
[a3] Gavrikov et al. "Can Biases in ImageNet Models Explain Generalization?." CVPR 2024.
[a4] Wang et al. "High-frequency component helps explain the generalization of convolutional neural networks." CVPR 2020.
[a5] Shao et al. "On the adversarial robustness of vision transformers." arXiv 2021.
[a6] Heo et al. "Rethinking spatial dimensions of vision transformers." ICCV 2021.
The score is increased to 7 as the rebuttal has addressed most of my concerns and questions.
We are deeply grateful to the reviewer for the positive comments and the increased rating. We are glad that our response has addressed your concerns.
We thank all reviewers for their valuable time and effort to review our work. We appreciate Reviewers’ kind comments, such as:
- "The authors conducted a thorough investigation on the impact of ViT architectures in relation to OoD generalisation, covering multiple aspects such as architectural attributes, correlation between ID and OoD performance and the effectiveness of training-free NAS methods" (Reviewer HQRy)
- "They explore various ways for increasing OOD accuracy, which is most important for real-world problems" (Reviewer f7Da)
- "The paper is the first one to study out-of-ditribution generalization of architectures discovered by NAS" (Reviewer pt1z)
- "Experiments on ZCPs (Zero-Cost Proxies) are thorough and the results are insightful" (Reviewer pt1z)
- "This study introduces the first large-scale benchmark for ViT OoD generalisation, which could be a valuable resource to the NAS field" (Reviewer HQRy)
- "OoD-ViT-NAS benchmark allows us to look for new and better approaches for finding new models structures to improve OOD accuracy" (Reviewer f7Da)
- "The reproducibility of this work is good as the authors provided a substantial amount of details of their methodology, hyper-parameters and computational resources" (Reviewer HQRy)
- "The paper is written in a clear and easy to understand manner" (Reviewer pt1z)
We would also like to express our appreciation to all the Reviewers for giving us the opportunity to clarify our work, as well as the constructive comments.
Based on Reviewers’ suggestions, in this rebuttal, we include:
- Additional results to further validate our findings on human design ViT setups
- Additional analysis to further understand why increasing ViT embedding dimension can improve OoD generalization
- Additional results on the SWAP zero-cost proxy
- Additional results on evolutionary search directly on OoD dataset
- Additional observations on Latency and Memory consumption
Importantly, additional results are consistent with our findings in the main paper.
In what follows, we provide comprehensive responses to all questions. We could provide more details if there are further questions. We hope that our responses can address the concerns.
Thank you AC for handling our paper. We thank the reviewers for all of their insightful comments and for taking the time to thoroughly review our paper. We would like to briefly summary our work and Author-Reviewer discussion.
Main contribution. In this work, we conduct an in-depth investigation into the impact of ViT architecture on OoD generalization. To achieve this, we first construct a comprehensive OoD-ViT-NAS benchmark covering 8 large-scale OoD datasets and 3,000 ViT architectures. This benchmark allows us to thoroughly study the impact of ViT architecture on OoD generalization, revealing that ViT architectures have a considerable impact on OoD generalization. Then, we conduct the ViT architectural studies on OoD generalization through the lens of (Training-Free) NAS and ViT structural attributes. Importantly, we discover that, among ViT structural attributes, increasing embedding dimensions improves OoD generalization leading to a simple yet effective method that can produce high OoD performance ViTs architectures.
Reproducibility. Our submission includes the program code, OoD-ViT-NAS benchmark, and detailed instructions for reproducibility purposes. All of these will be released to the public to facilitate future OoD research.
Rebuttal. In total we receive four reviews for our work. We are particularly grateful for the reviewers' constructive feedback to further improve our work. These have led to:
- Additional experiment to validate finding on embedding dimension's effectiveness for established human design ViT, e.g., ViT [1] and Swin [3]
- Additional High-Frequency analysis to support our embedding dimension finding
- Additional latency and model size analysis
- Additional experiment on the comparison of Training-Free NAS and Evolutionary Search
Importantly, additional results consistently support our findings in the main paper.
Final Reviewers’ Decision. After we submitted our rebuttal, we received positive feedback from all reviewers. In the end, we received the above acceptance threshold across the board (i.e., 6, 6, 7, 8) including a “Strong Accept” rating (Reviewer f7Da) and a feedback on the moderate-to-high impact on the field (Reviewer Lm5n). This encouraging feedback motivates us to further explore this research direction.
One again, we sincerely thank all reviewers and AC again for their valuable time on this work.
Best Regards.
While initial reviewer scores were mixed, the rebuttal helped clarify and alleviate most concerns. In the end, all reviewers suggest acceptance, and the AC and SAC see no reason to overrule their suggestions, hence we accept the paper.
However, we would like to remind the authors to follow through their duty to include the clarifications and additional results from the rebuttal in their paper for the camera-ready release; they really did help improve the clarity of understanding and confidence in the method.