6.3

/10

Poster4 位审稿人

最低5最高7标准差0.8

4.3

置信度

正确性2.8

贡献度2.8

表达3.0

NeurIPS 2024

AdanCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer

Yitao Xu,Tong Zhang,Sabine Susstrunk

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

Neural Cellular Automata can be inserted between the middle layers of Vision Transformers (ViTs) to improve ViTs' robustness on image classification.

摘要

关键词

Neural Cellular AutomataVision TransformerAdversarial RobustnessOut-of-distribution generalization

评审与讨论

审稿意见

评分: 7置信度: 42024-07-09

The work presents present a neural network architecture that combines vision transformers and neural cellular automata. The work shows that this hybrid architecture competitively at image classification task with a downsampled version of Imagenet (224x224). Furthermore, the results show a small improvement on performance against adversarial attacks and OOD test against the chosen vision transformers baselines.

优点

The paper makes use of hybrid architecture (NCA and ViT), a combination that remains underexplored in the current machine learning literature.
The presented model performs competitively against the baselines.
The paper is clear, well written, and results are well presented with an thorough appendix detailing training setup.

缺点

Readability of Table 1 could be improved: it is not indicated the values for adversarial attacks and OOD inputs: accuracy? error? In some cases high number are bold, in other low number are bold.
Please report whether the results are for the single best trained model.
The argument put forward by the authors to motivate the AdaNCA architecture is its competitive performance, however the magnitude of the performance improvement is limited, notably keeping in mind that the work is demonstrated on a downsampled version of Imagenet.
Since the architecture is presented as plug-and-play, it's unclear why the AdaNCA is not simply added to a ViT pre-trained models. Having a simple way of improving robustness of ViT with a quick training of the NCA would make the model much more impactful and scalable.

Typos:

Plug-in-play -> Plug-and-play

问题

How does AdaNCA differs from Vision Transformer Cellular Automata model from Attention-based Neural Cellular Automata, Tesfaldet et al. 2022?

-What's the motivaiton of using channel-weighted convolution for the perception part part of the NCA (what you call "dynamic interaction")? Wouldn't this be equivalent to the transformation that the update MLP does on concatenated channels vectors from standard depth-wise conv layers?

Why training AdaNCA from scratch rather than adding it to a pretrained ViT model? Could you report performance on this?
The paper conjectures that it is the stochasticity of the NCA that may be positively contributing to the AdaNCA performance for adversarial attacks. Have you tried introducing noise during training to the ViT baselines to compare performance?
Line 163 : “NCA typically queries the cell states at a random time step T”. I don't understand what do you mean with "NCA queries a cell state".

局限性

No code is provided.

作者回复

2024-08-06

We appreciate your valuable comments and positive feedback. We now address your questions below.

Comparison with ViTCA

Please refer to the global rebuttal ”AdaNCA and ViTCA”.

The difference between MLP processing concatenated channel vectors and Dynamic Interaction

Please see the global rebuttal ”The usefulness of Dynamic Interaction”.

Integrating AdaNCA into pre-trained ViT models

Please see the global rebuttal "Integrating AdaNCA into pre-trained ViT models".

Noisy training of baseline ViT models

We are keen to explore the possibilities of integrating NCA training strategies into the current ViT training pipeline, as we believe there are connections between the two model families (Section 3.1.1). Nevertheless, we want to point out that our focus is the integration of NCA and ViT. We made a first attempt at incorporating one training strategy, the stochastic update as you indicated, into current ViTs.

	Clean Acc. ( $\uparrow$ )	Attack Failure Rate ( $\uparrow$ )
Swin-Tiny Baseline	86.56	12.29
Swin-Tiny StocU	85.80	13.40

The stochastic update improves the model's robustness while undermining its clean performance. We assume this is because of the incompatibility between NCA which uses a local recurrent interaction scheme while ViT uses a global single-pass one. Despite this, it supports the notion that stochasticity during training indeed improves robustness.

Magnitude of performance

We wish to highlight that AdaNCA can contribute to over 10% absolute improvement in adversarial robustness, as shown in Table 1, Section 4 in our main manuscript, with a small cost. For instance, ConViT-B-AdaNCA uses merely 3% more parameters and 7% more FLOPS to achieve more than 10x improvement compared to the enlarged baseline model (* sign). More importantly, our method achieves on-par or better results than a recent advanced modification of ViTs, namely the TAPADL method. Therefore, AdaNCA provides a good trade-off between the performance and computational costs, facilitating its deployment.

Clarification

-- Line 163, NCA queries cell states:

We are happy to clarify more details of the NCA evolution. Following Equation 6 in Section 3.1, NCA evolves in a recurrent manner for certain steps. Instead of evolving for a fixed number of steps, NCA typically runs for random steps, where the number is randomly selected from a range. After evolution, the cell states are used for downstream tasks. In our case, AdaNCA outputs activations for the next ViT layer. Hence, the usage of the cell states happens at a random step, and we refer to this process as NCA querying the cell states. We will improve the text and make it clearer.

-- Readability of Table 1:

Thank you for your suggestion. We will improve the readability of Table 1 by adding arrows indicating the direction of performance growth. We clarify that the IM-C uses a different metric than other benchmarks that use accuracy, as stated in L257-258. It corresponds to a relative classification error compared to a pre-trained AlexNet, and thus lower is better.

-- Single Best Model:

Yes, our model performance is from a single best trained model, to facilitate the release of the pre-trained weights. The best model we obtained almost all fall into the last 10 epochs of training, with some exceptions in our ablation studies. All models are from the last 15 epochs of training.

2024-08-10

Thank you for the clarifications and the work put on running the extra experiments.

Regarding the Dynamic Interaction, beyond the better results found empirically, I still fail to understand how that transformation is mathematically not equivalent to the MLP transformation on the concatenated channels—but this might a limitation on my side.

I think your work makes for a nice contribution, I will update the rate accordingly.

2024-08-10

Dear Reviewer ZhPF,

Thank you very much for your comment! We are really grateful for your endorsement! We will add a paragraph for the discussion about the comparison between our Dynamic Interaction and the MLP transformation on the concatenated channels in our revised manuscript. We will also include all the additional results in our paper. Thanks again for your valuable feedback!

Sincerely,

Authors

审稿意见

评分: 5置信度: 42024-07-11

The authors propose the introduction of NCAs into VisionTransformers to improve the robustness against adversarial inputs as well as out of distribution data. This improves basic ViT architecture by up to 10% against specific adversarial attacks. In addition this modification, they propose Dynamica Interaction, for faster communication between cells/tokens. As the choice of the right layer to insert the NCAs is not trivial, they introduce a method to find the most efficient position based on the redundancy. The whole method has been evaluated against a good number of methods and on OOD data as well as adversarial inputs.

优点

I highly appreciate the effort of integrating NCAs into existing architectures, to combine the established method with the specific traits that NCA can bring, such as the authors have shown: robustness
The authors show generally strong robustness improvements when comparing the "classical" ViT variants, RVT-B, FAN-B, Swin-B and ConVitB with their AdaNCA variants
The proposed method for determining the best position for inersting the NCAs is sound
The manuscript is generally a well written paper and good to follow. The appendix adds valuable extra information

缺点

My major concern is the choice of baselines. While generally a good selection has been made, i strongly dislike the choice of removing the additional ADL loss (according to [22] the ADL loss is more important than TAP). The only logical explanation I could find for this choice, is that it performs better than the NCA based method.
This is especially misleading since you call your baseline "TAPADL-RVT". But the "adl" in this method stands for the loss.
Considering that you do not use the "adl" loss Table 2 is particularly unfair, as these are OOD examples, something the loss particularly optimizes on. Especially considering that the difference is not very big even without it.
Why does Table 2 once compare with the standard baseline and once with TADPADL?
The argumentation in appendix D, why the comparison with classical / concatenating NCAs is missing is unclear. Why would this require NCAs with 10 million parameters? In general it is not clear, why the NCAs are this big

More minor remarks:

"enables the modeling of global cell representations" in abstract is confusing
They choice of hyperparameters is unclear. Why is e.g. the EMA different inbetween AdaNCA setups?

问题

I very strongly encourage including the results for the original "TAPADL" [22], with the correct loss for training. I do not see it as a problem if you perform worse, which is likely the case, but it is absolutely necessary for completness.
Instead of doing this misleading modification to the TAPADL, did you try to add the ADL loss to your modified model?
How is it possible, that the NCAs have such a huge number of parameters? One major advantage of them is, that they do not need many parameters, so im curious why you chose to make them 1-2.5m parameters big.
Why did you choose the number of steps for NCAs so low? It is very atypical for NCAs to run for 2-3 steps.
It is not one of my major points, but can you call something an NCA if runs for 2-3 steps?
Your statement about the release of the code is unclear to me. Will you release it in case of acceptance?

局限性

The authors have adequately addressed the limitations

作者回复

2024-08-06

We thank you for your valuable feedback on our experiments. We will improve the clarity of the sentence in the abstract. We now address your major concerns below.

Choice of baselines

We fully agree with you that the original results reported in the TAPADL paper should be included for completeness, and we will add corresponding results to our manuscript. However, we'd like to clarify that we choose the original TAPADL models, instead of the one without ADL loss, and download them from the official code repository. We test them on our own machine, which might lead to different results from the original paper due to the difference between hardware and platforms. Importantly, the test results of our own code match the ones produced by their official code, which are the results reported in our paper. We do not train the TAPADL models as our focus is on the improvement of AdaNCA-enhanced ViTs compared with the baselines. We include the results from TAPADL models to showcase the strong robustness improvement that AdaNCA can achieve. In fact, our reported results match most of the reported ones in their main paper (citation [22] in our paper), with differences only in the IM-A results of both TAPADL models and the IM-R results of TAPADL-FAN. We assume these differences might originate from different data processing methods. However, we emphasize that we test our AdaNCA-enhanced model using the same setting as we test the TAPADL models, which follow the hyperparameters of the corresponding baseline models.

Table 2 comparison with TAPADL

We'd like to point out that we do not have access to the official RVT model trained on ImageNet1K, hence we use the TAPADL model as a proxy in non-adversarial robustness evaluation. We state this compromise in Section C.8 in the Appendix. We will add this clarification to our main manuscript. In all other similar comparisons, as provided in Table 13 in the Appendix, we use the standard baselines.

The usage of ADL loss

We clarify that we do not use ADL loss to train our models since we focus on our proposed architectural changes, as stated in L235-237.

NCA parameters

We highlight that AdaNCA works with cells that have much higher dimensionalities. For example, NCA in texture synthesis typically adopts a cell dimensionality of 16. ViTCA operates in 128-dim space. In ViTs, however, the cell dimensionality can be 768 or 1024, resulting in a large amount of parameters in the MLP of NCA. We choose the hidden dimensionality of the MLP to be equal to the input dimensionality, as we do not want to have an information bottleneck (Section C.3 in the Appendix). Hence, an NCA working with a cell dimensionality of 1024 can have 1024 x 1024 x 2 $\approx$ 2M parameters in the MLP part. Moreover, we insert 2-3 AdaNCA models into ViT to achieve a good balance between the computational costs and performance improvement. Therefore, the final increase in parameters can fall in the range of 1-2.5M, as you indicated. Regarding the concatenation scheme, NCA parameters can exceed 10M as shown in the Global rebuttal "The usefulness of Dynamic Interaction". It is because the input dimensionality to the MLP is $\mathcal{M}$ x larger if we use $\mathcal{M}$ kernels for interaction. Considering the above example where the cell dimensionality is 1024, the MLP now has the weight tensor of the first MLP being 4096x4096 with $\mathcal{M}=4$ if avoiding information bottleneck (Section C.3 Appendix), which is already 16M. We believe more efficient instantiation of the Update stage in NCA is worth exploring, and we are happy to add a paragraph discussing this future direction in our paper.

NCA steps

We agree that our choice of NCA steps differs from all previous NCA models. We make this compromise as the computational costs increase linearly as the NCA step grows. We aim to minimize the increase in the number of parameters and FLOPS for scalability and we do not want the source of improvement to merely stem from the increase in the size and computation of the models. We show that more NCA steps will indeed contribute to the model's performance, as shown in the following table, while it introduces extra FLOPS. The model setting is the same as in our ablation study in Section 4.3.

	# Params (M)	FLOPS (G)	Clean Acc. ( $\uparrow$ )	Attack Failure Rate ( $\uparrow$ )
Step=4	27.94	4.7	87.18	22.35
Step=5	27.94	4.8	87.22	23.46

We want to underscore that it is the architecture and evolution scheme that defines an NCA as introduced in Section 3.1, instead of the number of its recurrent steps. Admittedly, less steps will lead to a coarser path to the target state, and can potentially undermine the model's performance. Notably, with our scheme, AdaNCA has achieved generally strong robustness improvements compared to the baselines. It indicates that our choice of the recurrent steps of AdaNCA is a good balance between computational costs and performance.

Choice of hyperparameters

Our choice of the training hyperparameters follows the settings of corresponding baseline models, as listed in Section C.6, L718-719. Hence, their differences in hyperparameters result in the differences in our experiments. We want to highlight that our AdaNCA can adapt to different training hyperparameters, e.g., learning rates, EMA (on/off, different decay rates), batch size, etc, which indicate the strong adaptability of AdaNCA in different training settings as well as its promise in combining it with any ViT models.

Code release

Yes, we will release cleaned and well-documented code in case of acceptance, as well as the pre-trained models.

2024-08-10

Thank you for providing additional clarifications and results.

Choice of Baselines / Use of ADL Loss:

Initially, it was unclear that the baseline models used were pretrained. This led me to interpret the phrase as suggesting that you retrained these models without the ADL loss and compared them against your approach. This would have effectively compared against TAP rather than TAPADL. The phrase in question is: "Note that the SOTA method involves training with an additional loss (ADL) while we do not incorporate it in our training, since our focus is on the effect of architectural changes." Please revise this statement to explicitly clarify that the baseline models used were pretrained, as this is crucial for accurately understanding the comparison.

In my opinion, for the sake of completeness, you should have included an experiment in which you test your method, including ADL loss.

NCA Steps and the Impact of Dropout:

After reviewing the results with more NCA steps, I have the impression that the observed benefits are primarily due to the use of dropouts during training rather than the NCA itself. It would have strengthened your argument in using NCAs by isolating the effect.

Although I still have some reservations about whether NCAs have been utilized to their full potential in this work, I no longer have any major concerns. I will therefore increase my rating by 1.

2024-08-10

Dear Reviewer oE4b,

Thank you very much for your reply! We greatly appreciate your endorsement! We will modify the statement according to your comments to clarify our comparison strategy and will include the results from the original TAPADL paper for completeness.

Moreover, we will surely consider running experiments on ADL loss with AdaNCA-enhanced models. However, due to the large amount of hardware resource requirements and time consumption, we cannot present the results during the rebuttal and discussion period. We will try to include the results in our future revisions. Nevertheless, we fully agree that it is worth exploring and are confident in its effectiveness.

Thanks again for your constructive comments!

Sincerely,

Authors

审稿意见

评分: 7置信度: 42024-07-13

This paper introduces Adaptor Neural Cellular Automata (AdaNCA), a plug-and-play module designed to enhance the robustness and performance of Vision Transformers (ViTs). The innovation lies in integrating Neural Cellular Automata (NCA) as intermediary adaptors between the layers of ViTs. The paper demonstrates that AdaNCA can significantly improve the robustness of ViTs against adversarial samples and out-of-distribution inputs. The authors also propose a Dynamic Interaction mechanism to reduce computational costs and provide an algorithm to optimize AdaNCA placement within ViTs.

优点

The integration of NCA into ViTs is a novel approach that addresses the robustness issue prevalent in current ViT architectures.
AdaNCA shows improvement in robustness with only a small increase in parameters
The results show improvements across 8 robustness benchmarks and 4 different ViT architectures.
AdaNCA demonstrates improvement in accuracy under adversarial attacks on the ImageNet1K benchmark

缺点

The experiments are primarily conducted on image classification tasks. It would be ideal to see how AdaNCA performs in other domains or tasks, such as object detection or segmentation, to assess its generalizability.
The scalability of AdaNCA to very large-scale datasets and higher-resolution images is not yet studied.

问题

Would be good to elaborate a bit more on how the Dynamic Interaction works and why it is novel
How does the developing pattern of the NCA look over time? Are differences visible between a version with and without dynamic interaction?

局限性

Yes.

作者回复

2024-08-06

We thank you for your positive feedback on our work. We now address your questions.

The generalizability and scalability of AdaNCA

We fully agree with you that applying AdaNCA in other computer vision tasks and scaling it to larger datasets and higher-resolution images are worthwhile. Given the fact that AdaNCA can scale up to much larger sizes than all previous NCA models and its effectiveness in large-scale image classification tasks, we are confident in AdaNCA's competitiveness in tasks such as robust semantic segmentation. In fact, we are actively looking into robustness benchmarks such as CityScape-C [1] and ACDC [2] and we are keen to explore these problems in the future.

Elaboration of Dynamic Interaction

We are happy to explain our proposed Dynamic Interaction in more detail. Assume we have a token state $\mathbf{S} \in \mathbb{R}^{H \times W \times C}$ . After each $C$ -dim token interacts with its neighbors using $\mathcal{M}$ different depth-wise convolutions, we have $\mathcal{M}$ tensors with the same shape as the input token state $\mathbf{S}$ . Our Dynamic Interaction first computes a token-wise scalar weight tensor $\mathbf{W}_{\mathcal{I}m} \in \mathbb{R}^{H \times W \times 1}$ using a two-layer CNN for each interaction results. Hence, the output of the CNN is of shape $H \times W \times \mathcal{M}$ . The input of the CNN is the token map. The CNN consists of two 3x3 convolution layers with a batch normalization layer in the middle. The interaction results are then computed using the right-most hand side of Equation 8, in which a weighted sum is performed on the $\mathcal{M}$ interaction results. Each token has a different set of weights for aggregating the interaction results. Hence, the tokens dynamically adjust the weights of the combination on the interaction results based on both their own state and the states of their neighbors, which we term Dynamic Interaction. We will improve the texts of Section 3.2.1 to make it clearer.

Novelty of Dynamic Interaction

Please see the global rebuttal ”The usefulness of Dynamic Interaction”.

Developing patterns of AdaNCA

Please refer to Figure 1 in the R-PDF.

[1] Michaelis, Claudio, et al. "Benchmarking robustness in object detection: Autonomous driving when winter is coming." arXiv preprint arXiv:1907.07484 (2019).

[2] Sakaridis, Christos, Dengxin Dai, and Luc Van Gool. "ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

2024-08-13

Thank you for the clarifications!

2024-08-13

Dear Reviewer 3x5z,

Thank you again for your constructive feedback! Your endorsement is highly appreciated!

Sincerely,

Authors

审稿意见

评分: 6置信度: 52024-07-13

This paper proposes a strategy for improving the image classification robustness of Vision Transformers (ViT) through the use of specialized networks that are inserted at strategically placed layers within the ViT model. These networks are called Adapter Neural Cellular Automata (AdaNCA) and are intentionally chosen due to the proven robustness characteristics of NCA on various tasks such as image generation and classification. Although there has already been a connection made between NCA and ViT models (ViTCA from Tesfaldet et al.), AdaNCA differentiates itself by not trying to be a ViT in and of itself (as ViTCA does), but by acting as an adapter that can be placed at various layers within a ViT to improve its robustness. Through exhaustive experimentation, the authors prove the viability of using NCA in a much larger scale setting than ever before (as far as I know) through this adapter-type approach, thus providing a new pathway for the NCA community to consider when it comes to practical applications. Another difference between ViTCA and AdaNCA worth mentioning is the downstream task at hand: small-scale image denoising vs. large(r)-scale image classification.

A quick summary of what NCA are: they're a fairly recent (circa 2019-20) computational paradigm that build upon the much older model of Cellular Automata (CA). Both NCA and CA consist of a connected lattice of stateful cells whose states are recurrently updated through the repeated application of an update rule. The update rule consists of two stages: an interaction stage, where for each cell, information is gathered from its neighbouring cells; and an update stage, where this information is processed to produce a cell update, which can be applied in a residual manner or be directly treated as the new cell state. CA use a handcrafted update rule, with a popular one being Conway's Game of Life, while NCA use a learned update rule in the form of a neural net with convolutions in the interaction stage and an MLP in the update stage. The NCA update rule is trained via some downstream task, where the cell grid / lattice is evaluated against some target state after a certain number of cell updates. Due to the way NCA are trained (stochastic application of cell updates, repeated evaluation against target state at various points of cell lifetime via pool-based training, hidden states to facilitate cell message passing, etc.), they end up being fairly robust models, able to correct themselves in the presence of adversarial attacks (e.g., various types of structured or unstructured noise) and adapt to OOD situations.

The three main contributions of this paper are as follows:

Presenting AdaNCA, a NCA-based small model that is able to be inserted at various points within a ViT to improve its robustness on image classification. On average, the classification accuracy gains on clean, noisy, and OOD data outweigh the added parameter and computational cost in a non-negligible (statistically significant) manner.
Introducing a new type of cell interaction strategy called "Dynamic Interaction" (and a Multi-Scale variant) that uses far fewer parameters and FLOPS when compared to a vanilla concatenation approach (typically used by previous NCA), while being more parameter efficient than a simpler summation-based approach. This is the part where you combine the information from the various filter responses in the interaction stage before providing it to the MLP that performs the update stage.
Proposing an effective strategy for picking the best ViT layers for inserting AdaNCA. This strategy is backed by an exhaustive analysis that proves the strategy's effectiveness in maximizing the robustness gains of inserting AdaNCA within a ViT. In short, they show that AdaNCA is best inserted between sets of layers where each set consists of highly redundant layers. Basically, AdaNCA helps information flow between two sets of layers where the two sets are not redundant from one another.

There are some smaller things I'm leaving out of this summary due to how extensive the paper is in some areas, but this is the overall approach.

优点

Originality:
- The authors embark on the challenging task of proving the feasibility of NCA to a crowd where many don't believe in their usefulness and where many more have not even heard of NCA. Instead of pushing tired areas of machine learning research, these authors have taken a risk in pursuing a niche and not-yet-proven area of machine learning research and delivering on showing promising results on practical applications that the community typically cares for. In relation to other NCA research, the authors take the first meaningful step in applying them to a much larger scale setting than before (image classification on ImageNet1K and various other versions of it).
- Several advancements are proposed: a Dynamic Interaction strategy that's quite original and seemingly effective (although I do have some gripes on how effective it really is), a dynamic programming algorithm for choosing the best layers to insert AdaNCA, new metrics (Set Cohesion Index, among other related ones), and the idea of using an NCA as an adapter to improve ViT robustness.
Quality:
- The submission is technically sound for the most part. Most claims are well supported by exhaustive experimentation and analyses, all appropriately chosen to further their narrative.
- The authors are honest about evaluating both the strengths and weaknesses of their approach. Although there are areas I have gripes with (which I'll mention in the Questions and/or Weaknesses sections).
Clarity:
- For the most part, this is well-written paper. Clear, concise, detailed, and thoughtful. The authors clearly put in a lot of time and effort in trying to leave no stone unturned and I commend them for that. The main manuscript and the appendix is not only exhaustive in trying to prove the benefits of their model, but the experiments chosen, the details of their experiments they listed, and the commentary provided, are all indicative of a thoughtful team that really cares to push this area of research forward.
- The appendix was clearly not a throwaway. I'd like to commend the authors in putting the same effort they did in the main manuscript for the appendix. Like the main manuscript, it was clearly and concisely written. Great work.
Significance:
- Are the results important?
  - Yes. These results have given me a new perspective on how to apply NCA and I think the community at large would also appreciate these results. I believe this will convince some researchers, especially those specializing on model robustness, to pay closer attention to NCA due to the favourable cost-benefit tradeoff of using an NCA for improving the robustness of a ViT on image classification.
    - However, I absolutely do want the authors to answer my questions provided in the Questions section below. Especially the part on the usefulness of Dynamic Interaction and its multi-scale counterpart and if it's possible to use AdaNCA with a mostly frozen and pretrained ViT.
- Are others (researchers or practitioners) likely to use the ideas or build on them?
  - I am confident that this work will see adaptations of it in the near future, although I do have concerns regarding Dynamic Interaction, Multi-scale Dynamic Interaction, and if it's required to jointly train AdaNCA with a ViT from scratch. These concerns are detailed in the Weaknesses and Questions sections below.
- Does the submission address a difficult task in a better way than previous work?
  - Yes. They propose an alternative and effective approach to utilizing NCA for improving robustness on downstream tasks (specifically on image classification, but I'm sure this can be applied on other tasks).
- Does it advance the state of the art in a demonstrable way?
  - Although I'm not sure if it advances the state of the art, as the authors don't make any clear indication that it does, they do demonstrate that it improves many ViT-based models on image classification, particularly on noisy or OOD input. One of the models they apply AdaNCA to is amongst the state of the art (state of the art rapidly changes these days anyways), so I'm convinced this approach would also improve whatever the actual state of the art ViT-based model is on image classification.
- Does it provide unique data, unique conclusions about existing data, or a unique theoretical or experimental approach?
  - As far as I can tell, yes.

I just want to say that my favourite part of the experimentation in the main manuscript and the appendix was the layer similarity analyses. Everything from the motivation for focusing on layer similarity, to introducing the Set Cohesion Index, to the figures showing the similarity structure before and after AdaNCA is applied, to the plots and commentary proving a relationship between layer redundancy and robustness, to the dynamic programming algorithm for placing AdaNCA in the best spot based on this proven relationship, and to comparing it with a no-prior approach, were all well done.Also, I'd like to commend the authors for providing the following:

Providing detailed training hyperparameter information in the appendix (Table 4, 5, 6, 8, 9, 10).
Providing detailed information of datasets in the appendix (C.14).
Providing model and code license information in the appendix (C.15).

缺点

Originality:
- Citation and comparison with related works needs some improvement. My concern here is elaborated upon in the Questions section below. Fortunately for the authors, I think these are easily addressable.
Quality:
- Some incorrect / incomplete assumptions were made in various parts of the paper, particularly on parts commenting on where NCA get their robustness from. This concern of mine is detailed in the Questions section below.
- The dropout-like compensation technique shown in Eq. 7 was not ablated and thus was not sufficiently motivated. I go into more detail about my concerns about this in the Questions section below.
- The Dynamic Interaction strategy was not sufficiently motivated. It should be compared against the typical NCA approach of concatenation. Comparing Dynamic Interaction against a simple summation does not motivate the use of Dynamic Interaction over concatenation.
- The Multi-scale Dynamic Interaction strategy was not sufficiently motivated as it hurt performance at three scales, which indicates a potential fundamental flaw in the implementation. The authors provide a brief and insufficient explanation as to why it may be failing at three scales. Some more digging would be helpful here, as multi-scale approaches in neural nets typically go beyond two scales.
Clarity:
- All of my concerns regarding clarity are listed below in the Questions section. In short, there are some issues regarding redundancy and lack of explanation in some of the figures, tables, and equations; there's a lack of explanation of how NCA are typically trained; and there's a lack of commentary on the ablation study, making it difficult to appreciate the ablation results shown in Table 3.
Significance:
- The Dynamic Interaction strategy was not compared against the typical NCA approach of concatenation, and so its significance as a technique to increase NCA interaction stage efficiency is unclear. Furthermore, it could have been compared against other similar approaches instead of against a simple summation approach (which would obviously be worse). For example, comparing it against a 1x1 conv applied on depthwise conv outputs or comparing it against using a non-depthwise conv.
- The Multi-Scale Dynamic Interaction strategy hurt performance at three scales compared to a single-scale Dynamic Interaction, which limits the significance of Multi-Scale Dynamic Interaction. Theoretically, W_Ms should be able to zero out / ignore filter responses from useless scales, so why is it hurting performance when three scales are used?
- It seems limiting to have to train an entire ViT alongside AdaNCA from scratch. To really push the adapter narrative, I would suggest adapting AdaNCA to a frozen or partly-frozen ViT. I've suggested an experiment in the Questions section below.

问题

L40-41: This statement is missing one crucial piece. It's not only the stochasticity and modulation of local information that make NCA robust against noisy input, it's also the recurrent steps within a single training step that makes them robust (coupled with the pool-based training that extends a cell grid's lifetime to many training steps). The recurrent steps allow the cells to explore a wide variety of states, many of them perturbed by the model itself (especially early on during training when the model is fairly untrained). Tab 3 in your paper also proves this, with the ablation of "Recur" causing the biggest drop in attack failure rate compared to the other ablations. So my suggestion is to state the importance of recurrent updates.As a bonus, here's a paper I suggest taking a look at [1]. You'll be surprised to see the similarities between training an NCA (Mordvintsev-style) and training an Energy-based Model (EBM) using the technique in [1]. Just like the pool-based training technique that Mordvintsev et al. used, they use a sample replay buffer here. It shows how crucial the exploration and recurrence are for having a robust model.[1] Du, Y., Mordatch, I. Implicit Generation and Generalization in Energy-Based Models. In NeurIPS 2019.
For the introduction (L26) and related works (L101-103, L89), I'd like to see a mention of how ViTCA (from Tesfaldet et al.) improves the robustness of ViT architectures. I understand that a brief mention of ViTCA was given in L111-113 but nothing about how it achieved a more robust ViT model (albeit with its own unique limitations). On that note, I find it a bit surprising that nothing is said about ViTCA under the context of ViT and robustness and how that compares with AdaNCA and its approach to ViT robustness. Both contribute to a more robust ViT but in different ways. I'd like to see a deeper comparison made between ViTCA and AdaNCA as well as an acknowledgement of ViTCA already having improved ViT robustness using an NCA (albeit with a different downstream task in mind).
L107: You forgot to cite Gilpin [2]! If I recall correctly, he was the first to realize CA under conv nets. I may be wrong, but either way, I think it's necessary to cite and comment on his work if you're bringing up Mordvintsev et al ([43] in the paper).[2] Gilpin, W. Cellular Automata as Convolutional Neural Networks. In Physical Review E 2019.
Eq 3: Please put a quick explanation of the depthwise convolution operator as you did with the channel-wise concat operator in L142. Something like "where (*) is the depthwise convolution operation." It's just to cover your bases in terms of equation clarity.
Eq 4 and L142-143: I would suggest clarifying how S_out is used to update cells. "where cells get the updated states" is not clear. Perhaps you may want to show that S_out can be used as a residual to update cell state S, or that S_out is itself the new cell state. Either way, it's important to give this extra bit of clarity to the reader.
The Neural Cellular Automata paragraph in Section 3: This section is great for the most part, but I would definitely like to see an explanation of how the NCA update rule is typically trained. So far, only the rough architectural design and computations have been explained, but not how to train them, which is incredibly important. For example, perhaps you can mention at the end that after a certain number of recurrent steps (within a train step), a subset of each cell's state may be evaluated against some downstream task's loss function, or something to that effect. You can also make it clear that one may not necessarily need to use all parts of a cell's state, briefly mentioning the hidden channels that Mordvintsev et al (and many other NCA) use, which has been shown to be extremely helpful in facilitating cell communication.
Related to the above point, I would recommend clarifying some crucial differences between AdaNCA and other "more traditional" NCA that more closely follow Mordvintsev et al's approach. In other words, I would suggest talking about why there are no hidden channels in AdaNCA, what the initial cell states are during training (i.e., Mordvintsev et al. and other NCA approaches start with either a constant or randomly initialized cell grid, whereas here with AdaNCA it seems like the initial cell grid state is just the activations from the layer it's inserted after—correct me if I'm wrong), and why there's seemingly no pool-based training.
L158-162: "In previous NCA works, stochasticity is maintained during testing." ViTCA showed that this was not necessary and that it was possible to train with stochastic updates and test with synchronous updates. I would recommend mentioning this and contrasting it with how you've approached test-time synchronous updates. In particular, I'd mention how ViTCA does not use the dropout-style approach and merely switches to synchronous with no compensation.
- On this note, I would like to see an ablation between using the dropout-style compensation vs not using it. Can you quickly run an experiment to compare the two settings? Basically, remove the 1/p in "Train" in Eq 7 and see how it compares with having 1/p. I'm very curious to see how useful this is.
- L159-160 states "Such a scheme is problematic in discriminative tasks since reliable numerical outputs are essential." Isn't the whole point of training the NCA update rule to be agnostic to asynchronous cell updates is so that they can be robust to such a scenario during testing? In other words, it's been shown that one can achieve more or less the same results in an asynchronous testing setting by merely updating the cells for more iterations to compensate for those that are behind. The texture synthesis work from Niklasson et al. (Self-Organizing Textures, citation [45] in the paper), showed this in their time synchronization experiment. ViTCA from Tesfaldet et al. showed this in their ablation study, where the results were more or less the same at 50% update rate compared to 100% if you doubled the number of cell updates. Self-Classifying MNIST Digits from Randazzo et al. also showed that stochastic updates are no problem when it comes to reliable discriminative outputs. So I'm wondering why you say stochasticity during testing is problematic for discriminative tasks when other works have shown that not to be the case.
- L165-166: "Hence, the trained model can effectively handle the variability and unpredictability of the input, thus being robust against noisy input." I would suggest rewording this sentence. Specifically, I would highlight that it's not just the randomness of when cells get updated that contributes to their robustness, but also that the recurrent updates coupled with the stochastic masking allows the update rule to explore many cell states, teaching it how to recover from such states and move cells towards a good target state. It's also important to note the update rule's own influence on the perturbation of cells early in training, where the update rule has not been sufficiently trained and so acts as a source of noise to itself. So in essence, you're also training the update rule to be robust against perturbations caused by itself.
Fig 2 & 3: I would suggest either removing (b) from Fig 2 and have Fig 3 explain the architectural details, or remote Fig 3 and provide the architectural details in Fig 2. You'll save a lot of space as there's a lot of redundant information between the two figures. Also, I would put a more detailed explanation of the architecture rather than "Overview of AdaNCA architecture". That means explaining the computation pipeline and the different computational blocks so that the reader can easily refer to the caption to explain anything they don't understand in the figure. This will mean some repeat of what was said in the main manuscript, but it will immensely help with clarity when someone wants to quickly glean your paper and understand what's going on. A good rule of thumb is to have the gist of the paper be understood from reading the figures and captions alone.
Eq 8: Please put a quick explanation of the convolution operator.
For Multi-scale Dynamic Interaction, is W_I shared across scales? I'm assuming it is, but just wanted a quick confirmation if it is or not.
For Multi-scale Dynamic Interaction (and the single-scale version), I would like to see a comparison with a vanilla concatenation approach. I want to know if the cost of more parameters and FLOPS is worth it with the concatenation approach. Otherwise it's difficult to judge the usefulness of Dynamic Interaction. Comparing it with a simple filter response summation scheme and showing improvements doesn't really motivate Dynamic Interaction in the context of typical NCA approaches. If you can show that Dynamic Interaction (and Multi-Scale) is more computationally efficient than the concatenation scheme (the relative reduction of accuracy is made up for by a greater relative reduction of parameters and FLOPS), then it solidifies the benefits of Dynamic Interaction and possibly sets a new standard for the NCA interaction stage.
For Multi-scale Dynamic Interaction, have you tried comparing the weighted summation from W_I and W_Ms with a 1x1 conv approach to mix the depthwise info (basically this is how ConvNeXT does it, depthwise conv followed by pointwise conv)? Have you also tried comparing Dynamic Interaction with a non-depthwise spatial conv? If not, can you provide some quick commentary on what you believe the pros and cons would be for each approach and how it would compare with Dynamic Interaction?
For Multi-scale Dynamic Interaction, have you tried using a larger filter instead of a larger filter dilation? I have a feeling that the sparsity induced by the dilation coupled with the sparsity of cell state updates contributed to the model having difficulty with three scales.
For Multi-scale Dynamic Interaction, have you tried modifying the two-layer convnet that produces W_Ms such that its receptive field covers the same receptive field of the max dilation? In other words, if there are three scales, that means you would need the two-layer convnet to have a receptive field of at least 7x7, but the problem is that its receptive field is 5x5. Can you verify if it's required to match the two-layer convnet's receptive field with the max dilation in order to see an improvement with three or more scales?
- This could explain why it worked with two scales, as the max receptive field would be 5x5, which is within the receptive field of the two-layer convnet.
L224: I suggest explaining the "K(r = 0.6938, p < 0.001)" part.
Jointly training the ViT model and AdaNCA seems costly. Have you tried a training approach where you use a pretrained ViT model and have every layer but the one before and after AdaNCA insertion points frozen? Considering your layer set redundancy experiments, it seems like these "boundary" layers would be the only ones required to adapt to an AdaNCA that's also being trained. If this works, then this would really push the adapter narrative of AdaNCA.
- Another experiment on top of this would be to have all ViT layers frozen and train AdaNCA to act as a feature cleaner where OOD and noisy inputs have their features adapted to the part of the feature manifold that each layer set expects.
Table 1: I suggest providing citations beside each of the metrics under "Adversarial Inputs" and "OOD Inputs". You can simply reuse the ones provided in 4.1.
Table 1: I suggest bolding the lowest param count and lowest FLOPS within each row triplet.
Table 1: I suggest changing the bolding of your model to italicizing or some other visual indicator so as to not confuse the reader with the intention of the bolded numbers.
Table 1: I suggest explaining what the bold means in the caption.
Table 2: Same recommendations as the three above.
Table 3: I suggest explaining the bolding in the caption.
Table 3: I suggest removing the bolding from "Ours." It's already obvious that it's your model.
Figure 6: I suggest not using yellow as the outline colour as it's very difficult to tell amongst all the other yellow-ish colours. Use a strong dotted black or some other colour-blind-safe colour.
Figure 6: I suggest clarifying what "Frequency of Noise" and "Magnitude of Noise" means in the caption.
Figure 6: I suggest specifying what kind of noise is being shown here.
Section 6: Negative impacts were not considered. Please provide brief commentary on potential negative impacts, even if you believe there are none. It's important to show that you've at least considered it.
Figure 4: I suggest briefly commenting in the caption on how AdaNCA does not negatively affect the layer-wise similarity in Swin-B. In other words, providing a smaller version of what was said in 4.2 about how AdaNCA preserves the original layer sim structure.
Figure 5: I suggest providing some brief commentary on (a) and (b) in the caption.
L288: "while overly large scale can lose local information." How so? Theoretically, the model should learn to discount larger scales if they're not useful, so why does including more scales hurt performance and how would it cause a loss in local information? To me, this suggests that W_Ms isn't working the way it's intended to, so there must be some sort of bug or something else entirely. It's just really odd to me that merely adding a third scale is worse than keeping it at a single scale. I'm keen to hear your thoughts on this and if you've looked into this any deeper.
L300: I'd like to see commentary on each of the ablations. "Our design choices contribute to model performance and robustness" is not enough. There should be deeper commentary on each of the ablations, explaining any gains or loss in performance from each (e.g., why does ablating RandS and Recur lead to the highest accuracy?), and comparing relative gains between each (e.g., which component contributes the most to model performance?).
L298: "Ablate the Dynamic Interaction that the..." -> "Ablate the Dynamic Interaction so that the..."
L271: "...networks to conduct the analysis." -> "...networks to conduct this analysis."
"FLOPs" -> "FLOPS".

I would like to see my questions and suggestions above addressed to the best of the authors' abilities. My largest concerns lie with the lack of a clear and detailed comparison with the only other NCA model that operates under a ViT setting (ViTCA from Tesfaldet et al.), the relatively insufficient evidence of the usefulness of Dynamic Interaction and its multi-scale counterpart, and the lack of commentary on the ablation study.

局限性

The authors have addressed most of the limitations of their approach, as shown in the Limitations section (Section 5) of the main manuscript and across various points of the main manuscript and appendix. I would like to point out that the potential negative impacts of their approach have not been considered, and so I recommend they comment on this in Section 6 (Broader Impact), even if there are none.

作者回复

2024-08-06

We thank you for the valuable and detailed suggestions as well as the acknowledgment of the originality, significance, and novelty of our work. We now address your questions below.

Importance of recurrent updates

We fully agree with you that the recurrent update scheme of NCA is crucial for the model to be robust. We will modify L40-41 and L165-166 to include this point. We want to clarify that we do not ignore the recurrence but mention it in L37-38 to state the intuition behind the advantage of using NCA, and also in our ablation study (Section 4.3) on Recur.

Comparison with ViTCA

Please refer to the global rebuttal "AdaNCA and ViTCA".

Difference between AdaNCA and other NCA

We agree that a discussion on this topic can help readers better understand our method, including the training, hidden state usage, initial cell states, and the pooling strategy. As we do not include them in AdaNCA training, we will add them to the Appendix.

Stochasticity in AdaNCA

We'd like to highlight that stochasticity during testing can hinder the evaluation of adversarial robustness by producing obfuscated gradients ([3] in the main paper), leading to the circumvention of adversarial attacks (L244-245 and Section C.7.1 in the Appendix). Table 5 in the R-PDF illustrates this issue, where we test the classification accuracy under CW attack. Stochasticity also results in inconsistent outputs. This is problematic for practical image classification tasks where reliable output is critical, unlike applications focusing on visual effects (Self-Organizing Textures (SOT), ViTCA) or collective behaviors (Self-Classifying MNIST (SCM)). The change in Clean Acc. in Table 5 indicates that given the same image, the stochasticity can lead to different decisions, hindering the deployment of the trained models in real-world scenarios. We also conducted your suggested experiment by removing the 1/p during training (Table 6, R-PDF). The drop in Clean Acc. and robustness suggests that downstream ViT layers struggle with varying NCA output magnitudes. Our dropout-like compensation scheme improves AdaNCA’s performance, which might be due to the different application scenario compared to previous NCA works.

The usefulness of Dynamic Interaction and its multi-scale counterpart

First, we clarify that W_I is shared across scales.

-- Comparison with the concatenation scheme

Please see the global rebuttal "The usefulness of Dynamic Interaction".

-- Comments on other convolutions

Pros: Both spatial and 1x1 conv incorporate channel mixing in the interaction stage, potentially improving the model capacity of NCA.

Cons: Adding channel mixing can cause a drastic increase in the number of parameters and FLOPS, hindering scalability. For example, using spatial conv will result in ~10M more parameters when the token is 384-dim and the number of kernels is 4 (the conv will be 1536x384x3x3 for a single scale, and we use two scales). It will also modify the paradigm used in ViT and NCA. That is, to separate token mixing from channel mixing. The modification can potentially lead to redundant information.

-- The usefulness of multi-scale Dynamic Interaction

Our motivation for using multi-scale interaction is to perform more efficient token interaction learning over local scales since ViT already contains global information. Enlarging the neighborhood size will: 1) Complicate the process of selecting the neighbors to interact with; 2) Introduce excessive noise, making it difficult for tokens to accurately acquire neighbor information. 3) Repeatedly acquire the global information provided by ViT. The recurrence further amplifies the noise. We point out that ViTCA also suffers from the neighborhood size issue in image generation (Table 5 in their Appendix). Theoretically, self-attention can discount far-away information as the tokens also perform a weighted sum on the neighborhood values with the weight generated by itself while it struggles with large neighborhood sizes. In contrast, our model gains performance when S=2 (5x5 neighborhood), indicating the usefulness of our multi-scale module. Although increasing model capacity (as you suggested by using larger filters or increasing the receptive field of W_Ms) can improve the performance (Table 7, R-PDF), AdaNCA struggles with overly large neighborhoods. We will add a paragraph to discuss this issue. We note, however, that our work pioneers the application of multi-scale interaction in large-scale image classification tasks.

Integrating AdaNCA into pre-trained ViT

Please see the global rebuttal "Integrating AdaNCA into pre-trained ViT models".

Comment on the ablation study

Due to the space limit of the main paper, we did not include the analysis of the ablation study. Here, we provide our analysis below and we plan to add it as a Section in the Appendix: As shown in Table 3 (main paper), the highest robustness improvement is achieved with all components. Among them, turning off the recurrence leads to the largest drop in the robustness, as recurrence allows the model to explore more cell states than finishing the update in a single step. While it achieves the highest clean accuracy, it uses ~4x more parameters than our method, and the improvement of the clean performance is likely due to more parameters. Without any of the two sources of randomness, stochastic update and random steps, the model cannot adapt to the variability of the inputs and thus exhibits vulnerabilities against adversarial attacks. Finally, turning off our Dynamic Interaction will cause drops in both clean accuracy and robustness, as tokens cannot decide their unique interaction weights and thus cannot generalize to noisy inputs.

Citation, clarification, and typos

We highly appreciate your thorough and constructive feedback on the texts, figures, tables, and citations. We plan to incorporate all of them in our revised manuscript.

2024-08-09

Hi authors, thank you for the detailed rebuttal. I've read through your answers (as well as your global answers) and have some additional comments to provide.

First, I'm glad to see that you'll be integrating my suggested changes. I trust that you'll do so. I'm also glad to see in the rebuttal PDF that you've done some of the experiments I suggested, such as comparing with the concatenation approach, removing the dropout-like compensation, trying a larger receptive field for W_M (I'm guessing you tried a receptive field of 7x7? You should indicate the specific size.), among others. Interesting results! I'm happy to see commentary on the ablation as well. I'm particularly surprised to see that compensating for the larger receptive field requirement of 3 scales by increasing the receptive field of W_M still gave worse results compared to 2 scales!

Most of my concerns have been addressed by your rebuttal, but I do have one suggestion to make, particularly about comparing AdaNCA with ViTCA, as I believe your global response comparing the two was still insufficient:

AdaNCA focuses on robustness in image classification, whereas ViTCA is primarily designed for image reconstruction. These different goals necessitate different structural designs in the two frameworks.

I don't see how the two different tasks necessitated different structural designs as both dealt with tokens from imagery as inputs / outputs. If you feel this to be the case, then you should be able to explain how the different tasks motivated a different structural design, because from what I can tell, ViTCA can be used in an AdaNCA manner, as you've shown in your rebuttal PDF.

ViTCA studies the behaviors of tiny NCA (~100K) on small datasets such as MNIST, while AdaNCA-enhanced ViT can deal with sizes larger than 90M and can operate on ImageNet1K.

Correct, but this is only a lead to the more meaningful difference between the two approaches.

ViTCA's usage of self-attention for token interaction learning results in a higher parameter count compared to our proposed Dynamic Interaction approach (see Table 3 in R-PDF).

Correct, although I'd probably also say: "ViTCA uses self-attention, AdaNCA uses Dynamic Interaction."

Tokens in ViTCA correspond to pixels instead of image patches.

Tokens in ViTCA can trivially correspond to image patches, so this isn't really a meaningful difference.

With the above being said, I'd like to point to a sentence in my summary in my review:

AdaNCA differentiates itself by not trying to be a ViT in and of itself (as ViTCA does), but by acting as an adapter that can be placed at various layers within a ViT to improve its robustness.

Coupled with the fact that ViTCA uses a localized self-attention for cell interaction while AdaNCA uses Dynamic Interaction, I'd say the above quote is the most meaningful difference between ViTCA and AdaNCA and so my one remaining suggestion would be to highlight this in your paper.

Aside from that, I'm pleased with the paper overall and your rebuttal. I'll be updating my score to reflect this.

2024-08-09

Dear Reviewer h6zC,

Thank you very much for your reply! We highly appreciate your endorsement and will highlight the difference between ViTCA and AdaNCA in our paper according to your comments. Moreover, we will ensure that our revision includes all your suggested changes and the additional experiments, as well as the related analysis. Thank you again for your thorough and constructive feedback!

Sincerely,

Authors

作者回复

2024-08-06

We thank all reviewers for their valuable comments and the acknowledgment of our contributions. We are glad to note that all reviewers (h6zC, 3x5z, oE4b, ZhPF) agree on:

The novelty and significance of our method in integrating NCA into ViT.
The effectiveness of AdaNCA in improving the performance of ViT, including clean accuracy and robustness against adversarial attacks and OOD inputs.

Moreover, reviewers h6zC and oE4b remark that our method in determining the best insert positions of AdaNCA is sound. Reviewers h6zC and 3x5z mention the thoroughness of our experiments, and reviewer 3x5z recognizes the small cost brought by AdaNCA compared to the improvement it achieves.

We identify some common questions from the reviewers:

h6zC, 3x5z, ZhPF: The usefulness of our proposed Dynamic Interaction and its difference from the existing NCA concatenation method.
h6zC, ZhPF: Comparison between AdaNCA and ViTCA.
h6zC, ZhPF: The integration of AdaNCA into a pre-trained ViT model instead of training from scratch.

We address these below. We conduct all experiments using the same setting as in our ablation study (Section 4.3), namely Swin-Tiny on ImageNet100, except for the integration of AdaNCA and pre-trained ViTs, which uses ImageNet1K and Swin-base.

The usefulness of Dynamic Interaction

Our motivation for developing the Dynamic Interaction module is the high dimensionality of feature vectors in modern ViT models, as stated in L43-51 and Section D in the Appendix. We qualitatively showcase in Figure 1, rebuttal PDF (R-PDF), that Dynamic Interaction helps robustify AdaNCA when facing noisy inputs. We underscore that any operations involving linear transformations of the concatenated interaction results will lead to drastic increases in computational costs, as shown in Table 1 in the rebuttal PDF. Note that the concatenation scheme nearly doubles the FLOPS for RVT compared to the baseline, leading to difficulties in training. Such an increase renders scalability a challenge. We agree that more parameters can contribute to better performance (Table 2, R-PDF, numbers in parentheses indicate performance improvement compared to the Baseline). Note, however, that our Dynamic Interaction achieves ~70% of the improvements of the original NCA concatenation scheme (10.06/14.62) in robustness improvement and ~60% improvement in the clean accuracy (0.62/1.06), with merely ~10% of the parameters and FLOPS (0.35/2.99 for # Params and 0.2/2.3 for FLOPS). Therefore, our Dynamic Interaction scheme provides a good trade-off between the performance and computational costs. It is capable of scaling up, allowing us to insert AdaNCA into even larger models, such as current vision-language models where the token dimensionality is even higher.

AdaNCA and ViTCA

We acknowledge that ViTCA is the first Vision Transformer architecture to incorporate NCA. However, we highlight some key differences between AdaNCA and ViTCA.

AdaNCA focuses on robustness in image classification, whereas ViTCA is primarily designed for image reconstruction. These different goals necessitate different structural designs in the two frameworks.
ViTCA studies the behaviors of tiny NCA (~100K) on small datasets such as MNIST, while AdaNCA-enhanced ViT can deal with sizes larger than 90M and can operate on ImageNet1K.
ViTCA's usage of self-attention for token interaction learning results in a higher parameter count compared to our proposed Dynamic Interaction approach (see Table 3 in R-PDF).
Tokens in ViTCA correspond to pixels instead of image patches.

Nevertheless, we conducted experiments using ViTCA's local recurrent attention scheme to explore the potential benefits of integrating ViTCA into our framework. Our training setting is the same as in our ablation study (Section 4.3). Results are given in Table 3 in R-PDF. Despite having more parameters, the ViTCA-like scheme performs worse in clean accuracy and on-par with AdaNCA in robustness. This indicates that it is promising to explore the possibility of incorporating ViTCA into our framework.

Integrating AdaNCA into pre-trained ViT models

Implementing AdaNCA as a plug-and-play module for pre-trained ViT models would certainly improve the training speed. However, the current NCA might not be able to adapt to such a scheme. NCA may struggle to effectively transmit information between two pre-trained ViT layers, as these layers have already established strong connections. In contrast, training the model from scratch allows NCA and ViT to synergistically adapt to feature variability, resulting in better overall performance. To explore this, we experimented by inserting AdaNCA into a pre-trained Swin-base model on ImageNet1K by 1) freezing all ViT layers; 2) training only the boundary layers; 3) Finetuning all layers. Results are given in Table 4 in the R-PDF. None of the schemes perform as well as training from scratch. However, it is worth exploring in the future.

最终决定Accept (poster)

2024-09-25

Accept with minor revisions: The paper demonstrates significant potential, especially in terms of improving robustness and efficiency. However, there are gaps in the experimental setup (e.g., missing baselines, more extensive comparisons) and in the explanation of the key contributions like Dynamic Interaction. These issues are not major roadblocks but should be addressed in a revision to align with NeurIPS' standards. The paper offers promising contributions to the community and could serve as a foundation for further research in robust ViTs.