Learning Where to Edit Vision Transformers
摘要
评审与讨论
The paper presents a novel approach for editing Vision Transformers (ViTs) to correct predictive errors, specifically addressing subpopulation shifts. It introduces a learning-to-learn methodology utilizing a hypernetwork that identifies and modifies a small set of critical parameters in response to erroneous samples. This method is distinguished by its focus on localized edits, ensuring minimal disruption to unrelated model functions while generalizing corrections to similar errors.
优点
The concept of utilizing a hypernetwork to identify editing locations in ViTs for image processing tasks is innovative, addressing a gap in the literature concerning efficient and localized model editing. The proposed method is rigorously validated through the introduction of new benchmarks, showing significant improvements over existing techniques. The paper is well-organized and clearly written, with technical details and methodologies elaborately explained, making it accessible to readers familiar with the field. This research is highly significant as it enhances the practical utility of ViTs in real-world applications, particularly where predictive reliability is crucial, such as in autonomous driving and medical imaging.
缺点
The success of the proposed method heavily relies on the hypernetwork's ability to accurately predict edit locations. Misidentifications could lead to suboptimal edits, affecting model reliability.
问题
Could a strategy similar to the one proposed by [1] be adapted to apply a low-rank update to the weight matrix for memory editing in ViTs?
[1] Meng, Kevin, et al. "Locating and editing factual associations in GPT." Advances in Neural Information Processing Systems 35 (2022): 17359-17372.s:
局限性
The CutMix technique employed may not adequately represent real-world shifts, which could limit the applicability of the findings to practical scenarios.
We sincerely appreciate your constructive comments on improving our paper. We detail our response below point by point. Please kindly let us know if our response addresses the questions you had for this paper.
[W1] Reliability of the hypernetwork
We argue that identifying the 'correct' parameters to edit (i.e., editing locations) is a central challenge in any localization-based editing model, not just a particular challenge for our hypernetwork-based localization method.
- To tackle this common challenge, we have proposed the following two strategies to improve the reliability of our hypernetwork for localization:
Meta-learn to edit: Our meta-trained hypernetworks leverage transferable knowledge (inductive bias) from related training editing episodes to enhance the editing performance on each test editing sample. As shown in Fig. 5b and lines 361-363, the similarity between editing regions output by the hypernetworks correlates with the similarity in the input data. In contrast, previous editing baselines, such as FT-L2, determine the editing region for each image solely based on that particular test editing sample, which is prone to overfitting (resulting in low generalization) or overwriting irrelevant concepts (resulting in low locality).
Hypernetwork Optimization Decoupling Trick: As described in Appendix B.3.2, by introducing intermediate variables as the target editing region for the hypernetwork's prediction, this technique successfully ensures optimization success and provides extra supervision signals to guide the hypernetworks in learning to output optimal editing masks.
- On the empirical evaluation side, we conduct extensive experiments to demonstrate the reliability and superiority of our methods across various scenarios.
- We evaluate the editing methods on various datasets: the natural dataset, which consists of sixteen types of errors; the AI-generated datasets, which include two types of shifts (c.f. Fig. 4); and the fine-grained classification dataset (c.f. Fig. r4 of the rebuttal PDF).
- We apply our methods to various pre-trained models: ViT/S-16 (c.f. Fig. K in appendix), ViT/B-16 (c.f. Fig. 4), and SwinV2-g (c.f. Fig. r1 of the rebuttal PDF), under different pre-training strategies, including supervised pre-training for ViT/S-16 and ViT/B-16, and SimMIM self-supervised pre-training [1] for SwinV2-g.
[Q1] Adaption of ROME with LoRA
Following the reviewer's suggestion, we adapt ROME with LoRA for model editing experiments on two groups of the natural dataset using ViT/B-16. The results are presented in the table below.
Although the combination of ROME and LoRA demonstrates an improved trade-off between generalization and locality compared to using LoRA individually, our method still outperforms this combined approach by 13.58% and 15.31% in generalization at similar locality rates in the 609-586 and 890-430 groups, respectively, even ROME and ROME+LoRA require access to pre-training data.
609-586 group in the natural dataset | Generalization | Locality > ---- | --- | --- > ROME | 78.36 | 96.56 > LoRA | 76.07 | 91.27 > ROME+LoRA | 81.27 | 91.13 > Ours | 94.85 | 90.79 890-430 group in the natural dataset | Generalization | Locality > ROME | 66.10 | 97.70 > LoRA | 65.21 | 91.30 > ROME+LoRA | 71.75 | 91.84 > Ours | 87.06 | 92.83
[1] On data scaling in masked image modeling, CVPR 2023.
I appreciate the authors response, which addressed all my questions. I increase my score to 6.
Thank you very much for the positive feedback and increasing the score! We reiterate our deep appreciation for the reviewer's dedicated time and effort in reviewing our paper and providing invaluable feedback.
The paper explores and discussed the VIT editing, which is the first work in this field. The paper proposes a learning-to-learn approach that identifies a small set of critical parameters to edit to correct the prediction of an error example. The paper considers the reliability and generalization during the ViT editing. In experiments, the paper curates two benchmarks to validate that the framework can correct the error and provide customization.
优点
- The problem formulation and method are clearly defined and explained, and the method is easy to follow.
- The problem of ViT and the proposed method based on bi-level optimization are new in this field and meaningful.
- The paper contributes the benchmark for the problem including the natural data and generated data which is beneficial to later research.
- The experiments are intensive and validated on both real data and generated data. The ablation and analysis cover various metrics that form strong evidence.
缺点
- The motivation of CutMix and its influence are not clearly discussed and demonstrated. For example, there are also other data augmentation methods, are they the same as CutMix? Can they generate the same
- The paper mention that the editing should not influence the irrelevant examples. However, this is not clearly discussed in the paper. I have questions about the metric Locality. Specifically, if the predicted probability of the irrelevant examples changed, does this indicate that the model change?
问题
Please refer to the weakness part.
局限性
The author address the limitations.
We thank the reviewer for the valuable feedback. We address your concerns below point by point. Please kindly let us know whether you have any further concerns.
[W1] Motivation and influence of CutMix
We appreciate the reviewer's insightful question regarding the motivation and influence of CutMix, as well as the comparison with other data augmentation methods. To answer these questions:
Motivation and influence: As stated in lines 190-191, the primary motivation for using CutMix is to efficiently generate pseudo-editing training episodes that emulate the actual editing process, thereby enabling the hypernetwork to learn-to-localize the most crucial parameters for editing during meta-training. Specifically:
- Given an original image, we generate an associated pre-editing image through CutMix by pasting a small patch of another image (ranging in size from 48 × 48 to 128 × 128) onto the original image at a random position.
- This newly introduced patch by CutMix simulates distribution shifts in backgrounds, contextual objects, and novel attributes of an object, resulting in a difference in the pre-trained model's predictive distribution between the pre-editing and original samples.
- By aligning the model's predictive distributions on the CutMix-generated pre-editing image with that of the original image through fine-tuning only a subset of pre-trained parameters identified by the hypernetwork, as stated in our editing objective in Eq. (2), the hypernetwork learns to locate the most crucial parameters in the pre-trained models that account for the patch introduced by CutMix. This achieves our goal of learning where to edit in vision Transformers.
Comparison to other data augmentations: To address the reviewer's query, we refer to our ablation study in our paper where we have compared CutMix with PGD, a gradient-based augmentation strategy known for capturing diverse distribution variations (see line 381). As shown in Fig. 6(a), the performance of hypernetworks meta-trained with CutMix rivals that of more computationally expensive gradient-based augmentation strategies. This indicates that CutMix is not only effective but also efficient in simulating the necessary distribution shifts for our model editing tasks.
[W2] Clarification on the locality metric
Due to space constraints, we have included the discussion on the locality metric in Appendix B.1 of our manuscript. To further improve the clarity of the paper, we will explicitly highlight this in the revised manuscript. Specifically, we would like to emphasize the following points from our discussion:
- Our definition and evaluation of the locality metric are strictly consistent with the ones in prior works.
- As in [1, 2], the locality metric (see Eq. (6)) measures whether the edited model can maintain the prediction accuracy on a set of irrelevant data at the level of the pre-trained model (refer to the Metrics section in [1] and the evaluation of specificity, which is equivalent to locality, in [2]).
- That said, to address the reviewer's question: In line with the definition of locality metric in prior works, a change in the edited model's predicted probability on an irrelevant sample, as long as it does not affect the accuracy of the predicted label, reflects a change in the edited model but does not impact the value of the locality metric.
- We have designed our evaluation benchmark to better reflect changes in the edited model, even with the above definition of locality metric.
- To better capture model changes under slight predicted probability variations, we collect sensitive validation images from ImageNet, ImageNet-R, and ImageNet-Sketch to evaluate locality.
- These sensitive images are characterized by: 1) being correctly predicted by the pre-trained model, meaning that they have the highest predicted probability on the label class, and 2) having a difference less than 0.05 in predicted probability between the label class and the class with the second highest predicted probability.
- As a result, prediction accuracy, hence locality, on these sensitive images is affected even by small changes in predicted probability.
[1] Fast model editing at scale, ICLR 2022.
[2] Locating and editing factual associations in GPT, NeurIPS 2022.
The paper explores a model-editing task for ViT that is similar to model-editing in LLMs. For ViT model-editing, the paper proposes a meta-learning-based approach to selecting parameters for fine-tuning updates. The proposed selecting method trains a hyper-network that selects learning parameters using episodes made by CutMix augmentation. The proposed method exhibits superior performance to similar methods on newly made Editing Benchmark.
优点
- Exploring model editing for ViT with a meta-learning approach is new and interesting.
- The paper proposes a new benchmark for model editing, including data set and evaluation metric.
缺点
- Explanation and justification for the necessity of model editing in the vision domain are not enough.
- Although model editing is a powerful and important tool for LLM tuning, it doesn't explain the need for model editing in vision models.
- Vision models have relatively small parameters, and FLOPs are considered to be an important cost rather than memory and the number of parameters. Model editing might be effective in reducing parameters and memory, but I'm not sure it is also beneficial in the vision domain.
- Hypernetwork requires too high cost to select parameters for model editing.
- To the best of my knowledge, the major goal of model editing is to reduce costs for full fine-tuning like LoRA. But, the proposed method requires huge computation, and the training process might be larger than full finetuning.
- For 12 blocks ViT-B, the proposed method trains a hyper-network with 5 transformer blocks using multiple training episodes with inner- and outer-loop optimization, which might require huge computation costs compared to the original network.
- The hyper-network is trained only for a single pre-trained network. Thus, it is required to train other hyper-networks for other pre-trained networks. This whole process doesn't look efficient.
问题
-
The authors argue the importance of model editing based on LLM scenarios. But, I think vision domains are different from LLM scenarios. What are the benefits of general model editing in the vision domain? Without performance improvement made by the proposed method, does model editing have a unique role in the vision domain? It is important to evaluate contribution as a new editing benchmark and the first model editing in vision.
-
The proposed method requires training a hyper-network with two-loop optimization, which seems to be a high-cost computation. Does this method offer any computational cost advantages compared to other existing methods?
-
Can hyper-network be used to select parameters for another pre-trained network?
-
I think hyper-network training with multiple episodes is a significant advantage for a fair comparison. Do other methods also use similar processes?
-
The proposed method uses 5 transformer blocks to select just 3 FFNs. It is not parameter efficient and can't be expanded to large-scale networks over a billion scales. Is there any way to expand the proposed method for larger select space?
局限性
I didn't find a limitation section in the paper.
We sincerely appreciate your comments on our paper. You may find our responses below for your concerns. If you have any further concerns, we would be grateful if you could let us know.
[W1 & Q1] Explanation and justification for the necessity of model editing in the vision domain
- We would like to humbly clarify that the motivation for model editing in the visual domain aligns with that in LLMs, i.e., enabling data-efficient updates of a pre-trained model to correct errors (c.f. the last sentence of the first paragraph in the introduction in [2] and [3]).
- Necessity to correct errors: analogous to outdated knowledge in LLMs, vision models also encounter errors over time, such as failing to recognize images with subpopulation shifts. Correcting these errors, particularly in safety-critical applications like self-driving, is urgent.
- Necessity for data-efficient updates: the optimal strategy to correct errors without sacrificing generalization capabilities is to re-train a foundation model (including LLMs and vision foundation models) with both pre-training and error data. However, pre-training data, due to its huge volume, is often inaccessible [1] or computationally exhaustive to re-train.
- The motivation of being data-efficient, instead of parameter/memory-efficient, has led to several lines of model editing methods that do not necessarily reduce parameter or memory. For example,
- fine-tuning all parameters [5] and hypernetwork-based methods [6] update all parameters;
- memory-based [2] and T-patcher [7] even introduce additional memory and parameters, respectively.
- Our proposed method falls under the locate-then-edit line [8],
- whose primary motivation is to address the generalization-locality trade-off, which vision models also face as shown in Fig. 4, by precisely targeting a minimal number of parameters to update; In Fig. 4, ours shows impressive improvement on the balance between generalization and locality.
- which meantime enjoys the benefits of reducing parameters, memory, and also FLOPS of vision models. Compared to full fine-tuning which consumes around 52.8G FLOPs per editing iteration, ours consumes 35.5G FLOPs for the first itertation and 26.4G FLOPs for subsequent iterations. Please refer to [R2] of the general response for calculation details.
[W2 & Q2 & Q3] Computational efficiency concerns regarding the hypernetwork
- Please kindly refer to [R2] of the global response.
[Q4] Clarification on the fair comparison
For fair comparison, we have indeed compared against the state-of-the-art hypernetwork-based editing baselines, including KE [6] and MEND [4], in our experiments. The two baselines also involve the training of a hypernetwork with multiple episodes, although their objective and therefore inputs/outputs of hypernetworks differ from ours.
- Hypernetworks in the two baselines learn how to edit, while ours first learns where to edit.
- Both inputs and outputs are parameter gradients, while ours takes an editing sample as input to output a binary mask.
[Q5] Extention to large-scale networks
We humbly clarify that our method generalizes well to larger-scale networks.
- First, even as the size of pre-trained networks grows, our method via greedy search reduces the optimal layers to edit (i.e., the hypernetwork output space) to a minimal number (c.f. Line 201). During the response period, we found that editing the 19th to 21st layers of SwinV2-g (1B) is sufficient to achieve a good balance between generalization and locality.
- Second, at the hypernetwork side, its size does not necessarily increase with the size of the pre-trained model:
- For the same pre-trained model of ViT/B-16, a small hypernetwork can achieve almost comparable performance to larger hypernetworks. Decreasing the number of blocks in the hypernetwork from 5, to 3, to 1, did not result in a pronounced performance drop (c.f. Fig. r3 of the rebuttal PDF).
- Utilizing the same hypernetwork consisting of 5 transformer blocks still helps precisely locate edits for even billion-scale models like SwinV2-g, evidenced in Fig. r1(b-c) of the rebuttal PDF.
- Third, even if more layers are identified to edit, the size of the hypernetwork only marginally increases. This is achieved by including one more trainable input token for each FC layer without changing the architecture of the hypernetwork.
[1] Transferring pre-trained multimodal representations with cross-modal similarity matching, NeurIPS 2022.
[2] Memory-based model editing at scale, ICML 2022.
[3] Editing large language models: Problems, methods, and opportunities, EMNLP 2023.
[4] Fast model editing at scale, ICLR 2022.
[5] Modifying memories in Transformer models, ArXiv 2020.
[6] Editing factual knowledge in language models, EMNLP 2021.
[7] Transformer-patcher: one mistake worth one neuron, ICLR 2023.
[8] Knowledge neurons in pretrained transformers, ACL 2022.
Thank you for your response.
I have a few additional requests and need clarification on rebuttal comments.
- [W2 & Q2 & Q3] Computational efficiency concerns regarding the hypernetwork
- It is hard to estimate how much
9 hours on a single RTX A6000 GPU (48G)is. Could you give me some reference numbers, such as the computation times for ViT-B 1 epoch training time on a single RTX A6000? It might help me to understand how much the hypernetwork training costs.
- It is hard to estimate how much
- [Q5] Extention to large-scale networks
- I understand that your method's computation can be reduced while maintaining computation. But, it is still not enough to resolve my concern.
- At first, I need numbers that make computation costs look small.
9 hours on a single RTX A6000and52.8G FLOPs per editing iterationcan be a precise report, but it doesn't look small. I expect comments likeit just takes 0.n% of finetuning computationorsmaller than computation for single inferencethat look small without analyzing and calculating computation costs. - My concern about large-scale networks is scaling. 3 FFNs might be enough for ViT-B because it is 25% of 12 FFNs. But, for SwinV2-G as an example, 3 FFNs are only 6% of 50 FFNs. Also, the hypernetwork should be bigger to cover large channel sizes. Then, the impact of your method might be reduced on a large network, which means it is not scalable.
- In Fig r1, improvement compared to FT-L2 is much smaller than that of ViT-B. So, it is not enough to solve my concern about large-scale networks. Reducing the block size can reduce the total computation but still the hypernetwork has to be larger when channel-size increases and impacts of 3 FFNs also be restricted when the depth increases. So, I think this method can't be expanded to large-scale networks since it costs more while the effect is reduced, which can be a major weakness of the method.
We appreciate the reviewer for raising your remaining concerns. Please find our response below, and kindly let us know if it satisfactorily addresses your concern.
[D1] Computational efficiency regarding the hypernetwork
- For reference, in response to the reviewer's suggestion, we provide the computation time for training a ViT-B on one RTX A6000 (48G) with ImageNet-21k here. The training time is 25.55 hours per epoch, given the image size of 224x224 and the batch size of 256.
- Thus, training the hypernetwork accounts for 35% of the time required for 1 epoch of ViT-B training, which is fairly acceptable.
- Last but not least, as detailed in our general response [R2], hypernetwork training is performed only once for a pre-trained model prior to editing. It can be viewed as a brief (35% of 1 epoch) extended stage of pre-training, being both computationally acceptable and worthwhile.
[D2] Extension to large-scale networks
We understand the reviewer's remaining concern, which seems to arise from two main aspects: the (I) computation cost and (II) performance of the proposed method on larger-scale networks. We would like to humbly clarify the following.
- (I) The computational cost of our method on larger-scale networks
- Above all, we humbly clarify again that the primary motivation for model editing is to enable data-efficient updates, not necessarily to reduce computational cost. If full fine-tuning offers the best generalization-locality trade-off, it should indeed remain the preferred approach.
- When editing ViT/B-16, our method demonstrates greater efficiency compared to some SOTA baselines: ours consumes 50% FLOPs of full fine-tuning while KN consumes 77%.
- When editing SwinV2-g (1B) which has 1134% more parameters than ViT/B-16 (86 MB), the FLOPs of our method only marginally increases by 435% (from 26.4G to 141.5 G), while the FLOPs of full fine-tuning increases by 976 % (from 52.8 G to 567.9 G).
- (II) The performance of our method on larger-scale networks The reviewer's concern seems to stem from the following hypotheses: (1) larger networks require more FFNs to edit, (2) more FFNs to edit would necessitate significantly larger hypernetworks, and (3) the relatively small performance improvement in Fig r1 (b-c) is due to an in sufficiently large hypernetwork and, consequently, not enough FFNs being edited. However, we humbly clarify each point below.
- Larger networks do not necessarily require more FFNs to edit.
We focus on localizing and editing a visual concept; the neurons that describe a visual concept are (1) concentrated in the middle to late layers [3], as the early layers encode distributed representations of very primitive concepts (e.g., geometric patterns) [1]; and (2) compact -- the number of neurons associated with a single concept does not largely increase as the model size grows [2], with the additional model capacity likely benefiting the representation of more diverse concepts.
During the discussion period, we conducted experiments by increasing the number of FFNs edited in SwinV2-g. The results, showing only marginal improvement when editing more FFNs, also corroborate that editing just 3 FFNs is sufficient to balance generalization and locality.
Number of FFNs Identified layers by natrual 933-923 group Generalization Locality 3 19~21 94.77 75.95 5 18~22 94.39 77.69 8 17~24 95.92 75.24 Number of FFNs Identified layers by AI oil painting Generalization Locality 3 19~21 87.99 72.53 5 19~23 91.30 72.36 8 17~24 88.18 71.23 Our practice of editing 3 FFNs has also been validated successful and scalable in the context of LLMs by the prior hypernetwork-based method MEND [4], which successfully edits 3 FFNs across various scales of LLMs, including distilGPT-2 (82M), GPT-Neo (2.7B) and GPT-J (6B).
- Increasing the number of FFNs to edit requires only a marginal increase in the size of the hypernetwork.
- To accommodate larger channel sizes, the changes in the hypernetwork are limited to the input and output layers: (1) adding an input projection layer (e.g., 3584x768 for SwinV2-g) that projects the encoded image features to a lower dimension (which is the input dimension of the hypernetwork) and (2) increasing the output dimension of the output projection layer. For editing SwinV2-g, the size increase resulting from these two changes amounts to 29.3% of the original size of the hypernetwork.
- To accommodate more FFNs for editing, the only change required in the hypernetwork is the addition of one more trainable input token for each FC layer.
- The relatively small performance improvement in Fig. r1(b-c) is attributed to the difference between SwinV2-g and ViT across various classes.
- We observed significant differences in the error samples produced by SwinV2-g and ViT for each class, reflecting distinct model behaviors.
- The two groups reported in Fig. r1(b-c) were selected because our method for editing ViT showed the largest improvements (by up to 18% GR at the similar TRR) over the baselines (including FT-L2) on these groups. Although the improvement for SwinV2-g on these specific groups appears modest, during the discussion period, we found that our method for editing SwinV2-g achieved a 16% GR improvement at the similar TRR on other groups, including Vase in scene light dataset, 407-654 in the natural dataset.
[1] Identifying interpretable subspaces in image representations, ICML 2023.
[2] Labeling neural representations with inverse recognition, NeurIPS 2023.
[3] Interpreting CLIP's image representation via text-based decomposition, ICLR 2024.
[4] Fast model editing at scale, ICLR 2022.
Thank you for your additional comments.
They have addressed my concerns, and I have adjusted my rating to borderline accept. The paper offers valid contributions.
However, I still question the value of model editing in the vision domain compared to its impact on LLMs. Additionally, the novelty of the hyper-network design is limited and not particularly surprising. This is why I don't rate higher than borderline accept.
Nonetheless, I appreciate the authors' active discussion.
We sincerely thank the reviewer for the valuable feedback and for increasing the score! We greatly appreciate the opportunity to discuss and refine our research work.
Regarding the value of model editing in the vision domain, we anticipate its broad impact for several reasons: (1) correcting subpopulation shifts, which is prevalently encountered by vision foundation models, is an urgent need; (2) re-training a whole vision pre-trained model typically requires access to the full pre-training data (e.g., JFT-3B [1]), which is often inaccessible; and (3) our proposed editing method can be readily applied to diffusion models (e.g., SD-XL) for visual generation, which we are actively exploring.
Meantime, we would also like to highlight the novelty of our hypernetwork design: (1) Our approach is the first to learn where to edit, while previous hypernetwork-based methods [2, 3] focus on how to edit; this is reflected in the completely different input/output of our hypernetwork compared to earlier methods. (2) Optimizing such a hypernetwork poses significant challenges, which motivates us to propose (a) construction of pseudo editing episodes, which is the key to learning how to localize, and (b) decoupling of hypernetwork optimization (see Appendix B.3.2), which is crucial for avoiding trivial mask solutions.
Thank you once again for your thoughtful evaluation and for recognizing our contributions.
[1] Scaling Vision Transformers, CVPR 2022.
[2] Editing factual knowledge in language models, EMNLP 2021.
[3] Fast model editing at scale, ICLR 2022.
The paper investigates a novel method for editing Vision Transformers (ViTs) to enhance their performance by rectifying predictive errors. Specifically, it proposes training an additional ViT to locate specific rows in the weight matrices of several feedforward network (FFN) layers for efficient fine-tuning. The authors evaluate their approach using benchmarks such as MAD-searched natural images and AI-generated image datasets. Compared with several state-of-the-art (SOTA) baselines, this work achieves the best locality/generalization tradeoff, offering flexible editing choices and accurate localization of where to edit.
优点
Originality:
- The paper introduces a novel approach to model editing in vision tasks, which is an emerging area with significant potential.
- It combines existing techniques in a new way to address specific challenges in model performance enhancement.
- The method's ability to identify and edit specific parts of models is innovative and could be valuable for various applications.
Quality: The submission is technically sound with comprehensive question-raising and experimental support.
Clarity: The paper is well-written and organized, making it easy to follow the methodology and results.
Significance:
- The results are important as they demonstrate a new way to enhance vision models, which could have significant implications for both research and practical applications.
- The method addresses a challenging task and advances the state of the art in model editing and vision tasks.
缺点
Originality: The main objective, learning where to edit, is not a brand-new idea in NLP. The novelty could be better emphasized by comparing more thoroughly with recent advancements in model editing and vision tasks.
Quality:
- The application of the proposed method is somewhat limited to plain ViTs. It would be interesting to see the application to hierarchical ViTs like Swin and PVT.
- The evaluation is somewhat limited to specific datasets. It would be better to explore other datasets like CUB-200-2011, NABirds, Oxford Flowers, Stanford Dogs, and Stanford Cars
Clarity: No further issues.
Significance: The impact of the work would be more convincing with broader validation across diverse datasets and tasks.
问题
- For the comparison with FT-L2, is it possible to compare with FT-L1 or apply L2 in the proposed method to have a clearer comparison on the regularization term?
- Can the authors provide more detail on the computational complexity and resource requirements of their method? How does the additional complexity of the hypernet compare to the saving in fine-tuning?
- In line 150, is there a detailed description of the challenge that the authors focus on? The flow seems to be interrupted here.
局限性
Limitations are addressed in the originality and quality in the weakness section.
We sincerely appreciate your constructive comments on this paper. We detail our response below point by point. Please kindly let us know if our response addresses the issues you raised in this paper.
[W1] Application to hierarchical ViTs
We greatly appreciate the reviewer's suggestion to test our method with different ViT architectures beyond the plain ViTs. We note that our approach is designed to be broadly applicable across various ViT architectures, including hierarchical ViTs. This is because our method meta-learns the optimal locations for editing within the parameter space of any given pre-trained ViT, without imposing strict assumptions on the specific architecture of the models. To verify our claim:
- We follow the reviewer's suggestion and apply our method to a 1-billion-parameter SwinV2-g, which is self-supervised pre-trained via SimMIM [1].
- As shown in Fig. r1 of the rebuttal PDF, our method achieves the best Pareto front when editing two groups of the natural dataset.
- Due to the time limit and the model's size, we only included the most recent editing methods as baselines for comparison, but we will include all baseline methods and their results in our revised manuscript.
[W2] Application to more datasets
We follow the reviewer's suggestion and conduct more editing experiments on the Stanford Cars dataset. Experimental results in Fig. r4 of the rebuttal PDF show the superiority of our method over baselines on this fine-grained classification dataset. The details of the experiment setup are as follow:
- Following [2], we first train a classification head by linear probing on the Stanford Cars training set and evaluate it on the Stanford Cars testing set, achieving an accuracy of 50.69%.
- To construct the editing dataset, we collect incorrectly predicted images from the Stanford Cars testing set and group them by their labels. For each error group, one image is used to edit the model, while the remaining images in the same group are used to evaluate the generalization. Additionally, correctly predicted images from the Stanford Cars testing set are used to assess the locality.
[Q1] Comparison of the regularization term
We follow the reviewer's suggestion and conduct additional ablation studies to provide a detailed comparison of the regularization term. Our observations are as follows:
- FT-L1 generally outperforms FT-L2: As shown in Fig. r2(a) of the rebuttal PDF, FT-L1 generally outperforms FT-L2 across varying levels of regularization strength. This is because L1 regularization encourages sparse updates, allowing some pre-trained parameters to remain unchanged. Consequently, it better preserves the locality of the pre-trained model and updates only the most crucial parameters linked to editing success.
- Our method outperforms both FT-L1 and FT-L2: Despite the effectiveness of L1 regularization, our proposed method outperforms FT-L1 thanks to our meta-learned hypernetworks which can leverage transferable knowledge (inductive bias) from related training editing episodes to enhance the editing performance on each test editing sample (see Fig. 5b and lines 361-363). In contrast, FT-L1 performs each edit solely based on that particular test editing sample.
- Locality vs generalization trade-offs for ours + L2 regularization: To further illustrate the effect of L2 regularization, we conducted editing experiments using our proposed method with an additional L2 term. We explored two sets of balancing ratios between the cross-entropy (CE) loss and L2 regularization: 1:1 and 1:10. As shown in Fig. r2(b-c) of the rebuttal PDF, increasing the strength of L2 regularization enhances locality but compromises generalization performance.
[Q2] Computational efficiency concerns regarding the hypernetwork
- Please kindly refer to [R2] of the global response.
[Q3] Clarification on line 150
We would like to clarify that the central challenge we wish to address in line 150 pertains to the current localization strategies in model editing, which are predominantly designed for large language models (LLMs), such as GPTs. The strategies such as Knowledge Neurons (KN)and causal tracing (ROME) are not readily transferable to editing Vision Transformers due to the inherent differences between LLMs and ViTs:
- (i) Input tokenizations. LLMs use word embeddings as input tokens, whereas ViTs use cropped patches from images as input tokens.;
- (ii) Attention mechanisms. Each token in a sequence in GPT only attends to subsequent tokens in the sequence after it, while in ViTs every token attends to all tokens.
- (iii) Hierarchical structures. Structures of ViT-variants, such as Swin and PVT, possess hierarchical structures not presented in LLMs.
As a result, prior localization methods for LLMs yield suboptimal editing results when applied to vision transformers, [(c.f. Fig. 4, Ours outperforms ROME and KN)]. Thus, we are highly motivated to design a new localization strategy for pinpointing where to edit in ViTs.
[1] On data scaling in masked image modeling, CVPR 2023.
[2] Visual prompt tuning. ECCV 2022.
Thank you for the detailed response. I have updated my vote to accept.
We are pleased that the concerns raised by the reviewer have been addressed. Once again, we would like to express our gratitude for your constructive comments and positive feedback on our work.
We sincerely thank all the reviewers and ACs for your diligent efforts and high-quality reviews. If there are any additional questions or if further clarification is needed, please feel free to let us know. Your insights are highly valued.
We are delighted to note that reviewers find that:
- our method is novel (
Reviewers wreN), innovative (Reviewers MKpn), with clear and easy-to-follow writing (Reviewers wreN, sCvZ, MKpn) - we contribute a new benchmark for model editing (
Reviewers 91WR, sCvZ, and MKpn), which is beneficial to later research (Reviewers sCvZ)
In response to your valuable suggestions, we conduct additional experiments and the supplementary rebuttal PDF includes the new results for your convenience:
- Fig. r1: We add experiments on SwinV2-g [1], a large-scale (suggested by
Reviewer 91WR) hierarchical (suggested byReviewer wreN) ViT. - Fig. r2: Ablation study for a clear comprision on different regularization terms (suggested by
Reviewer wreN). - Fig. r3: Ablation study of the number of blocks in the hypernetwork (suggested by
Reviewer 91WR). - Fig. r4: We add experiments on the Stanford Cars to futher evaluate our methods (suggested by
Reviewer wreN).
Finally, due to character limits, we put responses to commonly raised questions below.
[R1] Application to large-scale hierarchical ViTs
We note that our approach is designed to be broadly applicable across various ViTs, including large-scale and hierarchical ViTs. To verify our claim:
- First, our method meta-learns the optimal locations for editing within the parameter space of any given pre-trained ViT, without imposing strict assumptions on the specific architecture of the models.
- Second, even as the size of pre-trained networks grows, our method via greedy search reduces the optimal layers to edit (i.e., the hypernetwork output space) to a minimal number (c.f. Line 201). The results in Fig. r1(a) of the rebuttal PDF indicate that editing the 19th to 21st layers of SwinV2-g (1B) is sufficient to achieve a good balance between generalization and locality.
- We compare our method with the most recent editing methods for editing SwinV2-g in Fig.r1(b-c) of the rebuttal PDF. Our proposed method demonstrates superior performance compared to all baselines, achieving the best balance between generalization and locality.
- Due to the time limit and the large scale of the model, we only included the most recent editing methods and will include all methods.
[R2] Computational efficiency regarding the hypernetwork (To Reviewer wreN and 91WR)
First, we would like to outline the major computational processes of the proposed method in the following table.
(a) Before Editing (Section 3.3) (b) Test-time Editing (Section 3.4) (c) Test-time Editing (Section 3.4) Meta-learning the hypernetwork with Eqn. (2) Forward pass of the hypernetwork to obtain the mask in Eqn. (3) Training only the parameters activated by (c.f. Line 238-242)
- We humbly clarify that, in line with hypernetwork-based model editing methods [2,3], the additional computational costs incurred by hypernetwork training (i.e., (a)) are acceptable and worthwhile for several reasons:
- Training the 5-block hypernetwork (31.9M) in our paper for a pre-trained ViT/B-16 takes approximately 9 hours on a single RTX A6000 GPU (48G).
- Given a target pre-trained network, the hypernetwork is trained prior to model editing (i.e., (b) + (c)) and this training only needs to be done once.
- The number of available popular vision pre-trained models (or even LLMs) is limited [4].
- The effectiveness of the learned hypernetwork in locating "where to edit" has been proved in subsequent editing tasks (c.f. Fig. 4 and Appendix C.1/C.2) and the comparison with random masks (c.f. Fig. 5(c)).
- During test-time editing (i.e., (b) + (c)), our method even reduces the computation compared to the editing method of full fine-tuning which consumes around 52.8G FLOPs per editing iteration;
In (b), one-shot inference with the hypernetwork to generate the mask takes only 277.1M FLOPs;
In (c), updating the parameters activated by the mask consumes 35.5G FLOPs in the first iteration and 26.4G FLOPs in subsequent iterations, either of which is significantly less than 52.8G FLOPS. The difference between iterations arises because the first iteration requires a one-time full forward pass through the pre-trained ViT to obtain features of the error samples, whereas subsequent iterations only update the optimal layers identified by our method (c.f. Line 201).
[1] On data scaling in masked image modeling, CVPR 2023.
[2] Editing factual knowledge in language models, EMNLP 2021.
[3] Fast model editing at scale, ICLR 2022.
[4] Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks, NeurIPS 2023.
This paper proposes a new method to edit vision transformers using a hypernetwork. It introduces a learning-to-learn methodology utilizing a hypernetwork that identifies and modifies a small set of critical parameters in response to erroneous samples.
The submission received scores of 7,5,7,6 and the rebuttal and discussion between reviewers and authors has cleared the remaining points. The AC recommends acceptance.