Machine Unlearning via Task Simplex Arithmetic
VLM unlearning by a closed-form ensemble of infinite number of functions with parameters uniformly sampled from a task arithmetic simplex.
摘要
评审与讨论
The paper proposes to improve unlearning by sampling task vectors from a simplex, which is composed of Q pivot task vectors. Authors discuss the potential of the method for reducing variance. Experiments are conducted on standard unlearning benchmarks with different VLMs.
优缺点分析
Strengths
- Making task arithmetic more robust is an important question for unlearning (and model merging). The paper is well-motivated, and the empirical results are good.
Weaknesses
- Some notations are not clear (e.g., variance of function ensemble)
- Some baselines can be improved (the proposed method is a general approach and can be applied to other scenarios besides VLMs)
- Please see details in the question section.
问题
- I would think that ensemble is a general approach that can be applied to other pretrained models (e.g., LLM). Furthermore, by reducing variance, it may also help other task arithmetic operations besides unlearning (e.g., learning instead of unlearning, adding task vectors from different tasks). Can we have a brief discussion on different models and tasks?
- On the variance of the ensemble
- Line 44, the notation of \sigma(x) is somewhat unclear. In my understanding, the randomness of the ensemble is from the subscript i (i.e., randomly selecting f \in R^C from Q possible task vectors), thus it is a multivariate random variable, and we should discuss the covariance rather than a scalar.the Does the notation there assume independence among f's components, or is f is a scalar (the probability of ground truth)?
- Line 59, can you further explain the reference to Bienaymé formula? It basically says that the variance of a sum of independent variables is the sum of each r.v.'s variance. Here, the ensemble is a sum of f \in R^C (each of which is a predicted C-dimensional distribution), but each prediction is not a random variable (as my understanding above, the randomness is from selecting Q task vectors, not from the classifier's output). Would you please clarify which random variable is studied here?
- Line 151, the Hessian term may contain subscript c.
- Equation 2, I would suggest using integration, rather than summations, over \tau. For completeness, it is also helpful to specify how to compute the first-order term.
- Equation 10/11, the \sigma^2 term contains unclosed parentheses.
- On experiments:
- Figure 3, I would suggest improving the function ensemble baseline by further increasing Q
- Table 3, in the incremental unlearning setting, does the ensemble method still keep the retain set performance?
- It would also be better to discuss limitations of the proposed methods (e.g., additional costs of computing/storing Hessian)
- Regarding line 604 (paper checklist), it would be better to open the code to help reproduce the results.
局限性
Please see the question section.
最终评判理由
The authors' response has addressed some of my concerns, and I keep my original score.
格式问题
N/A
Response to Rev. wwLa
We thank Reviewer for constructive feedback.
1. Ensemble is a general approach that can be applied to other pre-trained models (e.g., LLM).
Thank you. Below is the table showing performance on a multimodal LLM, LLaVA‑1.5‑7B (with CLIP ViT‑L/14 as the vision encoder). We additionally validate our method on the multimodal CLEAR benchmark (Dontsov et al., 2024), which involves fictional author profiles with paired face images and captions. As shown in the table below, we report forgetting and retention accuracy on the VQA task. Our approach achieves the best forgetting while maintaining comparable retention performance.
TABLE
| Method | Forget VQA Acc. () | Retain VQA Acc. () |
|---|---|---|
| Pre-trained Model | 69.2 | 55.7 |
| Standard Task Arithmetic | 42.7 | 49.4 |
| Uniform Merge | 37.9 | 50.5 |
| TIES-Merging | 36.1 | 49.7 |
| EMR-Merging | 34.8 | 49.3 |
| Function Ensemble | 35.6 | 50.0 |
| Ours | 31.5 | 50.2 |
Conceptually, our task‑vector framework extends to other multimodal settings, including open‑vocabulary CLIP tasks and image‑text retrieval. Adapting to those scenarios would require defining task vectors on the corresponding objective (e.g., fine‑tuning CLIP’s multimodal head to capture specific image‑caption associations) and applying our unlearning procedure in the same manner. We will explore these directions in future work.
2. Improving other task arithmetic operations besides unlearning.
While we focus predominantly on unlearning in this work, below we demonstrate incremental learning scheme. The reason why we focus on unlearning is that we model as variations in task vectors for a given unlearning dataset. Therefore, we assume that:
- results only in sparse changes: for small (86M is the number of parameters for ViT-B).
- maximum diameter between any two task vectors is small.
These assumptions are required for modeling with the Dirichlet distribution as per Resp. 1 to Rev. uLHj. For (CLIP ViT-B/16), we checked that only 4.6% parameters changed (for unlearning) so as long as these assumptions are met (they are easily met for unlearning), we can also consider multi-task learning based on task vector simplices by adding the sum of the task vectors of 8 tasks (datasets): Cars , DTD, SUN397, EuroSAT, GTSRB, MNIST, SVHN, and RESISC45 following Task Arithmetic (Ilharco et al., ICLR 2023). We report the average absolute accuracy (%) w.r.t. CLIP ViT-B/16 in the table below.
TABLE
| Method | Absolute Acc. () |
|---|---|
| Pre-trained Model | 55.2 |
| Standard Fine-tuning | 75.5 |
| Uniform Merge | 78.2 |
| Function Ensemble | 79.3 |
| Ours | 83.6 |
3. Line 151, the Hessian term may contain subscript.
Absolutely. Thank you.
4. Equation 2, I would suggest using integration, rather than summations.
Absolutely.
5. It is also helpful to specify how to compute the first-order term.
Absolutely, for the standard expansion we have:
while for Corollary 1 we have :
6. Parentheses in Eq. 10/11.
Duly noted.
7. Increase in Figure 3.
Absolutely. In Figure 3, we provide the comparison between function-level ensembles from task vectors and our task simplex w.r.t. the number of task vectors Q for CLIP ViT-B/32. In the following table, we further include the performance for . We can notice the convergence of unlearning performance..
TABLE
| Method | Forget Set Acc. () |
|---|---|
| 15.58 | |
| 15.20 | |
| 15.07 | |
| 14.98 |
8. In Table 3, does the ensemble method still keep the retain set performance?
Yes, within 95% of the original performance.
9. Discuss limitations of the proposed methods (e.g., additional costs of computing/storing Hessian).
10. Regarding line 604, it would be better to open the code.
Absolutely.
11. The randomness of the ensemble is from the subscript i: should we have a full covariance?
Thank you. This is very interesting question. We assumed each for is independent for simplicity and produces likelihood of the -th class. As some of task vectors may be noisy for some classes, we assumed limiting that variance alone a bit may be sufficient - and improvements validate that. However, reducing covariance (off-diagonal terms) could indeed help also decorrelate the model. However, that would require some serious rethinking of the Taylor expansion in Eq. (2) to produce the outer product as a closed form solution because of the integration operator.
12. The Bienaymé formula.
For a fixed sample , one may think of as a transformation of random variable (task vector sampled from the Dirichlet distribution (simplex)) into another random variable living in the class space (surface of a simplex) which follows some resulting from this transformation class distribution. The Bienaymé formula tells us what happens with variance of each class (treated independently) as the number of ensembled functions grows. That number depends on task vectors. However, is required to be low. The opposite means each functions are not independent - say in extreme case we have identical functions, clearly they cannot reduce the variance because they all are identical. For that reason we investigated the idea of learning or even moving in a small radius to help reduce the variance. We agree as well that reducing cross-terms in covariance could strengthen this effect.
This work proposes a method for machine unlearning, which is an interesting and practically important problem. The authors point out that task vectors, a popular unlearning strategy, exhibit substantial sensitivity to various fine-tuning configurations, leading to unstable unlearning effectiveness. Obtaining and aggregating multiple task vectors can reduce prediction-level variance and improve unlearning results; however, this comes at a high computational cost.
This manuscript presents a method that captures the space of task vectors induced by diverse fine-tuning strategies. Specifically, it models this space through the convex hull of a (𝑄−1)-simplex whose vertices correspond to 𝑄 task vectors. Instead of sampling task vectors directly from the simplex, the authors derive a closed-form ensemble representing an infinite number of functions whose parameters are uniformly sampled from the simplex, thereby achieving enhanced unlearning performance in a computationally efficient manner.
The method appears to be supported by solid theoretical foundations. Furthermore, the experimental results and accompanying analyses demonstrate the effectiveness of the proposed approach.
优缺点分析
Strengths: The paper is well-written and clearly organized. The proposed method is theoretically sound. The experimental results and analyses are thorough and convincing.
Weaknesses: The section titled “Proposed Method” contains many abstract equations, which may be difficult for readers who are not deeply familiar with this specific area to fully understand. Providing more intuitive explanations or illustrative examples could greatly improve accessibility.
问题
Overall, I find this paper interesting and well-motivated. Minor clarifications and improvements, however, would further strengthen the work. (1) This manuscript appears to be very theoretical. Providing more intuitive explanations or illustrative examples could significantly improve accessibility for a broader audience. (2) Regarding the organization of the “Proposed Method” section: there is only one numbered subsection titled “Problem Formulation.” This structure may lead to the misunderstanding that all subsequent content falls under problem formulation. It would be clearer to divide this section into multiple meaningful subsections. (3) In the paragraph beginning with "Unlearning with Task Vectors", the sentence that "Moreover, our closed-form solution achieves the lowest accuracy (the lower the better) on the forget set, surpassing the state-of-the-art approach by 3.3% on ViT-Base/32-based CLIP." is not directly supported by the results in Table 1. Additionally, in the paragraph starting with “Unlearning with Linear Task Vectors,” Another sentence in the paragraph beginning with "Unlearning with Linear Task Vectors", the sentence "Table 1 shows that the linearized task vectors consistently yield substantial reductions in the forget set accuracy 258 across multiple merging methods and CLIP architectures, while maintaining fixed accuracy on the 259 retain set relative to their standard (non-linearized) counterparts in Table 1." appears to mistakenly reference Table 1 twice, whereas the first mention likely should refer to Table 1.
局限性
Yes, but they put the discussion on limitations in Appendix F.2 instead of the main text.
最终评判理由
The authors have provided a clear rebuttal that addresses my main concerns. I believe the contribution is sufficient for acceptance.
格式问题
No Paper Formatting Concerns
Response to Rev. vtZ9
We thank Reviewer for constructive feedback.
1. Providing more intuitive explanations.
Absolutely. The main idea is to devise a distribution easy to construct and sample give parameter space (ViT-B). This can be easily achieved by an ensemble of functions based on Taylor expansion which leverages 0th, 1st and 2nd order statistics of these task vectors together. Taylor expansion is devised at .
- We will illustrate such a Taylor expansion and resulting const, linear and quadratic terms. The linear term depends on the mean of task vectors, and the quadratic term on specific second order statistics which we will illustrate. Kindly notice rebuttal does not permit figures (alas).
- We will also illustrate the ensemble in terms of the mean and variance for which we can adjust outliers using weights .
1. Divide "Proposed Method” into multiple meaningful subsections.
Absolutely. In general, we have the core "Problem Formulation":
- Problem definition with Closed-form Aggregation of Functions from Task Simplex,
followed by "Extensions" and "Theoretical Analyses":
- Advanced Aggregation Scheme.
- Distillation from Ensemble.
- Computing and Controlling Variance of Ensemble.
2. Sentence with "the state-of-the-art approach by 3.3% on ViT-Base/32-based CLIP" is not directly supported by Table 1.
Absolutely. We had a typo due to changes in the paper. We meant that compared to function ensemble [8] we achieve 7.55%, 7.72% and 6.34% improvement in unlearning on ViT-Base/32, ViT-Base/16 and ViT-Large/14, respectively.
3. "Table 1 shows..." repeats Table 1 twice.
Thank you. We have revised accordingly.
Thank you for your clarifications. I have no additional concerns. Kindly revise the manuscript accordingly.
Esteemed Reviewer,
We thank you for engaging with our rebuttal. Rest assured all suggestions will be incorporated into our paper. Meantime, if there is anything else we can improve, clarify or answer, kindly let us know.
Best regards,
Authors
This paper addresses the problem of machine unlearning in Vision-Language Models (VLMs) by proposing a closed-form function ensembling method based on the Task Arithmetic Simplex. Traditional task vector approaches suffer from unstable unlearning due to sensitivity to fine-tuning configurations, with prediction-level variance negatively correlating with unlearning performance. The authors model task vectors as vertices of a (Q-1)-dimensional simplex, deriving a closed-form ensemble of infinite functions via Dirichlet distribution and Taylor expansion, which effectively reduces variance and enhances unlearning. Experiments on 8 visual datasets validate the method’s superiority, especially in incremental unlearning and linear task vector scenarios.
优缺点分析
Strengths:
- Comprehensive Ablation Studies: The paper conducts rigorous ablation experiments to decompose the contributions of key components (e.g., advanced aggreagation, vertex importance weighting), clarifying the impact of each module on unlearning performance.
- Theoretical Innovation and Practical Efficacy: Modeling task vectors as a simplex and deriving closed-form ensembles using convex geometry and Dirichlet distribution addresses the high computational cost of traditional methods. Experiments demonstrate significant accuracy gains on forgetting sets (e.g., 9.98% on ViT-Large/14) with stable retained set performance, validating both theoretical rigor and practical utility.
Weaknesses:
- Unjustified Simplex Modeling of Task Vector Space: The authors propose modeling task vectors within the convex hull of a simplex but lack rigorous justification for this assumption. There is no mathematical proof or empirical analysis (e.g., geometric visualization of task vector distributions) to validate that task vectors indeed form a convex simplex. This raises questions about whether linear combinations of task vectors remain within the valid task space, especially given the non-linear dependencies inherent in real-world fine-tuning scenarios .
- Computational Cost and Dataset Limitations: While closed-form solutions avoid infinite sampling, generating Q task vectors (e.g., Q=30) requires multiple fine-tunings, imposing high costs for large models. Additionally, experiments are confined to vision datasets, lacking validation on language or cross-modal tasks.
- Theoretical Constraints on Taylor Expansion: The method relies on small perturbation assumptions (λ), but the paper omits comparative experiments under large perturbations, where second-order approximation validity may decline.
问题
- Theoretical Justification for Simplex Modeling: Please provide mathematical proof or empirical evidence (e.g., t-SNE visualization of task vector convexity) to support the simplex modeling assumption. How do non-linear task dependencies affect this framework?
- Cross-Modal Generalization: Are there plans to validate the method on language or multi-modal models? Please include cross-modal experiments to demonstrate generalizability.
- Large Perturbation Robustness: Could you provide performance data under varying λ to validate the Taylor expansion’s accuracy beyond small perturbation scenarios?
局限性
N/A
最终评判理由
The authors' response has addressed many of my concerns, so I am raising my recommendation to Borderline Accept.
格式问题
N/A
Response to Rev. uLHj
We thank Rev. for constructive feedback.
1. Validate task vectors form a convex simplex.
Thank you. This is a complex question.
Kindly note that we do not claim task vectors follow any specific distribution. Task vectors have dim. (CLIP ViT-B-16). For low number of vectors (), there exists no technique that can reliably estimate the distribution to tell its kind due to the curse of dimensionality - one needs to reliably estimate. Our .
As rebuttal does not allow figures, we will add t-SNE in revision. But we chose Dirichlet (educated choice) as under large dim., we have:
- simplicity to set/estimate PDF (for Dirichlet, our task vectors set PDF)
- ability to sample from it (our Taylor derivation does that)
- compact support (not exceeding observed min-max values of individual params.) (1a)
- ability to exploit 0th, 1st, 2nd order moments (simplex mean, implicit cov.) in modeling - Dirichlet lets us use of low-order moments in Taylor expansion (1c)
1a. Intuitive reasoning.
Notice that:
-
Few-epochs fine-tuning on unlearning dataset toward yields sparse change: for small . [A] confirms sparsity of . For (CLIP ViT-B/16), we checked that only 4.6% parameters changed.
[A]. Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable Approach, ICLR'25.
Thus, we assume that interpolating between such sparse differences in and captures a feasible parameter subset pertinent to the unlearning dataset, and realizes several magnitudes of "active" parameters. -
As counterexample, let task vectors be Normally distributed: (off-diagonal is 0; cannot estimate dim. covariance) but and and so we obtain where .
Table below (CLIP ViT-B-16): function-level ensemble under the Normal dist. ( task vectors sampled from it):TABLE
Distribution Normal () 16.80 64.32 Normal () 15.98 64.54 Dirichlet (Ours) 12.17 64.93 The Norm. dist. performs worse than our Dirichlet model as PDF arms decay slowly toward whereas Dirichlet dist. has compact support: it produces finite values of filter parameters constrained by simplex.
1b. Theory/analysis.
We provide a rigorous bound: interpolated task vector model deviates from the expected convex combination by at most .
The bound is small if:
- task vectors differ by a small diameter
- is small
- is smooth (low Lipschitz const. )
These conditions are met in our paper (low due to sparse change between task vectors, low ).
Theorem: Interpolation Performance Bounds.
Let be the unlearning function. Let be -Lipschitz continuous:
Given task vectors and any convex combination where , we have:
Theorem's Importance: any interpolated choice for deviates from expected convex combination by less/equal . We can provide Proof in discussions (rebuttal space limits).
1c. Analyzing Taylor orders (how do non-linear task dep. affect framework).
-
Vector task aggregation model [A] has the same 1st order Taylor expansion as our method (it equals also to NTK): As 0th & 1st order moments for our technique are identical with [A] & [B], integration over simplex is not an issue under this expansion by virtue of methods [A] & [B].
[B]. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, ICML'22
-
Now we compare 2nd order expansions:
-
(approach [A]):
-
(basic function ensemble [B]):
-
(our Simplex ensemble): for and .
As 0th and 1st order terms are identical, and our Simplex ensemble interpolates between two known approaches [A] ("Vector Uniform Merge" in our paper) and [B] ("Function Ensemble" in our paper) for 2nd order expansion (convex functions), which is why it is OK to integrate functions over task vectors uniformly drawn from the Dirichlet dist. (our simplex ensemble).
Use of concentration parameter of Dirichlet leads to and which lets smooth interpolation between [A] & [B].TABLE (CLIP ViT-B-16)
0.1 14.65 64.49 0.5 12.83 64.26 0.8 12.05 64.68 1.2 12.48 64.85 1.5 12.70 64.92 2.0 13.29 64.81 1.0 12.17 64.93 If we use Eq. (6) (paper), weight per vertex customizes interpolation (help reduce var.)
-
1d. Parameter augmentation.
As averaging all task vectors gives viable unlearning (model [A]), and ensembling functions on individual task vectors is viable unlearning (model [B]), interpolation between the mean and individual is "parameter augmentation". Let Based on 1b, if changes fast in & slow in (same neighborhood sizes) then intuitively entropy of prediction is lower than as smooth changes contribute coherently to vote - chaotic changes are incoherent leading to cancellation of class peaks: is max. So our method leverages smoothness of w.r.t. augmentation. See also [C]:
[C]. Averaging Weights Leads to Wider Optima and Better Generalization, UAI'19
1e. Do task vectors form simplex?
They do not have to so long they are close to simplex. We take out of 30 task vectors fine-tuned on SUN397 and build simplex. We count sketch dimensions to . Remaining 20 task vectors are in radius of the simplex while task vectors of Cars exceed distance (cluster around their own simplex).
2a. Cost: Many fine-tunings.
We fine-tune over mere 6-35 epochs (not 1000). We can reduce to 10 for speed or fine-tune in parallel.
See sequential fine‑tuning time/unlearning inference time (our paper's setup):
TABLE (CLIP ViT-B-16)
| Task Vector Number | Fine-tuning Time | Unlearning Inference Time | ||
|---|---|---|---|---|
| 10 | 14.06 | 64.77 | 1.9h | 372s |
| 15 | 13.29 | 64.62 | 2.9h | 458s |
| 30 | 12.17 | 64.93 | 5.7h | 716s |
Below we reduce epochs range (6-35) by 1/2 or 1/3rd:
| Fine-tuning Epoch Percentage | Fine-tuning Time | ||
|---|---|---|---|
| 1/3 of 6-35 epochs | 15.54 | 65.14 | 2.4h |
| 1/2 of 6-35 epochs | 14.28 | 65.06 | 3.2h |
| 5-35 epochs | 12.17 | 64.93 | 5.7h |
2b. Lang. or cross-modal task.
We apply our method on multi-modal CLEAR benchmark (Dontsov et al., 2024: fictional author profiles with face images-captions pairs). See the avg. accuracy (VQA task) (LLaVA‑1.5‑7B, CLIP ViT‑L/14 as vision encoder). Task vectors are derived by fine‑tuning on the corresponding target data.
TABLE
| Method | VQA Acc. | VQA |
|---|---|---|
| Pre-trained Model | 69.2 | 55.7 |
| Standard Task Arithmetic | 42.7 | 49.4 |
| Uniform Merge | 37.9 | 50.5 |
| TIES-Merging | 36.1 | 49.7 |
| EMR-Merging | 34.8 | 49.3 |
| Function Ensemble | 35.6 | 50.0 |
| Ours | 31.5 | 50.2 |
3.Constraints on Taylor.
Kindly note : in 2nd order term decays quadratically; 1st order term is linear. In practice, is chosen by cross-validation. For low our method reverts to 1st order expansion (as [A] & [B]). Below we vary (CLIP ViT-B/16) on SUN397. Bold: best in cross-val.
TABLE
| 0.1 | 61.4 | 67.6 |
| 0.2 | 57.5 | 66.7 |
| 0.3 | 53.6 | 66.2 |
| 0.4 | 49.8 | 65.7 |
| 0.5 | 45.9 | 64.8 |
| 0.6 | 42.0 | 65.1 |
| 0.7 | 40.1 | 62.8 |
| 0.8 | 39.2 | 60.3 |
Thank you for your clarifications, please revise the manuscript accordingly. I will adjust the score.
Esteemed Reviewer,
We thank you for engaging with our rebuttal. Rest assured we will revise the paper as per your suggestions. Meantime, if there is anything else we can improve, clarify or answer, kindly let us know.
Best regards,
Authors
This paper addresses the problem of machine unlearning in large vision-language models (VLMs) like CLIP. Machine unlearning refers to efficiently removing or “forgetting” the influence of certain training data (the forget set) from a model without retraining from scratch, to meet privacy regulations (e.g. the “right to be forgotten” under GDPR). The authors build on the concept of task vectors – differences in model parameters when fine-tuning with vs. without the target data (as used in Editing Models with Task Arithmetic by Ilharco et al., ICLR 2023). In prior work, subtracting a task vector from the model could forget a dataset in a plug-and-play way, but the effectiveness varied greatly depending on fine-tuning specifics. This paper’s key idea is to mitigate that variance by using ensembles of many task vectors. They propose modeling the space of possible task vectors (obtained under different fine-tuning conditions) as a simplex (convex hull) and derive a closed-form solution that aggregates an infinite ensemble of models sampled from this simplex. In simpler terms, instead of relying on one fine-tuned model difference, they analytically average out an infinite number of fine-tuning outcomes, which dramatically reduces prediction variance and improves forgetting performance.
优缺点分析
Paper empirically demonstrate that increasing the number of task vectors in an ensemble improves unlearning, showing a clear negative correlation between ensemble prediction variance and forget-set accuracy. Authors introduce the task simplex, a geometric construct capturing a diverse set of task vectors, and show that sampling within this simplex effectively yields interpolations corresponding to “multi-task” unlearning. Additionally, there is derive a closed-form ensemble method using a second-order Taylor expansion and properties of the Dirichlet distribution, to integrate over infinitely many models from the simplex without brute-force sampling. This yields an analytic solution that can be applied to the original model to forget the data. Moreovere, the paper extend this to a probabilistic aggregator (based on probability of “at least one” model predicting a class) to further boost reliable forgetting. This work also demonstrate how to distill the effects of the infinite ensemble into a single model (a single “unlearning” task vector), making deployment practical.
Extensive experiments on 8 image classification datasets and the ImageNet retain set show that their method outperforms state-of-the-art unlearning methods, including single task vector subtraction (Ilharco et al. 2023), model weight merging techniques (e.g. Wortsman et al.’s Model Soup, ICML 2022), and other ensemble or linearization baselines. They also show their approach supports incremental unlearning – sequentially forgetting multiple datasets one after another – with strong results. The results are impressive. For example, on CLIP ViT-B/32 (Table 1), the average forget-set accuracy (lower is better, since 100% would mean the model still remembers everything) for a single task vector was 24.2%. Their method brings this down to 15.2% – a large improvement (the forget sets are small, so a lower accuracy means the model is doing poorly on them, which is good for forgetting). This outperforms even the strongest baseline (EMR merge got ~21.8%, function ensemble of 30 models ~22.7%). They achieve similarly large gains on the larger CLIP models: e.g., on ViT-L/14, they go from ~16.7% (best prior) down to ~9.98% forget accuracy – essentially halving the residual accuracy on sensitive data, which indicates a very thorough forgetting. Notably, their distilled single model version only slightly regresses (e.g., 15.66% vs 15.20% on ViT-B/32), showing that you don’t actually need to keep multiple models around — one can reap the benefits in a single model after distillation.
Overall, the paper introduces a novel ensemble-based unlearning strategy that is both theoretically grounded and empirically effective, pushing forward the capabilities for efficient post-hoc removal of training data from large models.
===
If any, one might point out that the method’s computational overhead is front-loaded (fine-tuning 30 models per task). The paper frames this in a positive light (no need to retrain the large model from scratch, which is indeed much more costly than 30 fine-tunings). But 30 fine-tunings with augmentations could still be heavy for very large models or very many forget requests. The authors don’t explicitly report how long those fine-tunings take; however, since they did it for 8 datasets and multiple architectures, it was clearly feasible.
问题
Computing a full Hessian for CLIP seems infeasible - could you clarify how Equation (3) was implemented? Also, how important are the second-order terms versus first-order?
CLIP ViT-L/14 is large, but how about something like CLIP with Vision Transformer Huge or a future model with billions of parameters – would the method scale? The concern is memory/time for fine-tuning many large models. Do you foresee any obstacles in scaling up, or does the approach parallelize well enough (fine-tuning can be embarrassingly parallel for each task vector)? Also, for very large output spaces (say thousands of classes), does the advanced aggregator (Theorem 2) or variance computation become too slow? It might be helpful to discuss any scalability tricks or limitations.
Your approach makes it easier to remove knowledge of data, which is positive for privacy. Could it be misused in any way? For instance, one could intentionally “unlearn” important facts from a model (for example, sabotage a facial recognition system by making it forget certain identities). This would require access to the model and data, so it’s not a big threat model, but it’s worth considering if making models so flexible could have downsides. Another angle: might repeated unlearning degrade a model in unforeseen ways?
局限性
A notable limitation is the computational cost and workflow complexity of the approach. It requires multiple fine-tuning runs and careful orchestration (especially if doing sequential unlearning, one must manage updating the base model). In environments with limited computational resources, this could be challenging. However, this is a trade-off: any effective unlearning that is easier than full retraining will have some cost, and here it’s parallelizable fine-tuning jobs which many organizations can manage.
As discussed, the approach focuses on classification benchmarks. It’s not directly tested on, say, open-vocabulary CLIP tasks or retrieval tasks. So one limitation is task specificity: forgetting is measured in terms of classification accuracy. If the requirement was to forget certain image-text pair associations in CLIP (like CLIP should not associate a particular image with a caption because the caption is private), one might need to adapt the method (maybe fine-tune CLIP’s multimodal head on that association). The method should conceptually extend, but it may require careful setup for different objectives.
最终评判理由
The authors' rebuttal comprehensively addresses my points: HVP enables feasible second-order computation (3s per HVP), second-order terms are key (+2.2% over first-order), LoRA scales to ViT-Huge (8.74% forget acc.), aggregator remains fast (seconds, not hours), and new VQA results show multimodal applicability (31.5% forget vs. 34.8% best prior). Ethical misuse is now discussed, and repeated unlearning preserves ~90% retain acc. These additions (with tables) mitigate the front-loaded cost limitation and extend beyond classification, reinforcing the method's novelty, rigor, and impact in VLM unlearning. The authors did address all the queries with the rebuttal. This is a technically solid paper with high impact in AI privacy/unlearning, excellent evaluation, and no ethical issues. I stand by my original accept recommendation.
格式问题
The paper is well-formatted in general, following NeurIPS style. No required elements seem missing; they have references, sections, etc. Figures and tables are numbered and referenced in text. I did not catch any typos in the main text - it is polished. The checklist portion at the end might not be needed in final camera-ready, but for submission it’s fine.
Response to Rev. 5xSJ
We thank Reviewer for constructive feedback.
1. Computational overhead is front-loaded (fine-tuning 30 models per task).
Thank you. Kindly note we can vary , etc. Moreover, we can vary number of fine-tuning epochs (currently in range 6-35) to 1/2 or 1/3rd (e.g., range 2-12) which is significantly lower than model pre-training/training from scratch. Finally, obtaining tasks can be parallelized.
Tables below show results w.r.t. and the number of fine tuning epochs, respectively.
TABLE (CLIP ViT-B-16)
| Task Vector Number | Forget () | Retain () | Fine-tuning Time | Unlearning Inference Time |
|---|---|---|---|---|
| 10 | 14.06 | 64.77 | 1.9h | 372s |
| 15 | 13.29 | 64.62 | 2.9h | 458s |
| 30 | 12.17 | 64.93 | 5.7h | 716s |
| Fine-tuning Epoch Percentage | Forget () | Retain () | Fine-tuning Time |
|---|---|---|---|
| 1/3 of 6-35 epochs | 15.54 | 65.14 | 2.4h |
| 1/2 of 6-35 epochs | 14.28 | 65.06 | 3.2h |
| 6-35 epochs | 12.17 | 64.93 | 5.7h |
1b. One must manage updating the base model.
Only fine-tuning from to task vectors. However, unlearning itself is performed by mere Taylor expansion -- unless distillation is desired to "compact" the model.
2. Could you clarify how Equation (3) was implemented?
Eq. (3) uses so-called Hessian-Vevtor Product (HVP) which requires the use of vmap, jacrev and hvp from torch.func and torch.autograd.functional. HVP never computed the Hessian matrix but directly and efficiently obtains . Vector then can be vector-multiplied with . HVP takes 2-4x the Jacobinan computation which makes is extremely efficient. In our case, one HVP takes approx. 3 seconds.
3. How important are the second-order terms versus first-order?
Resp. 1 to Rev. uLHj shows that under 1st order Taylor expansion, our approach degrades to the 1st order Taylor expansion of [A] ("Vector Uniform Merge" in our paper) and [B] ("Function Ensemble" in our paper)**, and to the NTK model:
[A]. Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable Approach, ICLR'25. [B]. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, ICML'22
However, the 1st order Taylor expansion is not the same as running [A] (think full expansion). Table below compares results (CLIP ViT-B/16) between the 1st and 2nd order solutions:
TABLE
| Method | Forget () | Retain () |
|---|---|---|
| Ours (1st order only) | 17.19 | 64.52 |
| Ours (1st & 2nd order) | 12.17 | 64.93 |
4. Would the method scale to ViT Huge with billions of parameters ?
Yes, in a way it is always cheaper to run few epochs of fine-tuning than retrain entire ViT-H. Indeed, task vectors can be obtained in parallel. To save memory, ViT-H could be equipped with LORAs, i.e., each for some tall (low rank) matrices and . For ViT‑H with LoRA adaptation (rank‑16), fine‑tuning reduces to ~12h, while inference remains ~14 minutes, since LoRA adds negligible overhead to the forward pass.
TABLE
| Backbone | Forget () | Retain () |
|---|---|---|
| ViT-H (Pre-trained) | 67.40 | 79.13 |
| ViT-H (LoRA) | 8.74 | 73.59 |
5. Does the advanced aggregator (Theorem 2) become too slow?
Thank you. No, the main cost is in obtaining task vectors. The overall Taylor approximation takes seconds not hours.
Generally, Theorem 2 requires runs of jacrev and hvp runs compared to the basic approach that uses hvp runs. Variance minimization does not require extra computations as jacrev and hvp are pre-computed and shared with the basic method. Storing requires 19GB memory which can be kept on the GPU until we move to process the next class in . Where needed, for larger or ViT-H (632M), vectors can be stored in CPU RAM and moved around, they can be computed on several GPUs (e.g., one hvp per GPU), or LORA can be used to limit the size of parameters.
Below is table showing computation time excluding task vector fine-tuning for several variants of our method:
TABLE (1st, 2nd order, Theorem 2, variance tuning) (CLIP ViT-B/16)
| Method | Forget () | Retain () | Unlearning Inference Time |
|---|---|---|---|
| 1st order w/o variance tuning | 17.19 | 64.52 | 393s |
| 1st & 2nd order w/o variance tuning | 14.36 | 64.79 | 471s |
| 1st & 2nd order w/ variance tuning (Ours) | 12.17 | 64.93 | 716s |
6. Removing knowledge is positive for privacy. Could it be misused in any way?
Thank you. Absolutely, as any AI tool, there exists always a possibility of misuse. We will make it clearer in the Broader Impact and Limitations section. Indeed, a rouge actor with access to the model and unlearning data could intentionally “unlearn” important facts from a model. Indeed, our approach provides mere functionality but this is not a model that can monitor fairness of unlearning tasks.
7. Might repeated unlearning degrade a model in unforeseen ways?
Yes, this is a very interesting question. Unlearning could degrade the ability of to deal with some class labels which have strong semantic correlation with classes we unlearn. However, as unlearning is merely obtained by task arithmetic, we can always store and consecutive sets of task vectors to be able to unroll changes.
Below is the retain set performance (forget set accuracy can be found in Table 3 in the main text) on the retaining accuracy under incremental unlearning using CLIP ViT-Base/32:
TABLE
| Method | Cars | +DTD | +EuroSAT | +GTSRB | +MNIST | +RESISC45 | +SUN397 | +SVHN |
|---|---|---|---|---|---|---|---|---|
| Ours | 60.7 | 60.4 | 60.0 | 59.4 | 58.8 | 58.1 | 57.6 | 57.0 |
Note that the pre‑trained model achieves 63.3% accuracy on the retain set; our incremental unlearning preserves roughly 90% of this performance on the test set and 95% on validation set.
8. Open-vocabulary CLIP tasks or retrieval tasks.
To demonstrate cross‑modal applicability, we additionally validate our method on the multimodal CLEAR benchmark (Dontsov et al., 2024), which involves fictional author profiles with paired face images and captions. As shown in the table below, we report forgetting and retention accuracy on the VQA task using LLaVA‑1.5‑7B (with CLIP ViT‑L/14 as the vision encoder). Our approach achieves the best forgetting while maintaining comparable retention performance.
TABLE
| Method | Forget VQA Acc. () | Retain VQA Acc. () |
|---|---|---|
| Pre-trained Model | 69.2 | 55.7 |
| Standard Task Arithmetic | 42.7 | 49.4 |
| Uniform Merge | 37.9 | 50.5 |
| TIES-Merging | 36.1 | 49.7 |
| EMR-Merging | 34.8 | 49.3 |
| Function Ensemble | 35.6 | 50.0 |
| Ours | 31.5 | 50.2 |
Conceptually, our task‑vector framework extends to other multimodal settings, including open‑vocabulary CLIP tasks and image‑text retrieval. Adapting to those scenarios would require defining task vectors on the corresponding objective (e.g., fine‑tuning CLIP’s multimodal head to capture specific image‑caption associations) and applying our unlearning procedure in the same manner. We will explore these directions in future work.
9. Task specificity: forgetting is measured in terms of classification accuracy.
Thank you. Certain image-text pair associations in CLIP would require obtaining task vectors on lager number of image-text unlearning pairs by just fine-tuning with the CLIP loss. Then could be replaced with the feature output of CLIP and the rest follows. However, Taylor expansion can operate on feature spaces too.
Kindly see Resp. 8 above where we work with VQA instead of classification.
The paper considers machine unlearning in vision language models, where the problem is to 'forget' certain training examples. Prior works, which introduced task vectors, were computationally inefficient, unstable and highly dependent on fine tuning configurations. The current work models the variation in task vectors by considering an ensemble of task vectors on a simplex to model variations in fine tuning configurations. This leads to a tangible boost in unlearning performance.
Machine unlearning is a very important topic in the field and principled approaches such as the ones presented in the paper can be very impactful in practice. After the rebuttal phase, the authors have been positive about the paper. Thus, I recommend accepting the paper, and encourage the authors to make improvements proposed by the reviewers.