7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.8

置信度

创新性3.0

质量3.3

清晰度3.0

重要性2.8

NeurIPS 2025

Machine Unlearning via Task Simplex Arithmetic

Junhao Dong,Hao Zhu,Yifei Zhang,Xinghua Qu,Yew-Soon Ong,Piotr Koniusz

OpenReview PDF

提交: 2025-04-04更新: 2025-10-29

TL;DR

VLM unlearning by a closed-form ensemble of infinite number of functions with parameters uniformly sampled from a task arithmetic simplex.

摘要

As foundation Vision-Language Models (VLMs) unlock fine-tuning on smaller datasets while leveraging large-scale pre-training data, machine unlearning becomes critical in addressing privacy concerns and regulatory compliance. Task vector, representing the difference between parameters of models fine-tuned with and without specific data, is a popular retraining-free unlearning strategy. However, we observe that task vectors exhibit substantial sensitivity to various fine-tuning configurations, resulting in unstable unlearning effectiveness that correlates negatively with the prediction-level variance. While aggregating multiple functions (e.g., VLM with classifier) whose parameters are represented by different task vectors reduces function variance and improves unlearning, the computational cost of obtaining numerous task vectors and aggregating functions is computationally high. Thus, in order to capture the space of task vectors induced by diverse fine-tuning strategies, we propose modeling it within the convex hull of $(Q-1)$-simplex whose vertices represent $Q$ task vectors. Although a function ensemble can be formed by sampling numerous task vectors from such a simplex, we derive a closed-form ensemble of an infinite number of functions whose parameters are uniformly sampled from the simplex, enabling efficient function-level task vector ensembling with enhanced unlearning performance. Extensive experiments and analyses across diverse datasets and scenarios demonstrate the efficacy of our method.

关键词

Machine UnlearningSimplexVLM

评审与讨论

审稿意见

评分: 4置信度: 32025-06-25

The paper proposes to improve unlearning by sampling task vectors from a simplex, which is composed of Q pivot task vectors. Authors discuss the potential of the method for reducing variance. Experiments are conducted on standard unlearning benchmarks with different VLMs.

优缺点分析

Strengths

Making task arithmetic more robust is an important question for unlearning (and model merging). The paper is well-motivated, and the empirical results are good.

Weaknesses

Some notations are not clear (e.g., variance of function ensemble)
Some baselines can be improved (the proposed method is a general approach and can be applied to other scenarios besides VLMs)
Please see details in the question section.

问题

I would think that ensemble is a general approach that can be applied to other pretrained models (e.g., LLM). Furthermore, by reducing variance, it may also help other task arithmetic operations besides unlearning (e.g., learning instead of unlearning, adding task vectors from different tasks). Can we have a brief discussion on different models and tasks?
On the variance of the ensemble
- Line 44, the notation of \sigma(x) is somewhat unclear. In my understanding, the randomness of the ensemble is from the subscript i (i.e., randomly selecting f \in R^C from Q possible task vectors), thus it is a multivariate random variable, and we should discuss the covariance rather than a scalar.the Does the notation there assume independence among f's components, or is f is a scalar (the probability of ground truth)?
- Line 59, can you further explain the reference to Bienaymé formula? It basically says that the variance of a sum of independent variables is the sum of each r.v.'s variance. Here, the ensemble is a sum of f \in R^C (each of which is a predicted C-dimensional distribution), but each prediction is not a random variable (as my understanding above, the randomness is from selecting Q task vectors, not from the classifier's output). Would you please clarify which random variable is studied here?
Line 151, the Hessian term may contain subscript c.
Equation 2, I would suggest using integration, rather than summations, over \tau. For completeness, it is also helpful to specify how to compute the first-order term.
Equation 10/11, the \sigma^2 term contains unclosed parentheses.
On experiments:
- Figure 3, I would suggest improving the function ensemble baseline by further increasing Q
- Table 3, in the incremental unlearning setting, does the ensemble method still keep the retain set performance?
- It would also be better to discuss limitations of the proposed methods (e.g., additional costs of computing/storing Hessian)
- Regarding line 604 (paper checklist), it would be better to open the code to help reproduce the results.

局限性

Please see the question section.

最终评判理由

The authors' response has addressed some of my concerns, and I keep my original score.

格式问题

N/A

作者回复

2025-07-30

Response to Rev. wwLa

We thank Reviewer for constructive feedback.

1. Ensemble is a general approach that can be applied to other pre-trained models (e.g., LLM).

Thank you. Below is the table showing performance on a multimodal LLM, LLaVA‑1.5‑7B (with CLIP ViT‑L/14 as the vision encoder). We additionally validate our method on the multimodal CLEAR benchmark (Dontsov et al., 2024), which involves fictional author profiles with paired face images and captions. As shown in the table below, we report forgetting and retention accuracy on the VQA task. Our approach achieves the best forgetting while maintaining comparable retention performance.

TABLE

Method	Forget VQA Acc. ( $\downarrow$ )	Retain VQA Acc. ( $\uparrow$ )
Pre-trained Model	69.2	55.7
Standard Task Arithmetic	42.7	49.4
Uniform Merge	37.9	50.5
TIES-Merging	36.1	49.7
EMR-Merging	34.8	49.3
Function Ensemble	35.6	50.0
Ours	31.5	50.2

Conceptually, our task‑vector framework extends to other multimodal settings, including open‑vocabulary CLIP tasks and image‑text retrieval. Adapting to those scenarios would require defining task vectors on the corresponding objective (e.g., fine‑tuning CLIP’s multimodal head to capture specific image‑caption associations) and applying our unlearning procedure in the same manner. We will explore these directions in future work.

2. Improving other task arithmetic operations besides unlearning.

While we focus predominantly on unlearning in this work, below we demonstrate incremental learning scheme. The reason why we focus on unlearning is that we model $\boldsymbol\theta\_1,\ldots,\boldsymbol\theta\_Q$ as variations in task vectors for a given unlearning dataset. Therefore, we assume that:

${\boldsymbol\theta}\_i$ results only in sparse changes: $\sum\_{j=1}^{86M} {\mathbb 1}( |\theta\_{j,0}-\theta\_{j,i}|>\epsilon)\ll m$ for small $\epsilon$ (86M is the number of parameters for ViT-B).
maximum diameter between any two task vectors $\eta=\lVert\boldsymbol\theta_i-\boldsymbol\theta_j\rVert_2\quad\forall i\neq j$ is small.

These assumptions are required for modeling with the Dirichlet distribution as per Resp. 1 to Rev. uLHj. For $\epsilon=1e-4$ (CLIP ViT-B/16), we checked that only 4.6% parameters changed (for unlearning) so as long as these assumptions are met (they are easily met for unlearning), we can also consider multi-task learning based on task vector simplices by adding the sum of the task vectors of 8 tasks (datasets): Cars , DTD, SUN397, EuroSAT, GTSRB, MNIST, SVHN, and RESISC45 following Task Arithmetic (Ilharco et al., ICLR 2023). We report the average absolute accuracy (%) w.r.t. CLIP ViT-B/16 in the table below.

TABLE

Method	Absolute Acc. ( $\uparrow$ )
Pre-trained Model	55.2
Standard Fine-tuning	75.5
Uniform Merge	78.2
Function Ensemble	79.3
Ours	83.6

3. Line 151, the Hessian term may contain subscript.

Absolutely. Thank you.

4. Equation 2, I would suggest using integration, rather than summations.

Absolutely.

5. It is also helpful to specify how to compute the first-order term.

Absolutely, for the standard expansion we have:

$\frac{1}{|\Delta\_\theta|}\sum\_{\boldsymbol\tau\in\Delta\_\theta}\boldsymbol{\tau}=\frac{1}{Q}\sum\_{i=1}^Q\boldsymbol{\tau}\_i,$

while for Corollary 1 we have :

$\frac{1}{|\Delta\_\theta|}\sum\_{(w,\boldsymbol\tau)\in\Delta\_w\times \Delta\_\theta}\\!\\!\\!w\boldsymbol{\tau}=\frac{1}{\sum_{j=1}^Q w\_j}\sum\_{i=1}^Q w\_i\boldsymbol{\tau}\_i.$

6. Parentheses in Eq. 10/11.

Duly noted.

7. Increase $Q$ in Figure 3.

Absolutely. In Figure 3, we provide the comparison between function-level ensembles from task vectors and our task simplex w.r.t. the number of task vectors Q for CLIP ViT-B/32. In the following table, we further include the performance for $Q=40, 50$ . We can notice the convergence of unlearning performance..

TABLE

Method	Forget Set Acc. ( $\downarrow$ )
$Q=20$	15.58
$Q=30$	15.20
$Q=40$	15.07
$Q=50$	14.98

8. In Table 3, does the ensemble method still keep the retain set performance?

Yes, within 95% of the original performance.

9. Discuss limitations of the proposed methods (e.g., additional costs of computing/storing Hessian).

10. Regarding line 604, it would be better to open the code.

Absolutely.

11. The randomness of the ensemble is from the subscript i: should we have a full covariance?

Thank you. This is very interesting question. We assumed each $\mu_i$ for $i=1,\ldots,C$ is independent for simplicity and $f_i$ produces likelihood of the $i$ -th class. As some of $Q$ task vectors may be noisy for some classes, we assumed limiting that variance alone a bit may be sufficient - and improvements validate that. However, reducing covariance (off-diagonal terms) could indeed help also decorrelate the model. However, that would require some serious rethinking of the Taylor expansion in Eq. (2) to produce the outer product as a closed form solution because of the integration operator.

12. The Bienaymé formula.

For a fixed sample ${\bf x}$ , one may think of $f(\cdot)$ as a transformation of random variable (task vector sampled from the Dirichlet distribution (simplex)) into another random variable living in the class space (surface of a simplex) which follows some resulting from this transformation class distribution. The Bienaymé formula tells us what happens with variance of each class (treated independently) as the number of ensembled functions grows. That number depends on $Q$ task vectors. However, $\rho$ is required to be low. The opposite means each $Q$ functions are not independent - say in extreme case we have $Q$ identical functions, clearly they cannot reduce the variance because they all are identical. For that reason we investigated the idea of learning $w_i$ or even moving $\boldsymbol\tau_i$ in a small radius to help reduce the variance. We agree as well that reducing cross-terms in covariance could strengthen this effect.

审稿意见

评分: 5置信度: 32025-06-29

This work proposes a method for machine unlearning, which is an interesting and practically important problem. The authors point out that task vectors, a popular unlearning strategy, exhibit substantial sensitivity to various fine-tuning configurations, leading to unstable unlearning effectiveness. Obtaining and aggregating multiple task vectors can reduce prediction-level variance and improve unlearning results; however, this comes at a high computational cost.

This manuscript presents a method that captures the space of task vectors induced by diverse fine-tuning strategies. Specifically, it models this space through the convex hull of a (𝑄−1)-simplex whose vertices correspond to 𝑄 task vectors. Instead of sampling task vectors directly from the simplex, the authors derive a closed-form ensemble representing an infinite number of functions whose parameters are uniformly sampled from the simplex, thereby achieving enhanced unlearning performance in a computationally efficient manner.

The method appears to be supported by solid theoretical foundations. Furthermore, the experimental results and accompanying analyses demonstrate the effectiveness of the proposed approach.

优缺点分析

Strengths: The paper is well-written and clearly organized. The proposed method is theoretically sound. The experimental results and analyses are thorough and convincing.

Weaknesses: The section titled “Proposed Method” contains many abstract equations, which may be difficult for readers who are not deeply familiar with this specific area to fully understand. Providing more intuitive explanations or illustrative examples could greatly improve accessibility.

问题

Overall, I find this paper interesting and well-motivated. Minor clarifications and improvements, however, would further strengthen the work. (1) This manuscript appears to be very theoretical. Providing more intuitive explanations or illustrative examples could significantly improve accessibility for a broader audience. (2) Regarding the organization of the “Proposed Method” section: there is only one numbered subsection titled “Problem Formulation.” This structure may lead to the misunderstanding that all subsequent content falls under problem formulation. It would be clearer to divide this section into multiple meaningful subsections. (3) In the paragraph beginning with "Unlearning with Task Vectors", the sentence that "Moreover, our closed-form solution achieves the lowest accuracy (the lower the better) on the forget set, surpassing the state-of-the-art approach by 3.3% on ViT-Base/32-based CLIP." is not directly supported by the results in Table 1. Additionally, in the paragraph starting with “Unlearning with Linear Task Vectors,” Another sentence in the paragraph beginning with "Unlearning with Linear Task Vectors", the sentence "Table 1 shows that the linearized task vectors consistently yield substantial reductions in the forget set accuracy 258 across multiple merging methods and CLIP architectures, while maintaining fixed accuracy on the 259 retain set relative to their standard (non-linearized) counterparts in Table 1." appears to mistakenly reference Table 1 twice, whereas the first mention likely should refer to Table 1.

局限性

Yes, but they put the discussion on limitations in Appendix F.2 instead of the main text.

最终评判理由

The authors have provided a clear rebuttal that addresses my main concerns. I believe the contribution is sufficient for acceptance.

格式问题

No Paper Formatting Concerns

作者回复

2025-07-30

Response to Rev. vtZ9

We thank Reviewer for constructive feedback.

1. Providing more intuitive explanations.

Absolutely. The main idea is to devise a distribution easy to construct and sample give $86M$ parameter space (ViT-B). This can be easily achieved by an ensemble of functions based on Taylor expansion which leverages 0th, 1st and 2nd order statistics of these task vectors together. Taylor expansion is devised at $\boldsymbol\theta\_0$ .

We will illustrate such a Taylor expansion and resulting const, linear and quadratic terms. The linear term depends on the mean of task vectors, and the quadratic term on specific second order statistics which we will illustrate. Kindly notice rebuttal does not permit figures (alas).
We will also illustrate the ensemble in terms of the mean and variance for which we can adjust outliers using weights $w_i$ .

1. Divide "Proposed Method” into multiple meaningful subsections.

Absolutely. In general, we have the core "Problem Formulation":

Problem definition with Closed-form Aggregation of Functions from Task Simplex,

followed by "Extensions" and "Theoretical Analyses":

Advanced Aggregation Scheme.
Distillation from Ensemble.
Computing and Controlling Variance of Ensemble.

2. Sentence with "the state-of-the-art approach by 3.3% on ViT-Base/32-based CLIP" is not directly supported by Table 1.

Absolutely. We had a typo due to changes in the paper. We meant that compared to function ensemble [8] we achieve 7.55%, 7.72% and 6.34% improvement in unlearning on ViT-Base/32, ViT-Base/16 and ViT-Large/14, respectively.

3. "Table 1 shows..." repeats Table 1 twice.

Thank you. We have revised accordingly.

2025-08-04

Thank you for your clarifications. I have no additional concerns. Kindly revise the manuscript accordingly.

评论- thank you

2025-08-04

Esteemed Reviewer,

We thank you for engaging with our rebuttal. Rest assured all suggestions will be incorporated into our paper. Meantime, if there is anything else we can improve, clarify or answer, kindly let us know.

Best regards,

Authors

审稿意见

评分: 4置信度: 42025-07-03

This paper addresses the problem of machine unlearning in Vision-Language Models (VLMs) by proposing a closed-form function ensembling method based on the Task Arithmetic Simplex. Traditional task vector approaches suffer from unstable unlearning due to sensitivity to fine-tuning configurations, with prediction-level variance negatively correlating with unlearning performance. The authors model task vectors as vertices of a (Q-1)-dimensional simplex, deriving a closed-form ensemble of infinite functions via Dirichlet distribution and Taylor expansion, which effectively reduces variance and enhances unlearning. Experiments on 8 visual datasets validate the method’s superiority, especially in incremental unlearning and linear task vector scenarios.

优缺点分析

Strengths:

Comprehensive Ablation Studies: The paper conducts rigorous ablation experiments to decompose the contributions of key components (e.g., advanced aggreagation, vertex importance weighting), clarifying the impact of each module on unlearning performance.
Theoretical Innovation and Practical Efficacy: Modeling task vectors as a simplex and deriving closed-form ensembles using convex geometry and Dirichlet distribution addresses the high computational cost of traditional methods. Experiments demonstrate significant accuracy gains on forgetting sets (e.g., 9.98% on ViT-Large/14) with stable retained set performance, validating both theoretical rigor and practical utility.

Weaknesses:

Unjustified Simplex Modeling of Task Vector Space: The authors propose modeling task vectors within the convex hull of a simplex but lack rigorous justification for this assumption. There is no mathematical proof or empirical analysis (e.g., geometric visualization of task vector distributions) to validate that task vectors indeed form a convex simplex. This raises questions about whether linear combinations of task vectors remain within the valid task space, especially given the non-linear dependencies inherent in real-world fine-tuning scenarios .
Computational Cost and Dataset Limitations: While closed-form solutions avoid infinite sampling, generating Q task vectors (e.g., Q=30) requires multiple fine-tunings, imposing high costs for large models. Additionally, experiments are confined to vision datasets, lacking validation on language or cross-modal tasks.
Theoretical Constraints on Taylor Expansion: The method relies on small perturbation assumptions (λ), but the paper omits comparative experiments under large perturbations, where second-order approximation validity may decline.

问题

Theoretical Justification for Simplex Modeling: Please provide mathematical proof or empirical evidence (e.g., t-SNE visualization of task vector convexity) to support the simplex modeling assumption. How do non-linear task dependencies affect this framework?
Cross-Modal Generalization: Are there plans to validate the method on language or multi-modal models? Please include cross-modal experiments to demonstrate generalizability.
Large Perturbation Robustness: Could you provide performance data under varying λ to validate the Taylor expansion’s accuracy beyond small perturbation scenarios?

局限性

N/A

最终评判理由

The authors' response has addressed many of my concerns, so I am raising my recommendation to Borderline Accept.

格式问题

N/A

作者回复

2025-07-30

Response to Rev. uLHj

We thank Rev. for constructive feedback.

1. Validate task vectors form a convex simplex.

Thank you. This is a complex question.

Kindly note that we do not claim task vectors follow any specific distribution. Task vectors have $m=86M$ dim. (CLIP ViT-B-16). For low number of vectors ( $Q=30\ll 86M$ ), there exists no technique that can reliably estimate the distribution to tell its kind due to the curse of dimensionality - one needs $Q\gg 86M$ to reliably estimate. Our $Q=30$ . $\let\b\boldsymbol\let\th\theta\let\al\alpha\let\la\lambda\let\ta\tau\let\da\downarrow\let\ua\uparrow$ $\newcommand{\fg}{Forget (\da)}\newcommand{\rt}{Retain (\ua)}\newcommand{\tz}{{\b\th}\_0}\newcommand{\ti}{{\b\th}\_i}\newcommand{\tj}{{\b\th}\_j}$

As rebuttal does not allow figures, we will add t-SNE in revision. But we chose Dirichlet (educated choice) as under large $m=68M$ dim., we have:

simplicity to set/estimate PDF (for Dirichlet, our task vectors set PDF)
ability to sample from it (our Taylor derivation does that)
compact support (not exceeding observed min-max values of individual params.) (1a)
ability to exploit 0th, 1st, 2nd order moments (simplex mean, implicit cov.) in modeling - Dirichlet lets us use of low-order moments in Taylor expansion (1c)

1a. Intuitive reasoning.

Notice that:

Few-epochs fine-tuning $\tz$ on unlearning dataset toward $\ti$ yields sparse change: $\sum\_{j=1}^{86M} {\mathbb 1}( |\th\_{j,0}-\th\_{j,i}|>\epsilon)\ll m$ for small $\epsilon$ . [A] confirms sparsity of $\b\ta$ . For $\epsilon=1e-4$ (CLIP ViT-B/16), we checked that only 4.6% parameters changed.

[A]. Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable Approach, ICLR'25.

Thus, we assume that interpolating between such sparse differences in $\ti$ and $\tj$ captures a feasible parameter subset pertinent to the unlearning dataset, and realizes several magnitudes of "active" parameters.
As counterexample, let task vectors be Normally distributed: ${\b\th}'\sim\mathcal{N}(\b{\mu},\b{\sigma}^2)$ (off-diagonal is 0; cannot estimate $86M\times 86M$ dim. covariance) but $\b{\mu}=\frac{1}{Q}\sum\_{i=1}^Q\ti$ and $\b{\sigma}^2=\frac{1}{Q-1}\sum\_{i=1}^Q (\ti-\b{\mu})^2$ and so we obtain ${\b\th}'=\b{\mu}+\b{\sigma}\odot\bf{v}$ where ${\bf v}\sim \mathcal{N}(\bf{0},\bf{1})$ .

Table below (CLIP ViT-B-16): function-level ensemble $\frac{1}{S}\sum_{i=1}^S f({\bf x};{\b\th}'\_i)$ under the Normal dist. ( $S$ task vectors ${\b\th}'$ sampled from it):

TABLE

Distribution $\fg$ $\rt$
Normal ( $S=100$ ) 16.80 64.32
Normal ( $S=300$ ) 15.98 64.54
Dirichlet (Ours) 12.17 64.93

The Norm. dist. performs worse than our Dirichlet model as PDF arms decay slowly toward $\infty$ whereas Dirichlet dist. has compact support: it produces finite values of filter parameters constrained by simplex.

Distribution	$\fg$	$\rt$
Normal ( $S=100$ )	16.80	64.32
Normal ( $S=300$ )	15.98	64.54
Dirichlet (Ours)	12.17	64.93

1b. Theory/analysis.

We provide a rigorous bound: interpolated task vector model deviates from the expected convex combination by at most $L\la\eta$ .

The bound is small if:

task vectors differ by a small diameter $\eta=\max_{i\neq j}\lVert{\b\ta}\_i, {\b\ta}\_j\rVert_2$
$\la$ is small
$g$ is smooth (low Lipschitz const. $L$ )

These conditions are met in our paper (low $\eta$ due to sparse change between task vectors, low $\la<1$ ).

Theorem: Interpolation Performance Bounds.

Let $g:\th\to\mathbb{R}$ be the unlearning function. Let $g$ be $L$ -Lipschitz continuous: $\lVert g(\ti)-g(\tj)\rVert_2\leq L\lVert\ti-\tj\rVert_2\quad\forall i\neq j$

Given task vectors $\{\b\ta_1,\ldots,\b\ta_Q\}$ and any convex combination $\b\ta=\sum\_{i=1}^Q\al_i\b\ta_i$ where $\sum_{i=1}^Q\al_i=1, \al_i\geq 0$ , we have:

$\bigg\lVert g(\tz-\la\b\ta)-\sum_{i=1}^Q\al_i g(\tz-\la\b\ta_i)\bigg\rVert_2\leq L\la\eta$

Theorem's Importance: any interpolated choice for $(\al_1,\ldots,\al_Q)$ deviates from expected convex combination by less/equal $L\la\eta$ . We can provide Proof in discussions (rebuttal space limits).

1c. Analyzing Taylor orders (how do non-linear task dep. affect framework).

Vector task aggregation model [A] has the same 1st order Taylor expansion as our method (it equals also to NTK): $\underbrace{f\bigg({\bf x};\tz-\frac{\la}{Q}\sum_{i=1}^Q\b{\ta}\_i\bigg)}\_{\text{approach [A]}}\approx f({\bf x};\tz)-\la\bigg\langle J\_f(\tz), \underbrace{\frac{1}{Q}\sum\_{i=1}^Q\b{\ta}\_i}\_{\b\mu\_\ta}\bigg\rangle= f({\bf x}; \tz)-\la\bigg\langle J\_f(\tz,\underbrace{\frac{1}{|\Delta\_\th|}\sum\_{\b\ta\in\Delta\_\th}\b{\ta}}\_{\b\mu\_\ta}\bigg\rangle\approx\underbrace{\frac{1}{|\Delta\_\th|}\sum\_{\b\ta\in\Delta\_\th} f\bigg({\bf x}; \tz-\la\b{\ta}\_i\bigg)}\_{\text{our method}}$ As 0th & 1st order moments for our technique are identical with [A] & [B], integration over simplex is not an issue under this expansion by virtue of methods [A] & [B].

[B]. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, ICML'22
Now we compare 2nd order expansions:
- (approach [A]): $\qquad\qquad\qquad\\; f\big({\bf x};\tz-\frac{\la}{Q}\sum_{i=1}^Q\b{\ta}\_i\big)\approx f({\bf x};\tz)-\la\langle J\_f(\tz), \b\mu\_\ta\rangle+\la^2 \b\mu\_\ta^T H\_f(\tz)\b\mu\_\ta$
- (basic function ensemble [B]): $\frac{1}{Q}\sum\_{i=1}^Qf({\bf x};\tz-\la\b{\ta}\_i\ )\approx f({\bf x};\tz)-\la\langle J\_f(\tz), \b\mu\_\ta\rangle+\la^2 \frac{1}{Q}\sum\_{i=1}^Q\b\ta\_i^T H\_f(\tz)\b\ta\_i$
- (our Simplex ensemble): $\\;\\;\\;\frac{1}{|\Delta\_\th|}\sum\_{\b\ta\in\Delta\_\th} f({\bf x};\tz-\la\b{\ta}\_i)\approx f({\bf x};\tz)-\la\langle J\_f(\tz), \b\mu\_\ta\rangle+\underbrace{\rho\la^2\b\mu\_\ta^T H\_f(\tz)\b\mu\_\ta+\zeta\la^2\frac{1}{Q}\sum\_{i=1}^Q\b\ta\_i^T H\_f(\tz)\b\ta\_i}\_{\text{interpolation between [A] and basic func. ensemble}}$ for $\rho=Q/(Q+1)$ and $\zeta=1/(Q+1)$ .
  
  As 0th and 1st order terms are identical, and our Simplex ensemble interpolates between two known approaches [A] ("Vector Uniform Merge" in our paper) and [B] ("Function Ensemble" in our paper) for 2nd order expansion (convex functions), which is why it is OK to integrate functions over task vectors uniformly drawn from the Dirichlet dist. (our simplex ensemble).
  
  Use of concentration parameter $\al$ of Dirichlet leads to $\rho=\al Q/(\al Q+1)$ and $\zeta=1/(\al Q+1)$ which lets smooth interpolation between [A] & [B].
  
  TABLE (CLIP ViT-B-16)
  
  $\al$ $\fg$ $\rt$
  0.1 14.65 64.49
  0.5 12.83 64.26
  0.8 12.05 64.68
  1.2 12.48 64.85
  1.5 12.70 64.92
  2.0 13.29 64.81
  1.0 12.17 64.93
  
  If we use Eq. (6) (paper), weight $w_i$ per vertex customizes interpolation (help reduce var.)

$\al$	$\fg$	$\rt$
0.1	14.65	64.49
0.5	12.83	64.26
0.8	12.05	64.68
1.2	12.48	64.85
1.5	12.70	64.92
2.0	13.29	64.81
1.0	12.17	64.93

1d. Parameter augmentation.

As averaging all task vectors gives viable unlearning (model [A]), and ensembling functions on individual task vectors is viable unlearning (model [B]), interpolation between the mean $\frac{1}{Q}\sum_{i=1}^Q\b{\ta}\_i$ and individual $\b{\ta}\_j$ is "parameter augmentation". Let $p({\bf x}; \Omega)=\frac{1}{|\Omega|}\int\_{\b{\ta}\in\Omega} f\big({\bf x};\tz-\b{\ta}\big)d\b{\ta}.$ Based on 1b, if $f$ changes fast in $\Omega_{fast}$ & slow in $\Omega_{slow}$ (same neighborhood sizes) then intuitively entropy of prediction $Ent(p({\bf x}, \Omega_{slow}))$ is lower than $Ent(p({\bf x}; \Omega_{fast}))$ as smooth changes contribute coherently to vote - chaotic changes are incoherent leading to cancellation of class peaks: $Ent(vec(1/C))$ is max. So our method leverages smoothness of $f$ w.r.t. augmentation. See also [C]:

[C]. Averaging Weights Leads to Wider Optima and Better Generalization, UAI'19

1e. Do task vectors form simplex?

They do not have to so long they are close to simplex. We take $Q=10$ out of 30 task vectors fine-tuned on SUN397 and build simplex. We count sketch dimensions to $Q=d+1=11$ . Remaining 20 task vectors are in $0.9(\eta/2)$ radius of the simplex while task vectors of Cars exceed $1.6(\eta/2)$ distance (cluster around their own simplex).

2a. Cost: Many fine-tunings.

We fine-tune over mere 6-35 epochs (not 1000). We can reduce $Q=30$ to 10 for speed or fine-tune in parallel.

See sequential fine‑tuning time/unlearning inference time (our paper's setup):

TABLE (CLIP ViT-B-16)

Task Vector Number $Q$	$\fg$	$\rt$	Fine-tuning Time	Unlearning Inference Time
10	14.06	64.77	1.9h	372s
15	13.29	64.62	2.9h	458s
30	12.17	64.93	5.7h	716s

Below we reduce epochs range (6-35) by 1/2 or 1/3rd:

Fine-tuning Epoch Percentage	$\fg$	$\rt$	Fine-tuning Time
1/3 of 6-35 epochs	15.54	65.14	2.4h
1/2 of 6-35 epochs	14.28	65.06	3.2h
5-35 epochs	12.17	64.93	5.7h

2b. Lang. or cross-modal task.

We apply our method on multi-modal CLEAR benchmark (Dontsov et al., 2024: fictional author profiles with face images-captions pairs). See the avg. accuracy (VQA task) (LLaVA‑1.5‑7B, CLIP ViT‑L/14 as vision encoder). Task vectors are derived by fine‑tuning on the corresponding target data.

TABLE

Method	VQA Acc. $\fg$	VQA $\rt$
Pre-trained Model	69.2	55.7
Standard Task Arithmetic	42.7	49.4
Uniform Merge	37.9	50.5
TIES-Merging	36.1	49.7
EMR-Merging	34.8	49.3
Function Ensemble	35.6	50.0
Ours	31.5	50.2

3.Constraints on Taylor.

Kindly note $0\leq\la\leq 1$ : $\la^2$ in 2nd order term decays quadratically; 1st order term $\la$ is linear. In practice, $\la$ is chosen by cross-validation. For low $\la^2$ our method reverts to 1st order expansion (as [A] & [B]). Below we vary $\la$ (CLIP ViT-B/16) on SUN397. Bold: $\la$ best in cross-val.

TABLE

$\la$	$\fg$	$\rt$
0.1	61.4	67.6
0.2	57.5	66.7
0.3	53.6	66.2
0.4	49.8	65.7
0.5	45.9	64.8
0.6	42.0	65.1
0.7	40.1	62.8
0.8	39.2	60.3

2025-08-07

Thank you for your clarifications, please revise the manuscript accordingly. I will adjust the score.

评论- thank you

2025-08-07

Esteemed Reviewer,

We thank you for engaging with our rebuttal. Rest assured we will revise the paper as per your suggestions. Meantime, if there is anything else we can improve, clarify or answer, kindly let us know.

Best regards,

Authors

审稿意见

评分: 5置信度: 52025-07-03

This paper addresses the problem of machine unlearning in large vision-language models (VLMs) like CLIP. Machine unlearning refers to efficiently removing or “forgetting” the influence of certain training data (the forget set) from a model without retraining from scratch, to meet privacy regulations (e.g. the “right to be forgotten” under GDPR). The authors build on the concept of task vectors – differences in model parameters when fine-tuning with vs. without the target data (as used in Editing Models with Task Arithmetic by Ilharco et al., ICLR 2023). In prior work, subtracting a task vector from the model could forget a dataset in a plug-and-play way, but the effectiveness varied greatly depending on fine-tuning specifics. This paper’s key idea is to mitigate that variance by using ensembles of many task vectors. They propose modeling the space of possible task vectors (obtained under different fine-tuning conditions) as a simplex (convex hull) and derive a closed-form solution that aggregates an infinite ensemble of models sampled from this simplex. In simpler terms, instead of relying on one fine-tuned model difference, they analytically average out an infinite number of fine-tuning outcomes, which dramatically reduces prediction variance and improves forgetting performance.

优缺点分析

Paper empirically demonstrate that increasing the number of task vectors in an ensemble improves unlearning, showing a clear negative correlation between ensemble prediction variance and forget-set accuracy. Authors introduce the task simplex, a geometric construct capturing a diverse set of task vectors, and show that sampling within this simplex effectively yields interpolations corresponding to “multi-task” unlearning. Additionally, there is derive a closed-form ensemble method using a second-order Taylor expansion and properties of the Dirichlet distribution, to integrate over infinitely many models from the simplex without brute-force sampling. This yields an analytic solution that can be applied to the original model to forget the data. Moreovere, the paper extend this to a probabilistic aggregator (based on probability of “at least one” model predicting a class) to further boost reliable forgetting. This work also demonstrate how to distill the effects of the infinite ensemble into a single model (a single “unlearning” task vector), making deployment practical.

Extensive experiments on 8 image classification datasets and the ImageNet retain set show that their method outperforms state-of-the-art unlearning methods, including single task vector subtraction (Ilharco et al. 2023), model weight merging techniques (e.g. Wortsman et al.’s Model Soup, ICML 2022), and other ensemble or linearization baselines. They also show their approach supports incremental unlearning – sequentially forgetting multiple datasets one after another – with strong results. The results are impressive. For example, on CLIP ViT-B/32 (Table 1), the average forget-set accuracy (lower is better, since 100% would mean the model still remembers everything) for a single task vector was 24.2%. Their method brings this down to 15.2% – a large improvement (the forget sets are small, so a lower accuracy means the model is doing poorly on them, which is good for forgetting). This outperforms even the strongest baseline (EMR merge got ~21.8%, function ensemble of 30 models ~22.7%). They achieve similarly large gains on the larger CLIP models: e.g., on ViT-L/14, they go from ~16.7% (best prior) down to ~9.98% forget accuracy – essentially halving the residual accuracy on sensitive data, which indicates a very thorough forgetting. Notably, their distilled single model version only slightly regresses (e.g., 15.66% vs 15.20% on ViT-B/32), showing that you don’t actually need to keep multiple models around — one can reap the benefits in a single model after distillation.

Overall, the paper introduces a novel ensemble-based unlearning strategy that is both theoretically grounded and empirically effective, pushing forward the capabilities for efficient post-hoc removal of training data from large models.

===

If any, one might point out that the method’s computational overhead is front-loaded (fine-tuning 30 models per task). The paper frames this in a positive light (no need to retrain the large model from scratch, which is indeed much more costly than 30 fine-tunings). But 30 fine-tunings with augmentations could still be heavy for very large models or very many forget requests. The authors don’t explicitly report how long those fine-tunings take; however, since they did it for 8 datasets and multiple architectures, it was clearly feasible.

问题

Computing a full Hessian for CLIP seems infeasible - could you clarify how Equation (3) was implemented? Also, how important are the second-order terms versus first-order?

CLIP ViT-L/14 is large, but how about something like CLIP with Vision Transformer Huge or a future model with billions of parameters – would the method scale? The concern is memory/time for fine-tuning many large models. Do you foresee any obstacles in scaling up, or does the approach parallelize well enough (fine-tuning can be embarrassingly parallel for each task vector)? Also, for very large output spaces (say thousands of classes), does the advanced aggregator (Theorem 2) or variance computation become too slow? It might be helpful to discuss any scalability tricks or limitations.

Your approach makes it easier to remove knowledge of data, which is positive for privacy. Could it be misused in any way? For instance, one could intentionally “unlearn” important facts from a model (for example, sabotage a facial recognition system by making it forget certain identities). This would require access to the model and data, so it’s not a big threat model, but it’s worth considering if making models so flexible could have downsides. Another angle: might repeated unlearning degrade a model in unforeseen ways?

局限性

A notable limitation is the computational cost and workflow complexity of the approach. It requires multiple fine-tuning runs and careful orchestration (especially if doing sequential unlearning, one must manage updating the base model). In environments with limited computational resources, this could be challenging. However, this is a trade-off: any effective unlearning that is easier than full retraining will have some cost, and here it’s parallelizable fine-tuning jobs which many organizations can manage.

As discussed, the approach focuses on classification benchmarks. It’s not directly tested on, say, open-vocabulary CLIP tasks or retrieval tasks. So one limitation is task specificity: forgetting is measured in terms of classification accuracy. If the requirement was to forget certain image-text pair associations in CLIP (like CLIP should not associate a particular image with a caption because the caption is private), one might need to adapt the method (maybe fine-tune CLIP’s multimodal head on that association). The method should conceptually extend, but it may require careful setup for different objectives.

最终评判理由

The authors' rebuttal comprehensively addresses my points: HVP enables feasible second-order computation (3s per HVP), second-order terms are key (+2.2% over first-order), LoRA scales to ViT-Huge (8.74% forget acc.), aggregator remains fast (seconds, not hours), and new VQA results show multimodal applicability (31.5% forget vs. 34.8% best prior). Ethical misuse is now discussed, and repeated unlearning preserves ~90% retain acc. These additions (with tables) mitigate the front-loaded cost limitation and extend beyond classification, reinforcing the method's novelty, rigor, and impact in VLM unlearning. The authors did address all the queries with the rebuttal. This is a technically solid paper with high impact in AI privacy/unlearning, excellent evaluation, and no ethical issues. I stand by my original accept recommendation.

格式问题

The paper is well-formatted in general, following NeurIPS style. No required elements seem missing; they have references, sections, etc. Figures and tables are numbered and referenced in text. I did not catch any typos in the main text - it is polished. The checklist portion at the end might not be needed in final camera-ready, but for submission it’s fine.

作者回复

2025-07-30

Response to Rev. 5xSJ

We thank Reviewer for constructive feedback.

1. Computational overhead is front-loaded (fine-tuning 30 models per task).

Thank you. Kindly note we can vary $Q=10,15,30$ , etc. Moreover, we can vary number of fine-tuning epochs (currently in range 6-35) to 1/2 or 1/3rd (e.g., range 2-12) which is significantly lower than model pre-training/training from scratch. Finally, obtaining $Q$ tasks can be parallelized.

Tables below show results w.r.t. $Q$ and the number of fine tuning epochs, respectively.

TABLE (CLIP ViT-B-16)

Task Vector Number $Q$	Forget ( $\downarrow$ )	Retain ( $\uparrow$ )	Fine-tuning Time	Unlearning Inference Time
10	14.06	64.77	1.9h	372s
15	13.29	64.62	2.9h	458s
30	12.17	64.93	5.7h	716s

Fine-tuning Epoch Percentage	Forget ( $\downarrow$ )	Retain ( $\uparrow$ )	Fine-tuning Time
1/3 of 6-35 epochs	15.54	65.14	2.4h
1/2 of 6-35 epochs	14.28	65.06	3.2h
6-35 epochs	12.17	64.93	5.7h

1b. One must manage updating the base model.

Only fine-tuning from $\boldsymbol\theta\_0$ to $\boldsymbol\theta\_i$ task vectors. However, unlearning itself is performed by mere Taylor expansion -- unless distillation is desired to "compact" the model.

2. Could you clarify how Equation (3) was implemented?

Eq. (3) uses so-called Hessian-Vevtor Product (HVP) which requires the use of vmap, jacrev and hvp from torch.func and torch.autograd.functional. HVP never computed the Hessian matrix but directly and efficiently obtains $\bf{v}=(\bf{H}\cdot\boldsymbol\tau)$ . Vector $\bf{v}$ then can be vector-multiplied with $\boldsymbol\tau$ . HVP takes 2-4x the Jacobinan computation which makes is extremely efficient. In our case, one HVP takes approx. 3 seconds.

3. How important are the second-order terms versus first-order?

Resp. 1 to Rev. uLHj shows that under 1st order Taylor expansion, our approach degrades to the 1st order Taylor expansion of [A] ("Vector Uniform Merge" in our paper) and [B] ("Function Ensemble" in our paper)**, and to the NTK model:

[A]. Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable Approach, ICLR'25. [B]. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, ICML'22

However, the 1st order Taylor expansion is not the same as running [A] (think full expansion). Table below compares results (CLIP ViT-B/16) between the 1st and 2nd order solutions:

TABLE

Method	Forget ( $\downarrow$ )	Retain ( $\uparrow$ )
Ours (1st order only)	17.19	64.52
Ours (1st & 2nd order)	12.17	64.93

4. Would the method scale to ViT Huge with billions of parameters ?

Yes, in a way it is always cheaper to run few epochs of fine-tuning than retrain entire ViT-H. Indeed, $Q$ task vectors can be obtained in parallel. To save memory, ViT-H could be equipped with LORAs, i.e., each $\boldsymbol\tau_i={\bf AB}^T$ for some tall (low rank) matrices ${\bf A}$ and ${\bf B}$ . For ViT‑H with LoRA adaptation (rank‑16), fine‑tuning reduces to ~12h, while inference remains ~14 minutes, since LoRA adds negligible overhead to the forward pass.

TABLE

Backbone	Forget ( $\downarrow$ )	Retain ( $\uparrow$ )
ViT-H (Pre-trained)	67.40	79.13
ViT-H (LoRA)	8.74	73.59

5. Does the advanced aggregator (Theorem 2) become too slow?

Thank you. No, the main cost is in obtaining $Q$ task vectors. The overall Taylor approximation takes seconds not hours.

Generally, Theorem 2 requires $2C$ runs of jacrev and $2QC$ hvp runs compared to the basic approach that uses $QC$ hvp runs. Variance minimization does not require extra computations as jacrev and hvp are pre-computed and shared with the basic method. Storing $30 (Q)\times 86M (params)\times 8 (double)$ requires 19GB memory which can be kept on the GPU until we move to process the next class in $1,\ldots,C$ . Where needed, for larger $Q$ or ViT-H (632M), vectors can be stored in CPU RAM and moved around, they can be computed on several GPUs (e.g., one hvp per GPU), or LORA can be used to limit the size of parameters.

Below is table showing computation time excluding task vector fine-tuning for several variants of our method:

TABLE (1st, 2nd order, Theorem 2, variance tuning) (CLIP ViT-B/16)

Method	Forget ( $\downarrow$ )	Retain ( $\uparrow$ )	Unlearning Inference Time
1st order w/o variance tuning	17.19	64.52	393s
1st & 2nd order w/o variance tuning	14.36	64.79	471s
1st & 2nd order w/ variance tuning (Ours)	12.17	64.93	716s

6. Removing knowledge is positive for privacy. Could it be misused in any way?

Thank you. Absolutely, as any AI tool, there exists always a possibility of misuse. We will make it clearer in the Broader Impact and Limitations section. Indeed, a rouge actor with access to the model and unlearning data could intentionally “unlearn” important facts from a model. Indeed, our approach provides mere functionality but this is not a model that can monitor fairness of unlearning tasks.

7. Might repeated unlearning degrade a model in unforeseen ways?

Yes, this is a very interesting question. Unlearning could degrade the ability of $\boldsymbol\theta\_0$ to deal with some class labels which have strong semantic correlation with classes we unlearn. However, as unlearning is merely obtained by task arithmetic, we can always store $\boldsymbol\theta\_0$ and consecutive sets of task vectors to be able to unroll changes.

Below is the retain set performance (forget set accuracy can be found in Table 3 in the main text) on the retaining accuracy under incremental unlearning using CLIP ViT-Base/32:

TABLE

Method	Cars	+DTD	+EuroSAT	+GTSRB	+MNIST	+RESISC45	+SUN397	+SVHN
Ours	60.7	60.4	60.0	59.4	58.8	58.1	57.6	57.0

Note that the pre‑trained model achieves 63.3% accuracy on the retain set; our incremental unlearning preserves roughly 90% of this performance on the test set and 95% on validation set.

8. Open-vocabulary CLIP tasks or retrieval tasks.

To demonstrate cross‑modal applicability, we additionally validate our method on the multimodal CLEAR benchmark (Dontsov et al., 2024), which involves fictional author profiles with paired face images and captions. As shown in the table below, we report forgetting and retention accuracy on the VQA task using LLaVA‑1.5‑7B (with CLIP ViT‑L/14 as the vision encoder). Our approach achieves the best forgetting while maintaining comparable retention performance.

TABLE

Method	Forget VQA Acc. ( $\downarrow$ )	Retain VQA Acc. ( $\uparrow$ )
Pre-trained Model	69.2	55.7
Standard Task Arithmetic	42.7	49.4
Uniform Merge	37.9	50.5
TIES-Merging	36.1	49.7
EMR-Merging	34.8	49.3
Function Ensemble	35.6	50.0
Ours	31.5	50.2

9. Task specificity: forgetting is measured in terms of classification accuracy.

Thank you. Certain image-text pair associations in CLIP would require obtaining task vectors on lager number of image-text unlearning pairs by just fine-tuning with the CLIP loss. Then $f$ could be replaced with the feature output of CLIP and the rest follows. However, Taylor expansion can operate on feature spaces too.

Kindly see Resp. 8 above where we work with VQA instead of classification.

最终决定Accept (poster)

2025-09-17

The paper considers machine unlearning in vision language models, where the problem is to 'forget' certain training examples. Prior works, which introduced task vectors, were computationally inefficient, unstable and highly dependent on fine tuning configurations. The current work models the variation in task vectors by considering an ensemble of task vectors on a simplex to model variations in fine tuning configurations. This leads to a tangible boost in unlearning performance.

Machine unlearning is a very important topic in the field and principled approaches such as the ones presented in the paper can be very impactful in practice. After the rebuttal phase, the authors have been positive about the paper. Thus, I recommend accepting the paper, and encourage the authors to make improvements proposed by the reviewers.

Machine Unlearning via Task Simplex Arithmetic

摘要

评审与讨论

优缺点分析

问题

局限性

最终评判理由

格式问题

Response to Rev. wwLa

1. Ensemble is a general approach that can be applied to other pre-trained models (e.g., LLM).

2. Improving other task arithmetic operations besides unlearning.

3. Line 151, the Hessian term may contain subscript.

4. Equation 2, I would suggest using integration, rather than summations.

5. It is also helpful to specify how to compute the first-order term.

6. Parentheses in Eq. 10/11.

7. Increase QQQ in Figure 3.

8. In Table 3, does the ensemble method still keep the retain set performance?

9. Discuss limitations of the proposed methods (e.g., additional costs of computing/storing Hessian).

10. Regarding line 604, it would be better to open the code.

11. The randomness of the ensemble is from the subscript i: should we have a full covariance?

12. The Bienaymé formula.

优缺点分析

问题

局限性

最终评判理由

格式问题

Response to Rev. vtZ9

1. Providing more intuitive explanations.

1. Divide "Proposed Method” into multiple meaningful subsections.

2. Sentence with "the state-of-the-art approach by 3.3% on ViT-Base/32-based CLIP" is not directly supported by Table 1.

3. "Table 1 shows..." repeats Table 1 twice.

优缺点分析

问题

局限性

最终评判理由

格式问题

Response to Rev. uLHj

1. Validate task vectors form a convex simplex.

1a. Intuitive reasoning.

1b. Theory/analysis.

1c. Analyzing Taylor orders (how do non-linear task dep. affect framework).

1d. Parameter augmentation.

1e. Do task vectors form simplex?

2a. Cost: Many fine-tunings.

2b. Lang. or cross-modal task.

3.Constraints on Taylor.

优缺点分析

问题

局限性

最终评判理由

格式问题

Response to Rev. 5xSJ

1. Computational overhead is front-loaded (fine-tuning 30 models per task).

1b. One must manage updating the base model.

2. Could you clarify how Equation (3) was implemented?

3. How important are the second-order terms versus first-order?

4. Would the method scale to ViT Huge with billions of parameters ?

5. Does the advanced aggregator (Theorem 2) become too slow?

6. Removing knowledge is positive for privacy. Could it be misused in any way?

7. Might repeated unlearning degrade a model in unforeseen ways?

8. Open-vocabulary CLIP tasks or retrieval tasks.

9. Task specificity: forgetting is measured in terms of classification accuracy.

7. Increase $Q$ in Figure 3.