/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Felipe Pinto Coelho Nuti,Tim Franzmeyer,Joao F. Henriques

提交: 2025-01-23更新: 2025-07-24

TL;DR

We give a principled metric quantifying how much the fine-tuning stage contributed to the output of an LLM, and explore its relationship to model behavior and safety.

摘要

Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model's intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa.

关键词

Large Language ModelsInterpretabilityAI Safety

评审与讨论

审稿意见

评分: 32025-03-01

This paper proposes a novel method, Tuning Contribution (TuCo), to measure the contribution of fine-tuning to individual responses of large language models (LLMs). The authors introduce a decomposition framework that splits an LLM’s response into a Pre-Training Component (PTC) and a Fine-Tuning Component (FTC), enabling a more fine-grained analysis of fine-tuning effects. Experimental results show that TuCo is sensitive to different inputs, which shed more light on how to monitor and control the model’s behavior after finetuning. However, as stated in the weakness part, the paper needs more carefully defined experiments to justify the proposed metric. In summary, I believe the paper is quite novel and has the potential to bring a bigger impact. But the current version is not good enough for ICML. I would be very happy to increase my evaluation if the core issues are addressed.

给作者的问题

Questions:

I am not quite sure about how to understand $f_\theta(x,l)$ . Could it be understood as the function of the network's first L layers? In other words, it converts the input x to the hidden embeddings of the L-th layer. If my understanding is true, why do we need the “circuit modeling” in this paper?
In definition 4.3, what does the map $(x_1, \cdots, x_n) \rightarrow x_n$ looks like? Directly select the last column of FTC?
Many experiments require a pretrained model and a finetuned version of it. But on what dataset are these models trained on? I tried to find the answer in the paper but could not find it. It is important for many conclusions, e.g., those in section 5.2: if the model is finetuned on web-text data rather than chat-data, will the claim still hold?

论据与证据

Not quite. See the weakness part.

方法与评估标准

Yes.

理论论述

I checked the main context and part of the Appendix. The proofs looks good.

实验设计与分析

Not good enough and could be improved. See the weakness part.

补充材料

I read part of the appendix.

与现有文献的关系

The related work and background part is well written.

遗漏的重要参考文献

I find there are plenty of papers discussing finetuning behaviors, like [1]. Discussing the differences between their theoretical framework would be helpful.

[1] Ren, Yi, and Danica J. Sutherland. "Learning dynamics of llm finetuning." ICLR 2025

其他优缺点

Strength:

Unlike previous approaches that focus on benchmark-level fine-tuning effects, this work quantifies fine-tuning effects at the individual response level, which is quite novel and provides new perspectives on understanding the model’s behavior.
The discussions about the relationship between jailbreak prompt and TuCo are inspiring. Given the fact that the FT model is trained on some safety-related dataset, the TuCo for this sensitive information should be large. However, a carefully defined jailbreak prompt can circumvent it by triggering some “un-updated” region of the origin model. This finding could inspire more robust alignment strategies in the future.

Weakness:

The authors claim in their introduction that instead of simply comparing its final hidden states, TuCo considers more detailed representations. However, the superiority of TuCo over this simple method (i.e., directly compare last hidden states) is not well justified. Appendix B compares OutputCo and TuCo. But which one is better, and why? Plus, from Proposition 4.2 we know that TuCo is part of the upper bound of the L1 distance of two x, then, why not directly observe ||x-x||?
I find the experiments in the current version cannot support the claims well. The dataset used in finetuning is very important in measuring the model’s behavioral change. So, ablation studies varying the finetuning data will make the conclusion more solid.

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for their thoughtful comments and constructive feedback. We would like to clarify some points:

Appendix B compares OutputCo and TuCo. But which one is better, and why?

They have different interpretations, and are most appropriate for answering different research questions.

OutputCo tells us how large is the distance between pre-trained and fine-tuned models final hidden states. This could be used in a similar way to e.g. computing distribution distances between the final logprobs of each model.

Meanwhile, TuCo tells us how large is aggregate change in intermediate layer outputs due to fine-tuning. In this sense, it gives a quantitative view of how the computation performed by the model is affected, rather than only the final outcome. We are not aware of comparable metrics in the literature.

I find the experiments in the current version cannot support the claims well. The dataset used in finetuning is very important in measuring the model’s behavioral change. So, ablation studies varying the finetuning data will make the conclusion more solid.

We would like to ask the reviewer if they could point to specific claims they consider are not well-justified, so that we can make the appropriate improvements to the manuscript.

Further, we clarify that, in our experiments, we seek to demonstrate the applicability of TuCo in the wild, i.e. on real-world widely-used open weight models, without relying on bespoke toy datasets.

Rather, in Section 5.1, we make controlled interventions by varying the magnitude of the fine-tuning component $FTC$ . We demonstrate this can be used to control model behavior, and even improve its performance on certain MMLU tasks. This validates the relevance of measuring the magnitude of $FTC$ when studying the interactions between prompt content, model behavior and capabilities.

[...] why not directly observe $||x_{FT}-x_{PT}||$

As we argue in Appendix A, an effective metric should be interpretable for practicioners, useful for empirical analysis, and practical to compute.

The fact that TuCo is normalized (i.e. between 0 and 1) allows it to be more intuitively interpreted as a fraction (i.e. "30% contribution of fine-tuning"). An unnormalized metric, such as $||x_{FT}-x_{PT}||$ , is potentially subject to significant changes in scale across models and prompts, harming its interpretability and usefulness.

Further, per Section 5.5 and the prior answer, TuCo is qualitatively distinct from simply comparing final hidden states, even if one uses a normalized metric (i.e. OutputCo).

We use TuCo to quantitatively show jailbreaks attenuate the effect of fine-tuning (Section 5.3), that the attenuation is strongest the stronger the jailbreak (Section 5.3, MSJ results), and that successful jailbreaks show stronger attenuation (Section 5.4).

I am not quite sure about how to understand $f_\theta(x, l)$ .

As pointed out in Section 3, most commonly-used GPT architectures have residual connections around self-attention and MLP layers. This means that, on a layer $l$ computing a function $f_{\theta_l}$ (where $\theta_l$ are the parameters of the layer), the residual stream is updated as $x_{out} \leftarrow x_{in} + f_{\theta_l}(x_{in})$ .

Since there are $L$ layers, we have functions $f_{\theta_1}, \cdots, f_{\theta_L}$ . For notational simplicity, instead of writing the function computed by the $l^{th}$ layer as $x \mapsto f_{\theta_l}(x)$ , we write it as $x \mapsto f_\theta(x, l)$ .

What does the map $(x_1, \cdots, x_n) \mapsto x_n$ look like? Directly select the last column of FTC?

Yes, that is correct: this map picks out only the hidden state for the final token of the prompt.

[...] But on what dataset are these models trained on?

We thank the reviewer for pointing this out as an area of improvement.

Llama 2, Llama 3 and Gemma use a combination of publicly, private and synthetic instruction tuning and preference data, including conversational data and safety data. Mistral and Vicuna are only fine-tuned for instruction following. Zephyr-Gemma is fine-tuned on synthetic chat and preference data. The preference ratings take into honesty into account, but, per Tunstall et al. (2024), the samples are focused on helpfulness rather than harmlessness.

We have added a more detailed overview to the appendix.

if the model is finetuned on web-text data rather than chat-data, will the claim still hold?

In this case, we would expect not to see such a clear separation.

Conclusion

In the above, we hope to have addressed the reviewer's mentioned concerns, with particular regard to providing more details on model data mixes, and on what TuCo contributes over a simple comparison of final hidden states.

Given the above, we would like to ask the reviewer to consider increasing their score. If any concerns remain, we are happy to provide further clarifications and improvements to the manuscript.

审稿意见

评分: 32025-03-14

The authors seek to understand the effect of finetuning on a model. They propose to decompose the forward pass of a finetuned model into the pretrained component (PTC) and fine-tuned component (FTC). They then propose Tuning Contribution (TuCo) as a measure of the relative effect sizes. They subsequently analyze TuCo within many empirical settings.

The authors provide a constructive algorithm for calculating TuCo. Theoretically, they relate it to prior literature on transformer circuits. Empirically, they show that scaling the FTC can act like a form of steering. They also perform various other ad-hoc analyses, relating TuCo to jailbreaks and instruction tuning.

Update after review

The authors have addressed some of my concerns. We agree that while the method does not beat baselines on downstream tasks, I agree that it is an interesting proof of concept for a new analysis technique. Hence, I will update to a weak accept.

给作者的问题

N/A

论据与证据

One of the authors' central claims is that the FTC approximates the effect of finetuning. However, no direct evidence is provided for this claim. One way to check this would be to take the FTC after 1 epoch of finetuning (FTC-1ep), and the FTC after 2 epochs of finetuning (FTC-2ep). Would FTC-2ep be approximately double the magnitude FTC-1ep, while having the same direction?

It is also unclear why the authors settled on this definition of TuCo. From first principles, it seems much more natural to use other definitions, e.g. the difference in model weights between the two models under comparison.

方法与评估标准

In Section 5.1, the authors try controlling model behaviors by scaling the FTC, similar to existing work on activation steering. However, there are many methodological problems here.

The Model-written Evals dataset is not good.

Firstly, it has substantial spurious correlations. For example, in the 'subscribes to Christianity' task, the question is binary MCQ, and the 'answer matching behaviour' is always 'Yes'. Thus, the reported results could simply be explained by the model learning to say 'Yes' more. Other prior work which uses MWE does preprocessing to fix these issues [1], [2].
The model-written Evals dataset has a variety of other data quality issues as identified here: https://www.lesswrong.com/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking#E__Issues_Identified_in_Anthropic_Model_Written_Evals

In figure 2, the authors plot the change in aggregate model propensities ('agreement' across all samples) as a result of scaling the FTC. However, they do not report variance between individual examples. Prior work [2] indicates that, often, there is a large difference in the magnitude of the steering effect between individual samples. Looking only at the population aggregate obscures this effect, making steering look more effective than it actually is.

[1] https://arxiv.org/abs/2312.06681 [2] https://arxiv.org/abs/2407.12404

理论论述

The authors motivate TuCo from the perspective of 'generalized components'. It is very unclear what these 'generalized components' are and how they work. The authors should substantially revamp sections 4.2 and 4.3 to provide a clearer explanation.

I find the argument in 4.2 unclear and at times controversial. It is not clear that a transformer can be decomposed into a linear sum of circuits; the authors should explain their reasoning more clearly. It is also a very controversial claim that finetuning works by adding more circuits. I would like to see more justification of this perspective, preferably with references to existing empirical case studies.

The authors claim that TuCo is a generalization of earlier work on circuit analysis. I find this claim controversial, as they do not explain how their framework subsumes earlier theory such as https://transformer-circuits.pub/2021/framework/. Furthermore, the algorithm for computing TuCo (Algorithm 1) only uses model activations, and does not discuss circuit components.

实验设计与分析

Yes.

补充材料

No.

与现有文献的关系

Understanding the effect of finetuning is generally valuable for building an empirical science of ML.

遗漏的重要参考文献

The authors work aims to develop insight by looking at changes in model activations before and after finetuning. However, they do not discuss related literature on model diffing. It seems important to discuss other related techniques like model stitching [1] and sparse crosscoder analysis [2], which have the same 'type signature'. [1]: https://arxiv.org/abs/2106.07682 [2]: https://transformer-circuits.pub/2024/crosscoders/index.html

The authors also do not discuss

其他优缺点

I am not convinced of the significance of TuCo. From a practical perspective, the authors demonstrate signs of life with activation steering, but do not compare to relevant baselines such as CAA. They also use flawed evaluation methods which raise significant concerns about the validity of results.

In the jailbreak setting, the authors show that TuCo is lower on successful jailbreaks, but this does not seem to yield any technique for preventing the jailbreak, nor does it provide an especially clear insight as to why specific jailbreaks work as opposed to other.

Overall, I am not convinced that "TuCo is a relevant interpretability tool", as it has not yet led to interesting insights. I encourage the authors to show how TuCo can be used for a practical problem of interest.

I also am not convinced by the claim that "Model developers can use TuCo to detect inputs where finetuning has less impact and adjust accordingly"; if this were the case I would encourage the authors to include a case study where they do this.

其他意见或建议

N/A

作者回复

2025-04-01

We would like to address the queries raised. Some claims dismissing our experiments are incorrect and unjustified.

No direct evidence [...] FTC approximates finetuning

FTC is exactly the difference in layer outputs between the finetuned and pretrained models; if FTC is zero then FT=PT. Therefore it is a rigorous and universal notion of "effect of fine-tuning" for a given prompt.

difference in model weights [instead]?

Comparing model weights would be agnostic to the prompt, which is not our problem setting (Secton 4.1).

TuCo quantifies how much a prompt's $FTC$ contributes to the final hidden state, as an interpretable fraction (0 to 100%). This seems natural.

[In MWE] the 'answer matching behaviour' is always 'Yes'.

This is not true.. The matching answers are balanced, (50% "Yes" and 50% "No"). This is easily seen in https://github.com/anthropics/evals/blob/main/persona/subscribes-to-Christianity.jsonl.

[MWE has] data quality issues identified in [...]

This is an appendix of a blog post, which cannot be taken at face value. Moreover, it does not claim any issues with the Persona section of the MWE dataset, the only one we use in our evaluations. This source is both unreliable and inapplicable.

do not report variance large difference in the magnitude of the steering

We added variance to the plots. But this is redundant: there is no "difference in magnitude" that can skew the mean estimator, since we are averaging booleans (constant magnitude).

unclear what these 'generalized components' are and how they work generalization of earlier work on circuit analysis

For a circuit computing $g: (x_l, l) \mapsto g(x_l, l)$ at layer $l$ , when the input hidden state is $x_l$ , $g$ is a generalized component (Def. 4.1). Thus, Def. 4.1 applies to circuits in Elhage et al. (2021). We updated the paper to explicitly point this out.

We add that this seemed clear to other readers.

transformer [as] linear sum of circuits?

We do not claim full circuit decompositions exist or are known. We only make this assumption in the thought experiment in Section 4.2, in light of the great diversity of circuits identified in prior work.

only uses model activations, [not] circuit components

This is an important strength of our method: exact circuit decompositions need not exist or be known, but TuCo can nonetheless be computed for any model, because it assumes access to only intermediate model activations and pre-trained and fine-tuned models.

[discuss] model stitching and sparse crosscoder

We appreciate the suggestion, and updated our related work section. But the connection is indirect. These methods do not yield scalar-valued metrics, so their "type signatures" are different.

demonstrate signs of life [...] but do not compare to [...] CAA

We politely ask for a reference to this CAA. We also believe "signs of life" unnecessarily diminishes our results.

flawed evaluation methods [MWE]

As mentioned above, the reviewer's discrediting of MWE is unjustified and based on false claims. We kindly ask the reviewer to either clarify the perceived methodological flaws, or to remove the claim.

[no] technique for preventing the jailbreak

Per Section 5.4 and Appendix F.4, applying a threshold to TuCo detects jailbreaks, and model outputs can then be halted. Note TuCo is an analysis technique not designed to detect jailbreaks, and yet has out-of-the-box predictive power.

[no] insight as to why specific jailbreaks work

As pointed out in line 371, our results in Section 5.4 indicate that the attenuation of the contribution of fine-tuning to the model's final hidden state (which is what TuCo direcly measures) is associated with jailbreak success.

not yet led to interesting insights

Our work yields novel scientific insights on the interplay between LLM jailbreaking and fine-tuning, which are both of widespread interest in the community. In this sense, we consider TuCo to have produced interesting insights.

For example, we quantitativey identify a clear link between jailbreaking and the attenuation of the effects of fine-tuning, which had been merely hypothesized in prior work (Kotha et al. 2023, Wei et al. 2024).

claim "Model developers can use TuCo [...] and adjust accordingly"

This claim is made in the Conclusion and Future Work section -- we leave to future work the application of TuCo to improving fine-tuning methodology and dataset construction.

Conclusion

We hope to have addressed points raised by the reviewer, and pointed out incorrect statements dismissing our experimental results.

In light of these clarifications and corrections, which address the basis for the negative review, we would like to ask the reviewer to consider increasing their score.

审稿人评论

2025-04-02

Thank you for the extensive theoretical clarifications. I appreciate being corrected on incorrect claims re MWE and will update my opinion accordingly.

Here is your reference to CAA: https://arxiv.org/abs/2312.06681

We also believe "signs of life" unnecessarily diminishes our results. I understand that the authors contribute significant theoretical work. I maintain that, without a comparison to strong baselines, the results remain a 'sign of life'. For example, in the steering experiments presented in Fig 2, there is no comparison to existing steering methods.

I think the steering experiments are important because they are an example of a causal intervention - the authors intervene on the magnitude of the FTC and show that this affects the model's likelihood of predicting the correct answer. Causal interventions are important to validate hypotheses generated through interpretability research.

In mechanistic interpretability and related fields, my prior remains that new bodies of theory should be validated against downstream tasks as soon as possible. While the analysis sections on jailbreaking and web text are interesting, these ancillary observations cannot serve as the primary support in favor of a new method.

Per Section 5.4 and Appendix F.4, applying a threshold to TuCo detects jailbreaks, and model outputs can then be halted. Note TuCo is an analysis technique not designed to detect jailbreaks, and yet has out-of-the-box predictive power.

Thank you for the clarification. This seems important if true, since it is another example of a causal intervention. If this is the case, then I did not understand section 5.4 on the first read through and I still do not understand it. Please help me understand how you halt model outputs based on TuCo and how effective this is at preventing jailbreaks.

作者评论

2025-04-04

We thank the reviewer for the engaging discussion, and appreciate their openness to updating on the claims about MWE in the initial review.

Here is your reference to CAA: https://arxiv.org/abs/2312.06681

The now-referenced method CAA (Contrastive Activation Addition, Panickserry et al., 2024) seems to be relevant related work, so we have included it in the related works section.

This method computes vectors that can be added to the residual stream to steer the model to exhibit a behavior. Panickserry et al. do this by averaging the difference in model hidden states between contrastive pairs of prompts, with one showcasing the behavior and the other not.

We remark that the purpose of us including the $FTC_\alpha$ -scaling experiments in the paper is not to show we have the "best" steering technique for LLMs. In fact, this is not the goal of the paper - rather, our goal is to propose a universal, prompt-level analysis technique for measuring the effects of fine-tuning. They serve instead as validation of the relevance of our chosen "object of study" (i.e. the magnitude of $FTC$ ) by intervening on it, and showing it can be used to control model behaviors and capabilities. As such, we do not seek to establish $FTC_\alpha$ scaling is better than existing steering methods, since TuCo is not designed with this in mind.

Rather, it suffices to show steering is possible and statistically significant, which we do show in Section 5.1 across various tasks and models. As such, while we consider a comparison with CAA will strengthen our work and will include such a comparison in a final revision, we see our current experiments in Section 5.1 as sufficient to prove our point.

steering experiments [...] should be validated against downstream tasks as soon as possible

There are in fact a variety of "downstream tasks": we assess the interventions on MMLU eval tasks encompassing 17 different areas (each with several subtasks; see Appendix F.1.1), including areas such as biology, CS, maths, etc, and with model sizes ranging from 7 B to 13 B (see Fig. 6, Appendix F.1.1). We also include MMLU humanities tasks, including logic, philosophy or history (Fig. 7). Interventions on more specific tasks in social sciences, STEM and others are evaluated separately (Fig. 8-10). This is in addition to the MWE dataset we already discussed, which assesses deviations in tens of different viewpoints and biases (Appendix F.1.2).

While we subscribe to the reviewer's view that interventional experiments are crucial to establish causal explanations for interpretability, we believe that our suite of intervened tasks is already comprehensive. We will be keen to hear specific additional downstream tasks the reviewer considers are missing, and will strive to include them in the appendix of a final revision. With that said, the theoretical and observational results should not be dismissed, as they all are consistent and reinforce each other, together with the interventional results.

how you halt model outputs based on TuCo and how effective this is at preventing jailbreaks

In Section 5.4 and Appendix F.4, we report that applying thresholds to TuCo to predict jailbreak success yields an AUC score of over 0.8 for all models under consideration except for Vicuna v1.5 13B, where it is 0.78. This means one could in principle pick a threshold (depending on one's relative tolerance for false positives and false negatives) and use TuCo to detect jailbreaks, and obtain a non-trivially-performant classifier.

This means TuCo has meaningful jailbreak detection power. We remark, however, that TuCo is not intended as a jailbreak detection method. We include this experiment to display the relationship between jailbreaks being successful and them decreasing the effects of fine-tuning.

Still, as the reviewer points out, this does indicate our framework produces non-trivial performance in a downstream task, despite not being designed with it in mind. The effects we are observing through TuCo are useful for predicting an important characteristic of model outputs, before the output itself is even generated. This suggests such effects are not spurious or accidental.

Conclusion

We thank you for your pointer to CAA, and hope to have addressed your concerns regarding the presence of interventional experiments and downstream task evaluations in our work, together with the initial concerns from the initial review. Given this, we would like to politely ask if the reviewer would consider increasing their score.

审稿意见

评分: 42025-03-14

This paper investigates the impact that fine-tuning has on the forward pass representations of large language models (LLMs). The authors define the Tuning Contribution (TuCo) as a metric measuring the contribution of fine-tuned model representations as compared to pre-trained representations on the model’s forward pass for a specific input. The authors propose this metric as a tool to measure the degree of impact that fine-tuning has on individual model inputs. TuCo’s utility is assessed via empirical experiments focussing on a range of LLMs of up to 13 billion parameters, including Llama3, Gemma, and Vicuna. In a first experiment, the authors use TuCo to control model behavior by scaling the extent to which fine-tuning should contribute to the model’s final output. Second, the authors compare TuCo scores for web-crawled and chat-completion data and show that this score is substantially higher for chat-completion data. Finally, the paper shows that TuCo notably decreases when jailbreak attacks are applied to initially harmless prompts.

Update after rebuttal

I appreciate the authors' response to my questions and comments. I kept my score as it already indicates acceptance.

给作者的问题

None

论据与证据

The claims stated in the paper are supported by empirical evidence.

方法与评估标准

The proposed evaluation criteria are comprehensive and make sense in the context of the paper's problem statement and proposed solution.

理论论述

I did not check the correctness of the proof for Proposition 4.2 in Appendix D in great detail.

实验设计与分析

The experiments reported in Section 5 are comprehensive and technically sound. They largely contribute to a better understanding of the paper's proposed method and provide empirical evidence of TuCo's utility.

补充材料

I inspected the supplementary material but did not check / verify the provided code.

与现有文献的关系

The paper provides a brief but detailed overview of the related literature. Section 3 (Background) of the paper is largely redundant as knowledge of Transformers as well as pre-training and fine-tuning of LLMs can in my opinion be assumed by the reader. This space is better spent on moving additional details of the empirical results out of the appendix and into Section 5.

遗漏的重要参考文献

None that I am aware of.

其他优缺点

I overall found the paper to be very well-written and easy to understand, despite presenting a complex approach to better measure contributions of pre-training and fine-tuning on model representations, and as such represents a solid contribution. The empirical evaluations are detailed and comprehensive and demonstrate TuCo's utility. The paper spends too much time focussing on "setting the scene" and providing background information as well as deriving TuCo. I believe that it would benefit from moving parts of this into the appendix and instead increase its focus on empirical evaluations in the main manuscript (critical tables and figures mentioned in Section 5 have been moved to the appendix but would help the reader understand the results better in the main manuscript).

其他意见或建议

None

作者回复

2025-04-01

We sincerely thank the reviewer for recognizing the original contributions of our work, the comprehensiveness and soundness of our experiments, and the quality of our technical exposition.

In the following, we address the reviewer's points regarding the allocation of space to background and experiments in the manuscript.

The paper spends too much time focussing on "setting the scene" and providing background information as well as deriving TuCo. I believe that it would benefit from moving parts of this into the appendix and instead increase its focus on empirical evaluations in the main manuscript

We thank the reviewer for the suggestions on improving the focus of our paper. We agree that we would like to move some of the figures/tables from the appendix to the main paper, and that, for many readers, an extensive description of transformers is not required. Because the later sections depend on equations introduced in the background, it serves also to introduce necessary notation. However, we will strive to reduce it while keeping essential notation, to make space for some of the appendix's content.

审稿意见

评分: 32025-03-18

This paper introduces “Tuning Contribution” (TuCo), a new method to measure how much fine-tuning affects the outputs of a large language model (LLM) on a per-prompt basis. Formally, ToCUo is calculated by the ratio of the total magnitude of the "fine-tuning component" to the sum of "pre-training component" and "fine-tuning component", each of which is computed using the model’s hidden states at every layer. Empirical results demonetrate TuCo aligns with the controlability of model behavior during fine-tuning and its implication in LLM safety (e.g., Jailbreaks decrease Tuning Contribution, especially successful attacks).

update after rebuttal

The authors' rebuttal address my concerns. I will keep my score.

给作者的问题

(1) Given the TuCo is amodel-dependent metric (e.g., depending on the model architecture). In real-world practice, how do you suggest the practitioners to use the metric?

论据与证据

Most of the claims are well supported (e.g., Empirical Evidence That TuCo Reflects Fine-Tuning Effects, Definition of Tuning Contribution).

However, some claims need more justifications: (1) Lipschitzness of the layers may be strong in practice in the theoretical bound (2) The paper only tests a nice spread of open-source models (various LLaMA 2 and 3 sizes, Vicuna, Mistral, Gemma, etc.) but still only up to 13B parameters, but it is not yet shown how TuCo behaves on 30B, 70B, or even larger-scale systems. (3) Other model architeictures also need to taken into consideration, like MoE structure.

方法与评估标准

The paper’s methods and evaluations (MMLU for academic performance, curated chat datasets for alignment style, multiple jailbreak tactics for adversarial stress-testing) align well with the goal of measuring fine-tuning’s real-time influence on the model’s output.

理论论述

Yes. No major issues are found.

实验设计与分析

Yes. No major issues are found.

补充材料

Yes. Appendix D Proofs and E Experimental details.

与现有文献的关系

The paper might help contribute to the field of mechanistic interpretability, and LLM safety by tying these threads together into a computationally tractable framework for the effect of a single data point on fine-tuning.

遗漏的重要参考文献

N/A.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for their recognition of our extensive experimental suites and the relevance of our method to interpretability, as well as their thoughtful suggestions on areas of improvement. We would like to address some of the points raised:

(1) Lipschitzness of the layers may be strong in practice in the theoretical bound

In Appendix D.5, we rigorously justify our assumption of Lipschitzness for the commonly-used transformer layer with root-mean-square normalization applied before attention and MLP layers.

Intuitively, normalization ensures the input of attention and MLP layers is always of bounded norm, and such layers are locally Lipschitz (or, for MLP layers, globally Lipschitz). Further, the fact that a numerical $\epsilon$ is used during normalization (i.e. one normalizes $x \mapsto \frac{x}{\sqrt{||x||^2 + \epsilon}}$ ) ensures the normalization map itself is Lipschitz. Hence, the resulting layers are Lipschitz. The boundedness of PTC hence also follows.

(2) The paper only tests a nice spread of open-source models (various LLaMA 2 and 3 sizes, Vicuna, Mistral, Gemma, etc.) but still only up to 13B parameters, but it is not yet shown how TuCo behaves on 30B, 70B, or even larger-scale systems.

The computation of TuCo requires access to model parameters and a modified forward pass. As such, we would need to host a model ourselves to run our experiments on it. Given GPU and budget constraints, we were unable to evaluate TuCo on models of larger scale. Instead, we sought to evaluate a large suite of models up to 13B parameters to demonstrate the general applicability of our method.

(3) Other model architeictures also need to taken into consideration, like MoE structure.

TuCo is agnostic to the specific architecture of model layers, and applies without modification e.g. to MoE architectures. We will implement and evaluate TuCo for JetMoE-8B (https://huggingface.co/jetmoe), a recent MoE model of tractable size for which pre-trained and fine-tuned checkpoints are freely available. This includes modifying the HuggingFace implementation of the forward pass to support TuCo computation and running our experimental suite, which we were unable to complete in the short rebuttal period. We will include results in a camera-ready version if this work is accepted.

(1) Given the TuCo is amodel-dependent metric (e.g., depending on the model architecture). In real-world practice, how do you suggest the practitioners to use the metric?

We clarify that TuCo places very light assumptions on model architecture (i.e. only that the intermediate hidden states are updated as $x_{l+1} = x_{l} + f_\theta(x_l, l)$ ). In particular, TuCo does not depend on the use of any particular kind of layer (e.g. self-attention).

As mentioned in the conclusion (Section 9), we suggest practitioners use TuCo to detect inputs where fine-tuning is less effective, allowing them to adjust their datasets and mitigate potential vulnerabilities. This approach not only aids interpretability research by identifying prompts that attenuate finetuning effects but also lays the groundwork for integrating adversarial attack prevention in user-facing applications.

Conclusion

In the above, we hope to have addressed the reviewer's concerns regarding justifications of our theoretical assumptions, the scale and architectures of models considered, and downstream applications for TuCo.

We would like to ask if the reviewer would consider increasing their score in case their points have been addressed. Otherwise, we are happy to provide further clarification.

最终决定Accept (poster)

2025-05-01

All reviewers recommend acceptance - the paper analyses the "mechanistic" effect of fine-tuning and makes concrete some hypotheses in prior work regarding jailbreaks. The reviewers all felt the paper was well written, and had some interesting insights. Personally, I agree with the concerns raised by JnQu about the lack of clear benefit on downstream tasks. As a result, I recommend accept.