PaperHub
4.3
/10
Rejected4 位审稿人
最低3最高6标准差1.3
5
3
6
3
3.5
置信度
正确性2.5
贡献度2.3
表达2.0
ICLR 2025

Toward Trustworthy: A Method for Detecting Fine-Tuning Origins in LLMs

OpenReviewPDF
提交: 2024-09-16更新: 2025-02-05

摘要

关键词
Fine-Tuning Origins DetectionLoRALLM

评审与讨论

审稿意见
5

This paper introduces a new method to verify whether large language models (LLMs) have been fine-tuned from a specified base model, addressing limitations in existing verification techniques. The approach can detect obfuscation tactics, such as permutations and scaling, that obscure a model’s origin. Additionally, the framework extracts the LoRA rank used during fine-tuning, offering a more robust verification system. The method is empirically validated on 29 diverse models, showing its effectiveness in challenging real-world scenarios.

优点

  • The problem is very important.
  • Handling obfuscation, making it robust against manipulation.
  • A formal framework focused on identifying fine-tuning origins.
  • Extensive validation across multiple models.

缺点

  • Missing related work and discussion: The procedure (line 279) in Algorithm 1 "Random Rank Extraction" seems very similar to the counterpart in Algorithm 1 "Hidden-Dimension Extraction Attack" in the paper [1]. The "Rank Extraction Method" is also similar to this paper, but the authors never reference it in the submission. Please provide more discussion about it.
  • Strong assumption in the proof: The assumption in line 742 seems strong. Is there any support for this full rank assumption? Such issue is the same for the assumption in line 785. Unfortunately, some of these assumtions are not even shown in the main paper. They only appear in the appendix. I believe that some important assumptions should be clearly explained in the main paper. The assumption of linear independency in line 217 is also too strong.
  • Unreasonable insight: I question the rationality of the statement "if outputs are nearly identical, their corresponding intermediates are likely similar" in line 269. It's intuitive but not rigid enough.

If the author's answers address my concerns, I will consider raising the score.

[1] Carlini N, Paleka D, Dvijotham K D, et al. Stealing part of a production language model. ICML 2024.

问题

Typos and some minor issues:

  • The caption of figure 2 should be modified to prevent the overlap.
  • WW in line 168 should be revised, is it WcW_c?
  • "scalara" in line 200.
  • You should leave a blank space between RθR_\theta and "is" in line 186. There is also a similar issue in line 194 and 200.
  • There should be a reference in line 223 for Natural Language Toolkit (NLTK).
  • None of the formula in this paper is numbered.
  • What is the symbol between xix_i and yiy_i in line 575? I suggest the authors to incorporate a formal definition of it since this symbol is not as routine to the reader as addition (+), subtraction (-), multiplication (×), and division (÷).
  • The full name of "MLP" should be introduced in this paper.

Others:

  • I believe that there should be some reference in Section 3.3 to help readers better understand the problem. We cannot guarantee that all readers are familiar with this field. In particular, the readers may be confused about the sentence "The challenge posed by this scenario is encapsulated by the discrepancy in ranks of the parameter differences".
  • There should be some reference for PEFT in line 86.
审稿意见
3

This paper aims to propose a method for rigorously determining whether a model has been fine-tuned from a specified base model.

优点

This is an important task, and the experiments appear to be thorough.

缺点

The presentation of this paper is poor, which affects my overall understanding. For example, I don't understand the purpose of each subsection of the methodology. The writing is difficult to follow, making it challenging to grasp the key points. The methodology section is particularly confusing, as the steps are not well-explained, further hindering my comprehension of the paper. The experimental analysis should be more thorough. I would like to obtain a more detailed quantitative analysis.

The use of the term "trustworthy" in the title is unclear, leaving its meaning ambiguous. The title is also grammatically not good.

I also question the claim that "Crucially, the method remains valid regardless of the permutations used, enabling accurate determination of the base model for any derivative." If the parameters are significantly altered, it becomes theoretically impossible to determine whether the model was trained from scratch or fine-tuned. This differs significantly from the problem definition, which is based on LoRA. Could the authors clarify more about this?

问题

NA

审稿意见
6

This paper introduces a novel method for detecting fine-tuning origins in large language models (LLMs), addressing the challenge of transparency and trust when obfuscation techniques are used to hide model lineage. The method, which is the first of its kind, can extract the Low-Rank Adaptation (LoRA) rank used during fine-tuning, providing a robust verification framework. It has been empirically validated on 29 diverse open-source models, demonstrating its effectiveness in real-world scenarios. The study contributes to enhancing the trustworthiness and accountability of AI model deployments by accurately documenting model origins and modifications.

优点

Robustness: the ability of the method to recognize models that are fine-tuned even in the presence of obfuscation techniques (e.g., parameter permutations and scaling transformations) demonstrates its robustness against common means of obfuscation.

Accuracy: By extracting the LoRA rank, the method is able to accurately identify the differences between the fine-tuned model and the base model, providing a detailed quantitative measure of the origin of the model.

缺点

Scope limitation: The current method is mainly applicable to the case where the MLP layer is not modified during the fine-tuning process. If the parameters of the MLP layer are tuned or the architecture is changed, the effectiveness of the method may be reduced.

Challenges of small-amplitude output models: for models with small output amplitudes, the efficiency of reverse-engineering intermediate states may suffer due to weak gradient signals, which limits the applicability of the method.

Computational complexity: the use of techniques such as SVD and gradient descent may entail high computational costs, especially when dealing with large models.

Lack of discussion with backdoor-based detection approach like [1]

[1] Double-I Watermark: Protecting Model Copyright for LLM Fine-tuning

问题

The paper mentions that the methodology is mainly applicable to the case where the MLP layer has not been modified. How will future research be extended to accommodate cases where the MLP layer parameters are adjusted, or the architecture is changed?

How well does the paper's approach generalize to different types of models and fine-tuning strategies? Are there plans for more extensive experiments to validate this?

In the iterative optimization strategy, the update formula ym+1=ymαf(ym)zc2y_{m+1} =y_{m} -\alpha \nabla \| f(y_{m} )-z_{c} \| ^{2} depends on the gradientf(ym)zc2 \nabla \| f(y_{m} )-z_{c} \| ^{2}. If f(ym)f(y_{m} ) is not smooth in the neighborhood around ymy_{m}, will this affect the stability of the gradient and the convergence of the reconstruction process? In the presence of multiple local minima, how to ensure that the gradient descent method can find the global minimum and not get stuck in the local minimum, especially when zcz_{c} is affected by obfuscation techniques?

审稿意见
3

The paper proposes a technique to identify if a target model is a finetuned version of a base model, when the target model can be obfuscated. Gray box access to intermediate model weights is assumed.

优点

  1. I believe model attribution issues will only rise with the craze related to ML. Hence this paper is timely.
  2. For open source models, model obfuscation is perhaps an important way of hiding the base model information. Hence the focus on obfuscation is good.
  3. The paper considers many models in its experiments section including llama2, 3 and Mistral models.

缺点

  1. The paper really needs to be rewritten. Here are some comments: a) improve the caption of Fig.1 -- readers are supposed to understand this figure without reading the paper. b) Fig 2 caption and text are mixed. c) how are sections 4.1, 4.2 and 4.3 connected? -- it would help to give a para before 4.1 to explain the same. d) Alg 1 should be self-sufficient -- mention what each notation means in the algorithm itself. e) Where is theorem 4? where does the proof end? denote with end of proof symbol. f) The captions of all figures should be better. g) The explanation of results is very weak. h) there should be more focus on sections 4 and 5.2.

  2. The paper is really specific to fine tuning. Not sure if it works for distillation and compression.

  3. L49-51 where the authors claim to be the "first formal framework" seems a bit of an over claim. Techniques such as using fingerprinting have been shown to last for fine tuning (that too with black box access).

  4. I am not sure if the technique really works in non-trivial cases. a) the paper excludes obfuscation in the MLP layers -- why is this a reasonable assumption? won't it be easy for an adversary to manipulate MLP layers? b) When the LoRA config includes W_o , the technique completely fails , by estimating ranks very off from the given rank.

问题

  1. Why is rank(Wc - Wb) >> s?
AC 元评审

While the reviewers appreciated the paper’s motivation and initial experiments, their main concerns were with (a) the scope of the method, (b) the computational complexity, (c) missing related work discussion, (d) the strength of required assumptions, and (e) overall clarity. There was no author response. For these reasons I vote to reject. The reviewers have given detailed feedback and I recommend the authors follow / respond to their comments closely before submitting to a future ML venue. If the authors are able to fix these things it will make a much stronger submission.

审稿人讨论附加意见

There was no reviewer discussion and no author rebuttal. All reviewers except one voted to reject and the concerns of this reviewer were not addressed.

最终决定

Reject