PaperHub
6.0
/10
Poster4 位审稿人
最低4最高7标准差1.2
6
4
7
7
3.3
置信度
正确性2.8
贡献度2.5
表达3.0
NeurIPS 2024

Spectral Editing of Activations for Large Language Model Alignment

OpenReviewPDF
提交: 2024-05-12更新: 2024-12-26
TL;DR

We propose a novel inference-time editing method for LLM's activations, namely spectral editing of activations (SEA), to align the LLMs with the objectives in truthfulness and bias.

摘要

关键词
Large Language ModelAlignmentSpectral DecompositionRepresentation EngineeringModel Editing

评审与讨论

审稿意见
6

This paper focuses on the problem of editing undesirable behaviours at inference time that does not require any training. To that end, they present SEA, a method based on spectral editing of activations. To find the editing projections, the method requires keeping track of LLM activations over several neutral, positive and negative activations. From those activations, SVD is applied on the covariance matrices between the neutral and negative, and neutral and positive activations, respectively. To allow for non-linearity, the authors use an invertible non-linear feature function.

The authors investigate how their method impacts truthfulness and bias on two benchmarks (TruthfulQA and BBQ), and show that scores on those benchmarks can be improved using SEA. Consistent improvements are observed across six distinct LLMs of various sizes and architectures, using only 25 demonstrations, while not degrading other model capabilities, with an increase in inference speed of around 4%.

The paper first presents results with Llama-2 7B (base and chat) for thruthfulQA, comparing with several baselines, including ICL and LoRA. Overall SEA outperforms other methods while maintaining a much better inference speed. With an ablation study, the authors show that activations edited with positive and negative projections likely complement each other, and are not as effective on their own. They furthermore investigate the impact of feature normalization, showing that it is more effective than when no normalization has taken place.

Next, the paper investigates the impact of SEA on bias, as measured by BBQ. They show that the accuracy enhancement for linear SEA is moderate, but non-linear SEA gives more improvements, while baselines do not. For bia, they furthermore show that the results can be generalised to other Llama models (llama-2-13B & 70B, Gemma-it-2B & 7B and Mistral 7B).

Last, the paper investigates how SEA scales with the number of demonstrations needed to calculate the editing projections. Experiments show that for MC1 a mere 25 demonstrations suffice for the first improvements (no results are shown for MC2), for BBQ even fewer demonstrations can improve accuracy (number listed is also 25?), and shows that the method has little effect on a several other benchmarks unrelated to the edited parts.

优点

The paper discusses an important topic of making models more truthful and less biased. The proposed methods seems to work better than previous methods (though see weaknesses below) at a smaller loss of inference speed, while maintaining performance on benchmarks unrelated to the editing.

缺点

  • There is no significance testing for benchmark scores. Especially TruthfulQA is not a very large benchmark, for the entire benchmark (averaging over subsets), the 95% confidence intervals would be around 3, making several (but not all) of the reported differences insignificant. This should be addressed / discussed
  • The paper would be stronger if more evaluation benchmarks were considered for bias and truthfulness
  • Some of the results selection seems a bit arbitrary, which gives pause when considering the generalisability of the results. For instance, why are results for other model (families) shown for BBQ, but not for TruthfulQA? And why are scalability results for truthfulQA shown only for MC1, and not for MC2?
  • The toxigen scores for llama2-chat-7B seem outrageously high, in the Llama2 paper they are listed to be between 20 and 30 for the pretrained model, and around 0 for the chat version. In table 4, however, the scores are reported to be higher than 50 (!).
  • It is not entirely clear if the method would scale to making models more truthful and unbiased at the same time, would that require different editing projections to be stacked on top of each other?

Some presentational issues:

  • Figure 4 is a bit difficult to read because of the scale. The text makes statements about values around 25, but this cannot be confirmed from the figure. Perhaps a log-scale ould be more suitable? Or, alternatively, let the plot go up to 50, rather than 1500/2000, as nothing is discussed about values higher than 25 anyways.
  • In Table 4, toxigen scores going down are reported red, but for toxigen lower is better

问题

  • Could you explain why your toxigen scores differ so drastically from the scores reported in the Llama2 paper?
  • It could be that I am mistaken, but it seems that several separate editing functions are needed for truthfulness and bias. Can this approach scale to a method where both are taken care of?

局限性

The discussion of limitations is very limited, only discussing a specific performance degradation of not-linear SEA on control tasks (which is a limitation indeed, but really more just an experimental result). The limitations section is furthermore not referred to in the main text, but is a far-down appendix.

作者回复

1. Comment: There is no significance testing for benchmark scores.

Response: On TruthfulQA, we did the pair-wise t-test on SEA vs ICL baseline (in Table 1). We also confirm that SEA significantly outperforms LoRA-FT. We did not add more significant tests over other baselines as some of the results are from [1]. On BBQ (Figure 3), we also perform a pair-wise t-test and show the significance of the improvements of all SEA's variants over ICL and LoRA-FT.

[1] Alleviating Hallucinations of Large Language Models through Induced Hallucinations

2. Comment: The paper would be stronger if more evaluation benchmarks were considered for bias and truthfulness

Response: First, TruthfulQA and BBQ are two popular benchmarks for evaluating truthfulness and fairness. TruthfulQA is almost used in all papers to improve LLM's truthfulness, and BBQ is used for fairness evaluation for Gemma, Mixtral, and PaLM.

For truthfulness, we would like to highlight that we use HaluEval [1] to calculate the editing projections and evaluate our method with other baselines on TruthfulQA. This allows us to compare with other methods on public benchmarks and also verifies the task generalization ability of SEA editing from one dataset to another.

For bias evaluation, we further conduct one evaluation on CrowS-Pairs [2], which assesses the model's tendency to generate biased outputs, as an additional evaluation to the editing for fairness. We report the percentage of more-stereotypical sentences (lower is better) that are rated as more likely by a model than the non-stereotypical sentences as follows. We observe that both variants of SEA can reduce the tendency for outputting biased sentences for most bias categories. We would like to emphasise that Phi-SEA reduces the tendency for generating more stereotypical sentences with less by 7%. All these observations are consistent with our results on BBQ in Section 4.2.

ageautredisabilitygendernationalityappearancerace_colorreligionsexual_orientationsocioeconomicAvg
LLaMA-2-chat75.82%72.73%73.85%61.56%61.11%72.22%53.15%75.68%86.02%71.58%64.16%
Linear-SEA-Fair74.73%72.73%72.31%62.19%60.19%70.83%53.35%75.68%86.02%72.11%64.10%
Phi-SEA-Fair78.02%72.73%67.69%59.06%52.31%70.83%45.47%67.57%77.42%62.11%57.96%

[1] HaluEval: A Hallucination Evaluation Benchmark for LLMs [2] CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

3. Comment: Why are results for other models (families) shown for BBQ but not for TruthfulQA? And why are scalability results for TruthfulQA shown only for MC1 and not for MC2?

Response: We will add the following additional results to the camera-ready version.

Model family generalisation: We include the results for model generalisation on TruthfulQA as follows. SEA improves all LLMs on MC1. Please note that due to time constraints, we were unable to perform an extensive hyper-parameter search.

modelMC1MC2
ICL (LLaMA-2-chat-13b)37.755.7
SEA (N=2000,K=99.8%,L=25)38.0755.6
ICL (LLaMA-2-chat-70b)37.759.0
SEA (N=2000,K=99.8%,L=1)37.8258.95
ICL (Gemma-IT-2b)30.4848.22
SEA (N=2000,K=99%,L=21)30.7248.26
ICL (Gemma-it-7b)34.3952.97
SEA (N=2000,K=99.99%,L=28)35.1353.66
ICL (Mistral-7b)55.8172.18
SEA (N=2000,K=99.99%,L=10)56.4372.80

Scalability results: MC1 is the most direct metric to measure whether the model predicts the best answer in TruthfulQA. But we will also add the scaling results for MC2 as follows:

#Demonstrations2550100250750100015002000
MC254.7454.8255.1555.8555.3855.2756.2857.15

4. Comment: Could you explain why your toxigen scores differ so drastically from the scores reported in the Llama2 paper?

Response: Our evaluations are not directly comparable. In LLaMA's technical report, they report the percentage of generations that are deemed toxic by the metric; however, we follow the lm-evaluation-harness that Toxigen is formulated as asking the model to label if a given statement is hateful or toxic. The rationale behind our evaluation is that a safer or less toxic model should be more capable of identifying the safe/unsafe response. Additionally, we also include an extra fairness evaluation on stereotypical generation as discussed in Comment 2.

5. Comment: Can your method make the model more truthful and unbiased at the same time? Can this approach scale to a method that takes care of both?

Response: This is indeed an interesting question. We conduct an extra experiment on merging the positive and negative demonstrations for both truthfulness and fairness, then apply the same SEA editing procedure to calculate a pair of projection matrices jointly editing for truthfulness and fairness on LLaMA-2-Chat-7B. Compared with LLaMA-2-Chat-7B, we found that a joint projection can improve both fairness and truthfulness.

However, compared with the editing for a single target with the same number of demonstrations, the effect of joint projection is not as effective as specialised editing. We think the potential reason may be that the editing direction and degree of truthfulness and fairness may be different, which can be seen from the spectrum of the covariance of the activation values on HaluEval and BBQ (Fig1 in the additional rebuttal PDF). Thus, mixing the two goals for editing might lead to mutual interference to some extent.

MethodsTruthfulQABBC
MC1MC2Accuracy
LLaMA-2-Chat-7B36.9654.6843.02
Specialised Linear-SEA38.3155.2743.8
Specialised Phi-SEA//56.17
Joint Linear-SEA36.8454.8143.17
Joint Phi-SEA37.0954.6654.44
评论

Thank you for confirming and running some analyses. I appreciate these responses and I think they support the judgement that this is a "technically solid paper with no major concerns with respect to evaluation, resources, reproducibility, ethical considerations.".

评论

We thank you for taking the time to review our paper and reading our rebuttal.

审稿意见
4

This paper introduces Spectral Editing of Activations (SEA), which adjusts the internal activations of LLMs to enhance alignment with truthful and unbiased content. This technique involves projecting input representations to maximize correlation with positive examples (truthful content) while minimizing correlation with negative examples (biased or false content). The method can be applied during inference and is further extended to non-linear editing using feature functions. Comprehensive experiments were conducted on benchmarks related to truthfulness and bias.

优点

  • Research on representation engineering is very interesting and has great potential.
  • The experimental part is comprehensive and the effectiveness of the proposed method is evaluated on various benchmarks.
  • Paper is well written and easy to follow.

缺点

  • Some recent works in representation engineering should be included in the article, such as TrFr[1], TruthX[2].
    In particular, as far as I know, TruthX uses auto-encoder and contrastive learning to learn the editing direction on LLM's representation. This sounds similar to the motivation of "SEA edits activations by keeping them highly correlated with activations associated with positive behavior (e.g., truthful) and decorrelated with negative behavior (e.g., hallucinated)".
    I suggest that the author can compare SEA with these methods in the article to highlight the novelty of the proposed method.
    [1] TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space
    [2] Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning \

  • Some baselines on representation engineering, such as TrFr and TruthX, should be compared in the TruthfulQA experiment. As far as I know, they all report MC1 and MC2 in their papers (some work open-sources the trained models, because it should not be complicated to evaluate their methods and compare them with SEA).

  • It is not sufficient to use only multiple choice tasks for TruthfulQA. The authors should further test it on open-ended generation tasks (like previous works), because in real applications we interact with LLM in a conversational manner rather than making multiple choices.

  • The statement about "training-free" needs to be more rigorous. In my understanding, SEA does not require training LLM, but the process of "Finding the Editing Projections" is actually a training process, but its cost is very small. Like previous work, I prefer to call it "inference-time".

  • The motivations for some settings lack in-depth explanations and experimental ablations, refer to the Question section. I can understand that the author has some heuristic designs/choices, which is acceptable, so this is not a core weakness. But I suggest that some specific explanations and experimental results (if possible) can be added, which can make this research more insightful.

问题

  • When extracting activations within LLM, why do we use "activations at the last token position" instead of randomly selecting or taking the mean of all activations in the response?

  • Why use the output of each MLP layer as activation? Instead of using attention head like ITI, or using attention and MLP like TruthX?

Looking forward to the author's response, I might consider raising the score if the relevant issues are addressed.

局限性

N/A

作者回复

1. Comment: I suggest that the author compare SEA with TrFr and TruthX to highlight the novelty of the proposed method.

Response: We will add them to the related work and provide a comparison. We agree the general motivations for SEA and TruthX are similar, but there are many differences:

  1. methodology: Our methods are entirely different. We use spectral decomposition to search and apply the editing directions, while TruthX and TrFr require contrastive learning or probing to find them. Ours allows the editing projections to be calculated via a closed-form solution and has the advantage of training efficiency.
  2. fairness editing: we extend the linear method to non-linear editing for fairness with three kernels. The results of the biasing evaluation confirm the advantage of SEA.
  3. efficiency: SEA is a very lightweight method that complements existing methods. As shown in Table 1, thanks to SEA’s closed-formed solution, its training time is significantly lower than that of gradient-based editing methods. This will make SEA a meaningful baseline for efficiency for gradient-based editing methods in the future.

2. Comment: Some baselines on representation engineering, such as TrFr and TruthX, should be compared in the TruthfulQA experiment.

Response: Thanks again for suggesting these related works. We missed TruthX in our submitted manuscript because it was just accepted by ACL2024 which will happen on August 11. But in the camera-ready version, we would be happy to include them in the final result table to enable the community to better understand the progress in this field.

*3. Comment: It is not sufficient to use only multiple-choice tasks for TruthfulQA. *

Response: We also report the scores on the generation track. We use DaVinci-002 as the backbone for training GPT-Judge and GPT-Info as curie is not maintained by OpenAI anymore. To summarise, SEA has the highest truthfulness score among all methods. However, LoRA gets the best informativeness results, which is expected as LoRA fine-tunes the model using instruction-following data, unlike SEA.

InfoTruthInfo*Truth
LLaMA-2-Chat-7B69.40%47.36%33.29%
LoRA (N=1K)91.06%48.59%42.59%
LoRA (N=2K)92.41%47.49%42.35%
SEA (N=1K)70.38%48.96%35.25%
SEA (N=2K)68.05%50.67%33.66%

4. Comment: The statement about "training-free" needs to be more rigorous. I prefer to call it "inference-time".

Response: Thanks. We agree that SEA still leverages demonstrations to calculate the editing projections. We will change the term to inference-time editing.

5. Question: Why do we use "activations at the last token position" instead of randomly selecting or taking the mean of all activations in the response?

Response: We follow previous work [1,2] by using the activations at the lask-token position, which shows effectiveness in capturing the model's internal states over the entire sequence.

As a way to provide evidence for this claim, we also run an ablation study on the choice of activations in the two ways you recommended:

TruthfulQAMC1MC2
last-position39.4157.15
mean36.9654.55
random36.9653.6

[1] In-context Vectors: Making In-Context Learning More Effective and Controllable Through Latent Space Steering [2] Improving text embeddings with large language models

The result is as expected: using the last-position activations works best. Our explanation is that the completion is generally shorter than the prompt, especially for the QA task with short answers like TruthfulQA. Using mean pooling of all tokens from the whole sequence may over-amplify the signal from the prompt rather than the relatively short positive/negative completions. Also, as we are dealing with a decoder-only model, the tokens in the prompt cannot attend to the completion during encoding. So, using them to contrast the model's behaviours from the positive and negative completions would not be meaningful.

6.Question: Why use the output of each MLP layer as activation? Instead of using an attention head like ITI or attention and MLP like TruthX?

Response: Our main concern is efficiency. Attention has multiple heads and projections, which leads to 1) the number of hyperparameters increasing considerably, editing attention would require an understanding of the underlying mechanisms of each attention head [1,2], making it a more challenging and less applicable approach. 2) The decrease in inference efficiency: the complexity for editing each Transformer's block output would be O(L), but O(LxH) for editing attention, where L is the number of layers, and H is the number of attention heads.

Finally, the role of attention and MLP is still an open-ended research question. There is no absolutely correct paradigm, whether it is LoRA fine-tuning or representation editing. There are works [3-5] editing the transformer layers' outputs that also show promising performance. We suggest that users decide which place to apply the edits according to their needs and budgets.

[1] Retrieval head mechanistically explains long-context factuality

[2] Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

[3] In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering

[4] Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

[5] Erasure of Unaligned Attributes from Neural Representations

评论

Dear reviewers, as we approach the end of the rebuttal, we hope our response has addressed all your concerns. If not, please let us know, and we would be happy to provide further explanation. Thank you very much.

审稿意见
7

The paper introduces a novel method called Spectral Editing of Activations (SEA) to improve the alignment of LLMs by enhancing truthfulness and reducing bias. SEA operates at inference time, projecting input representations in ways that maximize correlation with positive demonstrations (truthful content) and minimize correlation with negative demonstrations (hallucinated content). The method leverages singular value decomposition for linear editing and extends to non-linear editing using feature functions. Extensive experiments on benchmarks for truthfulness and bias demonstrate SEA's effectiveness, generalizability, and efficiency across six different LLMs. The results highlight SEA's ability to improve model performance on tasks like TruthfulQA and the BBQ dataset with minimal impact on other model capabilities.

优点

The paper presents a unique inference-time editing method, SEA, which uses spectral decomposition to improve LLM alignment. This approach is novel compared to existing optimization-heavy methods.

The experimental design is robust, involving multiple benchmarks and diverse LLMs, demonstrating SEA's effectiveness in improving truthfulness and fairness while maintaining computational efficiency.

The paper is well-written and clearly explains the methodology, including the theoretical foundations of SEA and its practical implementation. The use of figures, like the one illustrating activation clusters, aids in understanding the concepts.

The ability to edit LLM activations to enhance desirable properties like truthfulness and reduce undesirable behaviors like bias has significant implications for the deployment of more reliable and fair NLP applications.

缺点

The paper could benefit from experiments on a broader array of tasks to further validate SEA's effectiveness across different contexts. This would help in generalizing the findings beyond the current benchmarks.

Including visualizations of the distribution shifts in activations before and after applying SEA would provide more insight into the impact of the method and help in understanding the underlying mechanics.

问题

Can you provide visualizations of the activation distribution shifts before and after applying SEA? This would help in understanding the impact of the method on the internal representations.

How does SEA perform on other important NLP tasks not covered in this study? Extending the evaluation to a wider range of tasks could further establish its generalizability.

Could you elaborate on the choice of benchmarks and how representative they are of real-world scenarios where LLM alignment is critical?

局限性

While the paper demonstrates SEA's effectiveness on truthfulness and fairness benchmarks, a more comprehensive evaluation across a wider array of tasks and datasets would provide stronger evidence of its generalizability.

作者回复

1. Comment: The paper could benefit from experiments on a broader array of tasks to further validate SEA's effectiveness across different contexts. This would help generalize the findings beyond the current benchmarks. How does SEA perform on other important NLP tasks not covered in this study? Extending the evaluation to a wider range of tasks could further establish its generalizability.

Response: In this work, we focused on truthfulness and fairness, which we perceive as critical attributes that enhance the usefulness of LLMs. We agree with the reviewer that while TruthfulQA and BBQ are the "go-to" benchmarks for such evaluations, our paper can benefit from experimenting with additional dataset.

For this purpose, we further conducted one additional evaluation on CrowS-Pairs [2], which assesses the model's produced embeddings' tendency to biased outputs. We report the percentage of more-stereotypical sentences (lower is better) that are rated as more likely by a model than the non-stereotypical sentences as follows. We observe that both variants of SEA can reduce the tendency for outputting biased sentences for most bias categories. We would like to emphasise that Phi-SEA reduces the tendency for generating more stereotypical sentences with less by 7%. All these observations are consistent with our results on BBQ in Section 4.2.

ageautredisabilitygendernationalityappearancerace_colorreligionsexual_orientationsocioeconomicAvg
LLaMA-2-chat75.82%72.73%73.85%61.56%61.11%72.22%53.15%75.68%86.02%71.58%64.16%
Linear-SEA-Fair74.73%72.73%72.31%62.19%60.19%70.83%53.35%75.68%86.02%72.11%64.10%
Phi-SEA-Fair78.02%72.73%67.69%59.06%52.31%70.83%45.47%67.57%77.42%62.11%57.96%

[1] HaluEval: A Hallucination Evaluation Benchmark for LLMs [2] CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

2. Comment: Including visualizations of the distribution shifts in activations before and after applying SEA would provide more insight into the impact of the method and help understand the underlying mechanics. Can you provide visualizations of the activation distribution shifts before and after applying SEA? This would help in understanding the impact of the method on the internal representations.

Response: We have provided a visualization of Linear-SEA and Phi-SEA editing on BBQ in Figure 2 in the additional rebuttal PDF. We observe that Phi-SEA (right) removes the important directions of negative demonstrations while retaining the directions related to positives, which explains the editing qualitatively. We will include these visualizations in the revised manuscript.

3. Question: Could you elaborate on the choice of benchmarks and how representative they are of real-world scenarios where LLM alignment is critical?

Response: First, TruthfulQA and BBQ are two popular benchmarks for evaluating truthfulness and fairness. TruthfulQA is almost used in every papers to improve LLM's truthfulness, and BBQ is used for fairness evaluation for Gemma, Mixtral, and PaLM.

Secondly, both datasets cover a wide range of scenarios regarding truthfulness and fairness. TruthfulQA spans over 38 categories and focuses on evaluating the model's ability to generate factually accurate responses, which is essential for maintaining LLMs' credibility and reliability in real-world applications. BBQ is also hand-built and covers 11 types of common bias. It assesses biases in the model’s responses, helping ensure fairness and reducing harmful stereotypes.

Thirdly, these benchmarks are representative of real-world scenarios where alignment is crucial because their QA task formulation covers both the accuracy of the information seeking and safety/ethical considerations. This allows us to demonstrate the effectiveness of SEA in mitigating undesired model behaviours in a more real-world setup. The QA task formulation also helps us obtain the polarised positive and negative demonstrations for calculating SEA's projections.

评论

Dear reviewers, as we approach the end of the rebuttal, we hope our response has addressed all your concerns. If not, please let us know, and we would be happy to provide further explanation. Thank you very much.

审稿意见
7

Summary

  • This paper presents an inference time alignment algorithm based on activation editing.
  • Their technique named as spectral editing of activations (SEA) projects the input representations onto directions with maximal covariance with positive demonstrations (truthful) and minimum covariance with negative demonstrations (hallucinations).
  • They use SVD to to find projection directions which correlate maximally with positive and negative demonstrations.
    • Equation (1), (2) in the paper describe the technique well.
    • The idea is to keep the largest singular values for positive demonstrations and smallest singular values for negative demonstrations.
  • The positive and negative activation vectors after editing are merged together with a feature normalization factor which they show later in the paper is important through ablation studies.
  • In addition to linear transformations, the authors extend the transformations to non-linear setting. This is based on the hypothesis that certain behaviors like producing biased responses may not exhibit linear separability in the activation space. To this end, they experiment with three non-linear kernels

nits and typos

  • Line 239: optimisation -> optimization

优点

Strengths

  • Compared to traditional activation engineering methods which require iterative optimization their proposed technique is training free.
  • The paper is well written and easy to follow.
  • The experiment evaluating truthfulness and speed is convincing. While I understand that you are considering inference time methods as baselines, I'm curious to know how these methods compare to tuning-based methods like DPO. Do they come close in performance?
  • The experiment on Bias Evaluation with non-linear function somewhat supports the hypothesis about non-linear separability of bias.

缺点

Weakness

  • Line 294 is a strong claim. As shown in Figure 4 BBQ, the performance plateaus. Consider re-phrasing.
  • Performance on control tasks is generally convincing but need to be careful about applying this technique to common sense tasks. The explanation given in line 307-308 does not tell me why the lossy function, does not apply to math tasks but only selectively to common sense QA.
  • I am also surprised why the authors did not compare with Best-of-N alignment as a baseline which also does not require any training and is quite simple to compare against.

问题

Questions

  • What is a good working value for the hyperparam K (line 127) ?
  • How do you enforce that k = r/2 where r is the rank of the matrix? Or is there no such constraint? Otherwise you end up double summing the values for overlapping directions.
    • It seems from table 1, that you are keeping top 99% and bottom 99% of explained variance. Does this not lead to double summation for activations?
    • Also this makes me wonder if the spectrum of activations really decays exponentially?

局限性

The authors have adequately addressed the limitations.

作者回复

1. Comment: Line 294 is a strong claim. As shown in Figure 4 BBQ, the performance plateaus. Consider re-phrasing.

Response: Thank you for your feedback. We agree that when comparing with results on TruthfulQA, the performance plateaus in Figure 4 for BBQ which suggests that the benefits of additional demonstrations may saturate beyond a certain point. We will rephrase line 294 to reflect this.

2. Comment: Performance on control tasks is generally convincing, but we need to be careful about applying this technique to common-sense tasks. The explanation given in lines 307-308 does not tell me why the lossy function does not apply to math tasks but only selectively to common-sense QA.

Response: For all linear editing variants, which are guaranteed to be lossless, we indeed observe a very minimal negative effect on other control tasks, including commonsense QA. Hence, the major factor for performance degradation is non-linear transformations. Regarding the distinct degradation for commonsense QA and other tasks in non-linear editing, one possible reason is that we apply the editing on MLP's outputs, which has recently been found to be highly associated with LLM's storing and recalling commonsense knowledge [1-3].

[1] Locating and Editing Factual Associations in GPT.

[2] Language Models Implement Simple Word2Vec-style Vector Arithmetic.

[3] Knowledge Neurons in Pretrained Transformers.

3. Question: While I understand that you are considering inference time methods as baselines, I'm curious to know how these methods compare to tuning-based methods like DPO.

Response: We would like to clarify that SEA does not aim to compare with the fine-tuning alignment, e.g., DPO/PPO. We would expect the inference-only editing lags behind the fine-tuning alignment, as we have discussed the large improvement of RLHF in Table1 and L221. Instead, we show that SEA can be applied on top of existing alignments, e.g., we apply SEA on LLaMA-Chat model aligned with PPO, while providing a lightweight (i.e., inference-only) and flexible (i.e., user can define the alignment objective by very few positive/negative demonstrations) control on model's output.

4. Comment: I am also surprised why the authors did not compare with Best-of-N alignment as a baseline which also does not require any training and is quite simple to compare against.

Response: First, we would like to clarify that we use the predicted likelihood of the candidate's answer to evaluate the model's truthfulness and fairness, making best-of-N baselines non-applicable for our original evaluation.

Second, our goal is to edit activations inside the model to control the model’s behaviour, which is orthogonal to the best-of-N method, i.e., the gains brought by SEA and the best-of-N method can be added together. Therefore, we conducted an additional round of experiments to further verify the effectiveness of SEA editing: we re-ran LLaMA-2-Chat-7B and its edited version of Truthful-SEA on the generation track of TruthfulQA under the best-of-N setup. We find that a larger N value leads to significantly higher scores in truthfulness and informativeness. In each N, SEA always has higher scores compared to the LLaMA baseline, except for the informative score when N=1. These experiments further consolidate the gains of SEA: it can be shown not only in the distribution of nucleus sampling but also extended to the best-of-N distribution.

Best-of-NInfoTruthInfo*Truth
LLaMA-2-Chat-7B169.40%47.36%33.29%
276.50%57.03%44.55%
380.54%62.30%50.31%
Truthful-SEA168.05%50.67%33.66%
277.72%57.28%44.56%
382.01%63.04%51.30%

Note that we follow previous work [1] to separately fine-tune GPT-3.5 as a truthfulness judge and as an informativeness judge.

[1] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

5. Question: What is a good working value for the hyperparam K (line 127)?

Response: It depends on tasks and models. We present the selected hyperparameters for all our experiments in Appendix D. We also present an analysis of the effect of different K in Appendix C. In the analysis, we show that increasing K will lead the downstream performance to increase and then decrease. Our explanation for the decrease in performance is that a very large K (i.e., including more explained variance from the positives while less from the negatives) might let projections capture the noisy signal in positive demonstrations while losing the task-related information in negative demonstrations.

6. Question: How do you enforce that k = r/2 where r is the rank of the matrix? Or is there no such constraint? Otherwise you end up double summing the values for overlapping directions.

  • q1: It seems from table 1, that you are keeping top 99% and bottom 99% of explained variance. Does this not lead to double summation for activations?

Response: No double summation takes place. Assuming we have K=99%, SEA tries to keep the top 99% of explained variance for the positive covariance while removing the bottom 99% of the negative covariance. The positive covariance and negative covariance are also two distinct matrices, as calculated in Eqt(2), so the directions in their projected subspaces are distinct as well. We will clarify it better in the paper.

  • q2: Also this makes me wonder if the spectrum of activations really decays exponentially?

Response: Yes, we observe exponential decays, as shown in Figure 1 in our attached additional rebuttal PDF. We present the spectrums for both covariances of the linear-SEA editing on truthfulness and Phi-SEA editing on fairness.

评论

Dear reviewers, as we approach the end of the rebuttal, we hope our response has addressed all your concerns. If not, please let us know, and we would be happy to provide further explanation. Thank you very much.

作者回复

This additional PDF page contains two figures asked by the reviewers:

  1. Figure1: Visualisation for the spectrum of covariances.
  2. Figure2: Visualisation for editing the activations.
最终决定

This work introduces a method called Spectral Editing of Activations (SEA) to improve the alignment of LLMs by enhancing truthfulness and reducing bias. SEA operates at inference time, projecting input representations in ways that maximize correlation with positive demonstrations (truthful content) and minimize correlation with negative demonstrations (hallucinated content). The method leverages singular value decomposition for linear editing and extends to non-linear editing using feature functions. There are evaluations on benchmarks for fairness and bias using several models.

The 4 reviewers and I are in agreement that the proposed method is timely and that the research question of improving truthfulness and reducing bias in LLMs is important. The proposed method is also much more computationally light-weight than previous methods for representation editing / concept erasure, which is an improvement. The evaluation is sufficient, though it is not clear what about the method is specific to the truthfulness and bias domain -- I understand that the authors decided to focus on these specific dimensions but the paper can benefit from a discussion of whether and how the method can be extended to other tasks where human alignment is desirable.

During their rebuttal, the authors were mostly responsive to the major concerns of the reviewers and have now provided a more comprehensive set of results on model families and benchmarks. I recommend that these results are included in the paper, as well as the other promised edits during the rebuttal promise.