PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Constrain Alignment with Sparse Autoencoders

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
AlignmentSAELLM

评审与讨论

审稿意见
3

The author proposes Feature-level Constrained Preference Optimization (FPO). FPO uses a SimPO objective plus a regularizer that compares features in a lower-dimensional space, rather than token probabilities in the high-dimensional vocabulary space. These features are obtained by a sparse autoencoder. In this way, FPO can still regularize the training relative to a reference model, thus improving stability relative to SimPO, while computing the regularizing term in a lower-dimensional space, thus improving efficiency relative to TDPO.

给作者的问题

See "Methods And Evaluation Criteria" and "Other Strengths And Weaknesses".

论据与证据

The claims and evidences are in general clear.

方法与评估标准

Based on the evaluation, FPO's performance is better or comparable to TDPO's performance, but with less cost. Essentially, FPO has 2 changes relative to TDPO:

(i) length normalization in the log probability difference and in the margin (ii) SAE feature regularization

Ideally, it would be great to understand the roles these choices play separately. For example, if we use only (i) but still use token-level regularization, can we increase the performance relative to TDPO? Or is (i) only useful with (ii)? Or maybe it suffices to use only (ii) without (i)?

理论论述

The paper has no proofs for theoretical claims.

实验设计与分析

The experimental designs are in general sound. See my comment in the "Methods And Evaluation Criteria" for suggestions on ablating the design choice relative to TDPO.

补充材料

I did not review the supplementary material.

与现有文献的关系

Based on the evaluation, FPO's performance is better or comparable to TDPO's performance, but with less cost. Essentially, FPO has 2 changes relative to TDPO:

(i) length normalization in the log probability difference and in the margin (ii) SAE feature regularization

The approach is motivated by the merits and shortcomings of SimPO and TDPO:

  • SimPO is computationally efficient as it does not use a reference model. However, it is sometimes unstable.
  • TDPO is numerically more stable than SimPO thanks to the token-level KL regularizer. However, it is computationally expensive, especially when the models have a large vocabulary size.

The proposed FPO can be seen as an extension of TDPO: it regularizes not in the high-dimensional token probability space but in the low-dimensional feature space.

遗漏的重要参考文献

NA

其他优缺点

The main strength of the paper, in my opinion, is that it demonstrates it's unnecessary to regularize an aligned model in token space. Regularization in feature space—specifically, using features from an SAE as the authors do—is more cost-effective and yields a similar effect.

To me, a shortcoming of the paper is that it doesn't adequately motivate why an SAE is a natural choice for producing the features to be regularized. SAEs are primarily used in interpretability studies, but their interpretability doesn't seem to be relevant to this regularization. Furthermore, the authors use pooled sparse activations in Equation (10), which are no longer sparse. It is therefore possible that the SAE's role might simply be to generate embeddings for regularization, and the specific model (SAE or others) may not be crucial. A simple test is to, replace the ReLU in the SAE with a PReLU; this results in a standard autoencoder for embedding token features, and may work just as well. Another option would be to use the pooled activations directly---without using the SAE---for regularization. Alternatively, one could simply minimize the weight difference between the unaligned and aligned models. I'm not suggesting the authors explore all these variations, as the possibilities are endless. I'm raising these examples because the SAE doesn't appear to be the most obvious choice, and it would be beneficial if the authors could clarify this and provide evidence supporting the advantage of using an SAE.

其他意见或建议

NA

作者回复

We greatly appreciate the reviewer's suggestions. We note that the reviewer has raised three valuable concerns, and we would like to address them.

Q1: To me, a shortcoming of the paper .... doesn't seem to be relevant to this regularization.

We acknowledge this gap in our original submission and address it here by providing both motivation and experimental evidence. SAEs decompose activations into sparse, interpretable features. Unlike traditional token-level constraints, which apply broad regularization, SAEs allow us to pinpoint and manipulate individual abilities, by selecting target features and constraining them. In our FPO method, we leverage this by the weighting parameter, beta, to these features. A higher beta strengthens the constraint, reducing the model’s ability in that domain, while a lower beta preserves or enhances it.

We conducted large-scale experiments using the Gemma2-2B model with four tasks: (1) Instruction Following on IFEval with tasks in JSON Format, Capitalize, Highlight Text, Lowercase, Bullet List, and Quotation. (2) Multilingual Capability. 1,000 entries from MultiAlpaca and WildChat, limited to French, Canadian French, German, and Italian, with English questions from MKQA. (3) Safety. 300 question-answer pairs from Jailbreak Bench and AdvBench, completed and verified with DeepseekV3. (4) Sentiment. 1,000 samples from Twitter Financial News Sentiment, split evenly (500 positive, 500 negative). For each domain, we identified globally activated features by averaging activations across all tokens (sparsity details in Q2). Result tables can be found here: https://github.com/FPO-code/FPO-code

SAEs presents strong and accurate control in specific domains e.g., enhancing model's ability to generate in json format while avoiding it to use bullet list ( please refer to our experiment tables ). We believe that FPO is the first method that expanding the alignment process to feature-level, which is a relatively important contribution to the community.

Q2: Furthermore, the authors use pooled sparse activations in Equation (10), which are no longer sparse.

While the initial step of average pooling across tokens in the provided text results in pooled activations that are no longer strictly sparse, the subsequent application of the top-k function restores a form of sparsity. The top-k operation is both meaningful and critical to our approach. This operation aligns with common practices in mechanistic interpretability, such as difference-in-means analysis, where the focus is on the most significant features. This is particularly effective because our alignment datasets exhibit global features—consistent patterns that persist across tokens, such as structured response formatting (e.g., JSON formatting) or safety-related behaviors (e.g., avoiding unsafe responses).

The top-k operation ensures that our regularization focuses on these dominant features, allowing us to control alignment constraints at a high level while maintaining computational efficiency. Without this step, the MSE would be computed over all components of cˉ\bar{c}^\ell, including noise from less significant activations, diluting the regularization’s effectiveness.

Here we provide some examples of extracted features in Q1 selected with our top-k algorithm (refer to 1. Can SAEs Identify Relevant Features from These Datasets?): https://github.com/FPO-code/FPO_code/blob/main/accurate_control.md.

Q3: It is therefore possible that the SAE's role might simply be to generate embeddings for regularization, ... simply minimize the weight difference between the unaligned and aligned models.

We appreciate these insightful suggestions and agree that exploring alternative regularization methods is valuable. Indeed, approaches like PReLU-based autoencoders, pooled activations, or weight difference minimization could provide some regularization benefits. However, our choice of SAEs is not arbitrary—it is motivated by their unique ability to decompose model representations into monosemantic, interpretable features, enabling precise and targeted control over specific capabilities during alignment. As detailed in Q1, this granularity surpasses what token-level constraints or broader weight-based methods can achieve.

To test the reviewer’s hypothesis, we compared our SAE-based FPO method with an alternative: TDPO with activation-level alignment, where regularization is applied directly to pooled activations (akin to the second suggestion). We used the same experimental setup as in Q1 and evaluated performance across the four domains: Instruction Following, Multilingual Capability, Safety, and Sentiment. For TDPO, we pooled activations from the model’s last layer and applied a constraint to minimize their divergence during alignment, simulating an embedding-like regularization. Result tables are here: https://github.com/FPO-code/FPO_code/blob/main/accurate_control.md

审稿意见
3

This paper proposed a DPO variant that is both good performing and computationally efficient. They replace expensive per token KL divergence regularization w/ an SAE-based regularization -- hinging the efficiency on the SAE sparsity.

给作者的问题

See above.

论据与证据

Somewhat -- the theory part connecting FPO (theirs) loss w/ the previous DPO variants loss is convincing (how their loss makes it more stable and efficient). However, the empirical result is a bit lacking for the following reasons:

  • Only 2 models are used of the same family (Gemma-2)
  • The efficiency claim is somewhat questionable -- SAE training is not considered in the Efficiency experiment (5.1)

方法与评估标准

yes, for the most part. Again, I am not convinced of the efficiency evaluation -- the SAE training is not included.

理论论述

Yes, the relationship between FPO loss w/ DPO's and other variants. Looks correct to me.

实验设计与分析

Yes, everything. As previously mentioned, i have 2 big issues w/ the experimental design (see above)

补充材料

No

与现有文献的关系

They propose a DPO loss variant that is stable in training, computationally efficient, and high performing. Previous work in the literature only satisfy 2 out of the three.

遗漏的重要参考文献

N/A

其他优缺点

My biggest worry is on the 2 issues I have stated before:

  1. Without including SAE training in the computational efficiency evaluation, it is still questionable if this method is really more compationally efficient than the baselines. Without the computational efficiency edge, the claim is questionable.
  2. Using more model family is needed, to show that this method works well for other models as well

其他意见或建议

I would be interested to see if the use of SAE can enable even a more precise control over the LLMs, like increasing only certain traits?

作者回复

Q1: Only 2 models are used of the same family. We apologize for the lack of evaluation across other model families. During our experiments with FPO, the Gemma model series was the only one with a relatively complete set of SAE models. Consequently, we limited our training and testing to the Gemma model. Following the release of LLaMA-scope w/ LLaMA-3 8B, we conducted subsequent experiments. The following details outline our experimental setup, including baseline selection, benchmark selection, experimental parameters, and results: https://github.com/FPO-code/FPO_code/blob/main/LLaMA_Experiments.md.

Q2: Without including SAE training in the computational efficiency evaluation, it is still questionable if this method is really more compationally efficient than the baselines. Without the computational efficiency edge, the claim is questionable.

We maintain that our methods offer significant efficiency advantages. We present two primary lines of evidence:

(1) SAEs are pre-trained on extensive datasets and can be reused across multiple downstream tasks (and can be generalized to base and chat model). Therefore, their training cost can be considered a one-time investment, rather than a recurring expense factored into each task's efficiency evaluation. This approach aligns with common practices in machine learning, such as leveraging pre-trained word embeddings without recalculating their training cost for every application. In such scenarios, efficiency assessments typically focus on the downstream task (e.g., alignment), as pre-training is treated as a standardized preliminary step. We provide a detailed analysis of the wall-clock time and computational costs associated with the combined SAE training and alignment process.

(2) To further enhance efficiency, we have introduced a Multi-Layer Sparse Autoencoder (MLSAE) approach, which not only outperforms the original single-layer SAE method but also reduces training costs. By capturing cross-layer global features, MLSAE eliminates the need for extensive layer-specific searches, streamlining the process while improving alignment accuracy. Our updated efficiency analysis, now incorporating MLSAE, demonstrates lower computational overhead compared to baselines like DPO and TDPO-2, even when accounting for pre-training. These results reinforce our claim of computational efficiency, alongside the method’s superior performance, as evidenced in our revised experiments on Gemma and LLaMA models. Experimental details can be found at https://github.com/FPO-code/FPO_code/blob/main/LLaMA_Experiments.md.

Q3: I would be interested to see if the use of SAE can enable even a more precise control over the LLMs, like increasing only certain traits?

Our work leverages SAEs to decompose model representations into features, offering a level of granularity that token-level methods lack. By applying MSE constraints to specific features e.g., features that controlling the json format output, FPO enables precise manipulation of model alignment. To substantiate this, we present experiments to accurately regulate specific capabilities during alignment, leaving others unaffected while achieving strong overall performance. We assessed four critical domains: Instruction Following with IFEval, Multilingual Capability with MultiAlpaca and WildChat, Safety with Jailbreak Bench and AdvBench, and Sentiment with Twitter Financial News Sentiment. Detailed experimental settings and results are provided: https://github.com/FPO-code/FPO_code/blob/main/accurate_control.md

审稿人评论

Thank you for the rebuttal! I have some extra questions. Q1: Thank you for providing these new results on Llama-3B

Q2: Your answer to Q1 kinda prove my original point in Q2: you cant really run FPO without haveing a pretrained SAE for the model you want to run it to. For "leveraging pre-trained word embeddings without recalculating their training cost for every application" => in most cases, we can use any pre-trained word embedding for downstream tasks, independent of the model we choose in the downstream task. But this is not the case for FPO right? We need a specialized SAE for a model we want to run SAE on?

Q3: Can you explain how to read and interpret the new results? Whats W/WO? Whats IFEval Acc? Why is the lower number the better? What is Output Rate% and how to interpret the Ratio?

作者评论

Q2: we can use any pre-trained word embedding for downstream tasks, independent of the model we choose in the downstream task. But this is not the case for FPO right? We need a specialized SAE for a model we want to run SAE on?

We acknowledge that SAEs indeed require independent training for a specific model. In our experiments, we did not pretrain any SAEs; but for models that are not widely used, pretraining an SAE may be necessary. However, we argue that the training overhead introduced by SAEs should not be considered a significant efficiency concern for three compelling reasons:

  1. A wide array of pre-trained SAEs is already available, covering popular models such as Mistral, LLaMA, Gemma, and Qwen. These model families encompass the majority of architectures commonly fine-tuned by the research community. Consequently, when transitioning from other alignment methods to FPO, there is often no need to train an SAE, as existing pre-trained SAEs can be directly utilized. Ref: https://github.com/jbloomAus/SAELens
  2. SAEs can be shared between base and chat versions of a model. This reusability further reduces the need for redundant training.
  3. Even in cases where retraining an SAE is required, the process is highly manageable. Since the model’s weights are frozen during SAE training, we can sample the model’s outputs offline and train the SAE independently. For instance, training an MLSAE on a model as large as LLaMA-70B can be accomplished with just two H100 GPUs using tensor parallelism. As model scaling continues, the computational cost of training SAEs becomes increasingly negligible, making it a minor consideration in the overall pipeline.

Q3: Can you explain how to read and interpret the new results? Whats W/WO? Whats IFEval Acc? Why is the lower number the better? What is Output Rate% and how to interpret the Ratio?

The motivation of our new results: SAEs enable precise decomposition of model representations into monosemantic features, offering a granular level of control over alignment that token-level methods cannot achieve. This feature-level approach is critical for our FPO method, as it allows us to regulate specific model capabilities—such as instruction-following, multilingual proficiency, safety, or sentiment—by adjusting constraints on identified features. For example, SAEs can isolate features like "formatting text into JSON" or "generating fluent French responses", enabling targeted fine-tuning during alignment.

Whats W/WO?

With (W): The model is provided with text instructions specifying the format of the output.

Without (WO): The model is not given any text instructions about the format and must infer the task requirements solely from the input.

Whats IFEval Acc?

IFEval Accuracy refers to the accuracy metric used to evaluate the model’s performance on the IFEval dataset in the Instruction Following domain. The IFEval dataset tests a model’s ability to follow specific formatting instructions, such as "Format text into JSON structure," "Capitalize specified words," or "Structure text into bullet points." We measure accuracy as the percentage of outputs that correctly adhere to the specified format.

We use beta to control specific instruction following ability. For instance, in the table, FPO (beta=0) achieves 0.46 (46%) accuracy on "JSON Format (W)," matching the best baseline (TDPO at 0.46). When beta=1 on all format-related features, the accuracy drops to 0.03 (3%), demonstrating that FPO successfully suppresses the model’s ability to perform these tasks by applying strong constraints.

Why is the lower number the better?

Attack Success Rate measures the percentage of instances where the model generates an unsafe response when subjected to adversarial prompts. A lower ASR means the model is more resistant to generating unsafe outputs, indicating better safety performance. By controlling beta on safety-related / harmful features, we can achieve a fully-aligned model (ASR 5%) or an uncensored model (ASR 80%).

What is Output Rate% and how to interpret the Ratio?

The Output Rate % indicates the percentage of responses correctly generated in the specified target language. When high-beta constraints are applied to French-related features, the FPO-aligned model exhibits a significantly reduced Output Rate for French, even when prompted with French questions.

Summary

We sincerely thank the reviewer for their quick response and valuable feedback. We recognize some weaknesses in our paper and will address them in future revisions. Still, we hope for the reviewer’s support in helping our paper get accepted. This paper introduces a new method for aligning and constraining LLMs at the feature level using SAEs. Our approach is unique compared to existing methods. We believe this work can inspire further research in the community. We hope the reviewer will see the innovation in our study and support its progress.

审稿意见
3

The paper proposes Feature-level constrained Preference Optimization (FPO), a novel method designed to improve the alignment of LLMs. FPO utilizes sparse features activated in a trained sparse autoencoder and employs feature-level offline reference to maintain the quality of sequential KL divergence. Experiment results demonstrate that FPO achieves a 5% absolute improvement in win rate compared to baselines while requiring lower computational resources.

给作者的问题

Please refer to the weaknesses.

论据与证据

The claims are generally fine.

方法与评估标准

The logical flow of the method looks good to me. For the concerns, please refer to Strengths and Weaknesses.

理论论述

I checked the proof in Appendix B.

实验设计与分析

For the experimental design, I have the following concerns:

The experiments are only conducted in Gemma model. What about other open-source models like Llama or Mistral?

补充材料

I didn’t find the supplementary materials.

与现有文献的关系

FPO combines existing alignment methods including DPO SimPO, and TDPO. Instead of using the KL divergence constraints, FPO uses SAEs to project LLM’s internal representations onto a sparse feature space to get feature vectors. Then FPO uses the MSE constraints of the feature vectors to replace the KL divergence constraints in TDPO.

遗漏的重要参考文献

The paper doesn’t have a related work section and doesn’t offer a sufficiently comprehensive overview of key prior work in LLM alignments. For example, [1], [2], [3], [4], [5] and [6].

[1] Azar, Mohammad Gheshlaghi, et al. "A general theoretical paradigm to understand learning from human preferences." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.

[2] Rafailov, Rafael, et al. "From rr to qq^*: Your language model is secretly a q-function." arXiv preprint arXiv:2404.12358 (2024).

[3] Kong, Lingkai, et al. "Aligning large language models with representation editing: A control perspective." Advances in Neural Information Processing Systems 37 (2024): 37356-37384.

[4] Ji, Xiang, et al. "Self-play with adversarial critic: Provable and scalable offline alignment for language models." arXiv preprint arXiv:2406.04274 (2024).

[5] Cen, Shicong, et al. "Value-incentivized preference optimization: A unified approach to online and offline rlhf." arXiv preprint arXiv:2405.19320 (2024).

[6] Richemond, Pierre Harvey, et al. "Offline regularised reinforcement learning for large language models alignment." arXiv preprint arXiv:2405.19107 (2024).

其他优缺点

Strengths:

  1. The paper is well-written and easy to follow.
  2. FTO can reduce GPU memory consumption compared to baseline alignment algorithms.
  3. By using SAE, FTO provides a potentially more interpretable and computationally efficient alternative to token-level alignment methods.

Weaknesses:

  1. According to Table 1 left, the performance gains of FTO appear marginal. For example, on the Gemma-2-2B model evaluated with AlpacaEval-2 (805 questions), the method surpasses DPO by only 13 or 14 questions and surpasses simPO by 8 or 9 questions. On the Gemma-2-9B model, the performance gap even becomes smaller, improving only 3 or 4 over SimPO. In Table 1 right, TDPO-2 achieves better performance compared with FTO.
  2. Although FTO can reduce GPU memory consumption, there is no theoretical guarantee that the proposed MSE constraints is better than the KL divergence constraints
  3. The hyperparameters in FTO include the SAE layer, hyperparameters α\alpha and stop-gradient operator. The hyperparameter searching process is time-consuming and challenging.
  4. The example code is not provided, which limits the reviewers' ability to verify reproducibility.

其他意见或建议

No other comments.

作者回复

Q1: The paper doesn’t have a related work section and doesn’t offer a sufficiently comprehensive overview of key prior work in LLM alignments

We are sorry for missing the related work section due to the page limit. Considering our work lie on the intersection of alignment and mechanistic interpretability, we’d like to give a series of important related works of these two fields. We will add the related work section to the paper in the following versions.

Q2: According to Table 1 left, the performance gains of FTO appear marginal. For example, on the Gemma-2-2B model evaluated with AlpacaEval-2 (805 questions), the method surpasses DPO by only 13 or 14 questions and surpasses simPO by 8 or 9 questions. On the Gemma-2-9B model, the performance gap even becomes smaller, improving only 3 or 4 over SimPO. In Table 1 right, TDPO-2 achieves better performance compared with FTO.

We thank the reviewer for their observation regarding the seemingly marginal performance gains of FPO in Table 1 and the comparison with TDPO-2. We acknowledge that the initial results on Gemma showed modest improvements. To address this, we extended our experiments to include LLaMA models and tested an enhanced approach using Multi-Layer Sparse Autoencoders (MLSAE) to extract cross-layer global features, which yielded superior performance. MLSAE extracts features from residual streams in all layers and thus make the constraint deeper. We also update the efficiency tests for this new method and show that MLSAE does not increase the wall-time computational cost. Results can be found at: https://github.com/FPO-code/FPO_code/blob/main/LLaMA_Experiments.md

Also, we propose experiments regarding the accuracy of FPO in controlling model capabilities. Our work leverages SAEs to decompose model representations into features, offering a level of granularity that token-level methods lack. By applying constraints to specific features e.g., features that controlling the json format output, FPO enables precise manipulation of model alignment, which is beyond merely improved performance. The experimental details and results can be found at: https://github.com/FPO-code/FPO_code/blob/main/accurate_control.md

Q3: Although FTO can reduce GPU memory consumption, there is no theoretical guarantee that the proposed MSE constraints is better than the KL divergence constraints

We provide the bounding KL Divergence with MSE of sparse activation in Appendix B. And here we also give additional broad experimental verification of the comparison between MSE constraints and the original KL divergence constraints: https://github.com/FPO-code/FPO_code/blob/main/mse_vs_kl.md.

Q4: The hyperparameters in FTO include the SAE layer, hyperparameters and stop-gradient operator. The hyperparameter searching process is time-consuming and challenging.

Regarding the SAE layer search, we have updated our approach with a MLSAE which achieves superior performance while eliminating the need to identify a specific layer for alignment. This enhancement streamlines the process and boosts effectiveness. As for the stop-gradient operator and hyperparameter settings, we conducted a preliminary search and confirmed that the stop-gradient operator remains essential across different models, consistent with TDPO’s findings. Consequently, we believe the cost of hyperparameter tuning is less burdensome than the reviewer might assume, as our initial exploration efficiently validates these key components.

Q4: The hyperparameters in FTO include the SAE layer, hyperparameters and stop-gradient operator. The hyperparameter searching process is time-consuming and challenging.

Regarding the SAE layer search, we have updated our approach with a MLSAE which achieves superior performance while eliminating the need to identify a specific layer for alignment (refer to Q2). This enhancement streamlines the process and boosts effectiveness. As for the stop-gradient operator and hyperparameter settings, we conducted a preliminary search and confirmed that the stop-gradient operator remains essential across different models, consistent with TDPO’s findings. Consequently, we believe the cost of hyperparameter tuning is less burdensome than the reviewer might assume, as our initial exploration efficiently validates these key components.

Q5: The example code is not provided, which limits the reviewers' ability to verify reproducibility.

We thanks the reviewer's feedback regarding the transparency and reproducibility. Here we provide the code and additional experimental settings and results: https://github.com/FPO-code/FPO_code.

审稿人评论

Thank you for your rebuttal. I believe the new experiments are very important to the paper. Please make sure that the updated results are included in the next version of the paper. I will increase my score.

审稿意见
3

This work proposes a feature-level direct preference optimization algorithm, FPO, for LLM preference learning. Specifically, it revised the token-level KL reguralization of TDPO into the feature level KL regularization where features are obtained from a pre-trained SAE.

The reference features and tokens are both precomputed and stored offline to reduce the memory footprint.

The results demonstrate that FPO can achieve comparable performance with TDPO2 while with fewer memory usage.

给作者的问题

Please see Other Strengths And Weaknesses for questions.

论据与证据

Why the title is Accurately? The key advantages of this work are its efficiency and reduced memory usage.

方法与评估标准

Please see Other Strengths And Weaknesses.

理论论述

The FPO feature distance bounds the feature level KLD.

实验设计与分析

  • Need to present ablation on the revised loss components.
  • Others please see Other Strengths And Weaknesses.

补充材料

All.

与现有文献的关系

This proposed approach can help reduce the memory footprint while maintaining equivalent alignment performance.

遗漏的重要参考文献

None.

其他优缺点

strength:

  • The writing is clear and easy to follow.
  • The proposed method is novel and neat. FPO can significantly reduce computational overhead and memory footprint.

weaknesses:

  • Why there is no related work?
  • The performance (alignment acc and diversity) of the proposed method almost has no difference compared with TDPO2. Technically, the change versus TDPO includes: (1) length normalization, and (2) computing compressed feature-level KL (at the last layer) instead of token-level KL. However, in ablation experiments, not both factors are decomposed for analysis. How does each factor contribute to the performance?
  • The FPO results in Table 2 are not presented. I cannot see the effectiveness of the proposed method. But from the descriptions, the FPO seems cannot scale to larger models.
  • Figure 1 presents ambiguous visual information. It is difficult for readers to get the key idea of this paper by viewing this figure with its caption. I would suggest the author refine this figure and amplify the caption.
  • Lacks some implementational details. Is the pre-trained SAE is fixed? Considering the pre-trained SAE is fixed, will the distribution shift of one layer’s output features (input of SAE) cause higher reconstruction error along the training?
  • It may seem weird that the author claims the alignment stability everywhere while not including any quantitative comparisons. Better to include (in the appendix) statistics on token log-probability difference as [1] does. This could help readers to realize the significance of this problem. How stable are TDPO, SimPO, and FPO which tradeoffs?

[1] Understanding Reference Policies in Direct Preference Optimization

questions:

  • What motivates you to change the base model in SimPO evaluation setting, considering you inherit most of it? If no special reason, could you please provide results on one arbitrary base model in that setting? It helps readers to get a sense of the advantage of the FPO method with comprehensive comparisons and its generalizability to various architectures.
  • In Figure 2 left, is this an example instance or statistics? What is the analysis setting?
  • In Fig. 3, how KL of different methods are computed? Since FPO’s regularization is feature-level KL, is it compared at the feature or token level KL?
  • What does uniqueness mean in line 84?
  • What is the wall-clock computational time of FPO vs TDPO? I think the data loading process will also reduce the advantage of FPO.

其他意见或建议

None.

作者回复

Q1: Why the title is Accurately? The key advantages of this work are its efficiency and reduced memory usage.

We appreciate the reviewer’s question regarding the title’s use of “Accurately” and their observation that the work’s key advantages seem to lie in efficiency and reduced memory usage. While efficiency and memory benefits are indeed significant outcomes of FPO, we argue that “Accurately” reflects the precision in controlling model capabilities. Our work leverages SAEs to decompose model representations into features, offering a level of granularity that token-level methods lack. By applying MSE constraints to specific features e.g., features that controlling the json format output, FPO enables precise manipulation of model alignment. To substantiate this, we present experiments to accurately regulate specific capabilities during alignment, leaving others unaffected while achieving strong overall performance. We assessed four critical domains: Instruction Following with IFEval, Multilingual Capability with MultiAlpaca and WildChat, Safety with Jailbreak Bench and AdvBench, and Sentiment with Twitter Financial News Sentiment. Detailed experimental settings and results are provided: https://github.com/FPO-code/FPO_code/blob/main/accurate_control.md

Q2: Why there is no related work?

We are sorry for missing the related work section due to the page limit. Considering our work lie on the intersection of alignment and mechanistic interpretability, we’d like to give a series of important related works of these two fields. We will add the related work section to the paper in the following versions.

Q3: The performance ... versus TDPO includes: (1) length normalization, and (2) computing compressed feature-level KL (at the last layer) instead of token-level KL ... contribute to the performance?

We analyze these two factors with a new ablation study in https://github.com/FPO-code/FPO_code/blob/main/tdpo_vs_fpo.md. Results show that (1) FPO performs better at length norm and (2) computing compressed feature-level KL shows no obvious effect on the downstream performance (but more efficient).

Q4: The FPO results in Table 2 are not presented. I cannot see the effectiveness of the proposed method. But from the descriptions, the FPO seems cannot scale to larger models.

ΔScores in Table 2 as the margin between FPO and other methods (as described in Line 296, Evaluation Benchmarks). As for the scaling results, we provides FPO in different architectures (LLaMA, Gemma) and parameters from 2B - 9B: https://github.com/FPO-code/FPO_code/blob/main/LLaMA_Experiments.md

Q5: Figure 1 presents ambiguous visual information ... refine this figure and amplify the caption.

We thanks the reviewer's valuable feedback, we will refine this figure in the following version.

Q6: Lacks some implementational details. Is the pre-trained SAE is fixed? ... cause higher reconstruction error along the training?

SAE are commonly robust under alignment, considering the model architectures are fixed and alignment process's data amount is much smaller comparing to pretraining. We test the reconstruction error to verify this: https://github.com/FPO-code/FPO_code/blob/main/reconstruction_error.md.

Q7: It may seem weird that the author claims the alignment stability ... How stable are TDPO, SimPO, and FPO which tradeoffs?

We thanks the reviewer's suggestions about the alignment stability. Here we perform comparison of token log-probability difference: https://github.com/FPO-code/FPO_code/tree/main/FPO-code-main. We will add this into the appendix.

Q8: What motivates you to change the base model in SimPO evaluation setting ... on one arbitrary base model in that setting?

We changes the SimPO evaluation settings because we found the current settings are better than the SimPO paper proposed. Here is the comparison: https://github.com/FPO-code/FPO_code/blob/main/simpo.md.

Q9: In Figure 2 left, is this an example instance or statistics? What is the analysis setting?

This is the globally activated feature statistics on 1024 examples from Ultrafeedback dataset for illustrating the sparsity of SAE features. We use the Gemma2-2b SAE with width 16k.

Q10: In Fig. 3, how KL of different methods are computed? Since FPO’s regularization is feature-level KL, is it compared at the feature or token level KL?

Feature-level KL is a more accurate and efficient way comparing to token-level KL. Here we propose a detailed comparison: https://github.com/FPO-code/FPO_code/blob/main/tdpo_vs_fpo.md.

Q11: What does uniqueness mean in line 84?

This means that FPO supposed a specific control of the alignment process (refer to Q1).

Q12: What is the wall-clock computational time of FPO vs TDPO?

Data loading does need more times. However, the efficiency in training saves more times (and memory). Details can be found at: https://github.com/FPO-code/FPO_code/blob/main/time.md.

最终决定

This paper proposes Feature-level constrained Preference Optimization (FPO), a method for aligning large language models using sparse autoencoders (SAEs) to enforce alignment constraints in a lower-dimensional feature space. FPO aims to improve computational efficiency and stability by replacing token-level KL divergence with feature-level mean-squared error constraints derived from pretrained SAEs. The authors demonstrate that FPO achieves competitive alignment performance with lower memory overhead and add empirical results showing the method’s ability to selectively control model capabilities.

This paper presents a novel and promising direction for more interpretable and efficient LLM alignment through feature-level regularization. While not without weaknesses, the core idea is sound, the empirical results are encouraging, and the rebuttal demonstrates a strong commitment to addressing reviewer concerns. The ability to constrain specific capabilities via sparse features is especially compelling and opens up new avenues in fine-grained alignment. Given the positive trajectory of the discussion and the potential impact of the approach, the paper is recommended for acceptance once the authors incorporate all the reviewer's feedback in the final version.