PaperHub
2.7
/10
Poster4 位审稿人
最低1最高3标准差0.8
1
1
2
3
ICML 2025

Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

OpenReviewPDF
提交: 2025-01-20更新: 2025-07-24

摘要

关键词
large language modelpreference modeling

评审与讨论

审稿意见
1

Authors claim that human preferences on LLM responses consist of multiple latent factors. They also claim these factors can vary across tasks. To capture such factors, authors propose a discrete latent variable model for preference learning. They employ standard methods for discrete latent variable modeling, such as Gumbel-softmax and variational inference. This approach is applied to DPO and their variants, IPO and SimPO. Evaluations are conducted on Mistral-7B, Llama3-8B, and Llama3-8B-Instruct base models, and Arc-Challenge, Arc-Easy, GSM-8K, TruthfulQA, and AlpacaEval 2.

update after rebuttal

My major concerns from the review are largely unaddressed in the rebuttal, as discussed in my reply to the rebuttal. Hence, I will maintain my score.

给作者的问题

What was the size of codebook used in most experiments? In Section 4.3, experiments on the codebook size was run, but in Section 4.1, 4.2, I don't think this crucial information was provided.

论据与证据

One of authors' major claim is that factors in preference vary across tasks (line 055), but this is not actually analyzed in depth. Figure 2 (right) is related but does not convincingly tell the method captures. It only suggests different sources of data cluster together, but the clustering looks weak, and the source of data tells very little about the task type; prompts in a single source, for example FLAN v2, cover diverse tasks.

Authors also claim that their method captures multiple factors involved in human preference. However, it is unclear how well the proposed method actually captures these factors. The major evidence used to support this is Table 1. However, improvements are mostly on GSM-8K, especially with Mistral7B model, which performs very weakly on the benchmark. Improvements on other base models and other tasks tend to be minimal, less than 1 percent point in most instances.

It's also worth noting that there are prior papers aiming to capture factors in human preference. In particular, HelpSteer 2 https://arxiv.org/abs/2406.08673 provide annotations on multiple factors, and the dataset has been very actively in preference modeling literature. If latent codes from the paper can be validated with human annotations like HelpSteer2 or models based on such annotations (for example Nemotron-RM https://arxiv.org/abs/2406.11704), it would much strengthen the claim of the paper. Or, authors could've conducted qualitative analysis of latent codes. They could've annotated what each latent code represents, or how different types of prompts activate different latent codes.

方法与评估标准

Regarding method: Authors employ the standard techniques for discrete latent variable models. The method to represent prompts and responses using hidden states is also standard. Hence overall, authors use principled techniques which have good empirical support from the literature.

The set of benchmarks is OK but not ideal. https://arxiv.org/abs/2406.09279 can be a good reference for the selection of benchmarks. GSM8K and TruthfulQA make sense, but usually DPO-based alignment methods show stronger benefit on AlpacaEval, MT-Bench and IFEval benchmarks. Authors do analyze AlpacaEval improvements in Table 4, and it is discussed in 'Experimental Designs Or Analyses' section.

In Table 3, preference accuracy is used to evaluate models, but this is not the standard method of evaluating alignment. DPO and IPO require base model predictions to compute the preference, and these preference predictions are not necessarily well-correlated with the actual generation performance. Previous works (again https://arxiv.org/abs/2406.09279 as a good example) mostly rely on benchmarks which evaluate generation results. For truthfulness, for example, it's better evaluating on XSTest.

理论论述

I took a brief look at the derivation of variational inference procedure in Appendix A, but it looked standard and intuitively made sense, so I didn't check the correctness line-by-line.

实验设计与分析

Authors do analyze AlpacaEval improvements in Table 4, but here the improvement with LPC is mostly with SimPO, and its raw win rate is very low. Also, these experiments are based on Llama3-8B-Instruct model, and per the public leaderboard, the base model's win rate is 22.6%, which is much higher than numbers reported in Table 4. Hence, all these methods are actually degrading model performance on AlpacaEval 2.0, and the base model performance wasn't reported for this particular experiment.

Figure 2 was discussed in 'Claims and Evidence' section.

Robustness analysis in line 408 is a clever idea of demonstrating how latent factors are captured.

补充材料

I checked Appendix A for the derivation of variational inference objective and additional experimental results (Table 5, 6, 7).

与现有文献的关系

Latent coding is a well-established topic of research in machine learning. Making a solid connection between latent coding and LLM alignment will open up a lot of opportunities for both topics to benefit from methods and data from each other.

遗漏的重要参考文献

Efforts to annotate multiple factors of preference such as HelpSteer 1/2 https://arxiv.org/abs/2406.08673 and the effort to automatically learn the weighting (challenge 3 mentioned in line 063) -ArmoRM https://arxiv.org/abs/2406.12845 - should be referenced.

其他优缺点

Most strengths and weaknesses are already discussed. The paper is clearly written, and easy to follow.

其他意见或建议

There is too much introduction and literature review. The proposed method starts being discussed only in page 4.5, which is too late. Authors could've cut down on literature review. In particular, the content in Section 3.1 could focus just on DPO since authors only experiment with DPO variants and don't use online RL methods like PPO.

作者回复

Thanks for your review.

Regarding AlpacaEval results

...but usually DPO-based alignment methods show stronger benefit on AlpacaEval, MT-Bench and IFEval benchmarks... ...Hence, all these methods are actually degrading model performance on AlpacaEval 2.0, and the base model performance wasn't reported for this particular experiment.

In our experiments, we find that offline alignment algorithms do not always boost performance on these benchmarks. This finding is consistent with https://arxiv.org/abs/2406.08673

Regarding preference accuracy

In Table 3, preference accuracy is used to evaluate models, but this is not the standard method of evaluating alignment. DPO and IPO require base model predictions to compute the preference, and these preference predictions are not necessarily well-correlated with the actual generation performance

Thanks for your suggestion. We use preference accuracy as a metric mainly because of its computational efficiency. Though preference accuracy is not a direct measurement of generation quality, it can serve as an indicator of how well the model can distinguish good responses from bad ones. In the main experiments, we report results on downstream benchmarks. We will add more comprehensive results to the revision.

Regarding HelpSteer2 results

We have conducted experiments on the HelpSteer2 dataset following the same setting as https://arxiv.org/abs/2406.08673. In this setting, the preference dataset is constructed by taking the response with the higher helpfulness score as the chosen response, with the remaining response being the rejected response. Results on Llama3-8B are shown below.

ModelArc-ChallengeArc-EasyGsm8KTruthfulQA
Base50.4380.0549.5143.82
DPO52.5382.1353.4346.16
DPO+LPC53.1081.8555.0146.93

As can be seen, LPC improves performance on most of the benchmarks. We will add more comprehensive results to the revision.

If latent codes from the paper can be validated with human annotations like HelpSteer2 or models based on such annotations (for example Nemotron-RM https://arxiv.org/abs/2406.11704), it would much strengthen the claim of the paper.

However, we want to clarify that the learned latent codes do not necessarily align well with human-annotated multi-view scores. This is because 1) latent codes are predicted based on the prompt, but human annotations are mostly scoring the response; 2) there is a gap between model-learned latent information and human-annotated explainable factors.

Regarding qualitative analysis of latent codes

...authors could've conducted qualitative analysis of latent codes. They could've annotated what each latent code represents, or how different types of prompts activate different latent codes.

We have already conducted a qualitative analysis of latent codes in Section 4.3, which shows how prompts from different sources activate different latent codes. For example, prompts from the Flan subset are usually more reasoning-intensive, and prompts from the truthful_qa and false_qa subsets are related to factual correctness. We will add more details in the revision.

Regarding codebook size

What was the size of codebook used in most experiments? In Section 4.3, experiments on the codebook size was run, but in Section 4.1, 4.2, I don't think this crucial information was provided.

We set the codebook size to 64 in the main experiments. We will add this information to the revision.

审稿人评论

We have already conducted a qualitative analysis of latent codes in Section 4.3, which shows how prompts from different sources activate different latent codes.

In typical analyses of unsupervised clustering (for ex topic models, which also learn "codebook"), each cluster is labeled with key phrases they capture, and they are further summarized by their theme. I already discussed limitations of the t-SNE plot authors are mentioning here:

One of authors' major claim is that factors in preference vary across tasks (line 055), but this is not actually analyzed in depth. Figure 2 (right) is related but does not convincingly tell the method captures. It only suggests different sources of data cluster together, but the clustering looks weak, and the source of data tells very little about the task type; prompts in a single source, for example FLAN v2, cover diverse tasks.

In our experiments, we find that offline alignment algorithms do not always boost performance on these benchmarks. This finding is consistent with https://arxiv.org/abs/2406.08673

I understand this, but how can you demonstrate the benefit of your alignment algorithm if it is not making improvements, but just "least degrading"? New HelpSteer experiments seem to show improvement over base model on GSM8K/TruthfulQA, but the boost across benchmarks seems inconsistent/small. As I mentioned in the review, I would like to see results on benchmarks more standard for evaluating alignment: AlpacaEval, MT-Bench, IFEval, etc.

However, we want to clarify that the learned latent codes do not necessarily align well with human-annotated multi-view scores. This is because 1) latent codes are predicted based on the prompt, but human annotations are mostly scoring the response; 2) there is a gap between model-learned latent information and human-annotated explainable factors.

I understand learned latent codes may not necessarily align well with an existing data, but the responsibility is on authors to demonstrate how well latent codes capture human values, which is author's main claim. Here I wanted to make a constructive criticism to suggest a method that I believe is promising to do this validation, but yes it may not be successful.

审稿意见
1

This paper studies alignment and addresses the challenge of inconsistent human preferences, which stem from various underlying factors. The authors propose a latent codebook and integrate a method to learn the latent variable within standard alignment techniques like DPO.

They test their approach on standard benchmarks, claiming improvements over preference learning algorithms like DPO, IPO, and SimPO.

给作者的问题

  1. Could you analyze the improved scores in Table 1? It’s unclear how the method enhances performance when both training and testing datasets are homogeneous—where preference labels are unique and consistent per query-response pair, and no customized steering is needed.

  2. Could you provide evidence that the method is not susceptible to posterior collapse as I pointed out in the " Claims And Evidence" section. Plotting KL\text{KL} is a good starting point.

  3. Besides the T-SNE plot in Fig. 2, could you further interpret the learned latent space? For example, can you probe the latent space to see if switching latent codes zz generates responses that align with different human preferences?

论据与证据

Claim: The latent variable encodes factors underlying human preferences.

The paper’s main claimed contribution seems plausible but has a key issue: using variational autoencoders (VAEs) for binary (preference) classification risks posterior collapse, where the model assigns preference labels without leveraging the latent space. This problem is especially severe when preferences are unambiguous per sample (i.e., for each input x,y1,y2x, y_1, y_2, there exists a unambiguous preference label, e.g., y1>y2y_1 > y_2). In such cases, Eq. (3) leads the decoder to ignore zz and at the same time achieve zero KL\text{KL}. This concern is especially relevant given the primary dataset used in the experiment UltraFeedback, where model-generated preferences are unique and consistent per sample.

方法与评估标准

The proposed method makes sense for modeling diverse preferences, and it is important to evaluate this method in terms of how well it adapts to diverse test-time human preferences for the same set of queries. However, there are key limitations:

  1. The authors primarily model preferences using UltraFeedback, a model-generated dataset that lacks diverse and conflicting preferences, limiting its ability to learn meaningful latent codes. Although section 4.4 is a good step of understanding conflcting preference labels, the experiments are very limited.

  2. The authors primarily test on standard benchmarks, which may not be ideal for this purpose. While the results show some improvement, it's unclear whether the proposed approach is responsible or if other factors are at play. One possibility is that the added complexity of LPC acts as a regularizer rather than contributing in the intended way.

理论论述

The proof appears correct. Conceptually, the method is a VAE for binary preference classification, where the classifier is modeled as logp(preferencex,y1,y2)\log p(\text{preference} \mid x, y_1, y_2) and further parameterized using language modeling log-probs.

实验设计与分析

See "Methods And Evaluation Criteria".

补充材料

Yes. Appendix A.

与现有文献的关系

Although the method's contribution is not fully validated, its motivation is significant. Previous work models aggregated human values from complex preferences, while this approach disentangles preferences into a latent space. This method can potentially help us understand human preferences better and enable language model steering at test time via different latent codes.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. The idea behind the method is interesting: modeling complex and sometimes conflicting human preferences in an unsupervised way from off-the-shelf preference dataset.

Weaknesses Some of the weakness has already been outlined primarily in "Claims And Evidence" and "Methods And Evaluation Criteria". Here is a recap:

  1. The dataset and benchmark chosen can not validate the claimed benefit of the method

  2. Lack of analysis and discussion for the improved score from the standard benchmark.

其他意见或建议

Typo 1. Line 051, "safty" -> "safety"

See "Questions For Authors" for suggestions.

作者回复

Thanks for your review.

Regarding the effectiveness of latent codes

The authors primarily model preferences using UltraFeedback, a model-generated dataset that lacks diverse and conflicting preferences, limiting its ability to learn meaningful latent codes.

Thanks for your suggestion; however, we do not agree that the UltraFeedback dataset lacks diverse preferences. Prompts in this dataset are collected from various sources, including math, reasoning, factual knowledge, creative writing, and other domains. The diversity of prompts is sufficient to learn meaningful latent codes, as reported in Section 4.3.

Could you provide evidence that the method is not susceptible to posterior collapse as I pointed out in the " Claims And Evidence" section. Plotting is a good starting point.

Besides the T-SNE plot in Fig. 2, could you further interpret the learned latent space? For example, can you probe the latent space to see if switching latent codes generates responses that align with different human preferences?

We conduct a latent code switching experiment to show the effectiveness of learned latent codes. Specifically, we collect 1000 original preference pairs and 1000 flipped preference pairs. Then, we substitute the latent codes of ordinary preference pairs with the flipped ones, the results are shown below.

PreferenceLatent codePreference Accuracy
originaloriginal66.8
originalflipped61.4

As can be seen, using the wrong latent codes leads to a significant performance drop, which indicates that the learned latent codes are effective. We will add more careful experiments in the revision.

Regarding the performance in Table 1

Could you analyze the improved scores in Table 1? It’s unclear how the method enhances performance when both training and testing datasets are homogeneous—where preference labels are unique and consistent per query-response pair and no customized steering is needed.

The motivation of this work is to improve LLM alignment by modeling complex human preferences behind the overall preference label. In this setting, the complex human preferences are implicitly determined by the query. In contrast, steering LLM to generate responses under different preference directions is to model complex human preferences in an explicit way, which is not the focus of this work. We will add more details in the revision.

审稿人评论

Thanks for your suggestion; however, we do not agree that the UltraFeedback dataset lacks diverse preferences. Prompts in this dataset are collected from various sources, including math, reasoning, factual knowledge, creative writing, and other domains. The diversity of prompts is sufficient to learn meaningful latent codes, as reported in Section 4.3.

The issue is that datasets like UltraFeedback lack diverse preferences for the same queries. When preferences are consistent within queries but may vary across different queries, standard preference alignment algorithms can handle this perfectly well - note that both language models and reward models are query-conditioned, meaning they naturally adapt to different preferences based on the input prompt. For instance, in safety alignment, models learn to reject unsafe prompts and answer safe ones - this variation across different prompts (and for different values) is easily handled by standard alignment methods.

Your algorithm resembles a conditional VAE, and its advantage becomes apparent specifically when there are diverse preferences for identical queries. In such cases, we can leverage latent codes to adapt to different preferences even when responding to the same query.

We conduct a latent code switching experiment to show the effectiveness of learned latent codes. Specifically, we collect 1000 original preference pairs and 1000 flipped preference pairs. Then, we substitute the latent codes of ordinary preference pairs with the flipped ones, the results are shown below.

As the paper states, you "add [FLIP] to the prompts associated with these flipped instances." This approach does not actually increase preference diversity for identical prompts. I recommend evaluating performance without adding the [FLIP] tag to better assess its impact.

审稿意见
2

This paper studies the problem of complex and often conflicting human preferences for the alignment algorithm. To address this, the paper introduces Latent Preference Coding (LPC), a framework that models the implicit, multifaceted factors behind human preferences using discrete latent codes. LPC integrates seamlessly with offline alignment algorithms and autonomously infers underlying preference factors and their importance directly from data, eliminating the need for predefined reward functions or manual weight adjustments. Extensive Experiments across multiple benchmarks demonstrate the effectiveness of the proposed algorithm.

给作者的问题

Please refer to the weakness part.

论据与证据

The motivation is reasonable and the written is easy to follow.

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

I review all parts of the supplementary materials.

与现有文献的关系

This paper studies the problem of using latent code to automatically learn human preference. This idea can be used in complex senarios with complex human preference.

遗漏的重要参考文献

There are also other papers about using latent representation to fine-tune LLMs:

  1. Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

  2. Training Large Language Models to Reason in a Continuous Latent Space

其他优缺点

Strengths

  1. The paper is well written and easy to follow.

  2. The experiment of verifying the effectiveness of their model is reasonable, especially complex scenarios for human preference alignment.

Weakness

  1. It seems that the model need to load the encoder transformer and the decoder LLMs during the training process. A critical practical concern is the computational overhead of jointly training an encoder alongside large decoder-only LLMs. The added memory footprint from parallel transformer modules could limit scalability.

  2. The posterior network conditions on the preference pair ywyly_w \succ y_l but the paper implements this via simple concatenation of ywy_w and yly_l embeddings. This raises questions about whether richer interaction modeling (e.g., explicit difference encoding, cross-attention) could better capture the relative preference structure. The authors should clarify design choices or explore ablation studies on alternative relational encodings.

  3. There are also other works using latent representation to improve the fine-tuning process of LLMs like [1], [2].

[1] Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

[2] Training Large Language Models to Reason in a Continuous Latent Space

其他意见或建议

Please refer to Weakness part.

作者回复

Thank you for your review.

Regarding the training cost issue

It seems that the model need to load the encoder transformer and the decoder LLMs during the training process...

The additional computation of LPC is negligible compared to the baseline methods. This is because the policy model, prior and posterior networks share the same backbone model, so most of the computation can be reused. We have briefly discussed this in section 3.3. We will add more details in the revision.

Regarding modeling choice of posterior network

... This raises questions about whether richer interaction modeling (e.g., explicit difference encoding, cross-attention) could better capture the relative preference structure. ...

Thanks for your suggestion. We choose the simple concatenation design because we want to reuse the computation of forwarding <x,yw><x,y_w> and <x,yl><x,y_l> in the policy model. In this way, the additional computation of the posterior network is minimal. Using more sophisticated modeling is ok, but it would require forwarding the LLM again, which is inefficient. We will add more details in the revision.

Regarding related works

There are also other works using latent representation to improve the fine-tuning process of LLMs like [1], [2].

The literature you mentioned investigates latent reasoning without explicitly generating CoT tokens, which is different from our work. Although our work also uses latent space, we focus on modeling complex human preferences using latent variables. We will add them to the related work section in the revision.

审稿人评论

Thanks for your reply! I will maintain my score.

审稿意见
3

This paper addresses the challenge of single reward model not being able to reflect the full latent factors that determine the user preference. It proposes LPC to model a latent variable model where the latent is configured as a discrete variable fully predictable by the prompt. For any prompt, a prior network infers which latent factor distribution is most relevant, while a posterior network refines this distribution based on preference comparisons. Experiments show that LPC improves performance across various preference-optimization methods (e.g., DPO, IPO, SimPO) and with multiple base language models (e.g., Llama3, Mistral), when being trained on cross-domain preference dataset.

给作者的问题

The following points may need discussion in the main text:

  1. how does the training cost of LPC compared to the baseline methods?
  2. how to set up the codebook of latent z, are those embedding prefixed or obtained by training?

论据与证据

The main claim made by this paper is: Modeling human preference demands considiering latent variables, with this idea, the proposed LPC method improves baseline alignmeny methods.

Evidence: experiments test the effect of LPC when added on to DPO, IPO, SimPO and with multiple base language models (e.g., Llama3, Mistral), according to the reads in Table 2, though the improvement seems to be mostly marginal, especially for context domain like Arc-Easy, but the imptovement is rather consistent.

方法与评估标准

Proposed method makes sense to me as it is a straightforward carry-out of the VAE method, with architechture designs specific to LLMs. Evaluation criteria also makes sense.

理论论述

No theory provided in this paper.

实验设计与分析

Experimental Designs are reasonable and analysis and results make sense.

补充材料

No.

与现有文献的关系

There are existing literature considering latent variables in the preference model, instead of relying on single reward. The novelty of this paper is it consider a latent variable reflecting the diverse domain context encoded in the prompt.

遗漏的重要参考文献

I think the authors should be aware of https://arxiv.org/abs/2402.08925, which considers a mixture model that assigns each piece of data to one among a few distinct groups (also a discrete latent exsit here). The author should discuss whether a clustering algorithm like the EM algorithm proposed there can be applied to modelling the mixture of prompt domains studied in this paper, and include that method as a baseline if comparable.

其他优缺点

My major concern is in Essential References Not Discussed part.

其他意见或建议

Please refer to the Questions For Authors part.

作者回复

Thank you for your review.

How does the training cost of LPC compared to the baseline methods

The additional training cost of LPC is negligible compared to the baseline methods. This is because the policy model, prior and posterior networks share the same backbone model so most of the computation can be reused. We have briefly discussed this in section 3.3. We will add more details in the revision.

How to set up the codebook of latent z, are those embedding prefixed or obtained by training?

The embeddings are trained jointly with the rest of the model parameters.

Essential References Not Discussed

Thanks for your suggestion. We have read the paper you mentioned. The method in that paper is based on PPO. However, our paper focuses on offline alignment algorithm, so it is not comparable. We will add it to the related work section in the revision.

最终决定

This work addresses the problem of aligning LLM with human preferences. It argues that human preferences are multifaceted, often dictated by conflicting factors and reward functions don't necessarily capture this complexity. It argues that a single reward function don't necessarily represent the multifaceted factors entailed by a prompt and particulaly so when the factors are conflicting. It further argues that pivotal factors that determine human preference vary with tasks and populations. Even when the factors are well-defined, assigning importance to various factors for a given prompt can be challenging. It proposes to model these implicit factors and their combinations in alignent by using discrete latent codes. It infers the implicit factors and their importance from data without requiring pre-defined sub-rewards and hand-crafted weights. At the core of the proposed approach is a discrete latent space of codes representing factors influencing human preferences. The latent codes are learnt from the data with the weights of the latent codes inferred from observed preference annotations by a posterior network and the codes are predicted by a prior network. The work presents experimental results on multiple benchmarks to validate the effectiveness of the proposed approach to alignment.

The problem addressed by the work is important. Both the problem and the proposed approach are well-motivated. The idea to use latent codes for aligment is novel. The solution proposed is principled and is based on the very reasonable assumption that holistic human preference is a mixture of multiple unobserved factors. Results are very encouraging and competitive in several cases. The work discusses cases where the proposed approach helps and where it doesn't and provides preliminary explanation for the same. The paper is very well written and has adequate detail to reproduce and revalidate the experimental results. While the work can be strengthened further based on some of the suggestions made by the reviewers, it makes significant contribution to the field of alignment and is likely to be of interest to researchers and practitioners.