7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.5

置信度

创新性2.8

质量2.8

清晰度3.0

重要性2.8

NeurIPS 2025

Counterfactual Reasoning for Steerable Pluralistic Value Alignment of Large Language Models

Hanze Guo,Jing Yao,Xiao Zhou,Xiaoyuan Yi,Xing Xie

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

Achieve Pluralistic Value Alignment through Counterfactual Reasoning

摘要

关键词

Large Language Models; Pluralistic Alignment; Value Alignment

评审与讨论

审稿意见

评分: 4置信度: 42025-06-25

The paper introduces COUPLE, a three‑stage framework that uses counterfactual reasoning with a structural causal model (SCM) to steer large language models toward pluralistic human values. The pipeline consists of three steps: (1) abduction of values (2) intervening on the values (3) counterfactual prediction. Experiments on Touché23-ValueEval and DailyDilemma show better alignment than several prompting and fine‑tuning baselines.

优缺点分析

Strengths

Value complexity and value steerability are important issues with most existing alignment methods. I also think causality could be helpful, given its ability to model complex relationships between variables. Therefore, I believe this direction is both interesting and promising.

Weaknesses

causality

I don’t fully understand how SCM is utilized in their framework. For example, in counterfactual reasoning, abduction is a hard task which requires inferring the exogenous noise given observed evidence. However, in Section 4.2, it seems to me the abduction step in this work is to infer $V$ . Note that $V$ is explicitly defined as one of the endogenous variables in Section 3. Besides, how is the causal relationship modeled here?

In the second intervention step, does the value attributor always extract priorities for all predefined values? Why compare the estimated $v’$ with the target $v$ instead of just intervening based on the given $v$ ?

In the counterfactual reasoning step, what are variables in the causal model? It appears their causal model only contains $V$ but not $C$ . However, they consider the estimation of $C$ as counterfactual prediction. Besides, I don’t think a relational graph and covariance matrix are sufficient to capture a causal model.

In summary, based on my understanding, the use of causality in their framework feels forced. Their methodology might still be valid without mentioning any causal concepts, however, it is hard to evaluate the soundness of their methodology given the current presentation.

Access to causal model

The authors try to include a “causal model” in their generation pipeline to improve value complexity. However, acquiring an accurate causal model can be quite difficult.

问题

What is the benefit of introducing concepts instead of directly generating responses based on values considering that they both just use a LLM?
The authors mentioned in multiple places that a strong LLM is required to achieve certain tasks. Do we have any clear definition on “strong” LLMs?

局限性

N/A

最终评判理由

I appreciate the authors' effort in providing the explicit SCM. This makes the discussion and analysis much easier. I still find the counterfactual estimation framework to be hand-wavy, but I also agree that exact causal inference in the context of LLMs is difficult to perfect, and I have adjusted my score accordingly given the improved clarity.

格式问题

N/A

作者回复

2025-07-31

Response to reviewer efFE:

Thank you very much for your insightful comments. We provide responses to your concerns about the causality usage (Weaknesses) and additional questions (Q) as follows.

W1: The use of causality in the framework feels forced.

The adoption of causality (a structured causal model) in our counterfactual reasoning framework is not forced but theoretically well-motivated by the core requirements of pluralistic value alignment. The effectiveness is also empirically validated by the ablation study in Sec 5.4.

For the task of pluralistic value alignment, we need to model how different value profiles lead to different LLM responses. As stated in Line 29-31, value studies in social science and psychology (e.g., Schwartz’s Value Theory) claim that multiple value dimensions with priorities serve as the guidance of human behaviors. This inspires us to address value alignment by modeling the inherent causality between complex values and LLMs behaviors. Therefore, we introduce the structural causal model (SCM) to capture the causal dependency between values and behaviors, as well as the interdependency among value dimension themselves.

Furthermore, Challenge 2(Value Steerability) in Line 45-48 highlights that different value profiles may have minor differences, such as the change in the priority of two value dimensions. We introduce counterfactual reasoning to simulate such shifts, however, without an explicit SCM, only LLMs prompts struggle to capture such nuanced adjustments.

In Sec 5.4, we conduct an ablation study to validate the significance of SCM. Specifically: i) removing the SCM while retaining counterfactual reasoning leads to a significant performance drop; ii) removing Counterfactual but keep the SCM for direct behavior prediction leads less performance drop. The results demonstrate that SCM is essential to capture the causality between values and behviors, but also enabling more effective counterfactual reasoning.

W2: Acquiring an accurate causal model is quite difficult.

In this paper, we do not build a value-to-behavior causal models from scratch using extensive annotated (value profiles, behaviors) pairs due to two main challenges:1) Large-scale datasets with comprehensive annotated values and behaviors are rare and costly to obtain; 2) Traditional approaches for building an SCM often struggle with high-dimensional, unstructured natural language inputs[7][8][9].

Instead, inspired by that LLMs have already embedded rich knowledge about values and LLMs have been regarded as a promising tool for causal inference tasks with natural language data, such as such as causal discovery and counterfactual reasoning[2][3][4][5][6], we adopt an LLM to build the causal model to encode the structural relationship among multi-dimensional values and how they jointly determine the final model responses, i.e., V $\rightarrow$ R (as introduced in Sec 3.2).

We acknowledge that this is an approximated causal model for value-to-behaviors, however, this already capture rich knowledge about the relationship between values and behaviors. The effectiveness has been validated by both quantitative experiments (in Sec 5.2, Sec 5.3) and qualitative experiments (Sec 6). We consider learning more faithful causal model as an important future work.

W3：How SCM is utilized in the framework?(with several sub-items in the following)

W3-1:In the abduction step, the endogenous variables are inferred but not the exogenous noises.

Regarding abduction: Given the structure of our framework and the black-box nature of LLMs, inferring exogenous noise variables strictly is not feasible. We instead use counterfactual reasoning to avoid the influence of exogenous variables. More specifically, we focus on the transitions of endogenous variables (from value to intermediate concept to response) and use LLM inference as a proxy for capturing the underlying causal relationships. Concerning the “value attributor” and priorities: for each response, we estimate the attribution for all values, not just a subset, and compare the estimated values with the target values to ensure alignment with the model. The value extraction process itself also utilizes LLMs, following current NLP and causal reasoning work [7][8], where LLMs are employed to identify relevant latent concepts and perform causal reasoning, leveraging their inherent causal attribution abilities [4].

W3-2: Details about the second intervention step. After the last abduction step, the value attributor extract priorities for all predefined values, obtaining the estimated value $v’$ . Then, we intervene the priority score of $v’$ towards the target value $v$ , i.e. do (V=v) as described in Line 157.

Comparing the estimated $v’$ with $v$ just aims to calculate the differences, and when there is no difference $|v’-v| < \theta$ , there is no need to conduct following steps.

W3-3: Variables of the causal model in the counterfactual reasoning step

As introduced in Sec 3.2 (Line 130-131), our causal model aims to encode the relationship between values $V$ and the model response $R$ , thus we have $X=[V,R]$ . As the textual responses could usually consider some side information that may not be correlated with the values, such as the rephrase of descption, we introduce value concepts, which are behavioral indicators of values beyond redundant and noisy text as a more essential proxy of response (as introduced in Line 182-183). Thus, in practical implementation, we replace the variable $R$ in the causal model with concepts $C$ , which is also used in the counterfactual reasoning step. In Line 208, we highlight that we would generate the final textual responses based on the value concepts.

W3-4: A relational graph and covariance matrix are sufficient to capture a causal model.

We would like to clarify that the relational graph and covariance matrix in our method is not intended to capture the causal model, as described in Line 201-206, they are used to capture the complex dependencies between value dimensions (addressing Challenge 1 - Value Complexity). In the response to W2, we explained that the value-to-behavior causal model is built or captured by an LLM with rich knowledge about values.

More specifically, the relational graph captures the original correlation among value dimensions, i.e. congruent, opposite or irrelevant; while the covariance matrix captures their priority scores.

Q1: Concern about why answers are not generated directly

This design—introducing explicit intermediate concepts or states for counterfactual reasoning—is now common in NLP (see, e.g., [relevant literature]). We introduce concepts/intermediate states because direct value-to-text mapping lacks interpretability and flexibility. Having an explicit intermediate stage allows for better control, counterfactual testing, and more nuanced understanding of model behavior. As shown in the ablation results in Table 4, removing this module leads to a significant drop in performance. The causal decomposition (value → concept → response) reflects the underlying logic of value-driven generation, and aligns with the way many recent controllable generation methods are designed.

Q2: Definition of "Strong" LLMs

By “strong” LLMs, we mean models that score higher on standard benchmarks of text understanding and instruction following (e.g., MMLU, BIG-Bench, GSM8K). There is no absolute threshold, but in practice, we select models with proven high performance in such evaluations. This is also common practice in the literature (e.g., OpenAI GPT-4, Google Gemini; see, e.g., [10,11]).

References:
[1] Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

[2] Jing Ma et al. (2024). Causal inference with large language models: A survey. arXiv preprint arXiv:2409.09822.

[3] Emre Kiciman et al. (2023). Causal reasoning and large language models: Opening a new frontier for c ausality. Transactions on Machine Learning Research.

[4] Zhijing Jin et al. (2023). Cladder: Assessing causal reasoning in language models. Advances in Neural Information Processing Systems, 36:31038–31065.

[5] Nick Pawlowski et al. (2023). Answering causal questions with augmented LLMs.

[6] Yuzhe Zhang et al. (2024). Causal graph discovery with retrieval-augmented generation based large language models. arXiv preprint arXiv:2402.15301.

[7] Amrita Bhattacharjee et al. (2023). Towards LLM-guided causal explainability for black-box text classifiers. arXiv preprint arXiv:2309.13340.

[8] Amrita Bhattacharjee et al. (2024). Zero-shot LLM-guided counterfactual generation for text. arXiv e-prints, arXiv:2405.

[9] Amir Feder et al. (2023). Data augmentations for improved (large) language model generalization. Advances in Neural Information Processing Systems, 36:70638–70653.

[10] OpenAI (2023). GPT-4 Technical Report.

[11] Gemini Team, Google et al. (2023). Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805.

2025-08-06

Thank you for the rebuttal. However, the response does not address my core concern about the usage of causality. In fact, with the provided clarification, it reinforces my concerns. I want to clarify that I am not questioning the significance of using SCM for this task, but rather whether the proposed method actually achieves appropriate and technically sound causal reasoning (as it claimed).

First, I don't see a properly defined SCM in the context of this task. The authors provide the original formal definition of SCM but fail to explain how it applies to the specific task at hand. Even though the actual structural equations might be difficult to estimate precisely, a proper definition is crucial. It seems to me there are several key endogenous variables: multiple values, concepts, and the final response. The authors do not provide a clear description of how the structural equations between these variables will be modeled, which leads to confusion. For instance, the authors claim that "We would like to clarify that the relational graph and covariance matrix in our method is not intended to capture the causal model, as described in Line 201-206, they are used to capture the complex dependencies between value dimensions" However, values and their relationships are inherently part of the SCM under consideration—in fact, they might be one of the most important components.

I feel like the proposed method simply uses a large black-box LLM with some constraints (the provided covariance matrix and relational graph) to perform some kind of "counterfactual reasoning." However, I don't believe this constitutes counterfactual querying as defined in the context of SCM (see Ch. 7 in [1]). In the rebuttal, the authors claim "We instead use counterfactual reasoning to avoid the influence of exogenous variables." However, inferring exogenous noise given observed evidence is a key component of proper counterfactual reasoning in SCM frameworks.

I think that the empirical benefits might stem from finer-grained control enabled by the constraints of multiple values and their predefined correlations. However, the current misconceptions regarding SCM and counterfactual reasoning represent a significant technical flaw in my view and make it hard to analyze why the proposed method might work or not.

Some other questions

(from the rebuttal) “This design—introducing explicit intermediate concepts or states for counterfactual reasoning—is now common in NLP (see, e.g., [relevant literature]).” I am not sure what the authors are referring to as [relevant literature]
In the "value intervention" step, is it necessary to provide all values? In other words, if we only specify "self-direction: 5," would the model be able to infer the remaining values automatically?

[1] Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

评论- [Response to Reviewer efFE 3/3]

2025-08-06

3. Clarification for other questions.

Regarding [relevant literature]: Sorry for this typo. The relevant literature correponds to the recent papers [1][2] that explicitly introduce intermediate concepts or states for counterfactual reasoning in NLP area.
For the "value intervention" step: For each specific question, we do not consider all values but only the value dimensions directly involved in the question (as described in Line 240-241, we consider up to 5 most related dimension to define the value profile $v$ ). For example, a scneario of "taking care of my parents" is related to Benevolence but irrelevant to Power. The model would only infer the related values but ignore others. When the given question involve the value "self-direction", we can specify "self-direction: 5" in the target, otherwise, we can only intervene other related dimensions.

We hope we have addressed your concerns

We hope our responses above could address your concerns and we're more than willing to respond to any further questions.

We would sincerely appreciate it if you could read our responses, and kindly reconsider the assessment of our work.

References

[1] A. Bhattacharjee et al. (2023). Towards LLM-guided causal explainability for black-box text classifiers. arXiv:2309.13340.

[2] A. Bhattacharjee et al. (2024). Zero-shot LLM-guided counterfactual generation for text. arXiv:2405.04793.

[3] A. Feder et al. (2023). Data augmentations for improved (large) language model generalisation. Advances in Neural Information Processing Systems 36, 70638–70653.

[4] Emre Kiciman et al. (2023). Causal reasoning and large language models: Opening a new frontier for causality. Transactions on Machine Learning Research.

[5] Zhijing Jin et al. (2023). Cladder: Assessing causal reasoning in language models. Advances in Neural Information Processing Systems, 36:31038–31065.

[6] Jing Ma et al. (2024). Causal inference with large language models: A survey. arXiv preprint arXiv:2409.09822. [7] Longxuan Yu et al. (2024). Causaleval: Towards better causal reasoning in language models. arXiv preprint arXiv:2410.16676.

[8] Alexander Marx and Jilles Vreeken. (2017) Telling cause from effect using mdl-based local and global regression. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 307–316. IEEE.

[9] Natasa Tagasovska et al. (2020). Distinguishing cause from effect using quantiles: Bivariate quantile causal discovery. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 9311–9323. PMLR, 13–18 Jul 2020.

[10] Kun Zhang and Aapo Hyvarinen. (2012). On the identifiability of the post-nonlinear causal model. arXiv preprint arXiv:1205.2599.

[11] Pengzhou Wu and Kenji Fukumizu. (2020). Causal mosaic: Cause-effect inference via nonlinear ICA and ensemble method. In International Conference on Artificial Intelligence and Statistics, pp. 1157–1167. PMLR.

[12] Kristy Choi et al. (2022). Lmpriors: Pre-trained language models as taskspecific priors. arXiv preprint arXiv:2210.12530.

评论- [Response to Reviewer efFE 2/3]

2025-08-06

Step	Name	What happens	Output
1	Abduction	Since we can only observe the LLM response but not the underlying values, we should estimate both the underlying values $v$ and the exogenous variable $\epsilon_1, \epsilon_2$ in this step. As detailed in Sec 4.2 Value Abduction, we first extract the key value concepts $C_r$ from the responses and leave other irrelevant text in the response as $\epsilon_2 (= r - C_r)$ . Then, we infer the value priority scores to obtain $v'$ . Considering that $\epsilon_1$ is latent during the concept generation process, we use the $v' \rightarrow C_r$ relation as the proxy of $\epsilon_1$ and append it in the LLM prompt for counterfactual reasoning to ensure the same $\epsilon_1$ .	The underlying value $v'$ , value concept $C_r$ , the textual information except for the core concepts that embeds $\epsilon_2 = r - C_r$ , and the relation $v' \rightarrow C_r$ that embeds $\epsilon_1$ .
2	Intervention	Apply the intervention `do(V = v)`.	`do(V = v)`.
3	Prediction	By providing the target value $v$ , $v' \rightarrow C_r$ that embeds $\epsilon_1$ , $(r - C_r)$ that embeds $\epsilon_2$ , we leverage the LLM to follow the above two functions to conduct causal inference and produce $c_v$ and finally $r_v$ .	Counterfactual reasoning of $r_v$ .

Sec 4.2 and Sec 4.3 in our submission detail how we leverage an LLM to implement the above formalized causal inference process respectively. Thus, our method is not just a large black-box LLM with some constraints, but a solid simulation of the counterfactual querying process as defined in the context of SCM. And extensive experiments verify the effectiveness of this simulated SCM counterfactual process.

We will add the formal definition of SCM and the corresponding implementation steps into our revision to enhance the technical soundness of causal reasoning in our framework.

2. Justification for using LLMs to implement causal inference

We approximate the above SCM inference process using an LLM, allowing the model to implicitly handle both structural relationships and exogenous factors. We are not the first to conduct causal inference with LLMs, which has become a popular strategy especially for natural language tasks with complex causality that are hard for traditional causal models, and achieved even better performance than classical methods. We have reviewed most studies in Sec 2.2, such as [4–6]. And we further present some empirical evidence in the following.

This table shows the performance of LLMs on the causal discovery task, evaluated on the Tübingen cause-effect pairs dataset [4]. Notably, models such as GPT-4 achieve accuracy on par with or better than traditional algorithms.

Model	Acc.	Wt. Acc.
Slope[8]	0.75	0.83
bQCD[9]	0.68	0.75
PNL-MLP[10]	0.75	0.73
Mosaic[11]	0.83	0.82
text-davinci-001	0.50	0.50
LMPrior[12]	0.83	—
gpt-3.5-turbo (single prompt)	0.89	0.92
gpt-4 (single prompt)	0.96	0.97

Beyond causal discovery, LLMs also demonstrate competitive performance across a range of causal reasoning tasks. This table summarizes results from [7], covering three major task categories: Causal Discovery (COPA, NPDS, e-CARE, Corr2Cause), Causal Inference (CLADDER), and Additional Causal Tasks (CRASS, MoCa, Tram).

Model	COPA	NPDS	e-CARE	Corr2Cause	CLADDER	CRASS	MoCa	Tram
GPT-4o	100.0	56.0	85.0	47.0	61.0	95.0	61.0	84.0
o3-mini	99.8	56.0	79.6	60.6	92.2	93.3	63.9	81.4
Human	95.8	97.7	92.0	94.5	94.8	98.2	92.0	98.8

These results confirm that LLMs are increasingly capable of addressing not only standard causal discovery but also a wide range of more complex causal inference and reasoning challenges.

评论- [Response to Reviewer efFE 1/3]

2025-08-06

Thanks for your constructive feedback. We address each of your main concerns as follows.

1. Formal definition of SCM in our framework

In our framework, an SCM $(X, \mathcal{F}, \epsilon)$ is built to encode the structural relationship among value dimensions and the final model responses to given questions. Thus, we mainly consider the questions $q$ , the value dimensions $v = (v_1, v_2, ...)$ , the value concepts $c_v = (c_v^1, c_v^2, ...)$ which are behavioral indicators of values beyond redundant and noisy text (as introduced in Line 182), and the final response $r$ as the endogenous variables. Other variables ( $\epsilon_1, \epsilon_2$ ) that influences the response generation process as exogenous variables, as detailed in the Table.

Variable	Direct parents	Description
$\epsilon_1$	—	Factors that could affect the value $\rightarrow$ concept generation process, such as the model generation temperature, tone, language style, and so on.
$\epsilon_2$	—	Factors that could affect the concept $\rightarrow$ answer generation process, such as the model generation temperature, tone, language style, and so on.
$v$	—	The value profile, with priority score on each dimension, i.e., $v = [(v_1, s_1), (v_2, s_2),...]$ . We mainly do intervention on this variable `do(V = v)`.
$q$	—	The given question, which require LLMs to generate response to.
$c_v$	$v$ , $q$	The value concepts, core behaviors determined by the value under the given question context.
$r_v$	$c_v$ , $q$	The final response shaped by the value concepts and question. Compared to the value concepts that only indicate the core behavior, the response should consider coherence, fluency and so on.

All functions $\mathcal{F}$ to capture the relationships among variables are as follows. We first describe the procedure for obtaining the concept variable.

c_v = \mathcal{F}_c(P_a(c_v), \epsilon_1) = \mathcal{F}_c(v, G_v, \Sigma _{v}, q, \epsilon_1),

where $G_v$ is the relational graph capturing the correlations among values, i.e., congruent, opposite or irrelevant; $\Sigma_{v}$ is the covariance matrix capturing the relative importance among values, such as self-direction: 5 > conformity: 3. This functions is introduced in Line 206 (Eq.(2)).

We then present the modeling steps for obtaining the response variable.

r_v = \mathcal{F}_r(P_a(r_v), \epsilon_2) = \mathcal{F}_r(c_v, q, \epsilon_2).

As introduced in Line 208-210, we prompt an LLM to aggregate the concepts to generate the final response.

Since traditional causal reasoning methods mainly focus on tabular data and struggle with causal inference of natural language tasks, we follow recent studies[1][2][3] about causal inference with LLMs to implement the above two functions using a powerful LLM. Specifically, the three-step counterfactual reasoning step is completed as follows:

2025-08-07

Dear Reviewer efFE,

Thank you again for your valuable comments and suggestions. We have posted further responses to address your concerns.

As the discussion period is nearing its end with less than two days remaining, we sincerely appreciate it if you could take some time to reply with further feedback on whether our responses have addressed all your concerns. If there are any other comments, we are more than willing to address them.

Best regards,

The authors

2025-08-09

Sorry for the late response.

2025-08-09

We sincerely thank the reviewer for the thoughtful follow-up and for recognizing the value of our explicit SCM formulation and improved clarity. We ensure that the clarifications from this discussion will be incorporated into the revised version to further improve the clarity of our method.

审稿意见

评分: 5置信度: 32025-07-02

This work proposes COUPLE (COUnterfactual reasoning framework for PLuralistic valuE alignment). COUPLE is designed to address the challenges encountered by prompt-based and tuning-based pluralistic alignment methods: value complexity and value steerability. COUPLE is based on a Structural Causal Model and has two key properties (structural value modeling and fine-grained steerability) that address each respective challenge. These properties are incorporated via a three stage pipeline that consists of value abduction, value intervention, and counterfactual prediction. Experiments and ablations demonstrate the efficacy of COUPLE compared to prompt- and tuning-based approaches, and further validate the necessity of each component.

优缺点分析

Strengths:

The experiments are thorough, including automatic and human evaluation, as well as ablations on the method components.
The framework presented and incorporation of a SCM to address pluralistic value alignment is novel and, based on the experiments, is effective.

Weaknesses:

Some aspects of the evaluation are questionable. In section 5.3 (lines 300-302), the authors state the responses are collected aligned to 4 different value objectives for the Touché23-ValueEval, yet in lines 219-220, it states the authors defined value objectives over 10 Schwartz basic value dimensions on this dataset. There is no justification for discarding 6 value objectives for this evaluation.
Analyses in section 6, particularly the interpretability analysis, are not clearly explained. For example, lines 343-347 propose an analysis on interpretability based on the most frequent words under each value and score level. The results in Table 5 are not clear from context. See more in the "Questions" section below. Additionally, the assertion that the results demonstrate that the concepts reflect the value/score and convey semantically meaningful content does not seem supported. For example, how are "not" and "over" semantically meaningful for the value dimension "Power"?
Overall, the clarity of the work can be improved. It was particularly difficult to parse section 5.1.2 as far as which baselines were actually used amidst commentary on the weaknesses of different methods (e.g. lines 248-249; lines 253-254).

问题

Please provide more information on the human evaluation. Which value objectives were evaluated, and why were those selected while the rest were discarded?
Is there a typo in Table 4? w/o ValueConcepts/MAE/Touché23-ValueEval/DeepSeek-R1 is reported as 0.179.
Please provide more information on the interpretability analysis in section 6. Specifically, for Table 5, what is the "Priority" column vs. the value priority highlighted in red?

局限性

Yes

最终评判理由

The rebuttal from the authors sufficiently addressed my concerns and misunderstandings. The provided clarifications and planned revisions to the work are reflected in my increased score.

格式问题

N/A

作者回复

2025-07-31

Response to Reviewer fCkw:

W1: Question (on value objectives and country alignment):

To clarify, in Section 5.3 (lines 300–302), responses are aligned according to four different countries/groups, not four value objectives. In all cases, the value objectives are defined as combinations of the ten Schwartz basic value dimensions.

The "ten" refers to the ten basic value dimensions defined by Schwartz, which form the full set of value objectives used throughout our evaluation. In all cases, no Schwartz value dimension was ever discarded or omitted.
The "four" refers to four alignment targets: two clusters/groups (Group 1 and Group 5, derived from our own user data for their clear differences), and two real-world country profiles (the UK and India).

None of the Schwartz value dimensions were discarded at any point in our evaluation. The four groups or countries used for alignment consist of two clusters identified from our own data (Group 1 and Group 5, chosen for their clear differences) and two real-world country profiles (the UK and India). The specific mappings for these four cases, along with their definitions, are provided in Appendix B.2.2 (lines 107–127), and the rationale for selecting these representative combinations is visualized in Appendix Fig. 1.

In summary, our evaluation always considers all ten Schwartz dimensions. The grouping is based on country-specific value profiles, not by omitting any objectives.

W2, Q3: Question (on interpretability analysis, Table 5): The reviewer finds the interpretability analysis in Section 6 and Table 5 unclear and questions the semantic meaning of certain frequent words.

Table 5 presents the most frequent words associated with each value and score level. Words marked in red indicate a strong correlation with the score polarity: for example, words frequently appearing in responses rated 5 (high score) tend to be associated with positive or supportive attitudes, while words appearing in responses rated 1 (low score) reflect negative or rejecting attitudes (e.g., "not").

Further examples can be found in Appendix C.5 (lines 271–274) and Appendix Table 17.

We will revise the Table and the revised text will explicitly explain how word polarity and context relate to each value and score, and clarify the interpretation of Table 5, including how to read the highlighted words.

W3: Question (on baseline clarity, Section 5.1.2):

Next, we will clarify baseline identification in Section 5.1.2:

All baseline models discussed and highlighted (in bold) in Section 5.1.2 were actually used in our experiments, and their results are explicitly reported in Table 2 and Table 3.
For fine-tuning-based baselines, experiments were conducted only on open-source models, as shown in Table 3. For closed-source models, only prompt-based baselines were tested, as shown in Table 2.

To improve clarity, we will revise the writing in Section 5.1.2 to explicitly state which models are used as baselines.

For full transparency, detailed descriptions of each baseline are provided in Appendix B.4 (lines 148–199).

Q1: Question (on human evaluation and value objectives)

Response:
Next, we discuss the question regarding our human evaluation and the selection of value objectives. In our study, we consider a total of 15 value objectives: five representative groups obtained by clustering real user data, five major European countries, and five major countries globally. For the human evaluation, we selected two representative groups (Group 1 and Group 5) and two countries (the UK and India) because their value profiles are more representative and diverse within our dataset.

For dataset alignment (e.g., with specific country profiles), we use established mappings as described in Appendix B.2.2 (lines 107–127). The full selection process and all value mappings are documented in the appendix.

Q2: Question (on Table 4, possible typo):

Response:
Sorry for the trouble caused by the typo. You are correct—there is a typo in Table 4. We will correct this in the revised version. We have also double-checked the other entries and can confirm that there are no further errors.

评论- Concerns Have Been Addressed

2025-08-06

I appreciate the authors' thorough response to clarify my concerns and questions. The clarifications here as well as the planned clarifications in the paper sufficiently address my concerns. I will be raising my Overall Rating from a 3 --> 5 accordingly, as well as increasing my clarity and significance ratings.

2025-08-07

Thank you very much for your positive update and for raising your ratings. We appreciate your careful review and are glad that our responses and planned clarifications have addressed your concerns. We will ensure that all relevant clarifications are clearly reflected in the revised version.

Thank you again for your thoughtful feedback and support!

审稿意见

评分: 5置信度: 32025-07-02

This paper tackles the problem of pluralistic alignment, which addresses alignment to diverse (and potentially conflicting) values, in contrast to traditional alignment approaches which assume universal alignment to a set of shared values. The authors highlight two issues with pluralistic alignment- value complexity and value steerability, and propose a new approach called COUPLE, which is a counterfactual reasoning framework designed to tackles these issues in addressing pluralistic alignment. COUPLE leverages a structural causal model, which captures the different relationships between values and related concepts, and enables new forms of counterfactual reasoning and improved interpretability. The authors also evaluate their proposed approach across two different datasets, showing promising results along with ablation studies.

优缺点分析

Strengths: Overall, this paper addresses the timely issue of pluralistic alignment, which is getting increasing attention. In contrast to the majority of approaches which are purely prompt-based, use of a structural causal model enables better modeling of complex relationships between different values and concepts, and how they might impact downstream behaviors. The improved counterfactual reasoning and interpretability of such an approach are also advantages to other pluralistic alignment approaches. There are also quantified results with ablation studies across two different datasets, demonstrating performance improvement over prior approaches. The use of a human evaluation also adds additional credibility to the results, showing humans tend to prefer COUPLE outputs.

Weaknesses: Overall, the paper title and introduction motivate steerable value pluralism, which is the ability to align model responses to different sets of multi-dimensional value targets. However, the actual results on this are quite sparse (relegated to Figure 4 b/c/d with not a lot of corresponding explanation in the text). The majority of the paper is focused more on the value abduction side, i.e. being able to accurately infer a set of values from a given response, which is a slightly easier problem.

Additional ablation studies showing non-interference in the outputs when doing interventions on non-relevant concepts/values would be useful. This would demonstrate that the structural causal model has learned appropriate relationships, and that non-relevant values and concepts do not have a casual effect on model outputs when perturbed.
The interpretability results are a bit weak, simply highlighting common concepts used for particular value dimensions. Doing a user study to see if humans weight similar concepts or an ablation on these concepts and measuring the impact on performance would be interesting to understand whether these are indeed key concepts.
The paper is also quite light on qualitative results (even in the appendix), which makes it hard to really evaluate the outputs of the proposed COUPLE approach. Most of the qualitative results preserve the general priority/order of values, simply changing their relative magnitude. It would be interesting to see how outputs of the proposed COUPLE approach change if values priorities are reversed. I believe adding convincing qualitative results would help a reader understand the approach better.

问题

I have the following questions:

It looks like in the rating scale, the authors combine the relevance and valence of a particular value (see the Kaleido paper by Sorensen et al., '24 for definitions of these terms). Would it make sense to disentangle these two terms, e.g. for values that are not relevant to a given scenario?
How sensitive is the approach to the number of intermediate key value concepts? As these concepts help inform relations between values, how they are defined may impact the structural causal model and downstream results.
How is the threshold on value intervention defined? This seems to have important implications for when the counterfactual reasoning component of the system should be applied.
For the Daily Dilemma dataset, it looks like alignment targets are defined in a scenario-specific manner, which makes it harder to quantify consistency in the alignment and value-conditioned responses across scenarios. Would it make more sense to cluster value profiles across the dataset, and treat these as shared alignment targets across sets of scenarios?
I had a hard time parsing Figure 4 c/d based on the description in the caption and text, which could potentially be improved. Do these two subplots correspond to the two different datasets or something else?

局限性

The authors list a few limitations in the final section of the paper, although the content is quite light. For example, limitations around the scalability of the approach should be discussed, as the results suggest that with more values, general accuracy degrades (Fig. 4a), which could affect real-world pluralistic alignment, which is often multi-dimensional. There are also potential limitations around using an LLM-based evaluator for measuring alignment, which may inherit the implicit biases related to the underlying model used for evaluation.

最终评判理由

I believe the author's rebuttal and additional clarifications/experiments have addressed the majority of my concerns, hence I will increase my final score.

格式问题

No concerns.

作者回复

2025-07-31

Response to Reviewer b5WM:

Thank you for your thorough and constructive feedback. Below we address each of your concerns directly:

W1: Clarify the paper’s focus on counterfactual reasoning.

While much of the paper discusses value abduction, counterfactual reasoning is also given equal attention in our framework. In fact, abduction is a prerequisite for meaningful counterfactual intervention: without accurate value attribution, downstream interventions would be arbitrary.

Furthermore, both the main experimental results and the human evaluation focus on measuring the alignment between our counterfactual generation answers and target values, rather than value abduction.

W2, Q5: Figure 4(b/c/d) explanation

Subplot 4b focuses on the alignment distance between the original LLM response and the target values. We group the cases by this distance, computed as sum|v_original - v_target|, and calculate the MAE loss for each group. This shows that the greater the initial misalignment, the harder it is for the model to achieve value alignment—demonstrated by increasing MAE with distance.
At the same time, our method achieves better alignment performance than the strongest baseline (Plan-and-Solve). Moreover, as the distance increases, the alignment loss of our method does not increase significantly, indicating better robustness.
Subplots 4c and 4d illustrate the difficulty of aligning the LLM’s original value on a given dimension to the target aligned value. Each cell in the heatmap (e.g., coordinate 5,1) represents the MAE loss when transforming responses with an original value of 1 to align with a target value of 5. This allows us to quantify how challenging it is to shift model outputs across different value levels. Subplot 4c shows the results for the strongest baseline, while 4d presents the results for our method.
The results show that it is particularly challenging to change responses with a very high original value to a very low value (and vice versa). Nevertheless, our method achieves strong performance across all value transitions.

W3: Concern about the impact of question-irrelevant values.

We agree this is an important point. We conduct an experiment where we randomly set unrelated values as "important" and evaluated the impact on output MAE. The results are summarized below:

Metric	Before Intervention	After Intervention
MAE	1.28	1.33

Output MAE before and after randomly setting unrelated values as "important" (100 cases).

The negligible change in MAE demonstrates that interventions on non-relevant concepts/values do not significantly affect the model's outputs, further supporting the robustness of the causal structure learned by the model.

W4: Concern about the concept layer in interpretability results.

We conduct an ablation study to remove key concepts from the model and measure the impact on alignment performance. The results are summarized below:

For 100 randomly sampled questions, we report the average alignment score with the "Achievement" value objective for the US value profile, before and after ablation (removal of Achievement-related concepts):

Metric	Before Ablation	After Ablation	Target Value
Achievement Alignment (Mean)	4.18	3.36	4

Table : Average alignment scores for the "Achievement" value objective (US value profile) before and after ablation of Achievement-related concepts (100 random questions).

The observed drop in alignment after ablation quantitatively demonstrates the model’s reliance on key concepts for achieving value-specific alignment, further supporting the interpretability of the model.

W5: Concern about maintaining the priority between values in qualitative case studies results.

To facilitate understanding, we provide additional qualitative examples that illustrate how model outputs change when value priorities are reversed (not simply shifted in magnitude). For the 15 alignment objectives, which include 5 groups and 10 countries, we randomly selected 100 questions from each and evaluated the model's performance on two dimensions (Achievement vs Security).

Overall Statistics (Achievement vs Security, All Groups & Countries):

The table below summarizes the overall statistics for the value pair Achievement and Security across all groups and countries:

Category	Count	Percentage
Total Questions	1500	100.00%
Target Achievement > Security (Fail)	87	5.80%
Target Achievement < Security (Fail)	102	6.80%
Expected Relationship	1311	87.40%
Total Change Rate	189	12.60%

In summary, 87.4% of the model outputs successfully maintained the expected value priority relationship between Achievement and Security after intervention, demonstrating that our approach reliably preserves the intended value order in the majority of cases.

Q1: Concern about the relevance and valence of values.

Thank you for pointing out the Kaleido paper. We acknowledge that our current rating combines both relevance and valence, and we will discuss the implications of this in the revision. In our approach, value relevance is determined at the question level (see example in Table 7 in the Appendix): For each question, the set of relevant values is limited—we first extract the possible values applicable to the scenario, and then the model is instructed to output only for these values. We will clarify this distinction in the revised text.

Q2: Sensitivity to number of key concepts

We have added a hyperparameter analysis to test sensitivity to the number of intermediate concepts.

Sensitivity Analysis: Number of Key Concepts. The table below summarizes the impact of varying the number of key concepts on the model performance, measured by MAE and Correlation:

Number of Key Concepts	MAE	Correlation
1	1.928	0.587
2	1.684	0.732
3	1.433	0.778
4	1.503	0.761
5	1.421	0.783

Table: Sensitivity analysis of the model performance with different numbers of key concepts, measured by MAE and Correlation.

As the number of key concepts increases, the model performance stabilizes. This is because the number of values involved in each question is fixed, so when the number of concepts is too low, it becomes difficult to generate a good answer. However, when the number of concepts is sufficiently large, performance improves and stabilizes.

Q3: Sensitivity to value intervention threshold

The value intervention threshold is set as a hyperparameter (currently 0, i.e., any change triggers intervention, making it very fine-grained). We will include an ablation showing performance under different thresholds and discuss the rationale and impact.

Formally, for each answer, we compute the sum of value differences across all dimensions where the value in the natural answer differs from the value in the target answer:

\text{DiffSum} = \sum_{i \in \mathcal{I}} |v'_i - v^\text{target}_i|

where $\mathcal{I} = \{i \mid v^\text{nat}_i \neq v^\text{target}_i \}$ , and $v^\text{nat}_i$ and $v^\text{target}_i$ denote the value of the $i$ -th dimension in the natural answer and target values, respectively.

Threshold	MAE	Correlation
0	1.433	0.778
1	1.529	0.761
2	1.828	0.596
3	2.242	0.461

Model performance (MAE and Correlation) under different value intervention thresholds.

Q4: Concern about alignment targets for DailyDilemma

There is indeed a challenge here: the DailyDilemma dataset involves 50 dimensions of values, but each character portrayal actually focuses on a few distinct dimensions. Therefore, there is inevitably a difference in values between characters. While we currently adopt scenario-specific alignment targets, this method may make it more difficult to quantify consistency in alignment across different scenarios.

Of course, based on your suggestion, we would be happy to conduct an experiment on clustering the value profiles in the dataset and using them as shared alignment targets to see if it improves this issue.

Limitations: scalability, LLM bias

We acknowledge that the limitations section could be further elaborated. We will expand discussion of:

Scalability: As the number of values increases, accuracy drops (see Fig. 4a); this will be elaborated as a key limitation for real-world use. However, we note that the ten Schwartz dimensions already form a comprehensive and widely-accepted value system. According to our statistics, most scenarios (i.e., for each question) typically involve only a small subset of these values, rather than all ten at once. This practical observation mitigates, to some extent, the scalability issue in realistic applications.
LLM-based evaluator bias: We will acknowledge this risk and its implications for evaluation fairness. To address this, we have also conducted human-annotated evaluations and survey-based assessments. In addition, we compared the results of LLM-based automatic evaluation with human judgment to ensure consistency between automated and human assessments.

Thank you again for your feedback. We hope the planned experiments and clarifications will fully address your concerns.

评论- Response to Authors

2025-08-04

I have read the author's rebuttal, and thank them for their effort in addressing my feedback and questions. Their responses help me with better understanding the proposed approach and corresponding results. As a result, I will change my rating to a 5.

Regarding W5, my intention was to understand whether it was possible to "reverse" a model's output based on swapped value priorities, although the authors seem to show the model is quite robust and preserves relative value priorities across a set of questions. This may be due to the learned relationships in the SCM.

2025-08-07

Thank you very much for updating your score and for your thoughtful engagement with our work. We appreciate your clarification on W5 and your interest in the reversibility of model outputs when value priorities are swapped. In the revised version, we will include additional experimental results and add all the various experiments we previously discussed.

Thank you again for your constructive feedback and support!

审稿意见

评分: 5置信度: 42025-07-03

This paper addresses the challenge of aligning large language models (LLMs) with pluralistic human values. The authors propose a novel framework called COUPLE (COUnterfactual reasoning for PLuralistic valuE alignment), which aims to handle two primary challenges in value alignment: value complexity and value steerability. COUPLE leverages a structural causal model (SCM) to model the interdependencies among various human values and uses counterfactual reasoning to adjust LLM outputs based on fine-grained, prioritized value profiles.

The core of COUPLE's methodology involves three steps:

Value Abduction: Infers the value profile underlying a model's response.
Value Intervention: Alters the value priorities if the inferred profile deviates from the target values.
Counterfactual Prediction: Generates a new response under the adjusted value profile, allowing for precise alignment.

The paper perform experiments on two distinct datasets with different value systems.

优缺点分析

Strength

Interesting question: the question of the paper is interesting. Intervention of value works with a structural causal model is fascinating.
Completeness of the paper: the paper is well-structured and consists of abundant element.
Nice figures: the figures are nice

Weakness

Insufficient experiments: the authors only perform experiments with 2 backbones, which isn't enough. The author should perform with more backbones, including claude, gpt4o...
Unclear writing:
- The full algorithm can be better stated with pseudocode, rather than natural language.
- line 123: What is the exact meaning of "reflecting the priorities", can you write it mathematically.
- line 127: What variables are selected as endogenous variables?
- line 130-131: The init graph of SCM is not clearly written. How do you build the init graph
- line 157: do(V=v) should be do(V=v)
- line 189-191: What is the meaning of "infer the annotated data"
- Line 218: The converting process isn't clearly stated.
Unreliable experiments: from line 189-191, we can see that the optimization objective remains the same as the reported criteria in table 2. This raises the doubts of reporting results with special designation.

问题

See weakness.

局限性

See weakness.

最终评判理由

The authors have addressed my concerns about backbones and the reliability of the experimental results. Their explanation of multiple lines clearly clarified my confusion. They promised to add the clarifications to the main paper.

Furthermore, inducing the value graph of an LLM for downstream tasks is a good idea to me, and thus, I tend to accept the paper.

格式问题

No concerns.

作者回复

2025-07-31

Response to reviewer e3SB:

W1: Regarding the concern on the number of experimental backbones

We acknowledge the reviewer’s view that having only two backbone models may be insufficient, and using more backbone models would make the paper more convincing. In fact, our current experiments already include two open-source models (Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct) and three closed-source models (gpt-4.1-mini, o3-mini, Deepseek-R1) as backbones, covering a diverse range of architectures and model capacities. The results for the closed-source models (gpt-4.1-mini and Deepseek-R1) are reported in Tab.2 of the main text. Results for o3-mini can be found in Appendix C.1.1 (lines 227–229, Tab.5). The two open-source models are presented in Tab.3 and Appendix C.1.2 (lines 235–242, Appendix Tab.12).

Based on the reviewer's suggestion, we conducted experiments and obtained results from GPT-4o-mini. On this backbone, our model's performance is also significantly improved.

Touché23-ValueEval

Method	MAE ↓	Correlation ↑
Raw Model	3.009	0.292
Role Prompt	2.931	0.344
Value Prompt	2.454	0.509
Tree of Thought	2.416	0.588
Plan and Solve	2.589	0.456
COUPLE	2.099	0.681

DailyDilemma

Method	MAE ↓	Correlation ↑
Raw Model	0.883	0.158
Role Prompt	0.807	0.249
Value Prompt	0.563	0.594
Tree of Thought	0.549	0.602
Plan and Solve	0.582	0.579
COUPLE	0.417	0.745

W2: Regarding writing clarity about the method framework

The following paragraph provides clarifications regarding the writing. We will also revise and enhance readability further based on these points.

Algorithm: The algorithm is presented below in pseudocode with proper mathematical notation and conventions, following best practices from the literature. Due to the limitations of the Markdown environment, the table is somewhat simplified. The full version will use LaTeX in the final paper.

Step	Action
Input	$q$ , $r$ , $v$ # $q$ : question; $r$ : nature_response; $v$ : target values
Output	$r_v$ # the generated response after value intervention
Step 1.1: Concept Extraction	$C_r \gets F_C(q, r)$ # Extract concepts from $r$
Step 1.2: Value Attribution	$v' \gets F_A(C_r)$ # Extract original values $v'$ from $C_r$
Step 2: Value Intervention	$v \gets do(v', v)$ # Intervene on $v'$ to set it as target value $v$
Step 3.1: Concept Reasoning	$C_v \gets G_C(q, v, C_r, v')$ # Predict new concepts based on $v$
Step 3.2: Response Generation	$r_v \gets G(q, C_v)$ # Generate final response
Return	$r_v$ # the final response

Line 123: Here, "priority" refers to the internal ranking among values. For example, if the alignment target places more emphasis on security than on achievement (e.g., $v_{\text{security}}=5$ , $v_{\text{achievement}}=3$ ), then the response should reflect this priority order, with security being prioritized over achievement in the generated answer.

(i) Mathematically, let $s = [s_1, s_2, ..., s_n]$ denote the target value scores, and $s' = [s_1', s_2', ..., s_n']$ the model output. Our objective is twofold:

Minimize the score differences: $\mathrm{MAE} = \frac{1}{n} \sum_{i=1}^n |s_i - s_i'|$ ,
Preserve priority: For all $i, j$ , if $s_i > s_j$ then $s_i' > s_j'$ .

(ii) These objectives directly correspond to the evaluation metrics used in our experiments: MAE quantifies the absolute difference between targets and outputs, while Correlation assesses how well the relative priorities among values are preserved (e.g., via Spearman correlation).

Line 127: We appreciate the reviewer’s point. However, as clearly stated in lines 130–131 of our paper, we treat both values and answers as endogenous variables and explicitly model the value-to-answer relationship. By specifying the value→answer mapping and performing interventions, we can conduct further inference (value→?) after perturbation, which helps avoid the influence of exogenous variables. This approach addresses the concern raised and clarifies that the value-to-answer modeling is already handled appropriately in our methodology.
Lines 130–131: Thank you for raising this point. To clarify:

(i) Description of the initial SCM graph:
As described in our paper, the initial causal graph in our Structural Causal Model (SCM) is constructed primarily based on domain knowledge and prior literature. Specifically, we define the main relationships—such as value→concept and concept→response—as the backbone structure for downstream causal inference. In our framework, we focus primarily on modeling the relationships among endogenous variables, as counterfactual interventions allow us to mitigate the influence of exogenous factors (e.g., linguistic style, tone, etc.). This approach ensures that our analysis targets the underlying causal mechanisms rather than superficial artifacts in the data.

(ii) Rationale for not explicitly displaying the full graph:
In line with existing work, both node-level and graph-level causal reasoning have been effectively addressed by leveraging the inherent knowledge within large language models (LLMs). Therefore, it is common practice not to explicitly enumerate the entire causal graph. Instead, recent studies typically specify only the key edges relevant to the target tasks, relying on the LLM’s internal representations to capture other dependencies as needed.

(iii) Relevant literature on LLMs + SCM:
This modeling approach is consistent with recent work that integrates LLMs with causal modeling or SCM frameworks. For further reference, please see:

[1] Jing Ma et al. Causal inference with large language model: A survey. arXiv preprint arXiv:2409.09822, 2024.

[2] Emre Kiciman et al. Causal reasoning and large language models: Opening a new frontier for causality. Transactions on Machine Learning Research, 2023.

[3] Zhijing Jin et al. Cladder: Assessing causal reasoning in language models. Advances in Neural Information Processing Systems, 36:31038–31065, 2023.

[4] Yuzhe Zhang et al. Causal graph discovery with retrieval-augmented generation based large language models. arXiv preprint arXiv:2402.15301, 2024.

[5] Amrita Bhattacharjee et al. Towards LLM-guided causal explainability for black-box text classifiers. arXiv preprint arXiv:2309.13340, 2023.

[6] Amrita Bhattacharjee et al. Zero-shot LLM-guided counterfactual generation for text. arXiv e-prints, pages arXiv–2405, 2024.

[7] Amir Feder et al. Data augmentations for improved (large) language model generalization. Advances in Neural Information Processing Systems, 36:70638–70653, 2023.

Line 157: We acknowledge the typo and will correct do(V=v) as required.
Lines 189–191: To clarify, we use our own annotated data and the "LLM as judge" approach to refine the evaluation metrics and enhance the accuracy of our model's performance assessment. The purpose of this approach is to iteratively improve the effectiveness of LLM as an automatic evaluator, allowing us to better assess model outputs.

While detailed information about this process is provided in Appendix A.2 (lines 23–35), we want to emphasize that this is already included in the original version of our paper. The reference to the appendix is meant to provide further supporting details, but we will explicitly explain this process in the revised manuscript to ensure the information is clear and accessible without requiring the reviewer to revisit the appendix.

Line 218: For Touché23-ValueEval, the dataset consists of statements expressing a point of view, such as "One should support homeschooling." We convert these statements into opinion-seeking questions (e.g., "Should we support homeschooling?") in order to elicit value-involving responses from the LLM using the GPT-4o-mini API.

W3: Concern about the reliability of experiments

There is a misunderstanding of our task definition. Our focus is on correction and improving accuracy, rather than simply optimizing a single automated metric. Moreover, in addition to automatic evaluations, we also conducted a survey-based evaluation using PVQ (Appendix C2.1, lines 243–253, Tab. 12), so our findings are not solely dependent on the automated evaluator.

Furthermore, we provide additional human evaluation results in the main text (Section 5.3, lines 299–306, Fig. 3), as well as in the appendix (Appendix C.3, lines 254–261, Appendix Fig. 2). Therefore, the concern about reporting results with special designation is unwarranted; our evaluation approach is transparent and multi-faceted.

评论- Response to the authors

2025-08-02

I'd like to thank the authors for their clarification and additional experiments! My concerns about backbones and experimental results are addressed. I'll raise my score to 5, and I hope the clarification of writing problems can be added to the paper. Thanks!

2025-08-03

Thank you for updating the score. We believe the quality and clarity of the paper have been improved following your constructive comments.

2025-08-04

Dear Reviewers,

Thank you again for your valuable comments and suggestions, which are really helpful for us. We have conducted additional experiments and posted responses to the detailed concerns.

We understand that the current period is quite busy, as the reviewers may be responding to other assigned papers' rebuttals. We sincerely appreciate it if you could take some time to reply with further feedback on whether our responses have addressed all your concerns. If there are any other comments, we are more than willing to address them.

Best regards,

The authors

最终决定Accept (poster)

2025-09-17

My recommendation is to accept the paper.

The paper presents a method for aligning responses to complex value profiles using counterfactual reasoning. The method maps concepts in responses back to underlying values, than intervenes on these values to generate new responses. The authors demonstrate the framework on two benchmarks with several LLM backbones.

Reviewers raised several points about clarity that seemed to be resolved during the discussion period. This resulted in consensus to accept. I would encourage the authors to follow through on promises made during the discussion period.