PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
5
3
4
3.5
置信度
创新性2.8
质量2.8
清晰度3.3
重要性2.5
NeurIPS 2025

Inference-Time Personalized Alignment with a Few User Preference Queries

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We propose a novel inference-time personalized alignment method that elicits the user's preferences with a few preference queries.

摘要

关键词
personalized alignmentinference-time alignmentuser preferencesbest of Nbest-arm identificationlogistic bandits

评审与讨论

审稿意见
5

This paper proposes USERALIGN, a novel inference-time personalized alignment method for generative models. Unlike previous approaches that require either large amounts of user feedback or explicit text-based preference specification, USERALIGN efficiently elicits user preferences through a small number of pairwise response comparisons. Building on the theoretical framework of logistic bandits and best-arm identification, USERALIGN rapidly selects a response that best matches the user’s preferences from a candidate pool, treating user feedback as consistent and noise-free. Experiments across text and image generation tasks demonstrate that USERALIGN achieves strong personalized alignment with minimal user interaction, outperforming existing baselines. The method is practical, domain-agnostic, and enables fast adaptation to individual user needs at inference time.

优缺点分析

Strengths:

  1. USERALIGN efficiently aligns generative model outputs to individual user preferences with only a few pairwise comparisons, greatly reducing the amount of required user interaction.

  2. The method is theoretically grounded, leveraging best-arm identification in logistic bandits to quickly and reliably find the best response.

  3. USERALIGN is domain-agnostic and can be applied to both text and image generation tasks, demonstrating strong performance across multiple application areas.

Weakness:

  1. USERALIGN relies on a fixed pool of pre-generated candidate responses, which limits the flexibility to generate new responses based on user feedback during the interaction process.

  2. The method assumes user feedback is always consistent, which may not be realistic in practical, real-world applications.

  3. Most experiments are conducted with simulated users instead of real human participants, so the real-world effectiveness and user experience of USERALIGN remain to be validated.

问题

  1. How would USERALIGN perform if user feedback is inconsistent or noisy, and do you have any strategies in mind for handling such cases?

  2. Can USERALIGN be extended to dynamically generate new candidate responses during the interaction process, instead of relying on a fixed response pool?

  3. Can USERALIGN be extended to circumstances that user attributes or interaction history are provided in personalized alignment?

局限性

NA

最终评判理由

Concerns addressed.

格式问题

NO

作者回复

Thank you for carefully reviewing our paper! We greatly appreciate your feedback. Please see below our responses related to your comments.

About the use of a fixed candidate pool and potential for dynamic generation

We thank the reviewer for raising this point as it aligns closely with a direction we’ve been eager to explore. We see promising opportunities to extend USERALIGN with dynamic pool augmentation, where new candidates are synthesized based on the user’s inferred preference distribution -- specifically targeting underexplored regions of the embedding space. While our current approach uses a fixed candidate pool, it naturally selects responses that lie within the feasible (i.e., consistent) region. Additionally, the pool can be enriched on-the-fly by biasing generation toward examples aligned with previously preferred responses.

One compelling way to bias the generator and enforce feasibility is as follows: once the user selects a few preferred responses, we feed those back into the generator -- e.g., by conditioning it on the last k accepted outputs -- so that it produces candidates aligned with that style or direction. Instead of presenting every new output, we then apply USERALIGN’s version-space filter: only those responses that satisfy all prior pairwise comparison constraints (i.e., lie within the current feasible region defined by intersecting half-spaces) are shown to the user. This ensures that dynamically generated candidates both reflect the user’s evolving preferences and remain consistent with all previous feedback. On the theoretical side, accounting for the non-stationarity due to dynamically generated candidates would introduce significant complexity and, we believe, deserves a dedicated follow-up investigation.

About the real‑world validity and a new user study

Our work focuses on short-term, task-specific interaction sessions, where users typically have clear and stable preferences. In such practical scenarios, assuming consistent and noise-free feedback is both reasonable and realistic. To tackle the reviewers’ concerns about real-world validity of our methodology, we would like to present results from a user study we have conducted. Given the limited time during rebuttal and high costs involved, we managed to conduct the user study for only one domain (food64d). Below we present the details of the study and the results. We will add the details about the user study and results to the updated version of the paper.

As part of this project, we had developed a lightweight web application to expose the methods considered in the paper through an interactive interface. We have now used this platform to conduct a user study with about 500 crowdsourcing participants to evaluate USERALIGN and IIDBest under realistic, noisy conditions in the food64d domain. Each participant was randomly provided with a question from the domain’s question set, and assigned either USERALIGN or IIDBest as their interaction method (blinded to the user). Participants were also placed in one of two personalization conditions: in the with persona condition, they saw a concise persona description (similar to our simulated users experiments) and were instructed to roleplay that persona; in the without persona condition, no persona was shown and participants made selections based on their own preferences.

Each participant interacted with the system in three stages. Stage 1 presented the question and either a persona (in the with‑persona condition) or an instruction to reflect on how the participant would personally like the question to be answered (in the without‑persona condition). Stage 2 consisted of a series of 10 pairwise comparisons, during which participants chose between two options that best fit the persona or their personal preference. In Stage 3, participants compared three final candidates against the zero‑temperature baseline response for their assigned question: the method’s selection at t = 10, the midpoint selection at t = 5, and a candidate chosen by the Random method (again, blinded to the user).

To ensure a fair comparison between methods and to keep the cognitive burden manageable for participants, we fixed the number of allowed comparisons during interaction for both methods to 10. This setting was chosen to allow enough interaction steps while staying within a manageable usage scenario. The candidate pool size was the same as that used in the main experiment, i.e., 20 generations.

For each of USERALIGN and IIDBest, we collected n = 120 evaluation sessions per setting; for Random, which does not depend on interaction steps, we recorded 2n = 240 samples. The tables below report the resulting win‑rates % across both personalization conditions.

  • food64d - with persona
Methodt = 5t = 10
Random44.17 (3.21)44.17 (3.21)
IIDBest71.67 (4.11)78.33 (3.76)
USERALIGN82.50 (3.47)89.17 (2.84)
  • food64d - without persona
Methodt = 5t = 10
Random43.75 (3.20)43.75 (3.20)
IIDBest73.33 (4.04)76.67 (3.86)
USERALIGN79.17 (3.71)85.83 (3.18)

About personalization and the use of user attributes or history

We thank the reviewer for raising this point! We are intrigued by this idea and would like to further expand on it. Our current work focuses on short-term, task-specific interaction sessions, where user preferences are assumed to be stable and well-defined. Nonetheless, as the reviewer suggests, incorporating user attributes or interaction history is an interesting direction for future work. Such information could be used to bias the generation of the candidate pool during interaction. However, this would likely require adapting our method to handle the non-stationarity introduced by evolving user profiles. To maintain stability, one could also consider gradually discounting older user comparisons over time.

评论

As the discussion period is ending soon, we are writing to thank the reviewer for their constructive feedback. We hope that our responses have addressed your concerns and are helpful in improving your rating. We will incorporate the reviewer's feedback and our responses (including additional experiments and discussions) in the paper.

评论

Thanks for your response. I will raise my score.

评论

We would like to thank the reviewer for their constructive feedback and appreciate the reviewer's input in helping us improve the paper.

审稿意见
5

The authors propose to conduct inference-time alignment with limited user interactions. The key idea is to actively pick up suitable pairs for learning. Specifically, the "high quality" response is selected via MLE optimization on the features, and the "low quality" response is selected by constructing a confidence set, followed by MLE optimization on both parameters and responses. To overcome the slow optimization speed problem, the authors introduce a practical set to shirink the confidence set. Results show that the proposed method, USERALIGN, performs well w.r.t. the speed of alignments with limited interaction steps.

优缺点分析

Strengths

  • The method is intuitively sound, theoretically backed up and novel.
  • The figure and algorithm box are clear and help explain the idea and motivation.
  • Results are good across selected baselines.

Weakness

  • The experiment results lack a direct 1-1 comparison. All results are computed against a fixed response; however, this does not explicitly tell the quality gap between different methods. For example, a high win rate does not denote a high-quality gap against a fixed response. Therefore, a direct comparison between different methods' generation is beneficial.

  • Following my question in the following section, the selection of y_2 is a major contribution. Because the current confidence-set-based implementation is quite complex, I wonder if a simpler approach will yield similar performance: we always set the confidence set to θ^t\hat\theta_t. Will the results be significantly affected in practice?

问题

  • Can authors explain how line 8 in algorithm two is computed? To my understanding, authors need to repeat the MLE optimization across the whole candidate set. How is the optimization conducted in a constrained space? How long will the optimization take for this line?

局限性

yes

最终评判理由

My main concern about 1-1 comparison and efficiency is well justified by the authors through updated results.

格式问题

N/A

作者回复

Thank you for carefully reviewing our paper! We greatly appreciate your feedback. Please see below our responses related to your comments.

1. The experiment results lack a direct 1-1 comparison. All results are computed against a fixed response; however, this does not explicitly tell the quality gap between different methods. For example, a high win rate does not denote a high-quality gap against a fixed response. Therefore, a direct comparison between different methods' generation is beneficial.

We appreciate the reviewer’s concern and have given careful consideration to the design of our evaluation setup. We intentionally chose to compare each method’s selected response against a shared baseline, which serves as a consistent and neutral reference point across all methods. We believe this provides a meaningful and scalable proxy for relative quality: if a method consistently outperforms the baseline more often, it is producing responses that better align with user preferences (or those of the simulated model). This setup allows for fair comparisons across methods while maintaining a consistent evaluation scale.

Our choice follows best practices in the preference optimization literature, where win rates against a fixed baseline are commonly reported. For instance, SimPO explicitly reports win rates against a baseline model for the Arena-Hard benchmark (Meng et al., 2024). Similarly, Rafailov et al. (2023) evaluate their algorithms by win rate against a baseline policy, employing GPT-4 as a proxy evaluator, highlighting that theoretical analyses often rely on such a baseline as a reference point.

Nonetheless, as per reviewer’s feedback, we conducted a preliminary experiment to check whether the performance results seen w.r.t. the baseline also translate into the direct 1-1 comparisons. Across different settings, we have observed that a higher win-rate when comparing to the baseline translates to a high win-rate in direct 1-1 comparisons. In the updated version of the paper, we will report results for direct 1-1 comparisons between methods.

2. Following my question in the following section, the selection of y_2 is a major contribution. Because the current confidence-set-based implementation is quite complex, I wonder if a simpler approach will yield similar performance: we always set the confidence set to θt^\hat{\theta_t}. Will the results be significantly affected in practice?

We thank the reviewer for raising this point about how our method could be simplified. After careful analysis, we would like to note that always setting the confidence set to θt^\hat{\theta_t} will not provide us a way to choose a pair of responses. The purpose of maintaining a confidence set is to systematically reduce uncertainty in θt^\hat{\theta_t}, which is key to effective preference elicitation.

As discussed in Appendix E.1, our method (USERALIGN) remains computationally efficient, relying primarily on simple convex optimization.

3. Can authors explain how line 8 in algorithm two is computed? To my understanding, authors need to repeat the MLE optimization across the whole candidate set. How is the optimization conducted in a constrained space? How long will the optimization take for this line?

Line 8 of Algorithm 2 involves solving a set of convex optimization problems -- one per candidate in the pool -- with a linear objective and a convex constraint set. These problems are computationally efficient to solve. We discuss computational efficiency in detail in Appendix E.

评论

As the discussion period is ending soon, we are writing to thank the reviewer for their constructive feedback. We hope that our responses have addressed your concerns and are helpful in improving your rating. We will incorporate the reviewer's feedback and our responses (including additional experiments and discussions) in the paper.

评论

Thanks for the rebuttal! I appreciate the authors' plan to report the 1-1 comparison in the future.

Regarding the efficiency: I appreciate the analysis in Appendix E, while I still think a better method is to report the wall-time of the optimization (same as the wall-time reported in data generation in Appendix E). Without it, I cannot have a clear picture of how the method's efficiency performs in the real world.

I will maintain my tendency to accept, but not to a higher score, because the current version lacks 1-1 comparison results and wall-time efficiency.

评论

We sincerely thank the reviewer for their engagement during the discussion period. To further address the reviewer’s concerns, we would like to share the following results as a follow-up to our earlier rebuttal.

1. Direct 1-1 comparisons between methods

In the past few days, we completed additional experiments to compute direct 1-1 win-rates between USERALIGN and IIDBest. Below we report results for two domains (food64d and visual512d). These results confirm that the superior performance of USERALIGN vs. IIDBest (when measuring win-rate w.r.t. a fixed baseline) also translates to high win-rates of USERALIGN in direct 1-1 comparisons with IIDBest.

The results reported below complement the results in Tables 3 and 4 (Appendix C). To report the win-rates below, we conducted fresh GPT-based pairwise evaluations. We will add these direct 1-1 comparison results for all domains and methods in the updated paper.

  • Domain: food64d | Method | Win-rate | |------------------------------------------|----------| | USERALIGN (ϵ=0.0\epsilon=0.0, cost=19.13) vs. Baseline | 96.67% (1.47) | | IIDBest (t=20) vs. Baseline | 88.00% (2.65) | | IIDBest (t=25) vs. Baseline | 90.00% (2.45) | | IIDBest (t=50) vs. Baseline | 90.67% (2.38) | | | | | USERALIGN (ϵ=0.0\epsilon=0.0, cost=19.13) vs. IIDBest (t=20) | 96.67% (1.47) | | USERALIGN (ϵ=0.0\epsilon=0.0, cost=19.13) vs. IIDBest (t=25) | 97.33% (1.32) | | USERALIGN (ϵ=0.0\epsilon=0.0, cost=19.13) vs. IIDBest (t=50) | 96.67% (1.47) |

  • Domain: visual512d | Method | Win-rate | |------------------------------------------|----------| | USERALIGN (ϵ=0.0\epsilon=0.0, cost=49.31) vs. Baseline | 96.67% (1.47) | | IIDBest (t=20) vs. Baseline | 81.33% (3.18) | | IIDBest (t=25) vs. Baseline | 76.67% (3.45) | | IIDBest (t=50) vs. Baseline | 79.33% (3.31) | | | | | USERALIGN (ϵ=0.0\epsilon=0.0, cost=49.31) vs. IIDBest (t=20) | 85.33% (2.89) | | USERALIGN (ϵ=0.0\epsilon=0.0, cost=49.31) vs. IIDBest (t=25) | 85.33% (2.89) | | USERALIGN (ϵ=0.0\epsilon=0.0, cost=49.31) vs. IIDBest (t=50) | 87.33% (2.72) |

2. Wall-time results for the optimization

Below we report the average wall-time (in seconds) for different computations done by the algorithm USERALIGN in a single step tt:

  • Computation of θt^\hat{\theta_t}
  • Computation of yt1y^1_t
  • Computation of yt2y^2_t
DomainWall-time Total (s)Wall-time θt^\hat{\theta_t} (s)Wall-time yt1y^1_t (s)Wall-time yt2y^2_t (s)
food64d (pool=20)0.12870.01000.00020.1187
visual512d (pool=40)0.87190.03280.00030.8392

As can be seen from these results, the wall-time in a single step tt is under a second, making it usable for real-world settings. In fact, our lightweight web application used for conducting the user study (as mentioned in our response to other reviewers) had the same implementation of the algorithm. We will update Appendix E with these additional details.


We hope that our responses address your concerns and are helpful for improving your rating. We appreciate the reviewer's input in helping us improve the paper!

评论

I sincerely thank the authors for the extra experiment and the commitment to update the paper accordingly. The results resolve my concern, and I raised my score.

评论

We sincerely thank the reviewer for their engagement during the discussion period and appreciate the reviewer's input in helping us improve the paper.

审稿意见
3

This paper presents USERALIGN, an inference-time personalized alignment method that efficiently elicits user preferences through a few pairwise response comparisons. By leveraging pretrained embeddings and version-space elimination in logistic bandits, USERALIGN rapidly identifies user-preferred responses from a candidate pool. Experimental results demonstrate that USERALIGN achieves strong personalization with minimal user interaction across diverse text and image generation tasks.

优缺点分析

Strength

  • Efficient Personalization: USERALIGN enables fast and sample-efficient inference-time personalization by identifying user-preferred outputs with only a few pairwise queries, reducing the amount of required user interaction.
  • Theoretical Guarantees: The method is grounded in a rigorous logistic bandit framework, providing theoretical analysis for convergence and query efficiency in preference identification.
  • Broad applicability: USERALIGN is demonstrated across diverse domains—including text and image generation—showing robust performance.

Weakness

  • Practicality: While USERALIGN claims to reduce user interaction cost, it is unclear if multiple pairwise queries are actually less burdensome than simply letting users select their favorite from the full candidate pool (best-of-N). For reasonable pool sizes, inspecting all options at once may be equally or more efficient, but the paper does not provide a direct comparison.
  • Concerns on Feature Extraction: The method depends on pretrained sentence/image embeddings to represent user preferences, but such models may not capture subtle semantic or nuanced differences between candidates. If the embedding space lacks sufficient granularity or expressiveness, the method may struggle to distinguish fine-grained preferences.
  • Lack of Human Evaluation: All experiments use GPT-based simulated users for preference feedback. There is no human study to validate whether the results generalize to real users or truly reflect actual human preferences.
  • The setup assumes a user will repeatedly provide feedback on the same pool of candidates at test time, which may be unrealistic or impractical in many real-world applications

问题

  • How come your method of multiple pairwise comparison be more efficient comapred to users select their favorite from the full candidate pool at once?
  • How can you justify the practicality of repeatedly collecting feedback on the same candidate pool at test time in real-world scenarios?
  • How can you ensure that your method still works if the embedding model fails to capture subtle differences in user preference?
  • How can you be confident that results obtained with simulated users will hold for real human preferences?

局限性

Yes

最终评判理由

The rebuttal clarified that the method can reflect real-world human preferences, supported by new human evaluation results. Some concerns remain regarding the practicality of generating a large candidate pool and the cost of multiple user interactions. Nevertheless, the method’s strong theoretical foundation, query efficiency, and solid empirical results justify an increased score, but remains as borderline reject.

格式问题

No concern

作者回复

Thank you for carefully reviewing our paper! We greatly appreciate your feedback. Please see below our responses related to your comments.

About the practicality of pairwise comparisons vs. best-of-N selection

We appreciate the reviewer’s suggestion regarding best-of-N selection. Our approach is designed to efficiently infer user preferences and automatically select the best response, rather than relying on the user to manually choose from a full set of N generated options.

Presenting all N candidates at once can significantly increase cognitive load, requiring users to compare multiple options simultaneously and make fine-grained distinctions -- often resulting in less reliable feedback. In contrast, pairwise comparisons reduce this burden by allowing users to focus on just two options at a time, facilitating more consistent and focused judgments. This is supported by findings in the preference modeling literature, where pairwise comparisons are widely used due to their robustness and lower cognitive demand. For example, Ouyang et al. (2022) on InstructGPT and Rafailov et al. (2023) on DPO highlight the effectiveness of pairwise feedback in eliciting higher-quality preference data from human annotators.

Moreover, our method is designed to avoid showing the same pair multiple times, encouraging exploration of new generations while consistently tracking the most promising candidate. This strategy enables both diversity and convergence, making the interaction both efficient and user-friendly.

To empirically evaluate how our approach performs in settings where best-of-N selection becomes infeasible, we conducted experiments with a larger candidate pool (N = 1000). In such high-N settings, asking users to directly select the best response from the full pool would be cognitively unmanageable and practically infeasible. Instead, by relying solely on pairwise comparisons, our method remains both usable and effective. The results, shown below, demonstrate that USERALIGN maintains strong performance even in this high‑N regime, highlighting the method’s scalability and robustness. We will include these findings, along with an expanded motivation for pairwise comparisons, in the updated version of the paper.

  • food64d - pool size = 20
Methodt = 0t = 5t = 10t = 20
Random54.00 (4.08)54.00 (4.08)54.00 (4.08)54.00 (4.08)
IIDBest54.00 (4.08)87.33 (2.72)88.67 (2.60)90.00 (2.46)
USERALIGN54.00 (4.08)98.00 (1.15)100.00 (0.00)100.00 (0.00)
  • food64d - pool size = 1000
Methodt = 0t = 5t = 10t = 20
Random52.00 (4.09)52.00 (4.09)52.00 (4.09)52.00 (4.09)
IIDBest52.00 (4.09)72.00 (3.68)76.00 (3.50)82.67 (3.10)
USERALIGN52.00 (4.09)86.67 (2.78)90.67 (2.38)96.67 (1.47)

About the new user study and human evaluation

To tackle the reviewers’ concerns about real-world validity of our methodology, we would like to present results from a user study we have conducted. Given the limited time during rebuttal and high costs involved, we managed to conduct the user study for only one domain (food64d). Below we present the details of the study and the results. We will add the details about the user study and results to the updated version of the paper.

As part of this project, we had developed a lightweight web application to expose the methods considered in the paper through an interactive interface. We have now used this platform to conduct a user study with about 500 crowdsourcing participants to evaluate USERALIGN and IIDBest under realistic, noisy conditions in the food64d domain. Each participant was randomly provided with a question from the domain’s question set, and assigned either USERALIGN or IIDBest as their interaction method (blinded to the user). Participants were also placed in one of two personalization conditions: in the with persona condition, they saw a concise persona description (similar to our simulated users experiments) and were instructed to roleplay that persona; in the without persona condition, no persona was shown and participants made selections based on their own preferences.

Each participant interacted with the system in three stages. Stage 1 presented the question and either a persona (in the with‑persona condition) or an instruction to reflect on how the participant would personally like the question to be answered (in the without‑persona condition). Stage 2 consisted of a series of 10 pairwise comparisons, during which participants chose between two options that best fit the persona or their personal preference. In Stage 3, participants compared three final candidates against the zero‑temperature baseline response for their assigned question: the method’s selection at t = 10, the midpoint selection at t = 5, and a candidate chosen by the Random method (again, blinded to the user).

To ensure a fair comparison between methods and to keep the cognitive burden manageable for participants, we fixed the number of allowed comparisons during interaction for both methods to 10. This setting was chosen to allow enough interaction steps while staying within a manageable usage scenario. The candidate pool size was the same as that used in the main experiment, i.e., 20 generations.

For each of USERALIGN and IIDBest, we collected n = 120 evaluation sessions per setting; for Random, which does not depend on interaction steps, we recorded 2n = 240 samples. The tables below report the resulting win‑rates % across both personalization conditions.

  • food64d - with persona
Methodt = 5t = 10
Random44.17 (3.21)44.17 (3.21)
IIDBest71.67 (4.11)78.33 (3.76)
USERALIGN82.50 (3.47)89.17 (2.84)
  • food64d - without persona
Methodt = 5t = 10
Random43.75 (3.20)43.75 (3.20)
IIDBest73.33 (4.04)76.67 (3.86)
USERALIGN79.17 (3.71)85.83 (3.18)

About use of pretrained embeddings and feature granularity

In the paper, we demonstrate that our algorithm performs well in different preference spaces defined by pretrained sentence and image embeddings. In the additional user study we conducted, we further validated that these representations are sufficient to capture user preferences in a meaningful and robust way. While pretrained embeddings may have limitations in capturing fine-grained nuances, our method remains flexible -- specialized (e.g., handcrafted or learned) embeddings can easily be integrated to capture additional granularity when needed.

评论

Dear authors, Thank you for clarification and additional experiments!

Through your rebuttal, I acknowledged the proposed method can reflect real-world human preferences.

Though I still have some concerns about the practicality in real-world settings, the inference cost for generating candidate outputs and the need for multiple rounds of user interaction (which also cause latency and human effort).

However, I see that your framework is novel and has potential to be useful for other applications, such as constructing high-quality preference datasets for training or applying to adaptive pool generation (as mentioned in section 6 of the paper). Importantly, theoretical groundness and superior results are promising.

Therefore, I raise my score.

评论

We sincerely thank the reviewer for their constructive feedback and engagement during the discussion period! We will incorporate the reviewer's feedback and our responses (including additional experiments and discussions) in the paper. We appreciate the reviewer's input in helping us improve the paper!

审稿意见
4

This paper proposes a new algorithm UserAlign for online inference-time personalization. The algorithm is based on (1) the best-arm identification of logistic bandits framework and (2) version space elimination under the assumption of consistent and noise-free user preference. The authors analyse the convergence via loss-based confidence set and experiment on text data with sentence transformer and image data with OpenClip. Their user persona and candidate responses are generated by prompting GPT. The advantage of UserAlign is (1) sample efficiency: only requires a few user interactions for comparison preference data; and (2) model-independence: can be applied to different models and modalities.

优缺点分析

Strengths:

  • The algorithm is sample efficient under certain assumptions and transferable across models and modalities
  • The theoretical analysis and experiment setup and results are clear and well described.

Weaknesses:

  • No real-world dataset experiment. It might be the case that the real-world settings may not satisfyy the consistent and noise-free assumption. In such case, how would the sample complexity be? Can the assumptions be checked empirically in some way?
  • The method requires a pool of already generated responses that are diverse enough. This may not be easily satisfied all the time.

问题

  • Is there a way to empirically validate / invalidate the core assumptions on user feedback (consistent and noise-free) used in the paper? For example, is it possible to identify the level of “consistent and noise-free” of the used dataset? How sensitive does the proposed method depend on the assumption?

  • Are there more details about the diversity of the candidate pool?

  • For preferences that have multiple dimensions, would the candidate pool ensure that all the preferences are satisfied? Is it possible to provide an example of candidate pool? Does the case “the algorithm will resort to i.i.d. sampling of y_(2) if no viable response remains” occur, and how often?

局限性

Yes.

格式问题

No.

作者回复

Thank you for carefully reviewing our paper! We greatly appreciate your feedback. Please see below our responses related to your comments.

About the real‑world validity and a new user study

Our work focuses on short-term, task-specific interaction sessions, where users typically have clear and stable preferences. In such practical scenarios, assuming consistent and noise-free feedback is both reasonable and realistic. To tackle the reviewers’ concerns about real-world validity of our methodology, we would like to present results from a user study we have conducted. Given the limited time during rebuttal and high costs involved, we managed to conduct the user study for only one domain (food64d). Below we present the details of the study and the results. We will add the details about the user study and results to the updated version of the paper.

As part of this project, we had developed a lightweight web application to expose the methods considered in the paper through an interactive interface. We have now used this platform to conduct a user study with about 500 crowdsourcing participants to evaluate USERALIGN and IIDBest under realistic, noisy conditions in the food64d domain. Each participant was randomly provided with a question from the domain’s question set, and assigned either USERALIGN or IIDBest as their interaction method (blinded to the user). Participants were also placed in one of two personalization conditions: in the with persona condition, they saw a concise persona description (similar to our simulated users experiments) and were instructed to roleplay that persona; in the without persona condition, no persona was shown and participants made selections based on their own preferences.

Each participant interacted with the system in three stages. Stage 1 presented the question and either a persona (in the with‑persona condition) or an instruction to reflect on how the participant would personally like the question to be answered (in the without‑persona condition). Stage 2 consisted of a series of 10 pairwise comparisons, during which participants chose between two options that best fit the persona or their personal preference. In Stage 3, participants compared three final candidates against the zero‑temperature baseline response for their assigned question: the method’s selection at t = 10, the midpoint selection at t = 5, and a candidate chosen by the Random method (again, blinded to the user).

To ensure a fair comparison between methods and to keep the cognitive burden manageable for participants, we fixed the number of allowed comparisons during interaction for both methods to 10. This setting was chosen to allow enough interaction steps while staying within a manageable usage scenario. The candidate pool size was the same as that used in the main experiment, i.e., 20 generations.

For each of USERALIGN and IIDBest, we collected n = 120 evaluation sessions per setting; for Random, which does not depend on interaction steps, we recorded 2n = 240 samples. The tables below report the resulting win‑rates % across both personalization conditions.

  • food64d - with persona
Methodt = 5t = 10
Random44.17 (3.21)44.17 (3.21)
IIDBest71.67 (4.11)78.33 (3.76)
USERALIGN82.50 (3.47)89.17 (2.84)
  • food64d - without persona
Methodt = 5t = 10
Random43.75 (3.20)43.75 (3.20)
IIDBest73.33 (4.04)76.67 (3.86)
USERALIGN79.17 (3.71)85.83 (3.18)

About the diversity of the generated candidate pool

We appreciate the reviewer’s concern about candidate pool diversity. We agree that integrating with separate, domain-specific generation tools that explicitly aim to diversify generations across multiple or complex preference dimensions could make USERALIGN even more powerful. More generally, domains with known axes of variation (e.g., food2d) offer natural opportunities to enhance diversity during candidate generation. However, USERALIGN itself is fully agnostic to the generation process: it operates over any fixed set of candidates and makes no assumptions beyond the ability to embed them. In our experiments, we demonstrate that even when using general-purpose embedding models, USERALIGN remains effective across a variety of domains.

Regardless of the domain, increasing either the diversity or the size of the candidate pool would theoretically enhance alignment performance, as it becomes more likely to include responses closely matching nuanced user preferences. But even with relatively modest pool sizes (20 or 40 candidates), USERALIGN already achieves robust alignment, indicating our current generative strategy involving LLMs reasoning about potential interests provides a reasonable basis.

Finally, as shown in Appendix D, the algorithm remains effective even with unbiased pools, where we did not aim to induce a larger diversity through prompting. Examples of candidate pools are shown in Figure 1 and Appendix Figure 5.

Clarification on i.i.d. fallback behavior

We would like to clarify when USERALIGN resorts to i.i.d. sampling. By default, the algorithm stops once it reaches sufficient confidence and no viable challengers remain. However, in part of our experiments (see Section 5.2), we fixed the number of comparisons to T for consistent evaluation across methods. In those cases, once the preference space was exhausted, we continued by sampling comparisons uniformly at random. In real deployments, the algorithm would terminate early once it converges, without requiring i.i.d. sampling.

评论

As the discussion period is ending soon, we are writing to thank the reviewer for their constructive feedback. We hope that our responses have addressed your concerns and are helpful in improving your rating. We will incorporate the reviewer's feedback and our responses (including additional experiments and discussions) in the paper.

评论

Dear Reviewers,

Thank you for your time and effort in reviewing this manuscript. The authors have submitted their rebuttal, and we would greatly appreciate it if you could review their responses at your earliest convenience.

If you have any further questions or concerns regarding the rebuttal, please don't hesitate to discuss. Thanks for your contribution to NeurIPS 2025.

Best regards, AC

最终决定

Summary:
This paper introduces UserAlign, an inference-time personalization method using pairwise comparisons within a logistic bandits framework to efficiently identify user preferences. It demonstrates strong empirical results across text and image domains with limited user interactions. The authors conducted a new user study during rebuttal to address concerns about real-world validity.

Pros:

  • Theoretically grounded and sample-efficient.
  • Effective across multiple modalities (text and images).
  • User study confirms performance with real human feedback.

Cons:

  • Relies on a pre-generated, diverse candidate pool, which may be impractical to ensure in some applications.
  • Assumes noise-free user preferences, and robustness to inconsistent feedback remains unclear.
  • Practical deployment costs (latency, interaction burden) are not fully quantified.