/10

Rejected4 位审稿人

最低3最高3标准差0.0

ICML 2025

Learning to Plan with Personalized Preferences

Manjie Xu,Xinyi Yang,Wei Liang,Chi Zhang,Yixin Zhu

提交: 2025-01-21更新: 2025-06-18

TL;DR

We develop agents that learn human preferences from few demonstrations and learn to adapt their planning strategies based on these preferences in multitasks.

摘要

关键词

preferencefew-shot learningplanning

评审与讨论

审稿意见

评分: 32025-03-12

This paper proposes a framework/benchmark, named Preference-based Planning (PbP), for learning preferences from human behavior and subsequently planning actions guided by these learned preferences. The PbP is an embodied benchmark built upon NVIDIA Omniverse and OmniGibson, and provides provides a large-scale, structured environment for evaluating preference learning and personalized planning. The performance of this PbP framework/benchmark is mainly evaluated via leveraging extensive State-of-the-Art (SOTA) algorithms.

-----------------update after rebuttal--------------- My concerns were addressed by the authors and I maintain my original score.

给作者的问题

There is often preference drift in practice, will the benchmark be able to handle evolving or conflicting user preferences over time?
In practice, the amount of data may be limited. In order to achieve reliable preference inference in real-world applications, is there any estimations on the number of demonstrations required?

论据与证据

There are several claims made in this paper, including but not limited to the followings:

Claim: The planning adaptability can be improved via learning human preferences from few-shot demonstrations.
Evidence: This paper validates and supports this claim via empirical evaluations. The empirical results demonstrate that the action prediction performance can be improved via incorporating learned preferences as intermediate representations.
Claim: The PbP benchmark provides a comprehensive and systematic evaluation for preference-based planning.
Evidence: This paper supports this claims via spanning 50 distinct scenes and encoding 290 unique preferences, with a comprehensive test set of 5000 instances.

Most of these claims are supported by quantitative results, some potential limitations include these results/claims are mainly validated via simulations, the real-world applicability of PbP is not quite clear.

方法与评估标准

The proposed methods are evaluation criteria make sense in general. For example, the generalization performance is evaluated via testing models on novel environments, which is a general method widely used in the literature.

理论论述

No. There is no theoretical claims in this paper.

实验设计与分析

In general, the experimental designs are well-structured since this paper mainly focuses on introducing a new benchmark. One potential issue is that it is clear that there will be an error propagation in the two-stage learning, however, there is not a comprehensive or deep analysis from this perspective. However does the noisy preference prediction affect downstream planning?

补充材料

Sections B&C, not too much on the Baseline details of Section D.

与现有文献的关系

This work is based on preference learning, embodied AI benchmarks, etc.

遗漏的重要参考文献

The paper discusses most relevant works.

其他优缺点

it is clear that there will be an error propagation in the two-stage learning, however, there is not a comprehensive or deep analysis from this perspective. However does the noisy preference prediction affect downstream planning?
Most of the claims in this paper are mainly validated via simulations, the real-world applicability of PbP is not quite clear.

其他意见或建议

How diverse are user preferences in the benchmark? Does the benchmark sufficiently cover cross-cultural differences?

作者回复

2025-03-31

Dear Reviewer kiee:

Thank you for your thoughtful review and constructive feedback.

Some potential limitations include these results/claims are mainly validated via simulations, the real-world applicability of PbP is not quite clear.

Most of the claims in this paper are mainly validated via simulations, the real-world applicability of PbP is not quite clear. We acknowledge the weakness for the lack of real-world demonstrations. We would argue that our study has a different focus. The preference definitions defined, together with the simulation based on prior research in the community, can be viewed as simplified simulation of real-world human-AI interactions. We did not ground the task in more challenging settings involving real humans, for the reason that their preferences might change unpredictably and observations of their behaviors can be noisy and with errors. Instead, we focused on fundamental reasoning and planning tasks where a stable preference can guide the entire process. However, moving forward, it is indeed important to explore our current work in real-world scenes with diverse and unpredictable real-human behaviors. Adapting the pipeline to real-world scenarios is a valuable direction for future work.

One potential issue is that it is clear that there will be an error propagation in the two-stage learning, however, there is not a comprehensive or deep analysis from this perspective. However does the noisy preference prediction affect downstream planning?

We have shown the results in table 1, where the mid line results indicates using previously inferred noisy preference labels, and the bottom line results ground truth preference labels for downstream planning. There is a significant performance drop when using noisy preference labels.

How diverse are user preferences in the benchmark? Does the benchmark sufficiently cover cross-cultural differences?

Very important question. There are 290 pre-defined questions in PbP. There are certainly some corner cases or cross-cultural differences that not coverd in these scope. However, as defined, these preferences can be seen as high-level abstractions of action sequences. The generalization of our preference modeling doesn't necessarily depend on the scope of these pre-defined preference definitions. The model learns from action sequences and finally outputs actions as well. The pre-defined preferences mainly serve as a guide to help us sample demonstrations and help model planning. While they may not cover all corner cases, they provide a substantial enough range to serve as a benchmark to evaluate the baselines.
Besides, our environment naturally supports extensions like cross-culture differences or preference drifts. So long as the task can be formulated as action sequences and sub-structures exist in the action distributions, the methodology can be applied, no matter how domains vary, or humans have their own unique preference, or organize preferences according to cultural differences. Our proposed three-tiered hierarchical structure of preferences is designed from the perspective of how things happen in a household scenario.

There is often preference drift in practice, will the benchmark be able to handle evolving or conflicting user preferences over time?

Yes, as the learned preferences are mainly based on high-level abstractions of action sequences, as long as the user preference drift is demostrated and can be observed through his observations, his preference evolution could be learned and updated to the machine's policy.

In practice, the amount of data may be limited. In order to achieve reliable preference inference in real-world applications, is there any estimations on the number of demonstrations required?

Yes. This is exactly the figure 7 wants to show. We do ablation study on the number of demonstrations and show the results in various cases. Generally, more demostrations mean better performance. In our experimental settings, we found that approximately five demonstrations are typically enough. However, real-world applications may require further validation, as they involve more complex perception challenges and more nuanced human behaviors than our controlled experimental environment. We also agree that a more rigorous study shall be required in order to study accurate scaling laws.

We sincerely welcome your further feedback.

审稿意见

评分: 32025-03-14

This work attempted to develop agents capable of learning preferences from few-shot demonstrations and generalize across diverse household task-planning scenarios. in that pursuit, the work also presents the 'Preference-based Planning' (PBP) benchmark featuring a set of demonstrations rendered within a simulator, representing 290 different preferences spanning multiple levels from low-level action execution to spatial constraints to sequence-level preferences. The findings indicate that learned preferences may be useful intermediate representations for planning and that pure language models show potential for scalability over vision-language models.

Update after rebuttal:

My major concerns and questions have been addressed. I do not have any further qualms about this work. Thus, I maintain my acceptance score.

给作者的问题

Section 5.3 two-stage learning, what do the auxiliary preference tokens look like? How do they differ for symbolic models vs vision models?
Does option-level include action-level preferences as well? If not, Why is option-level and sequence-level compared and not action-level preferences?

论据与证据

In section 3, the motivations for formulating preference learning as a few-shot learning from demonstrations are technically naive. Relevant literature in embodied AI and preference learning needs to be reviewed to back up the hypotheses and motivations for this approach than current evidence that compares abilities of humans and exhorts difficulty of preference collection.

方法与评估标准

Methods and evaluation criteria are largely sound as far as saw. Empirical and Ablation studies have been conducted to understand the impact of the number of demonstration examples in the prompts for the models. There is a sufficient comparison of text-based models and vision-based models.

理论论述

Section 3 - Formulating Preference-based Planning: The observations are stated to contain the $S_i$ , egocentric observation sequence video, $A_i$ , action sequence and $M$ , the bird's eye view of the scene maps. It is unclear whether or how the $M$ values - the bird's eye view of the scene maps are utilized in the inputs by the models. More details are needed explaining whether there are any explicit mappings other than $A_i$ between the $S_i$ and $M$ values. The work also needs to specify whether this was a choice of design or whether it is an accepted practice for such kinds of evaluations.

实验设计与分析

Section 5.4, Generalization experiment details are insufficiently explained. The description says that demonstration and test videos are rendered with the same objects and identical rooms. It's unclear what difference is being leveraged in the hypothesis of the experiment. A side-by-side rendering of the two instances would be useful to clear the conflations.

补充材料

Yes, I have reviewed appendix A to C.

与现有文献的关系

The work's results are relevant for embodied AI and preference learning communities in that it highlights issues with preference learning with few-shot demonstrations and generalization of learned preferences across different settings, especially with multimodal models.

遗漏的重要参考文献

Section 4.2 Constructing the PBP test set claims that the egocentric perspective is prioritized for certain reasons. It is essential to note that the VLMs have a frame of reference bias. Regardless of whether the authors knew about prior works proving the same such as [1], it is prudent to mention that such bias may well exist and that this work is aware of it when performing evaluations, especially as the work also presents a benchmark.

[1] Zhang, Zheyuan, et al. "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities." ArXiv, 2024, https://arxiv.org/abs/2410.17385.

其他优缺点

Other Weaknesses:

An insufficient amount of work is done in the failure analysis of the experiments. For example, It would have been useful to know which levels of preference learning are more prone to failure or are difficult for what kind of models.

其他意见或建议

Section 5.4, Table 3, Figure 6 - In some cases, 'gen' has been used to denote the generalization setting, and in other cases 'orig' has been used for the same. Are they different settings? If not, It is best to have consistent definitions.

作者回复

2025-03-31

Dear Reviewer rU32:

Thank you for your thoughtful review and constructive feedback.

the motivations for formulating preference learning as a few-shot learning from demonstrations are technically naive. Relevant literature in embodied AI and preference learning needs to be reviewed.

Our formulation is based on two considerations:
i) Humans, even infants are found to have the ability to detect others' preference with only a few demonstrations. There do exist a series of literatures, especially in psychology, supporting this [1][2][3]. Recently in embodied AI, works like [7][8] demonstrate the importance of personalization in robot assistants and highlight the difficulty in preference inference and adaption. So to facilitate embodied AI, it is necessary to test such ability.
ii) Under our home assistant setting, it is nearly impossible to collect a larget amount of demonstrations for a specific person and task. And it is too tedious for users to choose or rank the preferred trajectories in performing everyday tasks. So in a realistic setting, learning from the observations of user behaviors is a more natural way rather than collecting preference data. There exists literature proposing similar opinions, like reducing query times as far as possible [5][6], or using few-shot learning from demonstrations for adapting [4]. We will add more literature discussion in the Intro part and Related work part.

[1] Choi et al. How do 3-month-old infants attribute preferences to a human agent?. Journal of experimental child psychology.
[2] Duh et al. Infants detect patterns of choices despite counter evidence, but timing of inconsistency matters. Journal of Cognition and Development.
[3] Baker et al. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour. [4] Verma et al. Adaptagent: Adapting multimodal web agents with few-shot learning from human demonstrations.
[5] Hejna et al. Few-shot preference learning for human-in-the-loop. PMLR.
[6] Yu et al. Few-shot in-context preference learning using large language models.
[7] Hwang et al. Promptable behaviors: Personalizing multi-objective rewards from human preferences. CVPR. [8] Hellou et al. Personalization and localization in human-robot interaction: A review of technical methods. Robotics.

It is unclear whether or how the M - the bird's eye view are utilized in the inputs. More details are needed explaining any explicit mappings other than A_i between the S_i and M values. The work also needs to specify whether this was a choice of design for such kinds of evaluations.

The scene maps are not used as the model input in our settings, as discussed in sec 4.2. It rather helps illustrate the overall process of robot behaviors. Its exclusion from model inputs is a deliberate design choice to focus on egocentric perception. There are no other explicit mappings other than A_i (can be human action or robot action). For evaluation, we kindly refer to examples and discussons to reviewer eCTV.

Generalization experiment

For generalization experiment, there could be randomly-sampled scenes and objects for the same preference. We will add side-by-side rendering in revision. Thank you for your suggestion.

It is essential to note that the VLMs have a frame of reference bias.

We will explicitly highlight this point together with the references in revision.

It would have been useful to know which levels of preference learning are more prone to failure or are difficult for what kind of models.

Table 1-3 in the paper reports model-level performance at different levels. Generally, all models struggle more when dealing with the sequence level than the option level, as in long-horizon preference reasoning dependencies between preference steps accumulate errors.

'gen' and 'orig' has been used for the same.

Sorry for the confusion. We will fix this confusion in revision.

What do the auxiliary preference tokens look like? How do they differ for symbolic models vs vision models?

The auxiliary preference tokens are preference labels we define in section 4.1. They are same for symbolic models and vision models for two-stage experiment setting.

Does option-level include action-level preferences as well?

Yes. Action-level preferences are related to single actions. We didn't explicitly include action-level preference in comparision for the reason that as commonly defined, action preferences are related to specific actions, such as drinking juice vs drinking coffee. A basic imitation policy can effectively learn these preferences where models only need to remember and repeat. We would like to focus on more complex preferences that cannot be easily addressed through simple copy/imitation, and to test models in scenarios where preferences involve more nuanced or context-dependent decisions.

We sincerely welcome your further feedback.

审稿意见

评分: 32025-03-16

The paper introduces Preference-based Planning (PBP), a benchmark for learning human preferences and integrating them into AI planning. The framework enables AI agents to infer user-specific preferences from a few demonstrations and apply them in task planning.

给作者的问题

论据与证据

Following claims are made with or without evidences:

AI agents can learn user preferences from limited demonstrations and generalize them across diverse planning tasks.
"Few-shot learning generalizes well across various scenarios" – No direct comparison with baseline retrieval-based or reinforcement learning based approaches.
"Preference learning significantly improves AI adaptability" – While performance gains are observed, no real-user evaluations validate whether these improvements translate to better human-AI interactions.
"PBP is a realistic simulation and real-time rendering of human preferences" – The benchmark relies on synthetic data, and no real-world demonstrations are included.

方法与评估标准

Proposed method leverages AI agents with Levenshtein distance and accuracy as the metric for performance quantification. However, need more experiments to see how the work is inline with the literature. There is a lot of research on RL based preference learning (see literature section).

理论论述

No theoretical claims.

实验设计与分析

They generated 50 distinct scenes, 290 unique preferences, and 5,000 test cases. Two experiments are designed:

End-to-End Action Preference Learning – Models generate action sequences directly from past demonstrations.
Two-Stage Learning & Planning – First, models predict user preferences, then use them for task planning.

Levenshtein distance and accuracy is used as performance evaluation metric. However, detailed comparison with SOTA is missing.

补充材料

与现有文献的关系

Yes. Preference learning is a trending topic in RL and current work leverages AI agents to learn preference and actions.

遗漏的重要参考文献

Though relevant literature is included but RL based methods are common in literature and it will be interesting to see how the work compares to these methods. Many types of preferences are defined in RL domain but the paper doesn't mention which kind of preference they are using. See and cite the following paper for reference.

Advances in Preference-based Reinforcement Learning: A Review
Deep reinforcement learning from human preferences
Models of human preference for learning reward functions

Other simulation benchmarks can also be cited:

RObocasa

其他优缺点

The work leveraging embodies agents is interesting and reveal good accuracy however a detailed comparison with RL based methods is missing.

The paper also lacks Real-World Validation, the current evaluation is on the simulation (their own). Do we have any results on existing benchmarks?

The paper has limited Scope of Generalization – Only tests simulation-based preference learning without testing on other existing benchamars.

其他意见或建议

See weaknesses.

作者回复

2025-03-31

Dear Reviewer mzDn:

Thank you for your thoughtful review and constructive feedback.

No direct comparison with baseline retrieval-based or reinforcement learning based approaches.
Other Strengths And Weaknesses: a detailed comparison with RL based methods is missing.

Retrieval-based and RL-based approaches are intuitive candidates for this benchmark, but they face inherent limitations that make them unsuitable for the task. For Retrieval-based methods, they rely on matching input queries to similar patterns in a memory pool. Although they can retrieve relevant demonstrations, they fail to address the core challenge of the few-shot reasoning:

	State 1 (openai)	Stage 2 (openai)	State 1 (jina-v2)	Stage 2 (jina-v2)
Option Level	16.46	15.42	16.49	16.18
Sequence Level	14.01	24.08	14.72	26.54

These scores are notably poor—only marginally better than DAG-based methods—because retrieval-based systems, while capable of accessing more demonstrations, cannot effectively generalize from historical data to new tasks. We will add this baseline to our paper.
RL-based approaches, especially meta-RL, on the other hand, often need millions of training steps and explicit reward functions for each task. The requirement contradicts with the few-shot nature of our task: we need an agent that can adapt to new tasks given only a few demonstrations in a complex environment, where current RL-based or meta-RL based methods have been not shown satisfactory results. Moreover, designing reward functions for each preference can be subtle and trivial, which is impractical as humans have very diverse and nuanced preferences.

no real-user evaluations validate whether these improvements translate to better human-AI interactions.
The benchmark relies on synthetic data, and no real-world demonstrations are included.
The paper also lacks Real-World Validation

We would argue that as a first of its kind in the study, our work focuses on developing an agent capable of the task in virtual environments before transitioning to real-world scenarios. The preference definitions defined, together with the simulation based on prior research in the community, can be viewed as simplified simulation of real-world human-AI interactions. While not grounded on the real world, our design still focuses on fundamental reasoning and planning tasks where a stable preference can guide the entire process. However, moving forward, it is indeed important to explore our current work in real-world scenes with diverse and unpredictable real-human behaviors. Adapting the pipeline to real-world scenarios is the future direction of our work.

need more experiments to see how the used metrics is inline with the literature. There is a lot of research on RL based preference learning (see literature section).

RL baselines: see claims above. Levenshtein distance: kindly refer to examples and discussons to reviewer eCTV.

Relation To Broader Scientific Literature. Essential References Not Discussed.

Thank you very much for your kind reminder.
Our work aligns with PbRL works most closely as we learn from demonstrated action sequences that implicitly encode user preferences. PbRL mainly utilizes human preferences as feedback from experts to replace numeric rewards to help models learn better [1]. Works like [2] explore goals defined in terms of human preferences between trajectory segments, [3] proposes modeling human preferences instead as informed by each segment's regret. They are definity works related to our research topic. We further extend beyond traditional preference-based RL settings in several ways. While most preference-based RL methods require extensive human feedback through pairwise comparisons or explicit reward signals, we focus on learning from minimal demonstrations that implicitly convey preferences. Second, rather than learning a single reward function or policy, our approach aims to identify and abstract generalizable preference patterns across diverse tasks and scenarios.
We will add such discussion and cite these related papers in our revision. And we will also include RObocasa in Sec 2.3 of our paper.
[1] Advances in Preference-based Reinforcement Learning: A Review.
[2] Deep reinforcement learning from human preferences.
[3] Models of human preference for learning reward functions.

The paper has limited Scope of Generalization

As far as we know, there is a lack of benchmarks for embodied tasks that include systematically-defined human preference or human behavior data. Thus, we also propose a benchmark to tackle this and see this as one of the contributions of our work. We will release the benchmark together with our baselines for the community to test and explore.

We sincerely welcome your further feedback.

审稿意见

评分: 32025-03-20

The paper introduces a framework to enhance embodied AI planning by incorporating personalized preferences learned from limited human demonstrations. It proposes the Preference-based Planning (PBP) benchmark and shows that learned preferences serve as effective abstractions, improving personalized plan generation for embodied AI agents.

update after rebuttal

The rebuttal has addressed my key concerns. I therefore update my rating from weak reject to weak accept.

给作者的问题

Could you elaborate on the strengths of Levenshtein distance as an evaluation metric?

论据与证据

The proposed benchmark is novel within embodied AI. However, its significance is unclear, as simple symbol-based approaches already perform fairly well (Table 2), surpassing video-based counterparts.

The primary finding from the benchmark—improved personalized planning through learned preferences—is clearly demonstrated, but not particularly novel on its own.

方法与评估标准

The proposed benchmark uses Levenshtein distance to measure discrepancies between generated and ground truth action sequences. However, the rationale for choosing Levenshtein distance is unclear and should be better justified.

理论论述

The paper does not have theoretical claims

实验设计与分析

The proposed benchmark incorporates a comprehensive set of baselines and evaluations.

补充材料

The supplementary code provides implementations of the key baselines in the proposed benchmark.

与现有文献的关系

The contributions relate closely to recent literature on preference learning and embodied AI.

遗漏的重要参考文献

I do not see any major related work missing.

其他优缺点

There appears to be a mismatch between the paper's framing and its actual contribution. The paper is framed as a new method for preference-based planning, but its main contribution seems to be the empirical benchmark and evaluation.

其他意见或建议

No further comments

作者回复

2025-03-31

Dear Reviewer eCTV:
Thank you for your thoughtful review and constructive feedback. We appreciate your recognition of the benchmark’s novelty and the clarity of our primary findings. We address your concerns point by point:

The proposed benchmark is novel within embodied AI. However, its significance is unclear, as simple symbol-based approaches already perform fairly well (Table 2), surpassing video-based counterparts.

Video-based parts are important in the benchmark. We mainly incorporate symbol-based parts to study how different modalities impact performance for more comprehensive evulation. However, the benchmark’s significance greatly lies in its scalability and realism. To ensure our research can address real-world challenges in perception and noise, we should simulate human behaviors and their preferences within complex, real-world conditions. Therefore, video-based parts are more critical for potential future extension, as real-world agents must process raw sensory inputs to infer preferences dynamically. Visual cues can offer nuanced insights into user preferences that are not explicitly stated in the form of text, making them invaluable for applications where understanding subtle, non-verbal user behaviors.
Indeed, as you pointed out, the effectiveness of symbol-based LLMs in preference learning tasks, as demonstrated in our experiments, underscores the relative ease with which these models can handle few-shot induction when provided with explicit action and preference labels. However, it still presents great challenge in real-world grounding. As information from raw sensory often includes a lot of noise, converting everything into well-organized texts is nearly impossible. The poor performance of video-based approaches also illustrates practical difficulties.

The primary finding from the benchmark—improved personalized planning through learned preferences—is clearly demonstrated, but not particularly novel on its own.

Thank you for acknowledging our findings. While the idea of improved personalized planning through learned preferences is intuitive, our main contribution lies in creating the first realistic benchmark that extends this concept to embodied AI, modeling human preferences through observable behaviors across hierarchical levels, and leveraging machine learning methods to solve preference-guided planning in a scalable and systematic way. We demonstrate comprehensive research across a broad range of settings, and spans a broad range of settings, revealing key insights—for instance, that symbol-based approaches show promise in scalability, though significant challenges persist in both preference learning and planning, which has not been previously identified in the literature. We hope our work can serve as a foundation for future research.

The proposed benchmark uses Levenshtein distance to measure discrepancies between generated and ground truth action sequences. However, the rationale for choosing Levenshtein distance is unclear and should be better justified.
Could you elaborate on the strengths of Levenshtein distance as an evaluation metric?

Levenshtein distance is chosen as it quantifies sequential alignment between generated and ground-truth actions, penalizing deviations that violate preferences (e.g., mismatch in task ordering, like AAABCBCA (gt) vs AABCBCAA (predicted). Note how similar they are. In this case, accuracy is 32.5%(very low) but Levenshtein distance is 1/8 (smaller is better, which indicates the result is almost perfect) ). Levenshtein distance is much more suitable in our settings. Unlike rigid exact-match metrics, it accommodates valid variations in execution while ensuring preference adherence. We will add more explanantion in sec 5.2. Actually, Levenshtein distance has been widely used in sequence comparison [1-3].

[1] Yujian, Li, and Liu Bo. "A normalized Levenshtein distance metric." PAMI.
[2] Gu, Jiatao, Changhan Wang, and Junbo Zhao. "Levenshtein transformer." NeurIPS.
[3] Fanello, Sean Ryan, et al. "Keep it simple and sparse: Real-time action recognition." JMLR.

There appears to be a mismatch between the paper's framing and its actual contribution. The paper is framed as a new method for preference-based planning, but its main contribution seems to be the empirical benchmark and evaluation.

We appreciate this observation. While we agree the benchmark is a key contribution, we would argue that our main contribution lies in a series of studies on machine learning of human preferences. The benchmark serves as a testbed to rigorously evaluate preference-based planning, enabling us to demonstrate novel and comprehensive research across a broad range of settings. Through extensive experiments, we provide new insights into preference-based planning.

We sincerely welcome your further feedback and suggestions to strengthen our work.

最终决定Reject

2025-05-01

This work presents a benchmark to study if agents can learn preferences from few-shot demonstrations and generalize across diverse household task-planning scenarios. This is tested on the Preference-based Planning (PBP) benchmark featuring a set of demonstrations rendered within a simulator, representing 290 different preferences spanning multiple levels from low-level action execution to spatial constraints to sequence-level preferences. The paper evaluates that preferencescan help with planning and LLMs improve over VLMs. At the end of the discussion period this paper was borderline with no reviewer expressing interest in championing the work.

I think the benchmark is constructed well but the framing of the paper needs work to improve what the community takes away from it. In its current form the claims in the paper do not always match what is actually tested. For example "we focus on developing agents capable of learning preferences ...." is a statement in the introduction that suggests the paper is studying different strategies to infer preferences from data however the actual work but the paper focuses on studying existing LLMs on the proposed benchmarks. For what its worth I think the benchmark does hit upon a useful and interesting contribution for the community but the paper does need revision to highlight what is actually being done the primary contributions are a benchmark and an initial study of how existing multimodal models perform on the benchmark. I think the paper is also missing comparisons with other benchmarks in the literature. While I think the other benchmarks do not explicitly focus on personalization (which is the focus here) I think it is important to contextualize why doing so might inform progress on the other benchmarks.

https://ai.meta.com/research/publications/partnr-a-benchmark-for-planning-and-reasoning-in-embodied-multi-agent-tasks/

https://openreview.net/pdf?id=TavPBk4Zs9m

https://arxiv.org/abs/2502.09560v2

https://proceedings.neurips.cc/paper_files/paper/2024/file/b631da756d1573c24c9ba9c702fde5a9-Paper-Datasets_and_Benchmarks_Track.pdf

In addition, as was raised by the reviewers, I think the problem of frame-of-reference bias since this has been studied to be an important way through which vision language models fail in their ability to reliably make predictions.