/10

Poster5 位审稿人

最低3最高4标准差0.5

ICML 2025

ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

Letian Chen,Nina Marie Moorman,Matthew Craig Gombolay

提交: 2025-01-24更新: 2025-07-24

TL;DR

ELEMENTAL enables robots to learn user-aligned reward functions by combining user language instructions and demonstration through interactive self-reflection using vision-language models.

摘要

关键词

Learning from DemonstrationVision-Language ModelsInverse Reinforcement Learning

评审与讨论

审稿意见

评分: 32025-03-14

The paper proposes incorporating user demonstrations into LLM-based reward design methods in robotics. Their proposed approach is a direct contender of EUREKA. The main motivation of this work is that language can be ambiguous for task requirement specification and hence using user demonstrations is a good way to reduce this ambiguity and create a better interface for human task specifications.

For this purpose the authors propose ELEMENTAL, a method that leverages demonstrations to achieve this exact goal. The method has 3 different phases. In the first phase, a vision-language model is prompted with demonstrations and text and is expect to produce a feature extraction function. The prompt also includes the environment code in addition to the demonstrations (presented as either superimposed images for locomotion or keyframes for manipulation). The second phase is an inverse RL phase where the agent is expected to learn a reward from the previously extracted features and learn a policy with the obtained reward and using PPO. The final phase allows the agent to self reflect about the quality of the feature extractor so that it can improve it.

For evaluation, the proposed method is evaluated using a series of IsaacGym environments, and compared to baselines from inverse RL and to Eureka. The results show an improvement in performance in comparison to the baselines and some hints of better generalization. All results are in simulation and using IsaacGym. The authors also ablate multiple design choices from their method and report wall clock time comparisons to Eureka, in which their method is around 2.5 times slower.

Update after rebuttal

The rebuttal successfully addressed most of my concerns. I have now raised my score to weak accept.

给作者的问题

Would your method work if the keyframe selection is automated and hence some not all that well picked frames are used?
How would your method work with different VLMs? ideally it would be interesting to see it working with open-source VLMs.
Can you perform more experimental runs with random seeds (at least 5, ideally > 10)?
Why did you pick the reward to be a linear combination of the features instead of also prompting the VLM to write a reward function based on the features? how would your method compare to such an approach?

论据与证据

The main claim of the paper is that integrating demonstrations can reduce the ambiguity of task specification and hence improve performance of LLM/VLM inverse RL. While the results do show an improvement in performance, the evaluation lacks a clear connection between demonstrations and ambiguity being reduced. Perhaps this can be shown with qualitative examples where the demonstrations clearly induce some reward components that could not have been produced by a language-only method.

A second major claim of this paper is that it improves generalization to out-of-distributions tasks. Here the main concern is that VLM are trained with data online and can only produce reward within the support of reward they have seen during training (reasonable assumption). The authors attempt to validate this claim by designing some custom tasks within IsaacGym and showcase their method being successful on those as well. While this experiment is a good hint for generalization, it is not sufficient to understand the generalization capability of the method to tasks from completely unseen domains (different simulation with not much code online, different robots...).

方法与评估标准

Integrating demonstrations into VLM-based reward design is a good idea and a promising direction. The proposed method is reasonable, the assumption of reward being linear to some features is limiting but enough for a large set of tasks.

The evaluation benchmark is a good start to properly validate the method. I believe real-world results and different simulators would strengthen the claim of generalization and better highlight the contributions of the paper.

理论论述

None.

实验设计与分析

I do not like the usage of only 3 random seeds for testing the method and comparing it to baselines, especially that some of the results are not statistically significant as can be seen in figure 3. Besides that experimental design is good. Analyses is also mostly good, except that claiming generalization to out-of-distribution tasks is a bit of an over claim given that the tested tasks are all very similar and from the same domain.

补充材料

Supplementary material includes hyperparameters and some prompts used in the different stages. I would highly recommend adding side-by-side comparisons of how prompts + demos results in different rewards in comparison to just prompts.

与现有文献的关系

The paper is a logical next step to previous work on LLM/VLM-based reward design.

遗漏的重要参考文献

None.

其他优缺点

Strengths:

The paper is well-written and enjoyable to read.
Leveraging demonstrations to automated reward design is a reasonable next step and a goo idea to reduce the ambiguity of language-based methods.
The experiments show good performance improvement

Weaknesses:

The paper's novelty is quite limited.
Many design choices of the method are not well motivated, e.g., choice of VLM, choice of linear-feature rewards..
The paper lacks real world experiments or experiments with environments that are truly out-of-distribution, despite the paper boldly claiming generalization to such domains.
The paper clearly misses qualitative evaluation to showcase how the additional demos reduce ambiguity.
The paper relies on keyframes (selected by an expert) for manipulation tasks. Such information is not easy to get and a more automated approach is desirable.
The evaluation uses only 3 seeds.

其他意见或建议

None

作者回复

2025-04-01

We thank the reviewer for the detailed feedback and for recognizing the value of combining demonstrations with language for reward design, as well as ELEMENTAL’s improved performance. All updated tables and figures are included in https://shorturl.at/YHEDU (referred to as Response Table and Response Figure), following ICML rules. Below, we address the reviewer’s concerns.

[Q1 (Claim Concern 1, Weakness 4, Supplementary)] Demonstrations reducing language ambiguity

[A1] In our user study (see Reviewer KANf [A2]), a participant taught the “mix bowl with spoon” skill using language that included vague temporal phrases (“first… then…”) and spatial instructions (“lower into the bowl”). While such language hints the intent, it does not fully define the user’s intent. Nonetheless, ELEMENTAL was able to extract temporal-spatial alignment from visual input of the demonstration.

We argue that while it is possible to articulate complex relationships using language (e.g., with coordinate systems or math), doing so is burdensome and error-prone for users [1]. ELEMENTAL reduces ambiguity by grounding vague or underspecified language in visual demonstration, a more natural way to convey task details. We will add this example to supplementary to show the comparison of “prompts + demos” features in comparison to “just prompts” rewards.

[1] Doğan, F. I., Gillet, S., Carter, E. J., & Leite, I. (2020). The impact of adding perspective-taking to spatial referencing during human–robot interaction. Robotics and Autonomous Systems.

[Q2 (Claim Concern 2, Weakness 3, Experiment Concern)] Generalization Claim

[A2] In addition to our Ant generalization experiments in simulation, we conducted a real-world user study (see Reviewer tQa4 [A1]) on a salad mixing task using a Kinova JACO arm—a different robot and unseen domain.

Despite this domain shift and user-provided demonstrations, ELEMENTAL significantly outperformed Eureka in both task success ( $20.58 \pm 4.93$ vs. $12.42 \pm 4.72$ , $p < .001$ ) and strategy alignment ( $19.83 \pm 6.13$ vs. $10.50 \pm 4.32$ , $p < .001$ ). These results demonstrate ELEMENTAL’s ability to generalize to novel robots, real-world interactions, and imperfect user input.

[Q3 (Experiment Concern, Weakness 6, Question 3)] Random seeds and statistical significance

[A3] In our revised experiments (see Reviewer 5aJa [A1]), we increased the number of random to 5 across all benchmark and generalization tasks. On average, ELEMENTAL outperforms Eureka by 122.5% in benchmark settings and 81.2% in generalization (better in 8/9 and 4/4 tasks respectively). ELEMENTAL shows statistically significant improvement over Eureka in 5/9 benchmark tasks ( $p < .05$ ) and in 2/4 generalization tasks ( $p < .05$ ). Regarding Figure 3, we updated in Response Figure 1 and confirmed ELEMENTAL has significantly higher execution rate ( $p = .030$ ).

[Q4 (Weakness 1)] Novelty

[A4] To our knowledge, ELEMENTAL is the first to enable multimodal, self-improving reward learning with VLMs in robotics. ELEMENTAL combines VLMs with LfD, introducing a novel, three-phase framework that includes: (1) multimodal feature extraction from both demonstrations and text, (2) inverse reinforcement learning to optimize reward and policy, and (3) a self-reflective loop that iteratively revises the feature space using feedback from learned behavior. We show improvements over SOTA on standard benchmarks and real-world deployment, and would be grateful if the reviewer could specify which aspects they feel are insufficiently novel so we can better address them.

[Q5 (Weakness 2, Question 2, Question 4)] Design choices for VLM and linear reward

[A5] We chose GPT-4o in our experiment due to its strong multimodal reasoning capabilities, and we provide additional experiments with OpenAI’s o1 model in Response Table 1 (see Reviewer tQa4 [A4]). Preliminary results show that both ELEMENTAL and Eureka improve under o1, and ELEMENTAL continues to outperform Eureka in 7 out of 9 tasks—suggesting our framework is robust across some VLM choices. We will explore open-source VLMs in future work.

We agree that exploring richer reward representations is a promising direction for future work. We opted for linear combinations of features for potential human interpretability, providing insight into the importance of each feature. While prompting a VLM to write features and then write a reward function is possible (Eureka’s prompt effectively does this), we find that pairing feature construction with IRL leads to better alignment with demonstrations, as IRL naturally handles balancing feature weights through optimization rather than relying on one-shot reward drafting.

[Q6 (Weakness 5, Question 1)] Keyframe

[A6] Our real-world user study uses 10 equally spaced frames captured, demonstrating that ELEMENTAL works well with simple, automated keyframing. Please refer to Reviewer tQa4 [A2] for our full response.

审稿人评论

2025-04-04

The rebuttal successfully addressed most of my concerns. I have now raised my score.

审稿意见

评分: 42025-03-15

In this paper, the authors propose a framework that combines natural language guidance with visual user demonstration to align robot behavior. Using inverse RL and iterative self-reflection, ELEMENTAL improves task success by 41.3% over previous methods in out-of-distribution tasks.

In the first stage, features related to the task are inferred through the VLM. In the next stage, a reward function is optimized using the feature functions to match the demonstrations, and do IRL. The final stage, called reflection, iteratively improve the feature functions created in stage one and complete the learning loop.

Update after rebuttal

The authors addressed my questions, and I am keeping my original score, 4:accept.

给作者的问题

论据与证据

The framework proposed in this paper appears to be convincing. In particular, the Eureka, which is considered the most similar to this research, has already demonstrated the effectiveness of RL automation using VLM. This study goes a step further by not only automating the reward function but also automating the feature extractor with VLM, showing even more improved results.

However, regarding the process of improve the feature function based on the feature counts of the trajectories generated by the trained rollouts and the demonstration trajectories, it’s little bit unclear how exactly they changes across the updates. It is understanable since this part leverages black-box VLM, but if the authors could provide more insights of this process with thorough analyzation, it would be helpful for the readers.

方法与评估标准

The proposed method, evaluation criteria, and the baselines are considered appropriate.

理论论述

实验设计与分析

The experimental design and analysis seems valid.

补充材料

Checked Appendix

与现有文献的关系

This study represents one direction in the line of research on RL automation based on VLMs/LLMs and is part of the same context as Eureka, which the authors have cited. However, it achieves a higher level of automation compared to Eureka and can be considered a novel pipeline.

遗漏的重要参考文献

其他优缺点

It is considered one of the studies that increases the efficiency in reward function automation, improving over previous research.

其他意见或建议

The paper is well-written and easy to follow. However, I believe readers could gain better insights if the following two points were included:

The process described in Appendix C.2., where the VLM autonomously updates the feature function, is helpful for understanding the effectiveness of the proposed pipeline. I suggest the authors visualize how the VLM adds new features over time, and how each feature's contribution to the reward evolves.
Provide a comparison showing how much more flexibility in reward design is achieved compared to Eureka.

作者回复

2025-04-01

We thank the reviewer for the positive evaluation and for recognizing ELEMENTAL’s contribution in automating not only reward design but also feature construction. We are glad the reviewer found the paper clear and the method well-motivated. All updated tables and figures are included in https://shorturl.at/YHEDU (referred to as Response Table and Response Figure), following ICML rules. Below, we address the comments.

[Q1] Feature refinement through self-reflection

[A1] We agree that visualizing the evolution of the feature set and feature weights would improve reader understanding, and we will add such visualizations to the supplementary material. Below, we show further analysis of the Humanoid case study in Appendix C.2:

1st round: The VLM proposes three features—forward_velocity, uprightness, and heading_alignment. The learned policy is overly conservative and slow, achieving low episode lengths and high uprightness and heading alignment.
2nd round: The VLM revises the feature function by (1) adjusting normalization of the existing features and (2) introducing a new feature, lateral_velocity, to capture stride consistency and stabilize side-to-side movement.
Outcome: The revised reward weights assign positive weights to both forward_velocity and lateral_velocity, improving alignment with the demonstration. Heading_alignment receives a smaller weight but still matches the demo, suggesting the overemphasis in the previous round was corrected. Episode length increases from 691 to 932, reflecting the learned reward function is now more aligned with the ground-truth objective.

This example illustrates how the self-reflection loop enables meaningful revisions to the feature function and their relative importance. During each self-reflection round, ELEMENTAL compares the feature counts from the learned policy against those from the demonstration and feeds this discrepancy to the VLM. The VLM interprets this feedback to revise the feature function: adding missing features, modifying existing ones, or discarding those deemed unhelpful. While the VLM operates as a black box, the output feature code and IRL weights are transparent and human-readable—making it possible to inspect how ELEMENTAL adapts its reward representation over time.

[Q2] Flexibility in reward design compared to Eureka

[A2] We thank the reviewer for this insightful suggestion. A key advantage of ELEMENTAL over Eureka is its ability to construct richer, context-sensitive features by combining multimodal inputs (language + demonstrations). While Eureka interprets task objectives solely from text, ELEMENTAL leverages visual demonstrations to ground ambiguous or under-specified instructions, leading to more expressive reward features.

For example, in our real-world user study (see Reviewer tQa4 [A1]), one participant taught the robot to "mix bowl with spoon" using the instruction: "First, the robot should lower its gripper toward the inside of the bowl with the spoon pointing downward. Then, the robot should move in a way to make the spoon move in a circular motion for mixing."

This instruction contains temporal dependencies (e.g., “first...then”) and spatial relations that are difficult to resolve with language alone. The reward function from Eureka misses this temporal nuance and encodes the task with static orientation reward:

def compute_reward(ee_pos: torch.Tensor, bowl_position_tensor: torch.Tensor) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]:
    # other contents omitted due to space limit

    # Orientation reward
    ee_orientation = ee_pos[:, 3:7]
    dot_product = torch.abs(torch.sum(ee_orientation * desired_orientation, dim=-1))
    orientation_reward = torch.exp(orientation_reward_temp * (dot_product - 1))

In contrast, ELEMENTAL interprets the demonstration to encode timing and conditional dependencies. It defines a feature that encourages early reorientation only when distant from the bowl:

def compute_feature(obs_buf: torch.Tensor) -> Dict[str, torch.Tensor]:
    # other contents omitted due to space limit
    
    # 3. Reorient while distant to avoid collision
    down_direction = torch.tensor([0.0, 0.0, -1.0], device=obs_buf.device)
    orientation_similarity_far = torch.nn.functional.cosine_similarity(ee_orientation[:, :3], down_direction.unsqueeze(0), dim=-1)
    is_far = distance_to_bowl >= 0.2
    reorientation_early = torch.where(is_far, orientation_similarity_far, torch.tensor(0.0, device=obs_buf.device))

This example highlights ELEMENTAL’s greater flexibility in reward design: it constructs temporally-aware and spatially-grounded features by aligning language with visual demonstrations, something difficult to express in language alone.

Participants in the study often asked, “Should I describe this (for example, moving to the left/right side) from my perspective or the robot’s?”—underscoring the inherent ambiguity in language-only reward design.

审稿人评论

2025-04-07

Thank you for the detailed explanation to my question. In particular, the example the authors provided could greatly help reader's understanding of the difference between ELEMENTAL ane EUREKA. I will maintain my current recommendation (4: Accept).

审稿意见

评分: 42025-03-17

This paper introduces ELEMENTAL, a framework for reward design in robotics that integrates vision-language models (VLMs) with an inverse reinforcement learning (IRL) backbone. The authors aim to address the shortcomings of purely language-based reward engineering, particularly the difficulty of specifying nuanced features and balancing them properly. Instead, they propose using visual demonstrations and language-based prompts to construct an initial feature function, then iteratively refine that feature function and the learned policy through a self-reflection loop that compares the policy’s behavior (in terms of feature values) to the demonstration. This loop adjusts the reward function so that the final policy better matches the user’s intended behavior. Empirical evaluations on challenging IsaacGym tasks (locomotion and manipulation) show that ELEMENTAL outperforms both standard IRL methods (that lack VLM-powered feature extraction) and prior language-based reward-design approaches like EUREKA.

给作者的问题

My suggestions are stated above.

论据与证据

The main claim of the paper is that incorporating a VLM and demonstration enables better reward design than a pure LLM based approach like EUREKA. This claim is supported effectively in the paper via various comparisons against EUREKA and ablations of the proposed method itself. For example, Experimental ablations show that when ELEMENTAL is provided only with text demonstrations (or no demonstrations at all), performance drops significantly compared to the default setting with high-quality visual demos.

方法与评估标准

Yes, it makes sense. The IsaacGym benchmark used in the paper has been used in prior works (e.g., EUREKA).

理论论述

The paper does not introduce theoretical claims.

实验设计与分析

The paper compares ELEMENTAL to a variety of baselines: standard IRL, behavior cloning (BC), random policies, the ground-truth reward, and a prior language-based reward method (EUREKA). This set is sufficiently comprehensive to showcase where their method sits in terms of performance bounds (random and ground truth) as well as direct competitors in LfD and LLM-based approaches.

补充材料

I reviewed the supplementary material and it looks fine.

与现有文献的关系

ELEMENTAL extends decades of IRL research by introducing a language-based mechanism to produce reward features, circumventing the heavy reliance on manually crafted feature representations. At the same time, ELEMENTAL uses a VLM to do reward specification via IRL, which is a less common usage of VLMs for robotics. So overall, I think this paper offers a nice combination of existing ideas.

遗漏的重要参考文献

Essential related works are discussed and compared in great length.

其他优缺点

The proposed approach has couple weaknesses that I'd like see authors address during the rebuttal. First, I'd like to understand how sensitive ELEMENTAL is to the quality of the provided demonstrations. This is mentioned in the limitation section, but a table that shows the sensitivity would be helpful. An interesting question is whether future versions could incorporate “demonstration filtering” or handle partial demonstrations. Second, ELEMENTAL seems to require a lot of hyperparameter tuning. Some of the approach’s success presumably hinges on certain design decisions (e.g., gradient and weight normalization). One might wonder how robust it is if we alter those normalizations or if a certain environment has drastically different scale.

其他意见或建议

N/A.

作者回复

2025-04-01

We thank the reviewer for the valuable feedback. We are glad that the reviewer found our integration of VLMs with IRL to be a compelling combination and appreciated our empirical comparisons and ablation studies. All updated tables and figures are included in https://shorturl.at/YHEDU (referred to as Response Table and Response Figure), following ICML rules. We respond to the reviewer’s insightful suggestions below:

[Q1] Sensitivity to demonstration quality

[A1] We thank the reviewer for this important question. We assess ELEMENTAL’s sensitivity to demonstration quality in both simulation and real-world settings:

High-quality visual demonstrations are informative for VLMs to extract meaningful task semantics, as illustrated in Table 1 ELEMENTAL w/ random visual demo condition.
In our real-world user study (see Reviewer tQa4 [A1]), demonstrations were provided by human participants—potentially noisy and imperfect compared to RL-generated ones. ELEMENTAL still achieved significantly higher task and strategy scores than Eureka. As one participant noted when teaching the Go to mixture bowl skill: Even if my demonstration was slightly to the left of the mixture bowl, ELEMENTAL can help me fix this when I give it feedback and successfully put ingredients in the mixing bowl. This highlights ELEMENTAL’s ability to recover from imperfect input by constructing intent-aligned features and optimizing them through IRL.

We agree that handling low-quality or partial demonstrations is an important future direction. Techniques such as demonstration filtering (e.g., based on user confidence, VLM scoring, or automatic ranking algorithms) or Learning from suboptimal Demonstration methods [1–3] could enhance robustness. We will include these discussions in the future work section.

[1] Brown, D., Goo, W., Nagarajan, P., & Niekum, S. (2019, May). Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning (pp. 783-792). PMLR.

[2] Chen, L., Paleja, R., & Gombolay, M. (2021, October). Learning from suboptimal demonstration via self-supervised reward regression. In Conference on robot learning (pp. 1262-1277). PMLR.

[3] Beliaev, M., Shih, A., Ermon, S., Sadigh, D., & Pedarsani, R. (2022, June). Imitation learning by estimating expertise of demonstrators. In International Conference on Machine Learning (pp. 1732-1748). PMLR.

[Q2] Hyperparameter and design choices sensitivity

[A2] We appreciate the reviewer’s concern. We validated our design choices and hyperparameters across nine simulated domains and the real-world salad mixing user study (see Reviewer tQa4 [A1]), which together span diverse robotic settings—locomotion, manipulation, and human-in-the-loop learning. We used the same ELEMENTAL hyperparameters and design components (e.g., gradient and weight normalization) across all tasks, demonstrating robustness without per-environment tuning.

That said, RL and IRL can still be sensitive to hyperparameters—an open challenge in the field [4-6]. ELEMENTAL is the first framework to integrate VLMs with IRL using multimodal inputs, and we agree future work can improve its IRL backend. Our current normalization strategies help stabilize IRL optimization, and more advanced approaches (e.g., AIRL) could further improve robustness. We will include these discussions in the future work section.

[4] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018, April). Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).

[5] Hussenot, L., Andrychowicz, M., Vincent, D., Dadashi, R., Raichuk, A., Ramos, S., ... & Pietquin, O. (2021, July). Hyperparameter selection for imitation learning. In International Conference on Machine Learning (pp. 4511-4522). PMLR.

[6] Adkins, J., Bowling, M., & White, A. (2024). A method for evaluating hyperparameter sensitivity in reinforcement learning. Advances in Neural Information Processing Systems, 37, 124820-124842.

审稿人评论

2025-04-07

Thank you for the response -- I will maintain my original acceptance score.

审稿意见

评分: 32025-03-21

The paper introduces ELEMENTAL, which combines VLMs with Learning from Demonstration (LfD) to address challenges in reward design for robotic tasks. ELEMENTAL leverages visual demonstrations and natural language descriptions to generate task-relevant feature functions, which are optimized through an enhanced Maximum Entropy Inverse Reinforcement Learning (MaxEnt-IRL) algorithm. The framework incorporates a self-reflection mechanism to iteratively refine feature functions, reward functions, and policies, ensuring alignment with user demonstrations. Experimental results on IsaacGym benchmarks show that ELEMENTAL outperforms state-of-the-art methods.

给作者的问题

Question 1: How are the key frames obtained? Can they be derived or parsed using the VLMs? I am concerned that using key frames may make it difficult for this method to scale up, especially for complex manipulation-related tasks.

Question 2: What is the runtime of the ELEMENTAL algorithm, and how efficient are its individual modules compared to other methods? Providing these details would improve the paper.

Question 3: It would be better if the authors could provide results or experiments demonstrating deployment in real-world scenarios.

Question 4: This paper lacks ablation studies on prompt design and the selection of VLMs, which would also be very valuable.

论据与证据

The paper demonstrates the effectiveness and generalization capabilities of ELEMENTAL through experiments and validates its individual components.

However, the experiments are conducted solely in the IsaacGym environment. It would be better if additional experiments were performed in other simulation environments to further validate the approach.

Additionally, can ELEMENTAL be deployed in real-world settings? Including real-world experiments would strengthen the paper.

Furthermore, I suggest that the authors add an additional column in Tables 1 and 2 to report the mean values for better clarity.

方法与评估标准

This paper primarily utilizes VLM to design rewards and introduces LfD (IRL) to address the issues that VLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot properly represent the problem with text-based descriptions alone.

The evaluation mainly focuses on the rewards in IsaacGym Environments, which is reasonable.

理论论述

N/A

实验设计与分析

The experiments evaluation focuses mainly on nine tasks in IsaacGym. I suggest that the authors validate their approach in additional simulation environments or on more complex manipulation-related tasks.

If the authors could demonstrate its application in the real-world, that would be even better.

补充材料

Yes.

与现有文献的关系

This paper primarily builds upon the previous use of VLMs for reward design by introducing LfD (IRL). In a sense, this requires more supervision, but overall, it leads to better results.

遗漏的重要参考文献

As far as I know, the authors have discussed the related works.

其他优缺点

Strengths 1: Combining inverse reinforcement learning (IRL) to enhance reward design in visual-language models (VLMs) is both novel and interesting, while also making sense conceptually.

Strengths 2: The experimental results demonstrate that the proposed ELEMENTAL framework is highly effective.

Strengths 3: The techniques used, such as self-reflecting on features, optimizing the reward function, and initial prompt design, are reasonable and convincing, and their effectiveness is validated through experiments.

Weaknesses: The main drawbacks lie in the scalability of the method and the experimental evaluation. For details, please see the "Questions For Authors" section.

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for the constructive feedback and for highlighting the strengths of our IRL-VLM integration, self-reflection mechanism, and experimental results. In response, we have added real-world user study results, runtime analysis, and experiments using OpenAI o1 model. All updated tables and figures are included in https://shorturl.at/YHEDU (referred to as Response Table and Response Figure), following ICML rules. We address each point below.

[A1 (Question 3)] Real-world experiment

We conducted a within-subject user study and show ELEMENTAL achieves significantly better ratings from user than Eureka. In the study, 12 participants taught a Kinova JACO arm to complete a salad mixing task (illustration shown in Response Figure 2). For user study time consideration, participants were asked to teach three core skills—Go grasp mushroom, Go drop at mixture bowl, and Mix bowl with spoon—while the remaining skills—Go grasp pepper, Go grasp tomato, Go to home—were predefined. At the beginning of the study, we informed participants of the skill set and how the skills would be composed into a final full-task execution during the evaluation phase.

Each skill was taught twice per participant, once using ELEMENTAL and once using Eureka (order randomized). For each skill, after the initial kinesthetic demonstration and a natural language description of intent, participants observed the learned robot policy and provided textual feedback. This observation–feedback cycle was repeated twice per algorithm, consistent with our simulated experiments. After teaching all three skills with both algorithms, participants observed blind executions of the full salad mixing task (using each method’s learned and predefined skills) and rated them using 7-point Likert scales on two criteria:

Task performance (i.e., whether the robot accomplishes the task)
Strategy alignment (i.e., whether the robot’s execution matches user intent/preferences)

Each criterion consisted of four questions, resulting in the summed scores ranging from 4 to 28.

The user study results showed ELEMENTAL outperformed Eureka significantly:

Task score: ELEMENTAL $20.58 \pm 4.93$ vs. Eureka $12.42 \pm 4.72$ , $t(11) = -4.65, p < .001$
Strategy score: ELEMENTAL $19.83 \pm 6.13$ vs. Eureka $10.50 \pm 4.32$ , $t(11) = -4.20, p < .001$

These results demonstrate ELEMENTAL’s superior alignment with user intent and effectiveness in real-world settings. To achieve interactive real-time user study, both algorithms were tuned to complete each learning round in under 4 minutes by training via IsaacGym-based simulation on servers with NVIDIA A40 GPUs. This also demonstrates ELEMENTAL’s feasibility on real-world, out-of-distribution problems.

[A2 (Question 1)] Keyframe and scalability

In the four locomotion domains, we used temporally equally spaced frames and superimposed them into a single image (automatic). In our simulated manipulation domains, keyframes were selected by experts, though [1] shows that keyframe-selection in robotics tasks is user-friendly. In the real-world user study, we used ten equally spaced frames captured via a ZED camera during each demonstration—an automatic and scalable process.

Across all settings, ELEMENTAL performs robustly, suggesting low sensitivity to the keyframing method. We agree exploring automated keyframe selection via VLMs is a promising direction for future work.

[1] Akgun, B., Cakmak, M., Jiang, K., & Thomaz, A. L. (2012). Keyframe-based learning from demonstration: Method and evaluation. International Journal of Social Robotics.

[A3 (Question 2)] Runtime

As reported in Section 5.1, with the same policy training environment steps, Eureka averaged 68.2 minutes across the nine tasks, while ELEMENTAL averaged 168.4 minutes. Importantly, our user study demonstrates ELEMENTAL can be deployed interactively in real time, with each learning round completing in under 4 mins. We agree reducing runtime is a valuable future direction possibly via more advanced IRL algorithms, as discussed in Section 6.

[A4 (Question 4)] VLM ablation, prompt design, and reporting mean result values

For mean values, please refer to [A1] of Reviewer 5aJa, where we report results over five seeds with statistical tests, showing that ELEMENTAL significantly outperforms Eureka.

To study the effect of VLM choice, we include preliminary results using OpenAI’s o1 model in Response Table 1. While full 5-seed runs are ongoing due to time limits, current results show that both ELEMENTAL and Eureka improve with o1, and ELEMENTAL continues to outperform Eureka in 7 out of 9 tasks (on average 37% gain). This suggests that ELEMENTAL’s advantages are robust across some VLM choices. We will update the table once full results are available.

Regarding prompt design, our prompts are developed based on Eureka’s and kept similar (Supplementary Section A), minimizing the likelihood that performance differences arise from prompt tuning.

审稿人评论

2025-04-08

Thank the authors for answering my question. I will maintain my original positive score.

审稿意见

评分: 32025-03-28

The paper proposes an approach to inverse reinforcement learning (IRL) that uses the knowledge of a VLM to construct code the computes state features from the environment. These features are then used with MaxEnt IRL, and iteratively refined online to match the demonstration trajectories. Experiments show better performance than past imitation (no language) and reward inference (no demos) methods.

给作者的问题

The two feature count vectors are then fed back to the VLM, which uses the feature count differences to revise the feature function ϕ(s).

What is the theoretical justification for Eq. (7), which penalizes these differences?
Is there evidence that the VLM is able to improve these feature count differences (Eq. 7) in "phase 3"? The "self-reflection" prompt tells it to, and it improves overall performance, but there is no ablation showing that it actual improves the feature discrepancy.

Experiments

Why is Table 5 different from table 3?
Which results are statistically significant?

论据与证据

The paper claims that the method outperforms prior imitation, IRL, and reward design with language methods in terms of reward recovery and task success trained on recovered rewards. The results show quantitative comparisons on IsaacGym control environments that are consistent with these claims.

Issues:

Most of the results are reported without standard deviation/error or an statistical testing
When error bars are included (Figure 3, Table 5), they do not seem to show significant differences between the method and baselines
The baseline methods all either utilize the demonstrations or the language, never both. A natural baseline to include for a fair comparison would be some for of learning from demonstrations method (such as GAIL or BC) applied to the VLM-generated feature code.

方法与评估标准

Yes, the idea of combining IRL with VLM semantic knowledge is sound, and the environments and baselines are good, with the exception of the issues mentioned above.

理论论述

None provided, though the paper would benefit from more theoretical justification for "phase 3" (Eq. 7).

实验设计与分析

There appear to be no error bars or statistical significance testing, except in Table 5 and Figure 3.
It is unclear how the "successful code execution rate" per iteration reflects on the two methods. The paper claims that it shows that reward features are better than reward functions, but could this be a function of the prompts used? The figure also appears to have overlapping error bars at each point, making its value questionable.

补充材料

Yes, I looked through the prompts, example outputs, and experimental details.

与现有文献的关系

The paper provides a useful synthesis of ideas in IRL and recent advances in using VLM knowledge for decision making.

Some related work that could be discussed as well include existing approaches that have incorporated language in modeling an environment [1,2,3,4] or as a semantic prior for learning from other data [5,6]

References

[1] Lin, J. et al., 2024. ''Learning to Model the World With Language.'' ICML

[2] Ma, Y. J. et al., 2023. ''LIV: Language-Image Representations and Rewards for Robotic Control.'' ICML

[3] Fan, L. et al., 2022. ''MineDojo: Building Open-Ended Embodied Agents With Internet-Scale Knowledge.'' NeurIPS

[4] Nair, S. et al., 2022. ''R3m: A Universal Visual Representation for Robot Manipulation.'' CoRL

[5] Myers, V. et al., 2024. ''Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation.'' CoRL

[6] Adeniji, A. et al., 2023. ''Language Reward Modulation for Pretraining Reinforcement Learning.'' arXiv:2308.12270

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

Reward ambiguity is a major challenge in IRL. The proposed approach is a novel attempt to use VLM knowledge as a semantic prior to resolve this ambiguity.

Weaknesses:

The overall results (without error bars) don't provide strong support for the claims
I was confused by the justification for the "phase 3" component

其他意见或建议

Line 777: Restuls ⇒ Results

作者回复

2025-04-01

We thank the reviewer for the thoughtful and constructive feedback, and for recognizing that ELEMENTAL presents a novel approach to resolving reward ambiguity in IRL by using VLM knowledge as a semantic prior. In response to the reviewer’s comments, we have added new experiments, expanded statistical analysis, and revised figures to address the concerns raised. All updated tables and figures are included in https://shorturl.at/YHEDU (referred to as Response Table and Response Figure), following ICML rules. We address each point below.

[Q1 (Issues 1 & 2, Weakness 1, Questions 3 & 4)] Statistical significance

[A1] We thank the reviewer for highlighting the importance of statistical tests. We increased the number of random seeds from 3 to 5 for both ELEMENTAL and Eureka across all benchmark and generalization tasks. We report the mean and standard deviation in Response Table 1 (benchmark) and Response Table 3 (generalization), along with statistical tests. ELEMENTAL performs better in 8/9 benchmark tasks (5/9 statistically significantly, $p < .05$ or $p < .01$ ) and in 4/4 generalization tasks (2/4 statistically significantly, $p<.05$ ). Notably, ELEMENTAL achieves a 122.5% average gain across benchmarks and an 81.2% gain in generalization. We also: 1) updated original Tables 1 and 3 (max success across three seeds) to be Response Tables 2 and 4 (across five seeds); 2) updated original Table 2 (max reward correlation across three seeds) to be Response Tables 5 and 6 (mean and max reward correlation across five seeds).

The original Figure 3 used standard deviations, leading to large shaded areas. We have updated this in Response Figure 1 to report standard errors instead. A two-way repeated measures ANOVA (across nine paired tasks) shows significant main effects for algorithm, $F(1, 8) = 7.00, p = .030$ , and round, $F(2, 16) = 10.03, p = .002$ ; the interaction is not significant, $F(2, 16) = 2.21, p = .144$ . This indicates ELEMENTAL achieves statistically significantly higher code execution rates than Eureka.

Regarding whether prompts impact execution rate: we agree prompt design can influence execution rates. However, as detailed in Supplementary Section A, our prompts are developed based on Eureka's and kept as similar as possible.

[Q2 (Issue 3)] Baseline combining LfD and VLM-generated feature code

[A2] We thank the reviewer for this valuable suggestion. We implemented a VLM+BC baseline that uses the same VLM-generated feature functions as ELEMENTAL, transforms observations into feature space, and trains a BC policy mapping features to actions. As shown in Response Table 1, this baseline performs poorly—more than 50% worse than both ELEMENTAL and Eureka. This highlights that combining demonstrations and language alone is insufficient: BC suffers from covariate shift and lacks ELEMENTAL’s self-reflection loop, which iteratively refines the feature function.

We agree that exploring more advanced IRL methods (e.g., GAIL or AIRL) in place of Approximate MaxEnt-IRL would be a promising direction, as we noted in Section 6.

[Q3] More Related Works

[A3] We thank the reviewer for pointing out these relevant works. ELEMENTAL distinguishes itself from prior work by coupled reward inference and VLM-based feature drafting, as well as grounding in both demonstration and textual input. The referenced papers explore the use of language for modeling environments, reward shaping, or task decomposition, often treating language as a prior for pretraining or few-shot adaptation. In contrast, ELEMENTAL uniquely integrates VLMs into the IRL process by generating executable feature functions from visual-language prompts and iteratively refining them through self-reflection. We will incorporate discussion of these papers in the revised manuscript.

[Q4 (Weakness 2, Questions 1 & 2)] Justification and empirical support for Phase 3 (Self-reflection)

[A4] We thank the reviewer for these important questions. To clarify: the VLM does not directly penalize feature count discrepancies in Eq. (7). Instead, the discrepancies are provided as feedback, and the VLM interprets them—deciding whether to add, remove, or adjust features to better capture task-relevant behaviors and demonstration preferences.

We show an empirical evidence in Supplementary C.2 (Humanoid domain):

The 1st-round feature function (Box 1) included forward_velocity, uprightness, and heading_alignment.
Feedback (Box 2) showed underperformance in forward_velocity and overly conservative uprightness.
The VLM revised the feature function (Box 3), adding lateral_velocity and adjusting normalizations.
The 2nd-round result (Box 4) showed improved alignment: forward_velocity increased, and uprightness decreased.

This demonstrates that VLM self-reflection correctly improves feature alignment, and Eq. (7) provides information for VLM to revise features based on the comparison.

审稿人评论

2025-04-05

Thank you for your response. This addresses my main concerns; I have raised my score.

最终决定Accept (poster)

2025-05-01

The paper introduces ELEMENTAL, a framework for reward design in robotics that integrates vision-language models (VLMs) with inverse reinforcement learning (IRL). It leverages visual demonstrations and natural language descriptions to generate task-relevant feature functions, addressing shortcomings of purely language-based reward engineering. The framework incorporates a self-reflection mechanism to iteratively refine feature functions, reward functions, and policies, ensuring alignment with user demonstrations. Experimental results on IsaacGym and real-world tasks show that ELEMENTAL outperforms state-of-the-art methods such as standard IRL methods (that lack VLM-powered feature extraction) and prior language-based reward-design approaches like EUREKA.

The main motivation of this work is that language can be ambiguous for task requirement specification and hence using user demonstrations is a good way to reduce this ambiguity and create a better interface for human task specifications. The authors provide good examples for this is in the discussion with reviewers. All reviewers agree that the idea of combining IRL with VLM semantic knowledge is novel and sound, and the environments and baselines are good. The experimental results demonstrate that effectiveness of the proposed approach. The paper is well-written and easy to follow.

The authors should include newly experiments, expanded statistical analyses, revised figures, real-world user study results, runtime analysis, discussion related to improvements in feature refinement through self-reflection + reward design compared to Eureka -- as part of the final revised paper draft.