PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.8
置信度
创新性2.8
质量3.5
清晰度3.5
重要性2.5
NeurIPS 2025

$\textit{Hyper-GoalNet}$: Goal-Conditioned Manipulation Policy Learning with HyperNetworks

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29

摘要

关键词
Visuomotor HypernetworkGoal-Conditioned Policy

评审与讨论

审稿意见
4

This paper introduces Hyper-GoalNet, a framework that uses hypernetworks to generate task-specific policy network parameters from goal specifications for robotic manipulation. The authors also propose latent space shaping techniques using forward dynamics modeling and distance constraints to enhance performance. Experiments are performed on 6 Robosuite tasks, and comparisons are made with several baselines. Ablation studies on the use of a different hypernetwork (HyperZero) and the effectiveness of the two new latent space objectives are performed. The proposed approach is further tested on a real world robot, leveraging four evaluation tasks.

优缺点分析

Originality

The proposed latent space shaping technique is simple but principled and it shows benefits in the ablation studies. The use of hypernetworks in goal-conditioned behavioral cloning is novel. However, it is not clear whether the first claimed contribution is actually a novelty. In the Introduction, the authors say “we adapt optimization-inspired hypernetwork architectures for generating policy parameters conditioned on goal specifications, creating a framework that dynamically determines how current observations should be processed”, and again in section 3 “To realize this insight, we leverage hypernetworks to dynamically generate task-specific policy parameters [...] tailored to the specified goals”. While the related work section briefly mentions recent RL approaches leveraging hypernetworks, saying they cannot be directly adapted to goal-conditioned behavioral cloning (lines 88-90), the ablation study in 4.3 compares Hyper-GoalNet to “Hyper-Zero [39], an alternative hypernetwork architecture, under identical training conditions”. The adaptation seems therefore straightforward.

Significance

The use of hypernetworks in goal-conditioned behavioral cloning might be of interest to the community, offering a well-motivated novel combination of existing techniques. However, while the related work section briefly mentions recent RL approaches leveraging hypernetworks, important references such as [1, 2] are missing. The relationship with previous hypernetwork-based RL approaches is unclear. The authors should expand the related work section, explaining why such approaches cannot be directly adapted to goal-conditioned behavioral cloning (lines 88-90), and how they differ from the proposed method. Otherwise, they should add comparisons to other hypernetwork-based approaches, to assess the superiority of the proposed approach. Some concerns regarding the presented results for the chosen baselines raise doubts on the actual impact this work might have on the community (see quality section for detailed comments and suggestions).

Quality The paper includes experiments across multiple manipulation tasks (long- vs short-horizon, 3 difficulty levels) in simulation (Robosuite) and real robot experiments, with ablation studies. However, the results are not entirely convincing.

●​ In line 222, the authors claim that one of the advantages of Hyper-GoalNet is its ability to perform “natural goal completion detection”. However, they say they use “environment-provided terminal signals” for fair comparisons with the baselines (lines 278-281). Then, in section 4.3, they claim again that a key advantage of their approach is that “it enables reliable autonomous detection of task completion based on latent distances”. How is this verified?

●​ The discussion of the quantitative results in Tables 1-2-4 is too superficial, simply saying that the proposed approach outperforms the others. However, the baseline implementations seem weakened, with GCBC, Play-LMP, and HyperZero obtaining 0.0 success rates on all tasks and for all difficulty levels. The authors should say how the baseline methods were re-implemented, how the hyperparameters were chosen, check for technical errors in their implementation, and carefully justify any disruptive results like the ones that are currently reported. The training curves in Figure 5 seem to suggest there are issues in HyperZero’s training, which might be the cause of its poor performance.

●​ Success rates are computed over 50 rollouts, but error bars are absent in all presented results. This is particularly problematic in Tables 6-8, where some of the baseline approaches have success rates that are closer to the ones obtained by Hyper-GoalNet.

Clarity

Overall, the paper is hard to follow. There are inconsistencies in terminology, missing information, and the organization can be improved.

●​ The section on “Hypernetwork training” is not very clear. What is ℓ in equation 5?

●​ In section 4.3, the authors compare against HyperZero, which is an alternative hypernetwork architecture, never introduced before. The authors should consider adding a brief description of such architecture where the other baselines are described, in section 4.2.

●​ In all experiments, it is not clear what the architecture of the policy network is and whether it is the same for the baselines. Could performance differences might also be due to the architectural differences?

●​ In lines 313-315, the authors say: “Our training strategy also matters – unfreezing the R3M visual encoder [32] only after 20 epochs ensures stable parameter generation.” This aspect was never mentioned before, and the overall training procedure is unclear, and the code is unavailable at this time. The authors should consider adding this information in the Methods section.

●​ In line 314, the authors refer to epochs, while in Figure 3 they report time steps on the x axis. Are the two terms referring to the same thing? Try to be consistent with the terminology.

●​ In Table 9, the authors report average inference time per step, and claim the superior efficiency of their method. What about training times? No information about the memory and total execution time needed to reproduce the experiments is provided.

●​ Overall, results are not reproducible with the information provided. Details about the architectures of the policy network and the hypernetwork are missing. The training procedure is not clear, and the authors do not specify how hyperparameters were chosen.

[1] Beck, J., Jackson, M. T., Vuorio, R., & Whiteson, S. (2023, March). Hypernetworks in meta-reinforcement learning. In Conference on Robot Learning (pp. 1478-1487). PMLR.

[2] Y. Huang, K. Xie, H. Bharadhwaj and F. Shkurti, "Continual Model-Based Reinforcement Learning with Hypernetworks," 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi'an, China, 2021, pp. 799-805, doi: 10.1109/ICRA48506.2021.9560793.

问题

1.​ Why do GCBC, Play-LMP, and HyperZero achieve 0% success across all tasks and difficulty levels? Please provide: (a) implementation details for each baseline, (b) verification that the original papers' results can be reproduced, (c) hyperparameter search procedures used, and (d) any debugging steps taken. If technical errors were found and fixed, I would reconsider the quality score. Without proper baseline validation, the claimed improvements cannot be trusted.

2.​ How does your approach differ from Beck et al. [1]'s hypernetwork-based meta-RL method? Your claim that "hypernetwork-based RL approaches cannot be directly adapted to goal-conditioned behavioral cloning" (lines 88-90) is contradicted by your own straightforward adaptation of HyperZero. Please clearly explain: (a) what makes your adaptation novel compared to [1], and (b) why a comparison with [1] is not included. Adding this comparison or clearly establishing novelty would address my originality concerns.

3.​ Can you provide complete architectural and training details for reproducibility? Specifically: (a) exact policy network architecture used for all methods, (b) complete hypernetwork architecture beyond Eq. 4, (c) the loss function ℓ in Eq. 5, (d) training procedure including the R3M unfreezing strategy, (e) hyperparameter selection process, and (f) training time/memory requirements. Providing these details would improve the clarity score and enable verification of results.

4.​ How do you verify the claimed "natural goal completion detection" capability? You claim this as a key advantage (lines 222, 313) but use environment-provided terminal signals in experiments. Please show experiments where: (a) your method autonomously detects task completion using latent distances, (b) comparison with environment ground truth, and (c) success/failure rates using autonomous vs. environment detection. Demonstrating this capability would strengthen the significance of your contribution.

5.​ What causes HyperZero's training instability (Figure 5) and how did you attempt to address it? The loss curves suggest HyperZero never properly converges. Did you try: (a) different learning rates or optimization strategies, (b) architectural modifications, (c) different initialization schemes? Understanding whether this is an inherent limitation of HyperZero or an implementation issue is crucial for fairly evaluating your approach's advantages.

Addressing questions 1 and 3 satisfactorily would improve Quality and Clarity scores. Resolving question 2 would improve Originality. Demonstrating question 4 would enhance Significance. A thorough response to all questions could potentially change the overall recommendation from Reject to Borderline Accept/Accept.

局限性

The authors discuss some of the major limitations of their approach.They might add a discussion on failure modes when goal-specifications lie outside the training distribution or when latent distances mispredict completion (if this feature was actually used), which could lead to unsafe behaviors.

最终评判理由

I appreciate the authors' detailed rebuttal and additional details on the experiments. However, I remain minorly concerned about the overall clarity and completeness of the paper.

格式问题

None

作者回复

We thank the reviewer nbKf for the insightful feedback and for recognizing the novelty of using hypernetworks in goal-conditioned behavioral cloning. Our responses below, supported by new experiments, aim to address your questions (also thanks for organizing the weaknesses into these structured questions) and resolve the primary concerns listed in the 'Weaknesses' section. Due to space limitations, we focus on the major points here. All other minor edits and clarifications will be incorporated directly into the revised manuscript.


Q.5 HyperZero Training Instability, Analysis, and Solutions

A. to Q.5: We appreciate the question about HyperZero's training instability. We address this first to clarify HyperZero’s issues and highlight our method’s architectural advantages.

(a) Diagnosis and Initial Comparison Our initial attempts to stabilize HyperZero with varied learning rates and architectures failed. For the original submission, we avoided specialized initializations to ensure a fair comparison, as our method requires none. We attribute HyperZero’s instability to its architecture, which learns a direct goal-to-parameter mapping that is highly sensitive to initialization. In contrast, our architecture is inherently more robust because it simulates an optimization process, iteratively refining parameters through a series of blocks within a single forward pass.

(b) Enhanced Initialization Schemes for HyperZero Based on your suggestion, we tested two initialization schemes:

  • ScalarInit: A learnable scalar to scale the hypernetwork’s output, stabilizing training by controlling initial weight scale.
  • Bias-HyperInit[1]: Implemented as recommended in [1] for its compatibility with high-dim conditioning inputs.

(c) Comparative Results New results show that these initializations improve HyperZero, but our method remains superior without requiring such techniques.

Success Rate (%) on Coffee Tasks with Different Initializations

MethodD0D1D2
ScalarInit16%18%14%
Bias-HyperInit[1]30%18%0%
Ours(no special init)94%76%62%

These results confirm the effectiveness of HyperZero’s initialization schemes, yet our Hyper-GoalNet architecture remains superior and more robust.

Revision Plan: We will update Fig. 5 with new training curves and add this analysis to Sec. 4.3.


Q.1 Analysis of Baseline Performance

A. to Q.1: We thank the reviewer for this insightful question regarding the baselines. Below, we clarify our approach, provide analytical results, and outline revisions.

(a) HyperZero Analysis & Results

For a fair comparison, we initially used the official HyperZero using the same setup as our method but without special initializations. Following your feedback, we found HyperZero’s training is sensitive to initialization. Applying a specific initialization technique (see Q5 response) significantly improved its success rate. Our method still outperforms, but these results ensure a fairer comparison.

Revision Plan: We will update Tab. 4 with new HyperZero results and add a note detailing the setup.

(b) GCBC and Play-LMP Analysis

  • Implementation: Without official code, we used reputable third-party implementations, adapted for our data pipeline with identical core hyperparameters.
  • Performance Diagnosis: We verified that the failure stems from design, not implementation errors. We observe that both methods optimize policy via expert action log-probability, leading to overfitting (low training loss, high/stagnant validation loss). This incurs generalization issues to unseen test set variations, especially for high-precision grasping/placing. This finding is consistent with other work (e.g., C-BeT). Our hypernetwork design avoids such overfitting by a structural bias that forces the prediction to follow an optimization process instead of just a memorization of the mapping.

Revision Plan: We will add a paragraph to Sec. 4.2 with these details, technical reasons for performance (likelihood-based vs. distance-based loss), and citing corroborating prior work. We have also added stronger baselines (e.g., Diffusion Policy; see response to Reviewer EGhA), which we believe addresses your concerns.


Q.2 Comparison with Hypernetwork-based Meta-RL

A. to Q.2: We thank the reviewer for this question, which helps us clarify our work's novelty relative to hypernetwork-based RL methods [1].

(a) Our Novelty and Clarification

  • Paradigm vs. Architecture: Our claim that RL methods "cannot be directly adapted" refers to the training paradigm. The full RL algorithms of [1, 2] are inapplicable in our reward-less BC setting. We only adapted architectures (e.g., HyperZero), not the learning methods themselves. We will clarify this on lines 88-90.
  • Architectural Design: Our method, inspired by iterative optimization, is inherently more stable and robust than methods that directly generate "sub-optimal" parameters and rely on initialization fixes like Bias-HyperInit [1].
  • Task Conditioning: Our use of high-dim goal images provides superior scalability and generalization over the low-dim one-hot vectors used for discrete tasks in [1], enabling continuous goal specification.

(b) Comparison and Planned Revisions We did not include a comparison due to these fundamental differences(RL vs. BC). However, prompted by your review, we have now benchmarked against a key component of [1]. As shown in our Q5 response, our method significantly outperforms a baseline with Bias-HyperInit[1], confirming our approach's benefits.

Revision Plan: We will:

  1. Distinguish the RL vs. BC paradigms in the introduction
  2. Expand the Related Work to discuss [1] and [2], highlighting their key differences.
  3. Integrate these new results [1] into our main experimental tables.

We believe these changes will clearly establish our work's novelty and contributions.


Q.3 Architectural and Training Details

A. to Q.3: We thank the reviewer for these important questions on reproducibility. We will add to the appendix these details.

(a) Policy Network Architecture

  • GCBC, Play-LMP, & HyperZero: Details are in our response to Q. 1.
  • C-BeT: We adapted its official, state-based implementation for our challenging image-based setting by adding a ResNet-18 encoder. This ensured a fair comparison.
  • MimicPlay-M: We used the official implementation within our identical task setup.

(b) Hypernetwork Architecture Our hypernetwork architecture is inspired by [38] and works as follows: First, the goal image is encoded into a goal embedding. This embedding is then fed into a series of hypernetwork blocks. Each block executes a process analogous to a step of Policy Gradient estimation, iteratively optimizing a set of policy parameters as if performing gradient descent. The final, optimized parameters are then applied to a lightweight 3-layer MLP, which serves as the target policy network for predicting actions.

Due to space limitations, we cannot provide a full implementation breakdown. However, we are committed to release our complete codebase upon publication to ensure full reproducibility.

(c) The loss function in Eq. 5 is the Mean Squared Error loss.

(d) Training Procedure The model is trained end-to-end. The process per step is:

  1. The hypernetwork takes the goal image as input and generates the parameters for the target policy.
  2. The target policy takes the current observation as input and outputs an action.
  3. The final loss is a sum of the classic behavioral cloning loss (MSE) and our proposed latent shaping losses.

Unfreezing strategy: To protect the pretrained R3M vision encoder from noisy gradients early in training, we keep the encoder frozen for the first 20 epochs and then finetune it jointly with the model once the policy has stabilized. Crucially, this exact strategy was applied to all baseline methods for a fair comparison.

(e) Hyperparameter Selection We found our iterative, optimization-inspired method to be highly robust to the choice of hyperparameters. To maintain a fair comparison, we performed a light search and set the initial learning rate for all methods to 4e-5, using the Adam optimizer.

(f) Training Time & Memory Requirements All experiments were conducted on a single NVIDIA RTX 4090 GPU and all methods completed training within one day. While our method requires slightly more resources than HyperZero, this is coupled with a substantial performance gain.

HyperZeroOurs
Training Time/Epoch~90s~104s
Memory(Frozen)3038MB4916MB
Memory(Unfrozen)13452MB14844MB

Q.4 Verifying Natural Goal Completion Detection

A. to Q.4: While our original submission qualitatively illustrated this concept of automatic detection(Figs. 3, 10, 11), we used environment signals in our main results to ensure a fair comparison with all baselines.

To quantitatively validate this capability, we now compare our autonomous detection (Auto SR) with the environment's ground truth (Env SR) using these metrics:

  • Auto SR: Success determined by our method's latent distance threshold.
  • Env SR: Success determined by the environment signal.
  • Accuracy: Agreement rate between Auto SR and Env SR.
  • Recall: Auto SR's ability to identify true successes from Env SR, a crucial metric for not missing completed tasks.

Results for Autonomous Goal Completion Detection (Coffee Tasks)

TaskAuto SREnv SRAccuracyRecall
D00.960.9494%98%
D10.780.7690%95%
D20.740.6276%90%
Mean--86.6%94.3%

This strong alignment with the ground-truth signal, shown by an average accuracy of 86.6% and recall of 94.3%, confirms that our latent-distance-based approach is a reliable autonomous completion detector.


We thank the reviewer again for these thoughtful questions and hope our clarifications reinforce the value and clarity of our contributions. We would be happy to address any further suggestions or questions you may have.

评论

I thank the authors for the detailed rebuttal and additional details on the experiments. While I appreciate all of this, I remain concerned about the overall clarity and completeness of the paper.

While the core idea is promising and some of the new evidence is compelling, the current clarity and reproducibility (until code is released) do not meet the bar for a clear accept on my end. Nevertheless, I am raising my recommendation to weak reject.

评论

Dear Reviewer nbKf,

Thank you very much for your detailed review, for taking the time to consider our rebuttal, and for raising your recommendation. We truly appreciate your constructive feedback.

We would like to briefly address your remaining concerns on clarity and reproducibility.

  1. On Clarity: We have done our best to clarify all points in our rebuttal. While we were encouraged that other reviewers found the paper easy to follow and well-written, we understand that clarity can be subjective and are committed to making our paper as clear as possible. If there are any specific sections or concepts that remain unclear to you, please let us know, we would be happy to provide further clarifications.

  2. On Reproducibility & Code: We agree that access to the code is the ultimate verification of reproducibility. To ensure we are fully compliant with conference policies on sharing external links during the discussion period, we have sent an anonymous link to our code repository to the Area Chair. We have asked for the AC's assistance in forwarding it to you, and we hope you will be able to access it soon.

Thank you again for your valuable feedback and engagement. We look forward to the possibility of further discussion.

评论

Thanks for the additional follow-up. I think your rebuttal clarifies my doubts, but I must understand how you're planning of incorporating these clarifications in the main paper. If you can provide details on that, as well as showing code, I'd be open to revising my recommendation to a weak accept.

评论

Dear Reviewer nbKf,

Thank you for your prompt follow-up and for acknowledging that our rebuttal has clarified your doubts. We appreciate you providing a clear path forward. We are fully committed to incorporating all clarifications into the manuscript to make it more self-contained.

Below, we provide a detailed plan for how we will revise the main paper, followed by an update on the code accessibility.


Planned Revisions for the Main Paper

To directly address your concerns, we will make the following specific changes throughout the manuscript:

  1. Abstract:

    • We will add a link to our open-source code repository to ensure reproducibility.
  2. Introduction & Related Work:

    • We will expand our discussion of hypernetwork-based RL approaches [1, 2]. This will include:
      • A clearer delineation of the key distinction between the training paradigms (RL vs. our reward-free BC setting).
      • A more detailed explanation that while the full RL algorithms are not directly adaptable, their hypernetwork architectures (like HyperZero[3] and Bias-HyperInit[1]) can be, which we will clarify on lines 88-90.
  3. Method:

    • We will add targeted clarifications to enhance readability. This includes explicitly stating that " in Eq. 5 is the Mean Squared Error (MSE) loss" and providing a more detailed overview of the end-to-end training procedure.
  4. Experiments:

    • Hypernetwork Baselines Update: We will integrate the new comparative results for HyperZero, including the improved performance achieved with the ScalarInit and Bias-HyperInit [1] schemes. We will update the training curves (Fig. 5) and tables (Tab. 4) accordingly and add analysis to show that our proposed architecture remains superior even when baselines are enhanced with specialized initializations.
    • Baseline Analysis: We will add a detailed discussion explaining why likelihood-based baselines (GCBC, Play-LMP) underperform in our challenging, high-variation setting, attributing it to overfitting. This analysis will be supported by prior work to make our claims more robust.
    • Goal Completion Detection: To substantiate our claim of "natural goal completion detection," we will supplement the existing qualitative evidence presented in the paper (Fig. 3, Fig. 10, Fig. 11) with new quantitative results from our rebuttal. Specifically, we will incorporate the table comparing our autonomous detection against the environment's ground truth using Accuracy and Recall metrics, providing strong empirical evidence for this capability.
    • Clarity Enhancements: We will add a brief architectural description for HyperZero in the baselines section (Sec. 4.2) and clarify x-axis of Fig. 3.
  5. Discussion:

    • We will add a brief discussion on potential failure modes and behavioral safety, as you suggested.
  6. Appendix:

    • The appendix will be expanded to include comprehensive details on:
      • More implementation specifics, hyperparameter choices, and training protocols for all methods to ensure full reproducibility.
      • A detailed breakdown of training time and memory costs for our method and key baselines.

[1] Beck, Jacob, et al. "Hypernetworks in meta-reinforcement learning." Conference on Robot Learning. PMLR, 2023.

[2] Huang, Yizhou, et al. "Continual model-based reinforcement learning with hypernetworks." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.

[3] Rezaei-Shoshtari, Sahand, et al. "Hypernetworks for zero-shot transfer in reinforcement learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 8. 2023.


Code for Reproducibility

We agree that providing the code is important for verification. We understand and respect the conference policy regarding anonymity and the restriction on sharing external links in the rebuttal.

To address this, we have sent an anonymized link to our code repository to the Area Chair. We have respectfully requested the AC's assistance in forwarding this link to you for your review. We hope this will allow you to verify our implementation directly.


Thank you once again for your invaluable guidance throughout this process. We are confident that these comprehensive revisions will make the manuscript more polished and self-contained, in line with your helpful suggestions, and we are grateful for the opportunity to improve our work based on your feedback. We would be happy to provide any further clarifications.

Best,

Authors

评论

Thanks, with this I will update my score to a weak accept (I'm confidently waiting for the code to be sent from AC, although there are alternative ways to do that as well)

评论

Dear Reviewer nbKf,

Thank you so much for your positive feedback and for your decision to update the recommendation to a positive score. We are truly grateful for your time, constructive guidance, and support for our work throughout this process.

We have confirmed that the code has been sent to the Area Chair and hope it reaches you soon.

Thank you once again for your invaluable engagement!

Best,

Authors

审稿意见
4

This paper proposes a novel network architecture for goal-conditioned policy learning, where the goal is specified via an image. The central contribution is the use of a hypernetwork that generates the policy network’s weights conditioned on the goal image. The approach is evaluated through behavioral cloning on a set of manipulation tasks, originally introduced in the MimicGen benchmark. Additionally, the authors report results on a real robot, further demonstrating the feasibility of their method.

优缺点分析

Strengths

  • The paper is clearly written and easy to follow. Figures are well-designed and contribute to the reader’s understanding. I love figure 1 & 2, both are very well designed and executed.
  • The proposed architecture, involving a hypernetwork for goal-conditioning, appears to be novel in the context of visual goal-conditioned policies and robot manipulation. While I am not deeply familiar with all related work in this subdomain, the architectural choice seems original and well-motivated.
  • The experimental design strikes a good balance between simplicity and realism. The chosen tasks are tractable but still non-trivial, providing a reasonable testbed for evaluation.
  • The inclusion of real robot experiments is a strong point and adds practical credibility to the method, though more details would be welcome (see suggestions below).

Weaknesses

  1. Necessity of Goal Conditioning
    It is unclear whether the tasks presented in the paper truly require goal conditioning. In many of the simulated scenarios, the current observation already appears to encode sufficient information to determine the correct action. Goal images are typically most useful when the task involves ambiguity or requires disambiguating between multiple possible outcomes. As such, the relevance of the goal-conditioned architecture in these settings is somewhat questionable.

  2. Venue Fit
    While the work is technically sound and includes real-world experiments, its primary contributions lie at the intersection of robotics and representation learning. As such, it might be more naturally aligned with a conference like CoRL. That said, the paper could still be of interest to the NeurIPS community, particularly those working on goal-conditioned policies and representation learning for control.

问题

  • Can the authors clarify in which of the presented tasks the goal image provides information not already present in the current state? In the absence of ambiguity, what is the motivation for using a goal-conditioned architecture?

  • Could the authors provide videos of the real robot experiments to support their results and enable a better qualitative assessment of the learned policies?

局限性

yes

最终评判理由

See my original review. Both of my main weaknesses were not resolved (and honestly also cannot be resolved within a rebuttal). I am going to keep my original score of weak accept

格式问题

no

作者回复

Dear Reviewer uJn1,

Thank you for your insightful feedback and for recognizing the strengths of our work. We are grateful for your positive comments on the paper's clarity, the novelty of our proposed architecture, and the practical value of our real-world experiments. Your thoughtful assessment is much appreciated. We have carefully considered your questions and concerns, which we address point-by-point in the following.


W.1 & Q.1 Necessity of Goal Conditioning

A. to W.1 & Q.1: We thank the reviewer for this insightful question regarding the motivation for goal-conditioning in our chosen tasks. Our primary motivation for using a goal-conditioned architecture is not merely to solve a single task, but to learn a general and versatile policy that can be directed to achieve a wide spectrum of goal states, including critical intermediate steps of a task.

  1. Motivation Beyond Ambiguity: Learning a General Skill

The core advantage of our approach lies in its ability to generalize. Instead of learning a reactive policy that simply maps a state to an action for one specific outcome (e.g., "put the lid on the coffeemaker"), our goal-conditioned policy learns a much more fundamental and reusable skill: "given a goal image, manipulate the environment to match that image." This allows the policy to:

  • Target any valid goal state, not just the final one from the training data.
  • Understand and achieve intermediate goals, demonstrating a deeper comprehension of state progression rather than just overfitting to successful final trajectories.
  1. Autonomous Goal Completion Detection

A direct and powerful capability stemming from this architecture is autonomous goal completion detection. Our policy continuously compares the latent representation of the current observation with that of the goal image. When the latent distance falls below a threshold, the policy intrinsically "knows" the task is complete and can terminate. This capacity for self-assessment is impossible for a non-conditioned policy and is a key feature of our method, which we qualitatively illustrated in Figures 3, 10, and 11.

To quantitatively validate this capability, we performed a new analysis comparing our policy's autonomous success detection (Auto SR) with the environment's ground-truth success signal (Env SR). We introduce two key metrics:

  • Accuracy: The agreement rate between our method's decision and the ground truth.
  • Recall: Our method's ability to correctly identify all true successes. We emphasize recall as it is crucial for a policy not to miss a completed task and continue acting unnecessarily.

The results for the Coffee tasks are presented below:

Table: Quantitative Results for Autonomous Goal Completion Detection

TaskAuto SREnv SRAccuracyRecall
D00.960.9494%98%
D10.780.7690%95%
D20.740.6276%90%
Mean--86.6%94.3%

The results show a strong alignment with the ground-truth environment signal, achieving an average accuracy of 86.6% and, most importantly, a high average recall of 94.3%. This confirms that our latent-distance-based mechanism is a reliable method for autonomous completion detection, a critical capability enabled directly by our goal-conditioned design.

Revision Plan: In our revised manuscript, we will:

  1. Clarify our motivation in the introduction, explicitly stating that our goal is to learn a generalizable policy capable of achieving a spectrum of goals and performing autonomous completion detection.
  2. Add this new quantitative analysis and table to the experimental section to provide strong empirical evidence for this key capability.

We believe this clarification and the new results effectively demonstrate that goal-conditioning provides a crucial advantage—generalization and self-awareness—that goes far beyond simply resolving ambiguity in the initial state.


W.2 Venue Fit

A. to W.2:

We thank the reviewer for their thoughtful consideration of our paper's positioning. We agree that our work has strong connections to the robotics community, as the real-world validation is a crucial component of our research.

However, as the reviewer insightfully points out, the primary contributions of our paper are deeply rooted in machine learning topics that are of broad interest to the NeurIPS community. Our work focuses on:

  1. Representation Learning for Control: At its core, our paper investigates how to learn effective representations from high-dimensional visual inputs (goal images) to condition policies.

  2. A Novel Generative Architecture: Our central contribution is the use of a hypernetwork to generate the parameters of an entire policy network. This positions our work within the broader study of generative models, exploring how one network can produce the weights for another.

We believe that robotics provides a challenging and meaningful testbed for these fundamental ML concepts. Proving the effectiveness of our representation learning and generative architecture on complex, real-world manipulation tasks demonstrates their practical viability and robustness.

Therefore, given this focus on goal-conditioned representation learning and generative hypernetwork architectures, we believe that our paper is well-aligned with the interests of the NeurIPS community and will be of significant value to the community.


Q.2 Real Robot Experiment Videos

A. to Q.2: We thank the reviewer for this suggestion. We agree that videos are essential for qualitative assessment and have prepared them to be shared via a project website, which we will link in the camera-ready version.

Due to NeurIPS policy regarding anonymity, we cannot provide the link during the rebuttal period. In the interim, we would direct the reviewer to our Appendix (specifically the "Real Robot Experiment Setting" section), which details our hardware, setup, and procedures. We look forward to sharing the full videos, along with all code and experimental details, upon the paper's acceptance.

评论

The rebuttal did not provide new information and it is still unclear whether the goal conditioning is needed in the performed experiments. I am keeping my score at weak accept.

评论

Dear Reviewer uJn1,

Thank you for the follow-up and for maintaining your support for our paper. Your questions have been very valuable in helping us identify areas where we can further strengthen the paper's argumentation. We will use this valuable feedback to further improve the clarity of our motivation and experiments in the final manuscript. We are very grateful for your constructive engagement throughout this process and for your recommendation for acceptance.

Best,

Authors

审稿意见
4

This paper introduces Hyper-GoalNet, a novel framework for goal-conditioned manipulation policy learning that uses hypernetworks to dynamically generate task-specific policy parameters based on goal specifications. Unlike conventional approaches that concatenate goal and state information as input to fixed-parameter networks, this method treats goals as specifications that determine how current observations should be processed. The approach is evaluated on multiple manipulation tasks in simulation and real-robot experiments.

优缺点分析

Strengths:

  • The core idea of using hypernetworks to generate goal-specific policy parameters is innovative and well-motivated.
  • The paper includes extensive experiments across multiple manipulation tasks with varying difficulty levels, thorough ablation studies, and real-robot validation.
  • The paper is well-written and easy to follow.

Weaknesses:

  • The method is evaluated exclusively in single-view camera configurations. The authors do not demonstrate how their approach scales to multi-view camera setups, which are standard in most robot learning systems. Given that the method incorporates a latent shaping module, scaling to multiple cameras may introduce additional complexity and learning challenges.
  • The experimental baselines appear outdated and insufficient for comprehensive evaluation. For instance, the authors use C-BeT [1] but omit its more recent variant VQ-BeT [2]. Additionally, diffusion policies [3, 4], which have gained significant adoption in recent robot learning research, are not included in the comparison. While the authors' method employs behavior cloning loss, making direct comparison with diffusion-based approaches potentially unfair, they should clarify how their method could extend to other policies and justify the exclusion of diffusion-based baselines to strengthen their evaluation.
  • The real robot experimental results raise concerns, as all baseline methods exhibit surprisingly poor performance. Given that these appear to be standard robot learning tasks, the authors should clarify what specific aspects make these tasks challenging and explain the underlying factors contributing to the baselines' poor performance.
  • I am also concerned about the scaling ability of the method. With the number and complexity of tasks increasing, relying on the current observation and specific goals to predict the parameters of the policy could be challenging.

[1] Cui, Zichen Jeff, et al. "From play to policy: Conditional behavior generation from uncurated robot data." ICLR'23.

[2] Lee, Seungjae, et al. "Behavior generation with latent actions." ICML'24.

[3] Chi, Cheng, et al. "Diffusion policy: Visuomotor policy learning via action diffusion." RSS'23.

[4] Reuss, Moritz, et al. "Goal-conditioned imitation learning using score-based diffusion policies." RSS'23.

问题

If I understand it right, the method only predicts the action of the current timestep, so why don’t the authors apply some techniques such as action chunking?

局限性

The authors have discussed the limitations of the paper.

最终评判理由

Based on the authors' rebuttal, which provided clarifications on experimental protocols and included new results with diffusion-based policies and video goals, I am increasing my assessment to a score of 4.

However, I maintain reservations regarding the benchmark comparisons. The baselines used in the paper are relatively old. While the authors have included results from diffusion policies in the rebuttal, these perform similarly to their proposed method. It's unclear how much improvement it will bring in diffusion-based policies. I believe it would be crucial to include a comprehensive comparison between a diffusion-based Hyper-GoalNet and a pure diffusion policy.

格式问题

I do not notice any formatting issues.

作者回复

Dear Reviewer EGhA,

Thank you for your constructive feedback and for recognizing the strengths of our paper. We are grateful for your positive comments on our core idea being "innovative and well-motivated," our "extensive experiments," and the paper's clarity as "well-written and easy to follow." We have carefully considered your concerns and address your comments and questions in the following.


W.1 Single-View Camera Configuration

A. to W.1: We thank the reviewer for this question about camera configurations. Our focus on a single-view setup is a deliberate choice, as we argue it is essential for achieving real-world deployability and generalization. Also, a key to generalization in robotics is the ability to train on massive and diverse datasets. While multi-view systems are information-rich, their setup complexity present significant barriers to this kind of large-scale data acquisition:

  1. Barrier to Scale: The need for complex calibration and the high deployment cost make it challenging to gather data from hundreds of different environments, fundamentally limiting data diversity.
  2. Barrier to Simplicity: Specifying geometrically consistent goals across multiple views adds significant complexity to the data collection and annotation pipeline, further hindering scalability.

In contrast, a single-view approach is far more scalable. It lowers the barrier for massive data collection (potentially including heterogeneous web data) and enables deployment in novel, un-instrumented environments. We argue that forcing the model to learn from this constrained but realistic input fosters more robust and generalizable representations, paving a more promising path towards agents that can operate "in the wild."

Moreover, the Hyper-GoalNet framework is flexible. Its core mechanism—generating policy parameters from a goal—is agnostic to the vision encoder. Features from multiple cameras could be readily integrated (e.g., via attention) and processed by our latent shaping module. This sensor fusion is a feasible extension but is orthogonal to our primary contribution.

Revision Plan: In the revised manuscript, we will clarify the motivation for our single-view focus, emphasizing its benefits for generalizability and scalability, and discuss multi-view extensions in the future work section.


W.2 Additional Experimental Baselines

A. to W.2: We thank the reviewer for this valuable suggestion. To strengthen our evaluation, we have performed new experiments on some tasks against a state-of-the-art Diffusion Policy (DP) and a strong Goal-Conditioned DP baseline. Regarding VQ-BeT, we also worked to include it as a baseline. We endeavored to follow the official implementation, but encountered significant implementation challenges (e.g., the official, (only) image-based model failed to converge, and our efforts to stabilize its VQ-VAE pre-training stage for our high-precision visuomotor data proved non-trivial), preventing us from obtaining stable results within the rebuttal period. We are continuing this effort and aim to include it in the final version.

The new results are summarized below:

MethodCoffee d0Coffee d1Coffee d2Mug d0Inference Type
Diffusion Policy (DP)0.940.740.440.68Iterative
Goal-Conditioned DP0.920.780.660.74Iterative
Hyper-GoalNet (Ours)0.940.760.620.78Single Pass

These results highlight a crucial trade-off between performance and efficiency. To provide a robust comparison, we benchmarked against a state-of-the-art Diffusion Policy (DP) and an even stronger Goal-Conditioned DP (GC-DP), which leverages the goal image via concatenation. While powerful, both diffusion baselines require a computationally intensive, iterative sampling process at inference time.

In contrast, our method is a single-pass, feed-forward network, making it significantly more efficient. Despite this key efficiency advantage, the results show our method achieves performance comparable to the strong, iterative GC-DP baseline. We attribute this compelling result to our core contribution: the hypernetwork's ability to directly generate high-quality, specialized parameters from the goal, bypassing the need for iterative refinement.

In addition to these policy-level comparisons, we have also added new baselines analyzing the hypernetwork architecture itself, the details of which can be found in our response to Reviewer nbKf's Question 5 (a) and (c).

Revision Plan: We will add this discussion and the new results into our revised manuscript, which we believe substantially strengthens our paper's evaluation.


W.3 Real Robot Performance Gap

A. to W.3: We thank the reviewer for this critical question. The significant performance gap on the real robot warrants this deeper explanation. The tasks require high precision, which penalizes policies that are not robust to the noise and observation shifts inherent to the real world. Our analysis points to two distinct failure modes for the baselines:

  1. GCBC & Play-LMP: Overfitting from Likelihood-based Training. These methods failed due to their likelihood-based training objective (i.e., maximizing action log_prob), typically under a Gaussian assumption, which is more prone to overfitting than direct regression (e.g., MSE). We observed this overfitting empirically, as their validation loss failed to converge converge to very low values despite low training loss, indicating a poor generalization capability. Similar failure cases also occurred in simulation, where these overfitted policies could only reproduce coarse trajectory motions and consistently failed at the crucial, high-precision steps. This weakness was naturally amplified on the real robot, where noise and observation shifts are unavoidable.
  2. C-BeT: Lack of Robustness to Compounding Errors. For C-BeT, the issue is more nuanced. Its powerful Transformer architecture excels at fitting the in-distribution training data. However, it struggles with robustness. In the real world, a small amount of sensor noise can push the robot into a slightly out-of-distribution (OOD) state. A fixed-parameter model like C-BeT makes a small prediction error, which leads to a subsequent state that is even further from the training distribution. This triggers a cascade of compounding errors throughout the trajectory, ultimately leading to task failure.

Why Hyper-GoalNet Succeeds:

Our method is inherently more robust to these challenges precisely because it does not rely on a single, fixed-parameter network. Instead, Hyper-GoalNet dynamically generates the parameters of the policy network on-the-fly, conditioned on the current (and potentially noisy) observation and the specific goal. This mechanism acts as a form of rapid, instance-specific adaptation. The ability to generate and refine high-quality parameters makes the resulting policy far more resilient to observation noise and helps prevent the accumulation of minor errors.

Revision Plan: We will add a detailed discussion of these distinct failure modes and the underlying reasons for our method's robustness to the appendix of our revised manuscript. Thank you for prompting this deeper analysis.


W.4 Scaling Ability

A. to W.4: We thank the reviewer for raising this forward-looking question about the scalability of our method. We agree that scaling to a larger number and greater complexity of tasks is a crucial long-term goal for any manipulation policy.

Our current work focuses on establishing the core principle of Hyper-GoalNet: dynamically generating policy parameters from a goal specification (i.e., a goal image) for a lightweight MLP policy. We view this as a foundational proof-of-concept. However, the framework itself is inherently designed for scalability. We believe the key to scaling is not to simply increase the capacity of a single, fixed model, but to enhance the components of our modular system. Specifically, our hypernetwork architecture is compatible with:

  1. More advanced guidance: Replacing the single goal image with richer goal specifications, such as videos.
  2. More powerful target policies: Generating the parameters for more advanced network architectures beyond a simple MLP, such as a Diffusion Policy.

To validate this potential, we conducted a experiment on scaling a single Hyper-GoalNet architecture across four distinct manipulation tasks. In this setup, we equipped the hypernetwork with a more powerful diffusion-based target policy and used video demonstrations as goal information. The success rates across the subtasks were as follows:

TaskSubtask 1Subtask 2Subtask 3Subtask 4
Scaled HyperNet~25/25~25/25~25/25~24/25

Please note that this was an initial experiment to test the feasibility of scaling. The results are highly promising, demonstrating that when equipped with more advanced components, the HyperNet framework can achieve very strong performance across multiple, diverse tasks. This provides compelling evidence for the scalability of our approach.

Revision Plan: We are very excited by these initial findings and will make scaling a central focus of our future work. We will add a discussion of this to the paper to highlight the future potential of our framework.


Q.1 Action Chunking

A. to Q.1: Thank you for this insightful question. We apologize for the omission. In fact, we did employ action chunking in all of our experiments. Our policy predicts a short sequence of future actions at each timestep, and the agent executes this chunk sequentially before the policy is invoked again.

Revision Plan: We omitted this detail for brevity to maintain focus on our core contribution, the Hyper-GoalNet framework itself. We agree this is an important implementation detail and will add it to the experimental setup section in our revised manuscript. Thank you for pointing this out.

评论

Thank you for your detailed rebuttal and additional experiments. I have several follow-up questions:

Could you provide the specific references for the Diffusion Policy (DP) and Goal-Conditioned DP methods you evaluated? Given that diffusion-based policies show comparable performance to Hyper-GoalNet, I'm curious whether combining Hyper-GoalNet with a diffusion policy backbone could yield further improvements. This hybrid approach could potentially strengthen both the performance and the contribution of your work.

While I appreciate your analysis of baseline failure modes, could you elaborate on the specific complexities of your real-robot tasks? In my experience, pick-and-place tasks can often be handled effectively by simple behavior cloning with action chunking. What additional challenges in your setup necessitate the proposed approach?

The scaling experiments are quite interesting. What are the specific sub-tasks included in your evaluation? How do you encode video demonstrations into goal guidance representations?

评论

We sincerely thank the reviewer for their detailed consideration of our rebuttal and the additional experiments. We are encouraged by your continued engagement and believe your follow-up questions help to further clarify and strengthen our work. We provide our responses below.


Response to Follow-up Q.1

A1: Thank you for this question and for the insightful suggestion.

References and Implementation Details:

  • Our Diffusion Policy (DP) implementation is adapted from the original work by Chi et al. [1]. We followed the standard architecture, using a U-Net with hidden dimensions of [256, 512, 1024]. To ensure a fair comparison, all other experimental parameters, such as observation horizon and image resolution, were kept identical to our other baselines.
  • For the Goal-Conditioned DP (GC-DP), we were inspired by the approach in Reuss et al. [2], which conditions the policy on goal information. To create the fairest possible baseline, we adapted their idea by feeding both the observation and goal image into the diffusion backbone. However, unlike [2] which uses a Transformer, we used the same U-Net backbone as our vanilla DP implementation to ensure the performance gain came from the goal-conditioning itself, not a different architecture. As our results show, adding this goal information provided a performance benefit over the standard DP. We will add a detailed discussion of these baseline implementations to our revised manuscript.

On Combining Hyper-GoalNet with a Diffusion Policy: We agree that combining our method with a diffusion backbone is a very promising direction. Our approach (dynamic parameter generation) and diffusion policies (iterative action refinement) are orthogonal concepts. We hypothesize that using HyperNet to generate the parameters for a diffusion model could provide a much stronger conditional prior, enabling the diffusion backbone to model complex action spaces more effectively.

In fact, the "Scaled HyperNet" from our initial rebuttal already demonstrates this hybrid model: it uses a diffusion network as its target policy backbone. The strong performance of this hybrid model validates your suggestion and highlights the significant potential of this approach:

TaskSubtask 1Subtask 2Subtask 3Subtask 4
Scaled HyperNet~25/25~25/25~25/25~24/25

We thank you for this valuable feedback.


Response to Follow-up Q.2

A2: Thank you for pushing for this important clarification. Our experimental setup was intentionally designed with several compounding challenges to rigorously test policy robustness and generalization, which we detail below:

  1. High-Variance, "In-the-Wild" Style Data: Our dataset originates from human VR teleoperation, not clean, scripted trajectories. This, combined with significant randomization in the initial poses of both the robot and goal objects, creates high-variance conditions. This prevents policies from simply memorizing paths and forces them to learn a more general mapping from states to actions.
  2. Complex Tasks with a Limited Observation Window: Our evaluation suite includes longer-horizon, dynamic tasks like drawer pulling and object sweeping, which require tracking continuous motion towards dynamic goals. The difficulty of these tasks is significantly amplified by our intentionally short observation horizon of only two timesteps. This limited window makes it hard for policies to infer long-term intent, demanding highly reactive and precise actions based on minimal context.

These factors—high initial state variance and complex tasks viewed through a short observation window—create a challenging environment where fixed-parameter policies are frequently pushed into out-of-distribution states, leading to failure. This is precisely the scenario where Hyper-GoalNet's ability to dynamically generate tailored policy parameters for the immediate context becomes necessary, providing the adaptability and robustness to succeed. We will clarify these details of our experimental design in the appendix.


Response to Follow-up Q.3

A3: Thank you for your question. We are happy to provide these specifics.

Sub-tasks: The four sub-tasks in our scaling experiment, all within a single tabletop environment, were:

  1. Placing a red cube on a plate.
  2. Placing a blue cube on a plate.
  3. Stacking the red cube on the blue cube.
  4. Stacking the blue cube on the red cube.

The single failed trial occurred during the final stacking task. We attribute this to its higher demand for precision in both grasping and placement, making it the most challenging task in the set.

Video Goal Encoding: We used a simple 4-layer 3D convolutional network to process the demonstration video. This network outputs a compact feature embedding that serves as the goal representation for our Hyper-GoalNet framework.

评论

Thank you for your additional clarification. Based on the rebuttal, I will raise my score to 4.

My primary concern with the paper centers on the choice of baselines, which are quite old. While you have included results from diffusion policies, these perform similarly to your proposed method. It's unclear how much improvement it will bring in diffusion-based policies. I believe it would be crucial to include a comprehensive comparison between a diffusion-based Hyper-GoalNet and a pure diffusion policy. Such a comparison would significantly strengthen your paper's quality.

Btw, I think you forgot to add the references in the comment, but it's fine :-)

评论

Dear Reviewer EGhA,

Thank you so much for your detailed feedback throughout this process and for raising your score for our paper. We are sincerely grateful for your constructive guidance; your insightful questions have been invaluable in helping us identify how to significantly strengthen our work.

We completely agree with your final assessment. Quantifying the benefits of a hybrid Hyper-GoalNet + Diffusion model is an important next step, and your feedback has been invaluable in clarifying this promising research direction. This will be a key focus as we continue to develop this work.

And thank you for the friendly reminder regarding the references! We apologize for the oversight in the comment form. We will be sure to add the full citations and a thorough discussion of new baselines to the Related Work and Experiments sections in the revised manuscript.

Once again, thank you for your time, expertise, and for helping us elevate the quality of our paper.

Best,

Authors

审稿意见
5

In this work, the authors introduce a novel framework to learn goal-conditioned policies by separating the goal conditioning from the state parameters. Specifically, the approach uses hypernetworks to generate target policy parameters based on the goals. This is achieved by shaping the latent space to maintain the predictability of future states through a learned dynamics model and preserving physical relationships through distance constraints so that there is a monotonic progression towards goals. Through experimental validation, the authors show that their hypernetwork-based framework outperforms baseline goal-conditioned methods that use fixed network parameters. Similarly, the authors conduct ablation studies to compare the hypernetwork architecture and latent space shaping while also testing their approach on a real robot.

优缺点分析

The paper discusses a novel concept in detail with both theoretical and experimental justifications for each of the design choices for the Hyper-GoalNet framework, including the choice of the hypernetwork architecture as well as the latent space shaping approach. The experiment setup includes a robotics benchmark, ablation studies and evaluations on a real robot, not to mention the additional benchmarks in the appendix. The paper is also well-written and easy to understand. Lastly, the results show significant improvements over the baselines, highlighting the success of the approach. The paper could discuss the limitations in greater detail for the final discussion section instead of only describing it in the appendix. In addition, the experiment task variety or difficulty should be adjusted (particularly for the hypernetwork architecture ablation study) in order to have a more detailed comparison, as a 0% success rate across all tasks for Hyperzero.

问题

Why do many of the goal-conditioned methods have a 0.00 success rate across all three difficulty levels? Is there a more simplified benchmark where they have at least some successes? The same idea applies to the ablation study with Hyperzero, as it seems to have achieved 0 successes overall.

局限性

Yes, the authors adequately address the limitations, though it is mainly in the appendix.

最终评判理由

My concerns regarding the experiment results have been resolved in the detailed rebuttal by the authors. I am going to keep my original score of accept.

格式问题

No paper formatting concerns.

作者回复

Dear Reviewer bvF4,

We would like to sincerely thank the reviewer for their insightful feedback and positive assessment of our work. We are particularly grateful for your recognition of several key strengths, including the novelty of our concept with its detailed theoretical and experimental justifications, the comprehensive experiment setup spanning from benchmarks and ablation studies to real-robot evaluation, and the clarity of the paper as being 'well-written and easy to understand.' We are also encouraged that you highlighted the significant improvements our method achieves over baselines. We have carefully considered your comments and address your questions below.


W.1 & Q.1 The need for a more detailed comparison for HyperZero and other baselines.

A. to W.1 & Q.1: Thank you for this critical question. This point touches upon the core motivation for both our work and our experimental design.

First, as you noted, the manipulation tasks we focus on are intentionally challenging. The failure of some baselines on these complex tasks is precisely what motivates the need for a more robust architectural paradigm like our Hyper-GoalNet.

To provide the more detailed comparison you suggested, we have conducted further analysis:

1. Analysis of HyperZero:

In our original experiments, we directly adopted the official architecture of HyperZero for our tasks to ensure a fair comparison. The difference from our method was the hypernetwork architecture itself; all other settings, including the target policy network (a lightweight 3-layer MLP) and hyperparameters, were kept identical. To maintain fairness, we initially avoided any method-specific tricks like special initializations.

However, inspired by your feedback, we revisited the HyperZero ablation with an improved initialization trick. We attribute HyperZero's training instability to an inherent limitation in its design: its approach of directly generating an entire set of network weights makes the training process fragile and highly dependent on tricks like initialization to place the initial weights in a suitable distribution. In contrast, our method is inspired by the process of iterative optimization (akin to gradient descent), which allows for more efficient use of the hypernetwork's capacity and yields better-structured target parameters.

To stabilize HyperZero, we therefore introduced a learnable scalar that acts on the generated weights, effectively placing them in a suitable initial range. We found this technique stabilizes its training process, allowing the loss to converge to a lower value (around 0.14-0.18, though still not as low as our method). This indeed leads to non-zero success rates. The updated results on the coffee task provide the more detailed comparison you were looking for:

TaskHyperZero (New)Ours
coffee (d0)0.160.94
coffee (d1)0.180.76
coffee (d2)0.140.62

These new results offer a more nuanced view, confirming that while HyperZero can be improved,our proposed architecture still significantly outperforms it across all difficulty levels. This reinforces our claim that our specific hypernetwork design and latent space shaping are critical for success.

2. Analysis of GCBC and Play-LMP:

The challenges with these baselines are multi-faceted:

  • Implementation: The original authors for these methods did not provide open-source code. We therefore relied on third-party GitHub implementations, making minimal modifications only to align with our experimental setting (e.g., optimizer, learning rate) while preserving the core algorithm. It is noteworthy that in the experiments of a recent paper (e.g. C-BeT), GCBC and Play-LMP also demonstrated near-zero success on similar complex tasks, which corroborates our findings.
  • Fundamental Methodological Limitation: After careful inspection, we identified a key technical reason for their failure on our tasks. Both GCBC and Play-LMP model policy learning as a maximum likelihood estimation problem, optimizing the log-probability (log_prob) of actions from a learned Gaussian distribution. We observed that optimizing via probability is more prone to overfitting on the training dataset compared to direct regression with a distance loss (e.g., MSE). This was empirically validated: both methods achieved very low training loss, but their validation loss failed to converge to a low value, indicating poor generalization.
  • Observed Failure Mode: In practice, this overfitting prevents them from succeeding in novel test scenes with variations. While they could learn a coarse trajectory, they consistently failed at the fine-grained, precise movements required for successful grasping and placing, which are critical for task completion.

To further strengthen our baseline validation, we have also added comparisons to additional state-of-the-art baselines, as detailed in our response to Reviewer EGhA (W2). The new results are summarized below:

MethodCoffee d0Coffee d1Coffee d2Mug d0Inference Type
Diffusion Policy (DP)0.940.740.440.68Iterative
Goal-Conditioned DP0.920.780.660.74Iterative
Hyper-GoalNet (Ours)0.940.760.620.78Single Pass

As these results show, our Hyper-GoalNet achieves performance that is highly competitive with, and on some tasks superior to, these powerful diffusion-based models. More importantly, our method accomplishes this in a single inference pass, whereas diffusion-based policies require a much more computationally expensive iterative denoising process. This highlights a significant practical advantage of our approach in terms of efficiency and suitability for real-time applications.

In summary, we believe the failures of the initial baselines are not superficial but stem from fundamental limitations when faced with complex, long-horizon manipulation tasks. The new results for HyperZero provide the granular comparison you requested, and these additional comparisons against diffusion models further validate the effectiveness and efficiency of our proposed architecture. We will add this detailed discussion and the new results to the final version of the paper.

评论

I thank the authors for their detailed rebuttal. The additional experiments and results address my concerns and strengthen the claims in the paper.

评论

Dear Reviewer bvF4,

Thank you very much for your kind words and positive assessment. It is very rewarding for us to hear that our detailed rebuttal and the new experiments were effective in addressing your concerns. We deeply appreciate your contribution to improving the quality of our work and look forward to incorporating these enhancements into the final version.

Best,

Authors

评论

Dear Reviewers,

As the discussion period comes to a close, we wish to extend our sincerest gratitude to all reviewers for your time, expertise, and dedication throughout this process. We are incredibly grateful for the insightful and detailed feedback you have provided.

We were very encouraged that from the outset, many of you recognized the core strengths of our work—highlighting our Hyper-GoalNet framework as a "novel concept" (Reviewers bvF4, EGhA, nbKf) with "extensive experiments" (Reviewers bvF4, EGhA), and commending the paper for being "well-written and easy to follow" (Reviewers bvF4, uJn1, EGhA). This initial positive feedback was a tremendous source of motivation for us.

The entire review process has been an immensely valuable and constructive dialogue. Your insightful questions pushed us to significantly improve our work. In direct response to your feedback, we conducted additional experiments with more recent baselines, provided deeper clarifications on our methodology, and engaged in a detailed discussion about our results. We are pleased that these efforts were well-received, with Reviewer bvF4 noting that our rebuttal and new results "address my concerns and strengthen the claims in the paper," and that our clarifications resolved the doubts of other reviewers (Reviewers EGhA, nbKf).

Most importantly, we are grateful that this collaborative process has resulted in a much stronger paper. We are fully committed to incorporating the fruits of this discussion into the final camera-ready version. Specifically, we will:

  1. Integrate the new baseline comparisons and the corresponding analysis directly into the main text.
  2. Weave the detailed clarifications from the rebuttal into the paper to make our arguments more robust and self-contained, as requested by Reviewer nbKf.
  3. Acknowledge the excellent suggestions for future work, such as the diffusion-based HyperNet approaches, as proposed by Reviewer EGhA.
  4. As promised, we will make our code publicly available to ensure reproducibility and facilitate future research.

Thank you once again for your invaluable contributions. Your guidance has not only improved this paper but has also enriched our perspective as researchers.

Best regards,

The Authors

最终决定

The paper presents Hyper-GoalNet, a framework for goal-conditioned manipulation using hypernetworks to generate task-specific policy parameters from goal specifications instead of concatenating goals with states. The approach is evaluated on multiple manipulation tasks in simulation and real-robot experiments. The proposed architecture, involving a hypernetwork for goal-conditioning seems to be novel in the context of visual goal-conditioned policies and robot manipulation. The paper is well written. The results presented show significant improvement over baselines. One weakness pointed out by one of the reviewers was whether the goal conditioning is needed in the performed experiments. The authors are suggested to consider that seriously and revise their paper accordingly. Overall, a strong contribution.