PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
6
3
5
4.3
置信度
正确性2.7
贡献度2.0
表达2.3
ICLR 2025

Flow-based Maximum Entropy Domain Randomization for Multi-step Assembly

OpenReviewPDF
提交: 2024-09-23更新: 2025-02-05
TL;DR

We learn a neural sampling distribution for maximum-entropy domain randomization and use it for uncertainty-aware multi-step robotic assembly problems.

摘要

关键词
Reinforcement LearningDomain RandomizationUncertaintyAssemblyPlanning

评审与讨论

审稿意见
6

This paper presents GoFlow, a novel method to domain randomization for reinforcement learning. In robotics, domain randomization is a common technique used to enhance the robustness and performance of sim-to-real reinforcement learning. In recent years, sim2real RL methods have enabled quadrupeds and humanoids to achieve impressive locomotion and manipulation skills [1][2][3]. However, often domain randomization procedures require extensive human engineering. The proposed method seeks to automate the tuning of domain randomization parameters using a normalizing flow architecture and maximum entropy principle.

The method improves the robustness of control policies learned in simulation by adaptively randomizing simulation physical properties during training. It produces policies that can handle uncertainties more effectively. The authors demonstrate the capabilities of this approach with one multi-step real-world contact-rich assembly robotic task. Experiments in the paper suggest that it outperforms existing methods in terms of success rate in both simulated and real-world settings.

However, in its current form, I cannot recommend acceptance due to the following reasons:

  • experimental evaluations are limited, and cannot support the claims of the core contributions
  • real robot results are impressive in video, but not presented quantitatively or clearly in the paper

[1]: UMI on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers, Huy Ha et al, CoRL 2024. [2]: Takahiro Miki et al. ,Learning robust perceptive locomotion for quadrupedal robots in the wild.Sci. Robot.7,eabk2822(2022).DOI:10.1126/scirobotics.abk2822 [3]: Expressive Whole-Body Control for Humanoid Robots, Cheng et al, RSS 2024

优点

Originality:

  • While prior works such as DORAEMON have explored automating domain randomization using maximum entropy principles to enhance the generalization capacity and performance of RL policies, this is the first work that combines normalizing flow and maximum entropy to perform domain randomization for RL, to the best of my knowledge. The authors also demonstrated the effectiveness of their method on a challenging robotics task: multi-step contact-rich assembly, which is new to this line of literature.

Quality:

  • The proposed algorithm, GoFlow, is shown to outperform prior works on 4 simulation tasks and 1 real robot task. While the presented success rate is seemingly low (ranging from 0.1-0.2), the real robot video seems convincing.

Clarity:

  • The paper is overall well written and easy to flow. The authors provided a comprehensive overview of related works, background, and method sections. Graphics are helpful for understanding the motivation and strength of the proposed method. Specifically, figure 2 provides a good visualization of the multimodal ability of the normalizing flow approach and figure 5 demonstrates how GoFlow can be used to perform belief space planning for multi-step gear insertion tasks on a real robot.

Significance:

  • From the perspective of more effective domain randomization for sim2real RL, this work shows it outperforms prior methods. From the angle of multi-step manipulation, this work also shows it is able to plan over belief space and perform contact-rich assembly tasks.

缺点

  1. Experimental evaluations are limited:
  • For the comparisons against baseline methods, the range of tasks is too narrow (only 4 simulation tasks).
  1. Sim results are hard to interpret
  • Out of the 4 sim tasks, it is difficult to assess the effectiveness or significance of GoFlow's improvements or better performance, especially when all methods and tasks have relatively low success rates. For instance, in quadcopter, it is hard to draw a conclusion that 0.02 achieved by GoFlow is indeed better than baseline methods in a statistically significant manner.
  • For cartpole and gears, GoFlow performs similarly to some baseline methods.
  1. Real robot results are also limited
  • Although I truly appreciate the effort to design and perform such a challenging real robot experiment, it is hard to find the success rates or other evaluation metrics for the multi-step gear insertion task presented in the paper. Thus, it is difficult to assess the effectiveness of this method quantitatively, despite the cool results from the supplementary video.

问题

  1. If the aim is to make domain randomization easier or better for RL, the experimental results need to demonstrate that on a wider range of tasks, such as tasks that are highly relevant to real robotic systems, such as quadrupeds / humanoid locomotion tasks, manipulation tasks, etc, or simulation benchmarks that contains more than 4 tasks.

  2. Related to the current experimental results, why are the success rates lower than the numbers presented in prior works and baseline papers (for example Doraemon)? It would really help if the authors could explain the difference in evaluation protocols and metrics in more detail.

  3. The proposed method seems to have a high variance in success rate and low success rate from Figure 3. Could authors elaborate on this?

  4. Could the authors perform more experimental evaluations on the real robot task? For example, it is hard to tell the success rate for the overall multi-stage task, or the success rates for each stage of the multi-step task.

评论

Thank you for your careful review and for acknowledging the novelty of combining normalizing flows with maximum entropy principles. We hope our responses below, along with the updated manuscript that includes enhanced real-world experimental results, address your concerns.

For the comparisons against baseline methods, the range of tasks is too narrow (only 4 simulation tasks) … / … If the aim is to make domain randomization easier or better for RL, the experimental results need to demonstrate that on a wider range of tasks, such as tasks that are highly relevant to real robotic systems, such as quadrupeds / humanoid locomotion tasks, manipulation tasks, etc, or simulation benchmarks that contains more than 4 tasks.

The initial version of the paper had 5 simulation tasks which include, as suggested, quadrupeds (Anymal domain) and manipulation (Gears domain). Humanoid locomotion was not in the initial task set, but we have now run additional analysis on a humanoid domain as well, bringing the total number of simulated domains to 6. This is comparable to the number of tasks in all of the cited baselines such as DORAEMON (6 domains), LSDR (2 domains), ADR (1 domain).

Out of the 4 sim tasks, it is difficult to assess the effectiveness or significance of GoFlow's improvements or better performance, especially when all methods and tasks have relatively low success rates. For instance, in quadcopter, it is hard to draw a conclusion that 0.02 achieved by GoFlow is indeed better than baseline methods in a statistically significant manner.

The reason that the success rate is low in some domains is because we intentionally expanded the number and range of domain randomization parameters such that only a small subset of those ranges was feasible. This was done for two reasons. First, we aim to show how GoFlow performs better than baselines in such domains where the feasible parameter space is a small, off-center, and irregularly shaped distribution within the global sampling distribution. Second, small feasible parameter spaces illustrate the need for a planning system that can gather information to ensure that the agent is within the small preimage of a successful skill.

While the success rate under the global sampling distribution is small in some tasks, this does not have any bearing on the statistical significance of GoFlow’s performance against baselines. To ensure this, we performed statistical tests, which can be found in Appendix A.6.

评论

For cartpole and gears, GoFlow performs similarly to some baseline methods.

Thank you for pointing this out. While GoFlow's performance on specific tasks like cartpole and gears is similar to some baseline methods, the overall results across all tasks demonstrate its superior generalization and robustness. Additionally, our real-world experiments show that GoFlow has better sim-to-real transfer in the Gears domain despite having similar simulation performance to ADR in that domain.

… it is hard to find the success rates or other evaluation metrics for the multi-step gear insertion task presented in the paper. Thus, it is difficult to assess the effectiveness of this method quantitatively, despite the cool results from the supplementary video … / … Could the authors perform more experimental evaluations on the real robot task? For example, it is hard to tell the success rate for the overall multi-stage task, or the success rates for each stage of the multi-step task.

We agree that this was a major limitation of the initial manuscript. For this reason, we performed real-world experiments comparing all baseline methods on a real-world gear insertion task across 10 trials. To induce gear position uncertainty, the gear was placed in the same starting location for every trial, and picked by the robot with a random offset before running the insertion policy. Our results show that GoFlow achieves better real-world generalization than all baseline methods (See Table 1 in the Appendix). We have included footage of these experiments in the supplemental materials.

Related to the current experimental results, why are the success rates lower than the numbers presented in prior works and baseline papers (for example Doraemon)? It would really help if the authors could explain the difference in evaluation protocols and metrics in more detail.

The exact empirical results from other papers such as ADR, DORAEMON, and LSDR are not directly comparable. In addition to testing on different simulators, the papers deviate significantly in their domain randomization ranges. Focusing on DORAEMON, the randomization ranges were selected to be within some window of a known-successful center point. In contrast, our ranges are much larger and not necessarily centered.

The proposed method seems to have a high variance in success rate and low success rate from Figure 3. Could authors elaborate on this?

We acknowledge the increased variance in our method as a limitation and have addressed it in the manuscript. We note this in the limitations sections with the following statement: “One limitation of our method is that it has higher variance due to occasional training instability of the flow. This instability can be alleviated by increasing β, but at the cost of reduced sample efficiency (see Appendix A.2).” Despite this higher variance, our statistical analysis shows that GoFlow outperforms baselines.

评论

Thank you for the detailed responses to my questions and concerns.

I appreciate the effort to add additional experiments both in sim and real. These additional results provide more convincing experimental evidence to the proposed methods than the initial version, especially the full robot task evaluation video.

Taking these improvements into consideration, I would increase my overall ratings for this work.

One additional question: for the gear insertion task, “picked by the robot with a random offset before running the insertion policy”, how large is the randomization box?

评论

Thank you! To answer your question, we perturb the end-effector pose by a random ±0.01\pm 0.01 meter translational offset along the x dimension during the pick. We expect some additional grasp pose noise due to control error and object shift during grasp. These details are now noted in the paper as well.

审稿意见
3

The authors propose a method (GoFlow) for the automatic design of domain randomization distributions to facilitate sim-to-real transfer in absence of real-world data. In particular, GoFlow finds the right balance between a maximum entropy distribution and good policy performance within this distribution, while employing normalizing flows for extreme flexibility in capturing feasible patterns. The results demonstrate superior generalization performance when policies are trained with GoFlow.

优点

  • The method extends current state-of-the-art algorithms for Domain Randomization by employing flexible distribution representations through normalizing flows.
  • The method demonstrates that normalizing flows can be learned in this context to autonomously capture complex patterns among unobservable parameters, such as circular shapes in Fig. 2.
  • The authors successfully deploy a probabilistic pose estimation model that allows guiding the agent to seek additional information in partially observable settings.

缺点

  • Poor description of the method and contextualization relative to existing works:

    • The authors present the method as a novel approach for learning DR distributions. However, to my understanding, the presented opt. problem in Eq. 5 has been recently proposed by Tiboni et al. [1] (aka DORAEMON) under an equivalent formulation, which, in turn, took inspiration from the self-paced curriculum learning opt. problem [2] (See Eq. 3 and Sec. B in [1], and see Eq. 12 and Eq. 20 in [2]). All these methods involve a joint opt. problem between the policy performance and an entropy/KL objective from a target distribution. However, the authors fail to make connections to these existing works, while proposing an equivalent formulation. The authors shall explain which parts of the method are novel, and which come from existing works in literature.
    • The final objective for optimizing the sampling distribution in line 9 of Algorithm 1 includes a third term beyond the entropy and the KL term which is not discussed in the main text. It's also unclear why this term is added. Perhaps more importantly, this term was introduced by LSDR [3] but no connection/citation is reported in the manuscript by the authors. In fact, the final algorithmic implementation in Algorithm 1 appears to be equivalent to LSDR, but extended to normalizing flows.
    • DORAEMON [1] and LSDR [3] provide clear motivations behind their choices of the respective objective functions. However, the authors here do not discuss how the opt. problem described in Eq. 5 translates to Algorithm 1. As a result, the algorithmic implementation in Algorithm 1 appears disconnected from the motivation and the general problem described by Eq. 5.
  • Validity of the experimental evaluation:

    • DORAEMON [1] recently showcased successful deployment of FullDR, ADR, and DORAEMON in the cartpole and other locomotion tasks. However, in this paper these methods seem to score critically low in the same environments. This raises questions on the validity of the experimental evaluation, if not properly analyzed and discussed. E.g. why are the success rates so low across all methods?
    • Are baseline methods such as LSDR, ADR and DORAEMON trained with privileged value functions as well? This is known to drastically affect the results, and should be implemented for all baselines for a fair comparison.
    • ADR and DORAEMON assume that the initial sampling distribution is a feasible one, i.e. a near-optimal policy may and should be learned before ever stepping the sampling distribution. Is this assumption satisfied across the experimental evaluation? If not, it should be made clear that this assumption is violated.
    • How does the entropy change for different methods across the training process? What is the starting distribution/entropy for each method? No insight on the varying training distributions is reported, making it harder to draw conclusions on why each method is performing the way it does.
    • Hyperparameter curves for ADR and Doraemon seem to only include one value in some plots of Fig. 9, 10, and 11.
  • Lack of proper citing or outdated citations:

    • Sec. 1: "too broad a distribution leads to unstable [...] while too narrow a distribution leads to poor real-world generalization". Citation required, this claim is not demonstrated in this paper.
    • Sec. 1: "many esisting methods rely on real-world rollouts from hardware experiments to estimate dynamics parameters". DROPO [4] and NPDR [5] are recent state-of-the-art methods that compare and beat the cited and older papers BayesSim and SimOpt. They should likely be cited as well.
    • Sec 1: "As shown in this paper and elsewhere". "Elsewhere" should be accompanied by one or more references.

[1] Tiboni, G. et al. "Domain Randomization via Entropy Maximization." ICLR 2024.

[2] Klink, Pascal, et al. "A probabilistic interpretation of self-paced learning with applications to reinforcement learning." Journal of Machine Learning Research 22.182 (2021): 1-52.

[3] Mozian, Melissa, et al. "Learning domain randomization distributions for training robust locomotion policies." 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020.

[4] Tiboni, G., Arndt K., and Kyrki V. "DROPO: Sim-to-real transfer with offline domain randomization." Robotics and Autonomous Systems 166 (2023): 104432.

[5] Muratore, Fabio, et al. "Neural posterior domain randomization." Conference on Robot Learning. PMLR, 2022.

问题

  • What is the starting distribution and entropy of GoFlow? How does it vary during the algorithm? E.g., it starts wide and then it shrinks to focus on the feasible regions, or viceversa.
评论

We appreciate your detailed feedback and for highlighting the potential of our method to extend state-of-the-art domain randomization methods while highlighting key areas for writing improvements and deeper analysis.

The authors present the method as a novel approach for learning DR distributions. However, to my understanding, the presented opt. problem in Eq. 5 has been recently proposed by Tiboni et al. [1] (aka DORAEMON) under an equivalent formulation, which, in turn, took inspiration from the self-paced curriculum learning opt. problem [2] […] However, the authors fail to make connections to these existing works, while proposing an equivalent formulation. The authors shall explain which parts of the method are novel, and which come from existing works in literature.

Thank you for pointing this out. We agree that Section 4 would benefit from more explicit citations to these relevant works that have inspired our formulation. The optimization problem presented in Eq. 5 is indeed aligned with these prior methods and is not a novel contribution of our work. Our main novelty is a more flexible and expressive neural sampling distribution and showing how that distribution can be integrated into a multi-step planning framework.

We have revised Section 4 to acknowledge these works explicitly and clarify the scope of our contributions. Specifically, we have cited Tiboni et al. [1] and self-paced curriculum learning [2] and discussed how our approach builds on these ideas. The updated section also delineates the aspects of our method that are novel versus those that are extensions or adaptations of existing literature.

As a result, the algorithmic implementation in Algorithm 1 appears disconnected from the motivation and the general problem described by Eq. 5.

We agree that the explanation of Algorithm 1 was not sufficiently connected to the optimization problem defined in Eq. 5. We have elaborated on this connection in the updated manuscript.

The final objective for optimizing the sampling distribution in line 9 of Algorithm 1 includes a third term beyond the entropy and the KL term which is not discussed in the main text. It's also unclear why this term is added.

The first loss term in line 9 of Algorithm 1 is the gradient of the expected return with respect to the parameters of the normalizing flow, which pushes the sampling distribution to maximize reward. This expression is also used in [2], and it is related to the expression used in policy gradient methods with respect to the policy parameters. We have updated the paper to better explain this and cite relevant sources.

[…] Perhaps more importantly, this term was introduced by LSDR [3] but no connection/citation is reported in the manuscript by the authors. In fact, the final algorithmic implementation in Algorithm 1 appears to be equivalent to LSDR, but extended to normalizing flows.

Yes, the extension of the LSDR-style of learned domain randomization from multivariate gaussians to normalizing flows was one of our novel contributions. This representational choice proves to be very important in domains with irregular and uncentered feasible sampling regions. Our second novel contribution is demonstrating how learned control policies can be integrated into a belief-space planning framework that enables information gathering prior to policy execution.

DORAEMON [1] recently showcased successful deployment of FullDR, ADR, and DORAEMON in the cartpole and other locomotion tasks. However, in this paper these methods seem to score critically low in the same environments. This raises questions on the validity of the experimental evaluation, if not properly analyzed and discussed. E.g. why are the success rates so low across all methods?

The exact empirical results from other papers such as ADR, DORAEMON, and LSDR are not directly comparable. In addition to testing on different simulators, the papers deviate significantly in their domain randomization ranges. Focusing on DORAEMON, the randomization ranges were selected to be within some window of a known-successful center point. In contrast, our ranges are much larger and not necessarily centered. To show the relationship between domain ranges and success rate, we conducted an additional experiment comparing the range size to the success rate (Section A.3). From this experiment, you can see that GoFlow better handles larger domain sizes while still matching success rates like those reported in prior works for smaller ranges.

Further, low success rates imply that the policy is only feasible in a small sliver of the original domain ranges. The motivation behind multi-step planning section (section 6) was to show that if this sliver (or precondition) is identified, we can plan to move inside it, resulting in a policy with high likelihood of success.

评论

Are baseline methods such as LSDR, ADR and DORAEMON trained with privileged value functions as well?

Yes, all methods are trained with an identical RL algorithm and neural architecture. We have added a note about this in Section 5.2

ADR and DORAEMON assume that the initial sampling distribution is a feasible one, i.e. a near-optimal policy may and should be learned before ever stepping the sampling distribution. Is this assumption satisfied across the experimental evaluation? If not, it should be made clear that this assumption is violated.

This is a correct assessment of one of the limitations of DORAEMON and ADR. The assumption that the midpoint of the randomization space is valid in some of our tasks and not others. Importantly, GoFlow does not make this assumption. We briefly touched on this in the third paragraph of section 5.2, but have expanded this discussion to make it more clear.

How does the entropy change for different methods across the training process? What is the starting distribution/entropy for each method? No insight on the varying training distributions is reported, making it harder to draw conclusions on why each method is performing the way it does.

For added clarity and interpretability of the results, we added entropy statistics during the training process to the appendix (See figure 14).

Lack of proper citing or outdated citations

Thank you for pointing out these missing citations. We have included them into our related work.

Hyperparameter curves for ADR and Doraemon seem to only include one value in some plots of Fig. 9, 10, and 11.

The curves are not missing, but rather overlapping. Given that we used the same seed for every experiment, in the event that the hyperparameter change did not impact the learning dynamics (for instance, if the success rate never reaches \epsilon_D in DORAEMON), the curves will be identical.

评论

I thank the authors for the thorough responses to my concerns.

The first loss term in line 9 of Algorithm 1 is the gradient of the expected return with respect to the parameters of the normalizing flow, which pushes the sampling distribution to maximize reward. This expression is also used in [2], and it is related to the expression used in policy gradient methods with respect to the policy parameters.

I understand now. Perhaps this is trivial, but I believe it could be discussed more thoroughly in the paper for a cleaner theoretical explanation and interpretation. As far as I understand, LSDR implements it in a slightly different way. In particular, you are really computing the gradient of the policy return w.r.t. the current normalizing flow, whether LSDR attempts to find a distribution such that the policy's return is maximal on expectation over the fixed target wide distribution.

Update concerns:

  • The method section changed quite significantly, and to me it seems like the paper could benefit from dedicating a bit more time into making clearer connections of GoFlow to the closely related DORAEMON, LSDR and self-paced papers.
  • I believe the authors should also discuss the tuning process for the hyperparameter alpha, as it's known to be crucial in similar applications (soft actor critic, DORAEMON). For instance, DORAEMON turns the problem into a constrained opt. problem to avoid dealing with this trade-off hyperparameter alpha, which is instead set through domain knowledge regarding tolerance to errors. This is more clearly discussed in Sec. 8 of [1]. In other words, setting alpha inevitably affects how much the entropy increases. In a real setting, it's unclear how such trade-off parameter should be chosen to maximize generalizability in the real world.
  • unfair baseline comparison: the novel Fig. 14 importantly sheds light on the behavior of the DORAEMON baseline, which is never really able to increase its own entropy. This is likely due to the violated assumption of the method, i.e. policy training should start on a feasible, fixed point in the parameter dynamic space. Likewise, the ADR baseline starts with a higher initial entropy (why? ADR is expected to start with uniform distributions collapsed to a point-estimate) and is often (3 envs out of 6) unable to increase the entropy. I'd suggest taking the time to re-run these baseline while making sure their respective assumptions are met.

Overall, the authors provided an improved version of the paper during the rebuttal phase, but multiple concerns prevent me from raising my score any higher. I believe taking additional time to improve various aspects of the paper (both theoretical and empirical) would make this a valuable contribution to the field.

[1] Klink, Pascal, et al. "A probabilistic interpretation of self-paced learning with applications to reinforcement learning." Journal of Machine Learning Research 22.182 (2021): 1-52.

评论

Thank you for your response. Your feedback has been instrumental to improving the paper quality. Below we address your updated list of concerns.

I believe it could be discussed more thoroughly in the paper for a cleaner theoretical explanation and interpretation.

Thanks for this suggestion. We agree that it would be beneficial to show a derivation of this loss term. We have added a section to the appendix deriving it (Section A.7), and referenced that from the main text where the expression is used (Section 4).

the paper could benefit from dedicating a bit more time into making clearer connections of GoFlow to the closely related DORAEMON, LSDR and self-paced papers

To make this connection clearer, we have added another paragraph to the method section after the optimization problem has been defined to more clearly outline how other related works formulate and solve this problem. If this is not what was intended, could the reviewer please elaborate on what additional information would make this connection clearer?

I believe the authors should also discuss the tuning process for the hyperparameter alpha

The tuning process is discussed in the Appendix (Section A.2). We performed parameter sweeps for all baseline methods, including DORAEMON which doesn’t have the exact same hyperparameters, but instead has other relevant hyperparameters that need to be tuned to trade off entropy and reward.

unfair baseline comparison: the novel Fig. 14 importantly sheds light on the behavior of the DORAEMON baseline, which is never really able to increase its own entropy. This is likely due to the violated assumption of the method, i.e. policy training should start on a feasible, fixed point in the parameter dynamic space.

We acknowledge that DORAEMON performs effectively in domains where its assumptions hold, specifically that training starts at a feasible point in parameter space and that dimensions of variability can be characterized independently. However, these conditions may not always hold (See Figure 2). One of the primary goals of our work was to develop a method that does not rely on such assumptions.

However, we agree that this was not sufficiently conveyed in the initial draft. To address this, we now discuss in paragraph 3 of Section 5.2, as well as in the caption of Figure 2, that the test problems we use explicitly violate some of the assumptions underlying DORAEMON and other methods. Our updated draft also contains experiments in Section A.3 showing how the baselines degrade more quickly than GoFlow with increasingly wide and off-centered ranges.

Finally, regarding ADR, our implementation initializes the starting range to match the size of the interval increment, resulting in higher entropy at the beginning of training.

审稿意见
5

This paper proposes a method to automatically discover a good sampling distribution for domain randomization in reinforcement learning to learn conformant policies. This is done by using normalizing flows to learn a neural sampling distribution that maximizes the return and entropy of the sampling distribution. The approach is validated in simulation and a gear-assembly task. Finally, the learned sampling distribution is used for out-of-distribution detection to enable uncertainty aware manipulation planning.

优点

  • This paper addresses an important problem of improving robustness of control policies learned in simulation and would be of interest to the robotics community.

  • Experimental comparison with prior methods shows that GoFlow learns a better sampling distribution.

  • Using the learned distribution to estimate the belief space precondition is a useful contribution.

  • Experiments on real robots on a challenging gear assembly task indicate the efficacy of the approach.

缺点

  • The main novelty of using normalizing flows to learn the parameter sampling distribution is somewhat limited. However, it has been shown to be quite effective.
  • The related work section is brief. I would like to see a more thorough discussion of the baselines and of methods that learn the domain randomization distribution.
  • Error in estimating the belief space precondition can lead to incompleteness in planning.

问题

  • What is the true uncertainty distribution in the gear assembly task? Is it artificially induced or due to to uncertainty in perception and actuation? How is it simulated during training?

  • How sensitive is the method to the regularization coefficient hyperparameter? The scale of the expected return J depends on the reward function. Does one need to tune the coefficient for every reward function?

评论

Thank you for your thoughtful review and for recognizing the importance of improving control policy robustness, as well as the significance of our experimental validation in a challenging gear assembly task. Below we address each of the reviewer's concerns and questions.

The main novelty of using normalizing flows to learn the parameter sampling distribution is somewhat limited. However, it has been shown to be quite effective.

We believe the novelty of this work is two-fold. First, prior work on learning domain randomization distributions made the assumption that the learnable parameter distribution is centered in the parameter range and conforms to a standard parameterized probability distribution. Here, we remove those assumptions by using a more expressive representation (normalizing flows). Second, we demonstrate how learned control policies can be integrated into an uncertainty-aware planning framework that enables information gathering prior to policy execution.

Error in estimating the belief space precondition can lead to incompleteness in planning.

We agree with the reviewer’s statement that determinized probabilistic planners are incomplete. However, this does not limit their usefulness, and often superiority over, complete planners in many problems [1]. While our planner was primarily designed as a proof-of-concept for how to integrate learned preconditions into larger decision making framework, extensions to more advanced, and potentially complete, belief-space planners is an interesting direction for future work.

What is the true uncertainty distribution in the gear assembly task? Is it artificially induced or due to to uncertainty in perception and actuation? How is it simulated during training?

In the gear assembly task, the uncertainty is in the relative pose between the gripper and the gear object. During simulation, the gear is placed in the robot’s gripper with a random offset. Because the gear pose is not in the observation space, the robot must design a conformant policy that handles arbitrary offsets.

How sensitive is the method to the regularization coefficient hyperparameter? The scale of the expected return J depends on the reward function. Does one need to tune the coefficient for every reward function?

Please see the appendix for details on our hyperparameter search. We find that sensitivity to these parameters varies per task. Empirically, we found that a similar range of alpha/beta values tended to work best across tasks.

As a side note, we performed hyperparameter search for all baselines according to their respective hyperparameters. Our results report the best-performing hyperparameter set per method.

The final objective for optimizing the sampling distribution in line 9 of Algorithm 1 includes a third term beyond the entropy and the KL term which is not discussed in the main text.

The first loss term in line 9 of Algorithm 1 is the gradient of the expected return with respect to the parameters of the normalizing flow, which pushes the sampling distribution to maximize reward. Because we are taking the gradient, we can equivalently use the log probability instead of the probability directly, which leads to better numerical stability. This trick is also used in [2] and in policy gradient RL, but with respect to the policy parameters. We have updated the paper to better explain this and cite relevant sources.

[1] Yoon, S., Fern, A., & Givan, R. (2007). FF-Replan: A Baseline for Probabilistic Planning. Proceedings of the AAAI Conference on Artificial Intelligence.

[2] Mozifian, M., Gamboa Higuera, J. C., Meger, D., & Dudek, G. (2020). Learning Domain Randomization Distributions for Training Robust Locomotion Policies. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

评论

Thank you to the reviewers and AC for their work in helping improve the quality of this paper. We appreciate the recognition of the importance of our work in improving control policy robustness through normalizing flows showing how they can be integrated into uncertainty-aware planning frameworks. Below, we summarize the key changes we made to address all of the reviewers' concerns:

  • We updated the related work and method section to include all relevant citations and more clearly explain the meaning of each term in the algorithm and how it connects to the original optimization problem.
  • We elaborated on the details of the baseline implementations and the root cause of their limitations and poor performance in comparison to our method.
  • We included entropy statistics during training in the appendix for increased interpretability of the results.
  • We expanded the set of simulated domains by including a humanoid task with randomized link masses.
  • We performed real-world experiments comparing performance across baselines on a gear insertion task. Footage of these experiments have been uploaded as supplemental materials.
  • We performed statistical tests on the simulated and real-world results to confirm the significance of our results.

More detailed responses to each reviewer is provided in the comments.

AC 元评审

This paper introduces a method combining normalizing flows and maximum entropy principles for domain randomization in reinforcement learning. The reviewers recognized the significance of the problem and the effort to tackle it, particularly with a challenging real-world robotic assembly task. However, they noted significant weaknesses in experimental evaluation, theoretical contributions, and connection to prior work.

Strengths:

  • The work addresses a critical problem in sim-to-real transfer for RL.

  • Reasonable use of normalizing flows for domain randomization.

  • Demonstrated applicability in a challenging real-world tasks.

Weaknesses:

  • Limited theoretical novelty compared to recent works like DORAEMON and LSDR.

  • Experimental design and baseline comparisons raise concerns about validity and fairness.

  • Insufficient quantitative evaluation of real-world results.

While the reviewers acknowledged the importance of the problem, the concerns about the rigor of the experiments, fair comparisons to baselines, and limited theoretical insights outweigh the strengths. Addressing these concerns would significantly enhance the paper’s impact. Given these considerations, I recommend rejection at this stage.

审稿人讨论附加意见

During the rebuttal, the authors made efforts to address concerns, including improving the connection to related work, clarifying their contributions, and adding experiments to strengthen their empirical validation. Despite these efforts, reviewers maintained concerns about the limited theoretical novelty, inconsistencies in experimental design and baseline comparisons, and insufficient task diversity. While the revisions improved the paper's clarity and provided stronger support for the proposed method, the overall contributions were deemed incremental, and experimental validation lacked the rigor expected for acceptance. Based on these considerations, I recommend rejection.

最终决定

Reject