4.3

/10

withdrawn4 位审稿人

最低3最高6标准差1.3

3.8

置信度

ICLR 2024

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

Xin Liu,Yaran Chen

OpenReview PDF

提交: 2023-09-16更新: 2024-03-26

TL;DR

Discover qualified and diverse unsupervised skills for fast adaptation by contrastive learning and dynamic weighting.

摘要

关键词

unsupervised reinforcement learningskill discoveryself-supervised learningmulti-joint robot locomotion

评审与讨论

审稿意见

评分: 3置信度: 42023-10-24

This work introduces a new unsupervised skill discovery method, ComSD. The main idea of ComSD is to decompose mutual information into $I(\tau; z) = H(\tau) - H(\tau|z)$ and to introduce a skill-dependent coefficient $\beta(z)$ to weight the second term. The authors employ a CIC-like contrastive estimator for the second term and a particle-based entropy estimator for the first term. They show that ComSD outperforms previous skill discovery methods (BeCL, CIC, APS, SMM, and DIAYN) in DMC locomotion environments in terms of both fine-tuning and hierarchical RL performances.

优点

The final objective in Eq. (12) is simple and intuitive.
The authors evaluate ComSD on both fine-tuning and hierarchical RL settings, where ComSD achieves better performance compared to five previous approaches.

缺点

The contribution of the proposed method (ComSD) appears incremental compared to CIC. Especially, this previous version of CIC is almost identical to ComSD, with the same contrastive objective for $H(\tau|z)$ and the same particle-based entropy estimator for $H(\tau)$ . The only difference is that ComSD additionally uses a skill-dependent linear coefficient (i.e., using $\beta(z) \alpha H(\tau|z)$ instead of $\alpha H(\tau|z)$ ), which, in my view, is too incremental to justify an ICLR publication.
Given the high similarity to CIC, I believe this work requires more thorough comparisons with CIC. Since ComSD uses $\beta(z)$ and $\alpha$ individually tuned for each environment, it should be compared against (the $H(\tau) - \alpha H(\tau|z)$ version of) CIC with individually tuned $\alpha$ to ensure a fair comparison.
There is a mismatch in CIC performance between this paper and the original CIC paper. Weirdly, their performances are the same in walker walk and walker stand, but different in walker run and quadruped run. Could the authors clarify these differences?

Minor issues

"(Liu & Abbeel, 2021a) first employs the second MI decomposition Eq. 2 for a better MI estimation." in Appendix B: (1) I would recommend using \citet for inline citations, and (2) to the best of my knowledge, DADS (Sharma et al., 2020) is the first such method.
Nitpick: $\propto$ can't be used in Eq. (4) because the right-hand side is not technically proportional to the left-hand side.

问题

As far as I understand, the authors use a fixed latent vector of $z = [0, 0.5, 0.5, \dots, 0.5]$ for fine-tuning experiments. In this case, why don't we just use a fixed coefficient of $\beta = w_{low}$ during training?

评论- Thank you very much for reviewing our work! We address your questions below.

2023-11-14

We would like to thank you for your careful reading and expert review. We will answer your questions one by one.

For weakness:

Q1. original version of CIC

We agree that the estimator employed by the original CIC is similar to our contrastive estimator. But they still differ in the use of negative pairs, and our contrastive estimator is proven to be more accurate when the skill vector $z$ is sampled by a uniform distribution. In addition, CIC finally gives up this estimator in their final version and conducts corresponding experiments, which means they also found it harmful for exploration but didn’t solve it. The diversity hurt brought by giving up the conditioned entropy estimator is also ignored by CIC. Moreover, we also notice that CIC authors made experimental mistakes on contrastive estimator in first version (see their ICLR 2022 rebuttal). They actually didn't employ this estimator in their experiments and then totally gave up this estimator in their methodology. In this situation, we think (i) employing a contrastive estimator can be regarded as a contribution, and (ii) successfully incorporating a more accurate estimator through a novel weighting algorithm for exploration-exploitation balance is meaningful. It is a simple but effective method, like CIC and APS. (CIC actually enhances APT reward with a state-skill contrastive encoder and APS actually augments APT reward with a successor feature.) In addition, our evaluation experiments are also much more comprehensive and reasonable than CIC, which could help future works on skill evaluations.

Q2. Additional experiments with ComSD w/o SMW

Actually, we have conducted similar experiments (compared with ComSD w/o SMW) in our ablation study on numerical results and skill analysis. As mentioned in Section 3.3, the intervention of exploitation reward simultaneously improves behavioral diversity and reduces the activity of learned skills, which is also observed in experiments by CIC. This means a tuned constant $\alpha$ cannot achieve a balance between diversity and quality, which explains why CIC gives up contrastive estimator and motivates our SMW.

Q3. Mismatch of numerical results

These results come from BeCL. Concretely, for the results of DIAYN, APS, SMM and CIC, we run their official implementation and choose the similar results reported from URLB, CIC, and BeCL papers. Actually, we also notice that there are mismatches between these papers.

For minor issues:

Thanks for your correction, and we will revise it! We want this citation to be a parenthesis following the method name, which seems to cause confusion.
Thanks for your correction, and we will revise it!

For questions:

Q. Why not use a small and fixed $\beta$

As mentioned in Section 3.3, we use the exploitation reward to improve behavioral diversity which is ignored in CIC. However, the naive intervention of exploitation reward simultaneously improves behavioral diversity and reduces the activity of learned skills. SMW is designed to maintain this diversity improvement while alleviating the exploration hurt. If we use a fixed low coefficient ( $\beta = w_{low} = 0$ ), the intrinsic reward $r_{ComSD} = r_{exploration}+\beta(z)\cdot \alpha r_{exploitation}$ degenerates into $r_{ComSD} = r_{exploration} = r_{CIC}$ , i.e., ComSD actually degenerates into CIC. The behavioral diversity will be ignored again and the skill combination performance will drop severely. See section 4.2 for skill combination performance and see Appendix D for visualized skills of CIC and ComSD. If only skill finetuning evaluation is considered (i.e., the behavioral diversity is ignored like that in CIC), the SMW is of no use because it's designed for diversity-and-exploration balance. SMW enables ComSD to achieve competitive performance on both adaptation tasks simultaneously.

References

CIC Unsupervised reinforcement learning with contrastive intrinsic control. NeurIPS 2022

BeCL Behavior contrastive learning for unsupervised skill discovery. ICML 2023

URLB Urlb: Unsupervised reinforcement learning benchmark. NeurIPS 2021

CSD Controllability-Aware Unsupervised Skill Discovery. ICML 2023

APS Aps: Active pretraining with successor features. ICML 2021

DIAYN Diversity is all you need: Learning skills without a reward function. ICLR 2018

SMM Efficient exploration via state marginal matching. arxiv 2019

2023-11-21

Thanks for the answers. While I appreciate the detailed responses, my main concerns still remain unresolved. Given the similarity to CIC (including previous versions), I believe the novelty of this method lies in the adaptive coefficient $\beta(z)$ . I would have appreciated this idea if this simple change alone indeed led to a significant performance gain. However, from the current results, I’m not convinced if this is the case. In order to empirically prove the effectiveness of the adaptive coefficient, I believe a thorough comparison against (the previous version of) CIC with a fixed coefficient $\alpha$ individually tuned for each environment (as the authors did for ComSD) is necessary, given that this is the only main novelty of the proposed method. I reviewed the results in Fig. 4, but (1) they are limited to only a single domain, and (2) "ComSD w/o SMW" seems to be not sufficiently tuned given its inferior performance to CIC (as CIC is a special case of "ComSD w/o SMW"). As such, I am unable to recommend acceptance in its current form.

审稿意见

评分: 3置信度: 42023-11-01

The authors propose a unsupervised skill discovery method named ComSD. With ComSD, the mutual information between states (or pairs of consecutive states) and skill vectors is estimated with the NCE-style contrastive learning loss and particle-based entropy estimation, and the coefficient for weighting the two intrinsic reward terms is designed to differ across skill vectors. They also empirically compare ComSD with baseline methods in two representative skill discovery evaluation settings, skill combination and skill fine-tuning on locomotion simulation tasks including DMControl and provide further quantitative and qualitative analyses.

优点

The empirical evaluation is done in various settings. Also, compared to the selected set of baselines, the proposed method shows fair performance on the tasks.
The manuscript basically easy to follow, and Fig. 1 and 2 help readers to understand the proposed approach more quickly.

缺点

Skill-based Multi-objective Weighting (SMW), which is the main contribution of this paper in my view, needs rationale for it. Apart from empirical confirmation that it improves the resulting performance, I don't think why the weighting coefficient should be different for different skill vectors and why it needs to be structured like that are not clear. I'm not trying to argue that just a constant coefficient is enough, but I'm rather asking the following questions: Exactly what benefits does that form provide and why? How does having different coefficients for different skills like that affect the overall learning objective?
I believe the statement about the proposed method's difference from CIC and APS (Sec. 4.1) needs further clarification. If the "contrastive results" employed for the state entropy maximization mean the state representations, CIC uses the state representations from the contrastive learning for estimating the state entropy.
Some presentation issue: the use of "reasonable" and "unreasonable" for describing the MI estimation methods doesn't seem technical/scientific and is without appropriate backup.

问题

Please check out the weakness section.

评论- Thank you very much for reviewing our work! We address all your questions below.

2023-11-14

We would like to thank you for your careful reading and detailed reviews. We will answer your questions one by one.

Q1: SMW motivation.

As mentioned in Section 3.3, the intervention of exploitation reward simultaneously improves behavioral diversity and reduces the activity of learned skills. The larger the exploitation coefficient $\alpha$ , the greater the skill diversity but the lower the behavioral activity (lazier exploration). With the proposed dynamic weight $\beta$ (SMW), the skill space is actually divided into different regions with different learning objectives. In the high-exploitation skill space ( $flag(z)$ is large), the agents pay more attention to behavioral diversity, while in low-exploitation skill space ( $flag(z)$ is low), the agents try their best to explore without considering diversity. In the middle interval ( $flag(z)$ between $f_{low}$ and $f_{high}$ ), the agents can learn skills at different activity levels (corresponding to different $\beta$ value). With SMW, the agent takes both diversity and exploration into account, learning diverse skills at different activity levels.

We will explain our motivation more detailedly in our revision, thanks!

Q2: Difference from CIC.

I’m sorry there’s a typo in Section 4.1 causing a great misunderstanding. Thank you for pointing this out. The correct statement is "For CIC, ComSD follows it in state entropy estimation but first proposes to employ contrastive results for explicit conditioned entropy maximization." Concretely, CIC only uses a single exploration reward (for state entropy estimation) but doesn’t design an exploitation reward (for conditioned entropy estimation), i.e., CIC doesn’t explicitly maximize conditioned entropy by RL. The intrinsic reward of CIC is: $r_{CIC} = r_{exploration}$ . By contrast, ComSD employs contrastive exploitation reward to estimate conditioned entropy and uses a dynamic weighting algorithm to alleviate its impact on dynamic behaviors. Its intrinsic reward is: $r_{ComSD} = r_{exploration}+\beta(z)\cdot \alpha r_{exploitation}$ .

Q3: non-technical expression

Thanks for your suggestions! We will modify our expression.

References

CIC Unsupervised reinforcement learning with contrastive intrinsic control. NeurIPS 2022

审稿意见

评分: 6置信度: 32023-11-05

This paper points out that one of the major challenges in MI-based skill discovery is balancing two intrinsic reward terms, the state entropy for exploration and the negative conditioned state entropy for exploitation or state-skill alignment. On top of the prior skill discovery method, CIC, this paper proposes Skill-based Multi-objectives Weighting (SMW) to dynamically weight these two reward terms for different skill vectors. The proposed method outperforms prior MI-based skill discovery approaches in both skill composition and skill finetuning experiments on 4 URLB locomotion domains.

优点

The paper points out the challenge of balancing exploration and exploitation in skill discovery and then provides a simple practical solution, Skill-based Multi-objectives Weighting (SMW).
The exhaustive experiments demonstrate that the proposed method, ComSD, can discover diverse skills and adapt better to downstream tasks. Especially, ComSD significantly outperforms other methods on most skill combination tasks.
The paper is well-organized and easy to follow.

缺点

In Section 3.3, it is clear that the weighting between two intrinsic reward terms is challenging. However, it is not straightforward to get how the proposed SMW resolves this issue. The choice of the dynamic weighting term in Equation 11 is not justified and explained sufficiently. As this is the main contribution of this paper, it has to be clearly stated and examined in the paper.
The approach of ComSD seems very similar to CIC except for the dynamic weighting (SWM) and it is unclear how ComSD is different from CIC other than SWM. In Section 4.1, the paper says "For CIC, ComSD follows it in state entropy estimation but first proposes to employ contrastive results for explicit state entropy maximization." but I cannot follow the "explicit state entropy maximization" part. Could the authors elaborate on this more?
Many MI-based methods show their limited applicability to domains other than simple locomotion environments. Although the strong skill discovery performances on the locomotion tasks are impressive, the proposed approach could overfit to the specific domain. Comparisons on manipulation tasks, as in (Park 2023) would make the claim of this paper much stronger.
Although the proposed method outperforms prior MI-based skill discovery approaches, recent non-MI-based skill discovery methods (Park 2021, Park 2023) have shown much diverse skill sets. Thus, it is important to compare with these non-MI-based approaches.

Although the author response did not address all my concerns, it is clear that ComSD has resolved some issues in CIC implementation, which seems useful for future research. Thus, I increased my rating to borderline accept.

问题

Please address the weaknesses mentioned above.

Minor questions and suggestions

In Equation 11, $w_{high}$ at the end should be $w_{low}$ .
In Section 4.1, Evaluations, "skill combination in URLB" should be "skill finetuning in URLB".
Section 4.2 mentions that 6-10 different random seeds are used in skill combination and finetuning experiments. Does this mean a pre-trained skill policy and meta- or finetuned- policy for each random seed? Or, does this mean one pre-trained skill policy and 6-10 different meta- or finetuned- policies? The variances in Figure 3 and Table 1 seem a bit smaller than expected given the high-variance nature of skill discovery methods.

评论- [2/2] Thank you very much for reviewing our work! We address all questions below.

2023-11-14

For minor questions and suggestions:

Thanks for your careful reading, but in our settings, when $flag(z)$ equals $f_{high}$ , the beta is indeed $w_{high}$ .
Thanks for your correction, and we will revise it!
To reduce instability, we pre-train several skill agents and conduct several adaptation experiments. For example, we pre-train two walker agents with different seeds. For each agent, we train three meta-controllers with different seeds.

References

CIC Unsupervised reinforcement learning with contrastive intrinsic control. NeurIPS 2022

BeCL Behavior contrastive learning for unsupervised skill discovery. ICML 2023

URLB Urlb: Unsupervised reinforcement learning benchmark. NeurIPS 2021

CSD Controllability-Aware Unsupervised Skill Discovery. ICML 2023

APS Aps: Active pretraining with successor features. ICML 2021

DADS Dynamics-aware unsupervised discovery of skills. ICLR 2020

评论- Thanks for author responses

2023-11-22

Thank you for your detailed responses.

I understood that ComSD has some improvement in its implementation, which has not been addressed in CIC. I'll increase my rating accordingly.

However, the current form of the paper is not sufficient for me to strongly support the acceptance of this paper. The writing about skill-wise dynamic weighting should be clearer and better motivated (e.g. why does it need to be skill-based and why using the first dimension for a flag function a good choice?). Also, as Reviewer YuZu mentioned, more thorough comparisons and ablation studies are required to justify SWM's necessity and effectiveness.

评论- [1/2] Thank you very much for reviewing our work! We address all questions below.

2023-11-14

We would like to thank you for your careful reading and detailed suggestions. We will answer your questions one by one. We will also revise our paper according to your suggestions (e.g., more intuitive motivation and a detailed explanation of SMW).

For weakness:

Q1: SMW motivation.

Q2: Difference from CIC.

I’m sorry there’s a typo causing a great misunderstanding. Thank you for pointing this out. The correct statement is "For CIC, ComSD follows it in state entropy estimation but first proposes to employ contrastive results for explicit conditioned entropy maximization." Concretely, CIC only uses a single exploration reward (for state entropy estimation) but doesn’t design an exploitation reward (for conditioned entropy estimation), i.e., CIC doesn’t explicitly maximize conditioned entropy by RL. The intrinsic reward of CIC is: $r_{CIC} = r_{exploration}$ . By contrast, ComSD employs contrastive exploitation reward to estimate conditioned entropy and uses a dynamic weighting algorithm to alleviate its impact on dynamic behaviors. Its intrinsic reward is: $r_{ComSD} = r_{exploration}+\beta(z)\cdot \alpha r_{exploitation}$

Q3: CSD and benchmark

We also notice that CSD says 'MI-based methods often end up discovering 'static' skills with limited state coverage'. But we don’t completely agree with this view. In practice, MI-based methods can produce extremely continously dynamic behaviors in challenging locomotion (CIC) and enable efficient adaptation across different fields (APS,DADS).
DMC (URLB) is challenging for unsupervised skill discovery. CIC found that DMC is much more difficult than gym (used in CSD) due to the lack of extrinsic signals on agent balance. According to CIC, most methods that succeed in openAI gym fail in DMC. Across methods that have been evaluated on DMC, only CIC and our ComSD can generate continuously dynamic robot behaviors.
We have also considered manipulation tasks. However, we find that the objectives of robot skill discovery and manipulation skill discovery are totally different. In challenging locomotion (DMC), we want the agent to learn different behaviors (postures) at different activity levels. Both dynamic behaviors and static postures are useful, which motivates our method design. The learned skills are hard to judge only by moving distance. However, in manipulation, the objective is to move the arm or small objects as far as possible. Evaluation of skills is also limited to movement distance and directions (Only moving far away is considered a good skill, while direction is the only indicator of distinguishing skills), which means many interesting behaviors are ignored. For example, continuously moving behaviors (like arms moving up and down) and moving at short range (or moving a different distance) are regarded as useless skills in CSD's experimental settings. The ability of discovering continuously dynamic behavior and static postures can't be well evaluated in current manipulation experimental settings.

Q4: Comparison with LSD and CSD

We agree with you and will consider more comparisons. However, we did not limit the comparison to MI-based methods on purpose. Actually, we compare our method with current SOTA methods (CIC, BeCL) on DMC (URLB) and employ all skill discovery baselines used in their papers. The comparison with five popular baselines, including SOTA, is enough to support our conclusion. In addition, changing the benchmark for CSD leads to unfair comparison (e.g., we don’t have proper hyper-parameter settings). This may explain why CSD was also not compared with CIC and APS. Considering the reasons above, we didn’t make a comparison with LSD and CSD, which follows CIC and BeCL.

For minor questions and suggestions:

continued.

审稿意见

评分: 5置信度: 42023-11-06

In this work, the authors present a novel unsupervised skill discovery algorithm called ComSD (Contrastive Multi-Objective Skill Discovery) that uses contrastive learning and entropy estimation to learn skills in an unsupervised fashion in simulated environments. The primary insight from the authors are twofold: using contrastive learning to learn a similarity metric between skill latents z and trajectories tau, and using a coefficient to balance the aspect of quality of the policy vs. the diversity of the explorations.

In this work, the authors first explain their algorithm, which is to first learn an estimated lower bound on the p(tau, z) by using an NCE loss function to find the entropy of a skill, and using a particle based entropy estimator to estimate the entropy of the trajectories. Then, the authors explain their automatically rebalanced multi-objective weighting (SMW) which help the learned skills balance between the quality and the diversity of the learned skills.

The authors then show some experiments on standard unsupervised skill discovery environments from DM-control. Their experiments contain two experiments first for skill finetuning and skill combination using a hierarchical controller. Then, in an ablation experiment they show that both the dynamic weighting and the contrastive encoding are both important for ComSD to perform well. Finally, they show that ComSD also outperforms other comparable algorithms in state diversity metrics.

优点

The work shows a number of positive qualities:

The authors motivate their algorithm well; there are only so many ways of doing unsupervised skill discovery, but the authors identify an approach that differs from the baseline approaches and execute on it.
The trade-off between diversity and consistency is not explored often enough in the literature, but the authors identify it as an important factor, and consciously optimize for this trade-off across their skills.
They evaluate their algorithm on a good number of environments, and against a good number of baselines.
The authors also evaluate ComSD on two state diversity metric, which is also quite important for an unsupervised skill discovery metric.

缺点

The primary flaw with this work is the incomplete comparison with prior state of the art. Unsupervised skill discovery is a crowded field of research, so it is natural that they may miss certain previous works during their literature review. However, a quite relevant work that the authors seem to have missed is [1]. [1] seems relevant to the authors work because of the (a) information gain expansion used in the work, (b) use of the point-based entropy estimator to compute the reward, and most importantly (c) use balancing coefficient to trade off between diversity and consistency between different skills. Moreover, [1] seem to outperform the standard unsupervised skill discovery algorithms at the time, so it would be good for the authors to add the work as a baseline and/or explain the differences between ComSD and [1] and why they are not compatible for direct comparison.

Apart from that, there are some other issues of the work:

The notation r_exploration and r_exploitation look quite similar to each other, and I was thrown off multiple times while reading the explanations. If they could change the notation it may be much easier to read.
The SMW objective is not motivated well. It seems like it was pulled out of nowhere. More motivation for this would be apt.
The skills are trained for 2M steps, however, there is no clear reasoning for why this number was picked. How does this look in the limit, at say 10M steps?

[1] Shafiullah, Nur Muhammad Mahi, and Lerrel Pinto. "One After Another: Learning Incremental Skills for a Changing World." International Conference on Learning Representations. 2021.

问题

See above. Specifically, what is similar vs not about previous works, more motivation for the SMW module, and the behavior in limiting number of environment steps would be good.

评论- Thank you very much for reviewing our work! We address all questions below.

2023-11-14

Thanks for your detailed review and suggestions! We appreciate your affirmation and will answer your questions one by one.

For main weakness:

Q: Additional related work DISk

Thanks for the additional reference. Surely we missed an important related work, and we will cite it in our background! However, we can’t agree that we make an incomplete comparison because we follow recent advanced works (including SOTA) in baseline selection. Concrete reasons are shown as the following:

First, we have compared our method with CIC and BeCL, the current SOTA methods on DMC (URLB). We also employ all unsupervised skill discovery baselines used in their papers. Comparisons with five popular baselines, including SOTA, are enough to support the conclusion.
Second, DISk provides a new perspective on skill discovery but is not suitable for direct comparison here. The main reason is that they set different policy networks for different skills, while all methods in our paper only use one skill-conditioned agent. In addition, DISk is evaluated on openAI gym but not on DMC. It is difficult to select the appropriate number of skills for DISk on a new benchmark and ensure a fair comparison without proper hyper-parameters. These may explain why current SOTAs (CIC and BeCL) on DMC also don’t employ DISk as a baseline and why DISk is not compared with APS and SMM.

For other issues:

Q1: The notation r_exploration and r_exploitation look quite similar to each other.

Thanks for your advice. We will revise the notation!

Q2: SMW motivation.

Q3: pre-training steps.

We follow previous advanced work (CIC, BeCL) and benchmark (URLB) on training step settings.

References

DISk One After Another: Learning Incremental Skills for a Changing World. ICLR 2022

CIC Unsupervised reinforcement learning with contrastive intrinsic control. NeurIPS 2022

BeCL Behavior contrastive learning for unsupervised skill discovery. ICML 2023

URLB Urlb: Unsupervised reinforcement learning benchmark. NeurIPS 2021

APS Aps: Active pretraining with successor features. ICML 2021

SMM Efficient exploration via state marginal matching. arxiv 2019