Task Adaptation from Skills: Information Geometry, Disentanglement, and New Objectives for Unsupervised Reinforcement Learning
摘要
评审与讨论
This work targets on the Information Geometry in unsupervised Reinforcement Learning, especially Mutual Information Skill Learning (MISL). On the basis of previous work, this work first considers the diversity and separability of learned skills. As MISL can not guarantee these properties, this work proposes LSEPIN to measure the disentanglement, and then shows the connection between LSEPIN and downstream task adaptation cost. Moreover, this work investigates the information geometry of Wasserstein distance based skill learning methods. Finally, this work proposes PWSEP and theoretically shows that it can discover all optimal initial policies.
优点
-
Definitely an important problem to be tackling! Previous work[1] has established the connection between skill-based unsupervised RL and information geometry. [1] has shown that KL-based metric can only find skills with the ''largest radius'' and how to find all vertices of the skill polytope is still an open challenge. This work has elegantly solved this challenge and I believe it will be of interest to researchers of Unsupervised RL.
-
The description of the article is very clear, and the introduction to related work is very specific.
-
I have basically read all the proofs of the theorems, which are well-written.
Thanks to the authors for putting in the effort in doing this work!
[1] Eysenbach, Benjamin, Ruslan Salakhutdinov, and Sergey Levine. "The information geometry of unsupervised reinforcement learning." arXiv preprint arXiv:2110.02719 (2021).
缺点
Overall, I think this paper is well written and its contribution is solid. I do not find clear weaknesses but I still have some questions (see Questions). I will adjust my score accordingly based on the author's response and other reviewers' comments.
问题
-
This paper claims that "it is possible for WSEP to discover all vertices of the feasible polytope". As the theoretical results can only show that all learned skills satisfy rather than WSEP can learn all skills, there is no direct evidence to suggest that WSEP can indeed find more skills than MISL (i.e., skills without maximum “distances”). Can authors show that WSEP can find more skills or even all skills? I think empirical results or theoretical results even in simple cases will be really helpful.
-
All experiments are provided in the Appendix. It seems that providing the main experiments and analyses in the main text can help the readers to better understand this work.
-
What will happen if we change the W-distance to other distance metrics in PWSEP(i)? In my opinion, it seems that the proof of Thm 3.5 holds for any distance metric (owns Non-negativity, Symmetry, and Triangular Inequalities). Is it right? Or are there some special properties of W-distance necessary for proving Thm 3.5?
-
It's better to provide some proof sketch of theorems in the main text, like Thm 3.5.
Thanks for the fast reply!
We hope the figure provides the intuition that WSEP can learn the vertices that are not with maximum "distances".
Here is a simple MDP that produces the feasible state distribution polytope in the figure, and the quantitative results below shows why in this case, WSEP learns more vertices than MISL.
An MDP example that produces the polytope in the figure
An MDP with for states and for actions. The transition matrix for action is:
The transition matrix for action is:
The transition matrix for action is:
Because the transition probabilities only depend on actions, state distribution is determined by distribution :
We can see , in this case, is the convex coefficient of three probabilities, so the feasible state distribution is in the convex polytope with these three probabilities being the vertices of .
This MDP accords with the figure, where we have labeled the vertices with ,
Quantitative comparision of I(S;Z) and WSEP
MISL only learns skills at :
For this MDP, the unique center of the "circle" with "maximum radius" (We showed that it's uniquely determined by the MDP in appendix D) would be .
MISL would learn only with to have the average state distribution at the unique center .
By this solution, is maximized by .
Putting weight on the other vertex would lower . For example, if , the average would be which is not the center of the maximum "circle" any more, and their .
WSEP can learns skills at all vertices :
For WSEP, when the skill number is set to , when two skills are at like MISL, putting the last skill at or would have lower WSEP than putting the last skill at , due to the triangle inequality.
Suppose the transportation cost between every two states is 1, then
WSEP for skills at and skill at would be ,
WSEP for skills at and skill at would also be .
WSEP for skills at respectively would be , and this is the solution for maximizing WSEP for this case (By convexity, moving any of the three vertices would lower its distance to the others).
Therefore, WSEP would favor learning all vertices instead of only skills at like MISL. In this case, MISL can not discover that is not with maximum "distances" but WSEP can.
Thanks again and we hope this addressed the question!
I have roughly checked the example and believe it is reasonable. Also, I believe that adding a more detailed version of this example (for example, it is better to strictly show that [0.2, 0.4, 0.4] is exactly the center of the "circle") to the paper will make the paper more solid. I would like to keep my score and believe that this work will be a valuable contribution to the community.
We sincerely thank you for your thoughtful review and recognition of our work. We are delighted to address your questions. Please see our response below:
Q1: About whether "WSEP can find more skills or even all skills"
A1: We consider that, by 'skills' you are referring to those at the vertices.
By "it is possible for WSEP to discover all vertices of the feasible polytope" we mean that WSEP is not restricted by maximum "distances", so it might discover the vertices that can not be discovered by MISL in certain cases. For example, in the situation shown by the figure from this anonymous link, MISL can only learn . Even when the skill number is set to be , the third skill learned by MISL will be duplicated at or but not lie at . Instead, WSEP can discover all 3 vertices in this case.
However, as mentioned in Section 3.3.4 and appendix F.6, unlike PWSEP, optimizing WSEP is not guaranteed to always discover all vertices. Among the vertices that can not be discovered by WSEP, there can also be ones with maximum "distance" that can be discovered by MISL. This motivates us to look into PWSEP, which is guaranteed to always discover all vertices with enough skills.
Q3: What will happen if we change the W-distance to other distance metrics in PWSEP(i)?
Thm 3.5 holds for any distance metric that owns "Non-negativity, Symmetry, and Triangular Inequalities" and strict convexity.
We briefly discussed this topic in appendix F.4 after the proof of theorem 3.5. Although total variation distance and Hellinger distance are also true distance metrics, they're less used in RL compared to KL divergence and Wasserstein distance, resulting in limited research on their efficient approximations for state distributions. In addition, we haven't found valid literature to support the strict convexity of total variation distance and Heilinger distance. Additionally, Wasserstein distance has the potential to take advantage of the choice of transportation cost to provide smooth and stable measurements with meaningful information as discussed in appendix G.4
Q2&Q4: About experimental results in the main text and proof sketch
Thank you for your valuable suggestion on enhancing our paper's presentation. We will certainly take your feedback into account.
Thanks again for your time and attention!
I have read your reply and my Q1 still holds. I fully understand that MISL can only find skills restricted by maximum "distances" where WSEP might find other skills. My question is: Is there a situation in which WSEP indeed finds more skills than MISL? More specifically, can we find an MDP, maybe very simple, where WSEP can find more skills than MISL? The proposed figure in the rebuttal is not a strict example, right?
This paper studies the geometry of state distributions learned with mutual information skill learning for the purpose of theoretical task adaptation analysis. The authors propose Least SEParability and INformativeness (LSEPIN) to measure the diversity and separability and show a relationship with worst-case adaptation cost (WAC). The authors theoretically prove the relationship between the optimization of LSEPIN and Similarly, the authors also propose Wasserstein based distance metric WSEP which is more suitable in symmetric polytope to measure the separability of skill. Similar to LSEPIN and WAC, the relationship between WSEP and Mean Adaptation Cost is studied.
优点
- The geometry perspective of task adaptation is interesting.
- Studies on the geometry is promising and can motivate readers.
- The theoretical results are well justified.
缺点
- Some Definitions of words are not defined well (diversity and separability).
- The flow of the paper is weakly organized and hurts readability (the first topic is LSEPIN and the second WSEP.) The paper have no discuss on other perspectives.
- The findings from the theoretical derivation are not surprising. WAC and adaptation cost.
- Most contents have high dependency with appendix. Although the authors studied various components, they are not well organized.
问题
- Here are questions that we want to discuss with the authors.
- We might incorrectly understand the contribution of this work. Why the theoretical derivation of Theorem 3.1 is important for a task adaptation?
- What can we additionally learn from the theorem 3.1 or how can we use the theorem for other task adaptation works? For my understanding, the relationship: increasing LSEPIN results in lower WAC.
- Why measuring can be used to measure diversity and separability? I guess two features should be computed among skills, but the is just a mutual information for a single code.
- In LSEPIN, the metric measures the least skill code. I guess an abundant skill code may hurt the calculation. Isn't it problematic? How could you ensure that all the codes are meaningful in the computation of LSEPIN?
- WAC is measured with the worst state distribution. how this could be meaningful and practical? That is, why we need to measure the worst case adaptation? To the best of my knowledge, skill adaptation is applied to the state distributions which are similar to the state distribution for the target skill. Therefore, the importance of part in WAC is not persuasive.
- Could you please additionally describe the necessity of symmetry and triangle inequality?
- What is the limitation of WSEP?
- Could this study can be combined with in-distribution and out-distribution perspective of task adaptations?
[Overall] Although the authors conducted several theoretical derivations and analysis, the choice of metrics and the relationship between them have little meaning. Mostly because the task adaptation is might assume close state distribution for a skill . However, the authors study the worst-case adaptation. Additionally, the organization of the paper is not well constructed and hard to follow.*
We thank the reviewer for your valuable feedback and your willingness to engage in discussion. We would like to address your concerns and answer your questions. Please see the following for our response.
1. Regarding the concerns related to the "WAC" cost and Theorem 3.1
About:
- [Overall] Although the authors conducted several .. the authors study the worst-case adaptation.
- WAC is measured with the worst ... the importance of part in WAC is not persuasive.
- What can we .. For my understanding, the relationship: increasing LSEPIN results in lower WAC.
- The findings from the theoretical derivation are not surprising. WAC and adaptation cost.*
1.1 Clarification of the definition of WAC
To address these concerns, we would like to first clarify the definition of the WAC cost and mitigate potential misunderstandings. The WAC is defined as: is the optimal state distribution for downstream task , and is the set of learned skills.
WAC is indeed considering adaptation from closest skills in the set of learned skills because of . The in WAC means to choose a downstream task such that its optimal state distribution is the far from all learned skills, even for the skill that is closest to .
1.2 Practicality and meaningness
In practice, although we can adapt from the closest skill after knowing the downstream task , we do not have prior knowledge about during unsupervised pre-training. Therefore, we can only prepare for the worst downstream task (the one far from even the closest learned skill).
Theorem 3.1 and the theoretical results in Section 3.2 shed light on what kind of skills are favored for downstream task adaptation and how to quantitatively measure them.
1.3 About the contribution of our findings
Following the above clarification of the WAC definition, we hope it is clear now that our analysis is dealing with the fundamental question for URL: "How the learned skills can be used for downstream task adaptation, and what properties of the learned skills are desired for better downstream task adaption?"
The findings ... not surprising.
The findings may appear "not surprising", given that promoting diversity and separability of learned skills has been an intuitive heuristic in prior practical algorithms [2][3]. However, our unique contribution lies in offering a theoretical justification for this heuristic.
2. Regarding concerns related to "diversity and separability"
About:
- Why measuring can be used to measure diversity and separability? ..., but the is just a mutual information for a single code.
- Some Definitions of words are not defined well (diversity and separability).
First of all, LSEPIN is defined as , its mean is not (details in "formal difference" of appendix B.2).
As we have mentioned at the beginning of Section 3: "Separability means the discriminability between states inferred by different skills." In inequality (61) of appendix E, we show that increasing promotes the KL divergence between and (average state distribution of non- skills). Also discussed in appendix E, higher KL divergence between and means skill is more distinctive and has less overlap with other skills, so this means better separability.
For a set of skills, they are diverse because all skills in the set are distinctive and have little overlap with each other (they are "far" from each other in terms of KL divergence). The standard MISL objective can not guarantee the seperability of each skill, as discussed in Section 3 after the list of informal results, so MISL without LSEPIN can not guarantee diversity.
3. Regarding why focus on single skill code
In LSEPIN, the metric measures the least skill code. ... How could you ensure that all the codes are meaningful in the computation of LSEPIN?
It is important to ensure every learned skill is distinctive and separable from others. High means that this skill covers a specific region of the state space that is less covered by other skills. If it is not separable from other skills, its state coverage could have a huge overlap with other states, so deleting this skill would make no difference for exploration or downstream task adaptation.
In [1], they implement many existing algorithms with a low number of skills, eg. only 4 skills for SMM. If one of its skills exhibits a low , it may have a huge overlap with other skills, leading to a situation where 25% of the skills are underutilized, offering no meaningful contribution to exploration and downstream task adaptation.
4. Regarding the necessity of symmetry and triangle inequality
Could you please additionally describe the necessity of symmetry and triangle inequality?
This topic is discussed in Appendix B.3. We showed an example of what could happen if we replace the Wasserstein distance with KL divergence in appendix F.7. The example in appendix F.7 shows that if we replace the Wasserstein distance in WSEP with KL divergence, due to the lack of triangle inequality, maximizing this metric could result in skills too close together, harming diversity.
Symmetry is important when comparing distances between different pairs of points, for example, we have pair and pair and a non-symmetric measure . Without symmetry, and . Then it's possible that there exists a situation where but . From the measurement you cannot know whether pair is more separable than pair .
Besides, state distributions of two different skills could share different domains, which is a problem for a well-defined (not infinity) KL divergence.
5. Regarding limitations of WSEP
What is the limitation of WSEP?
-
Theoretically, as discussed in Section 3.3.4 and appendix F.6, although WSEP also discovers vertices, it is not guaranteed to discover all vertices with only skills. This motivates us to propose the PWSEP algorithm to solve the vertex discovery problem using only skills.
-
Practically, the implementation of Wasserstein distance depends on the choice of transportation cost. For continuous control, it is common that the transportation cost is chosen as the L2 norm. However, the L2 norm between two states might not accurately measure how difficult it is to travel from one state to the other, because there might be obstacles between them.
As discussed in appendix G.4, an idea for future work is to learn state representations so that the L2 norm in the representation space can reflect the actual difficulty of traveling from one state to the other.
6. Regarding the organization of the paper and readability
- The flow of the paper is weakly organized and hurts readability (the first topic is LSEPIN and the second WSEP.) The paper have no discuss on other perspectives.
- Most contents have high dependency with appendix. Although the authors studied various components, they are not well organized.
The logic flow of our paper is like this:
- We try to answer the fundamental question of URL: "How the learned skills can be used for downstream task adaptation and what properties of the learned skills are desired for better downstream task adaption?”
- We found that LSEPIN captures the desired properties necessary for preparing skills for downstream task adaptation.
- We found that LSEPIN and MISL are essentially optimizing KL divergences and MISL with LSEPIN also has limitations like the one mentioned in remark 3.2.1, this inspires us to investigate whether we can overcome the limitations of MISL by optimizing true metrics between state distributions.
- We found that WSEP and PWSEP with true metric Wasserstein distance can overcome the limitation and can discover potentially optimal skills that can not be discovered by MISL.
Thank you for pointing out the readability and organization issues that can be further enhanced. We could replace some sentences with tables or diagrams.
7. Regarding the relation to in-distribution and out-distribution perspective of task adaptations
Could this study can be combined with in-distribution and out-distribution perspective of task adaptations?
First of all, in unsupervised RL setting, the target task distribution is unknown during training, it can only learn by its intrinsic motivations like intrinsic rewards for maximizing or WSEP. Therefore, it is unlikely that the learned skills cover the target task distribution. The URL setting should be closer to the out-distribution meta RL setting, where the skills learned by intrinsic rewards can be considered as the training tasks and the downstream tasks can be considered as out-of-distribution test tasks. One future research idea could be how to combine the URL approaches with training tasks so that the pretrained agent can be well-prepared for out-of-distribution downstream tasks
Reference
[1] laskin 2021, URLB: Unsupervised Reinforcement Learning Benchmark
[2] Eysenbach 2019, Diversity is all you need: Learning skills without a reward function.
[3] laskin 2022,CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery
Thanks again for your constructive feedback!
Thanks again for your valuable feedback on our paper. Our rebuttal addresses the concerns, especially the ones related to the WAC cost definition. By WAC, we actually consider the practical adaptation procedure you mentioned, which is to adapt from the 'closest' skill. We hope this clarifies our main contribution.
We wonder whether you have any remaining concerns, we are looking forward to addressing any additional concerns you may have.
Thank you for the detailed explanations on specific questions.
- Regarding the concerns related to the "WAC" cost and Theorem 3.1 Thank you for clarifying the definition. The definition based on the unknown optimal distribution is persuasive.
- Thank you for clarifying the definition of LSEPIN and separability. Please refer B.2 in the paper. The definition of helps me understanding the definition of separability.
- Thank you additional explanation on the meaning of separable skill learning.
- I understand the impact of the lack of triangle inequality. Thank you for the explanation. I appreciate your effort to provide additional comments on the questions, especially in sections 5.6 and 5.7. Thank you
Here are additional comments to clarify the problem tackled in this work. Two terms "diversity" and "separability" are used jointly in this work. In my understanding diversity is about [1] which measures the coverage of skills, while the separability is defined with . To the best of my understanding, the main contribution is on separability. Is the contribution of this work is on both properties? (related to question 6, organization of the paper) I also checked the reviews from other reviewers. I also agree that the inclusion of the main experimental results to improve readability. [1] Eysenbach, Benjamin, et al. "Diversity is All You Need: Learning Skills without a Reward Function." International Conference on Learning Representations. 2018.
Thanks for the reply and your valuable perspective. We consider separability to be the prerequisite of diversity. Without separability, even with a large number of skills, they could be close to each other and only cover a small region of the feasible state distribution polytope, so skills need to be distinctive and separable first before they can be diverse.
We showed at the beginning of Section 3 with the example of Fig. 2, that maximizing alone does not necessarily guarantee separability consequently resulting in limited diversity.
The approach from [1] was based on the heuristic: Maximizing and entropy of will result in diverse skills. However, they did not provide rigorous justification for this heuristic, and we showed in the response to reviewer ZXDZ with the example from anonymous link that maximizing even along with entropy of could result in duplicated or inseparable skills which leading to limited diversity.
We genuinely value the suggestion of moving experimental results to the main paper for better readability. Since our results are mainly theoretical, there would be a trade-off between theoretical and experimental results. Nonetheless, we will carefully consider how to integrate some experiments without compromising the theoretical emphasis.
Dear Authors,
Thank you for the kind response.
Although the paper still exhibits weaknesses in its experimental support and the organization of the flow, the responses provided have clarified the major concerns. Therefore, I would raise my score to 'weak accept (6).
Sincerely,
Reviewer j7aV
The paper provides a theoretical analysis of learning skills using unsupervised reinforcement learning (URL), which serves as an initialization for learning a policy for a downstream task. The paper shows that Mutual Information Skill Learning (MISL) does not guarantee diversity and separability of learned skills, and proposes to replace the uniform distribution of skills objective with the Least SEParability and INformativeness (LSEPIN) metric to promote informativeness and separability. Moreover, the paper proposes to replace the KL divergence in MISL with Wasserstein distance that exploits better geometric properties. Finally, it proposes another Wasserstein distance-based algorithm (PWSEP) that can theoretically discover all optimal initial policies.
The authors show theoretically that LSEPIN bounds the Worst-case Adaptation Cost (WAC) and show that the Wasserstein distance has better geometrical properties (such as symmetry and triangle inequality) that leads to better skill separation. In addition, the authors provide experiments to validate the proposed theoretical results in the appendix.
优点
The paper investigates an important topic and provides a rigorous analysis of the proposed ideas. In addition, it proposes a practical algorithm that was tested empirically and demonstrates superior results compared to existing MISL methods.
缺点
The paper is hard to read and follow, with lots of details.
Since most of the contribution of the paper is placed in the appendix, including all experimental results, it is hard to understand and assess its contribution without reading carefully the appendix.
问题
I would like to ask the following questions:
-
Is it possible to rigorously prove that adding a loss that promotes uniform p(Z_input) to objective (1) does not promote any diversity? Or promotes less diversity than the LSEPIN loss in all cases?
-
Is there a measure for adaptation other than Worst-case Adaptation Cost?
-
Is there any advantage to the KL divergence over the Wasserstein distance? theoretically or computationally?
-
What are the limitations of the PWSEP algorithms (specifically the WSEP objective)?
We appreciate your time and attention. We thank you for the review and comments and please see our response to your questions below.
Q1: Is it possible to rigorously prove that adding a loss that promotes uniform to objective (1) does not promote any diversity? Or promotes less diversity than the LSEPIN loss in all cases?
A1: It might be hard to derive a quantitative bound to show adding a loss like to objective (1) promotes less than LSEPIN in all cases, because the diversity of state distributions depend on not only but also parameter and the function class for policy . However, we can illustrate why such loss is not guaranteed to promote diversity like LSEPIN by the following example where maximizing with uniform results in limited diversity:
For a case of with , the feasible polytope allows a maximum "circle" centered at with a maximum "radius" of as this fig in the anonymous link shows. There are 5 vertices of , of which lie on the maximum "circle". Vertex and vertex are and .
Because is shared by all skills and , we can assume
When there are 4 skills to learn, uniform should have for all . In order to achieve both uniform and maximization of , the optimal skill set should be the one containing two skills at and the other two skills at , as shown in the figure. Only with this skill set, the uniform average of skills with could be the center of the maximum "circle" .
We can see from this example that although skill set maximizes both and , there are two pairs of skills being not separable thus resulting in limited diversity.
In appendix E, we have shown higher explicitly increases thus promoting to be separable from (average state distribution of skills other than ). Therefore, LSEPIN promotes diverse skills explicitly. Compared to skills only at and , MISL with LSEPIN would favor skills at respectively, and for to be the center of maximum "circle" , distribution is not necessarily unform.
Q2: Is there a measure for adaptation other than Worst-case Adaptation Cost?
A2: This question is inspiring and we have been also considering this question recently. One idea is to consider adapting from a convex combination of learned skills instead of from one of the learned skills because a good convex combination of learned skills can be "closer" to the optimal state distribution and practically feasible to obtain.
For example in Figure 2 of our main paper, a convex combination of could be closer to than . In practice, the convex combined state distribution can be obtained by sampling from the distribution .
For the practical adaptation procedure, we can first find the resulting in a with the best accumulated reward. Because we can not directly update the parameters for , we could use a new parametric model to perform offline RL with data collected by and relabeled by downstream task reward function . Through this approach, we concurrently distill knowledge from pretraining and execute adaptation for the downstream task.
Theoretical analysis of this adaptation procedure can be an idea for future works.
Q3: Is there any advantage to the KL divergence over the Wasserstein distance? theoretically or computationally?
A3: Computationally, for non-parametric estimation with collected data, KL divergence may be preferred over Wasserstein distance. This is attributed to the fact that KL divergence can be directly estimated using samples, whereas the estimation of Wasserstein distance requires solving a linear program, introducing additional computational complexity. Theoretically, KL divergence is always strictly convex, while the strict convexity of Wasserstein distance depends on the choice of transportation cost as mentioned in our proofs in appendix F.
Q4: What are the limitations of the PWSEP algorithms (specifically the WSEP objective)?
A4: As mentioned in Section 3.3.4, Although lemma 3.3 shows that maximizing WSEP also discovers vertices, the discovered vertices could be duplicated (shown in appendix F.6), so it might not be able to discover all vertices with only skills. This motivates us to propose PWSEP algorithm that iteratively optimizes a projected Wasserstein distance to make sure every new iteration learns a new skill.
Another limitation of WSEP is discussed in appendix F.5, showing that although a higher WSEP lowers an upper bound of the adaptation cost as our theoretical analysis shows, because of the gap between the upper bound and actual adaptation cost, sometimes high WSEP could not result in lower adaptation cost.
As for PWSEP, despite its favorable theoretical property of discovering all vertices, in practice, each iteration could only learn a locally optimal skill, as mentioned in the empirical results of appendix H.
Thanks again for the constructive feedback!
Thank you for your dedication in answering my questions.
-
I appreciate your effort to provide this illustrative example. It clarifies the difference between promoting uniform distribution of skills to promoting diversity. In my opinion, including this example in the paper/appendix would be valuable for the reader to clarify this point.
-
The idea of performing adaptation from a convex combination of learned skills instead of an adaptation from one of the learned skills sounds promising and has the potential to work better in practice. A theoretical analysis of this idea will most probably lead to the same conclusion as with adapting from one of the learned skills.
-
If I understand correctly, since the KL divergence is always strictly convex, for complex RL environments the KL divergence could be more computationally practical - whereas the Wasserstein distance will provide an optimal solution, but with additional computational effort. Am I correct?
-
Could you elaborate on the impact of PWSEP learning only a locally optimal skill in each iteration? In your answer, please relate to the empirical results (section H) and in general.
I have another small question about a detail that I probably missed while reading the paper - Why is your approach free from the “non-concyclic” assumption, while the previous work of Eysenbach et al. (2022) takes this assumption into account? (The assumption that limits the number of vertices on the same “circle” to be |S|)
In general, I think that the paper provides a worthy contribution to the community and should be accepted. The authors answered most of my concerns and I’m willing to increase my score.
That being said, the readability of the paper can be improved. I think that the paper would benefit from a rigorous definition of separability and diversity of skills at the beginning, accompanied by a few sentences dedicated to motivation and examples (see point #1). In addition, including the important empirical results in the main paper will motivate the applied RL community to make use of and build upon the proposed algorithms.
Thanks again for your detailed answers.
Thanks very much for your valuable feedback and suggestions to further improve our work.
- It's a good idea to mention this example in the main paper and include its detail in the appendix, we will definitely consider adding it in the revision.
- Yes, it should be related to how much of the feasible polytope is covered by the convex hull of learned skills. Therefore intuitively, the learned skills should also be separable and diverse.
- Yes, it is correct.
- We can look at the Figure 5 (e) in the empirical result of appendix H. The red skill is learned in the end, so it goes away from all other skills instead of going downwards to the undiscovered branch of the tree maze, and it ends up in a possibly local optimum.
About the question regarding the "non-concyclic" assumption:
The “non-concyclic” assumption basically restricts the solution set of MISL to be unique. Under this assumption, there is only one unique set of skills with to maximize . Without this assumption, there could be different sets of skills learned by MISL, and the WAC cost favors the one with the best LSEPIN.
In practice, it is common that the "non-concyclic" does not hold and the solution set for MISL is non-unique. For example, different sets of skills can all be considered to have maximized , as long as there is no overlapping between any two of their state distributions because KL divergence can be considered maximized when there is least overlapping between distributions. Moreover, even if the MDP satisfies the "non-concyclic" assumption, the practical solutions could be suboptimal and on a "circle" with a smaller "radius" so all points on this "circle" are within the feasible polytope, resulting in non-unique suboptimal solution sets. LSEPIN could also benefit WAC in these suboptimal cases, as discussed in appendix C.4.
Regarding the presentation, we greatly appreciate the suggestion to add "a rigorous definition of separability and diversity of skills at the beginning, accompanied by a few sentences dedicated to motivation and examples", and this will be done in the revision. Since there would be a trade-off between theoretical and experimental results, we will carefully consider how to integrate some experiments without compromising the theoretical emphasis.
Thanks again for your support and thoughtful advice.
Thank you for your detailed response!
It helps me very much to understand the details in the paper further. I raised my score to 8.
This paper analyzes unsupervised skill-learning through a rigorous mathematical lens, focusing on the properties of the learned skills and their usefulness for downstream tasks. Most existing works use mutual information as the skill-learning objective and Eysenbach et al. (2022) provides a mathematical analysis of the same. This work analyzes the mutual information-based skill learning paradigm with a focus on downstream task adaptability via worst-case adaptation cost. This work also introduces a complementary metric called LSEPIN to measure diversity of learned skills. The authors show that maximizing MI or LSEPIN is essentially optimizing the KL divergence between state distributions. As an alternative, they suggest using the Wasserstein metric owing to it being a proper metric, and propose a new skill learning objectives built upon Wasserstein distance.
References:
Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. The information geometry of unsupervised reinforcement learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=3wU2UX0voE.
优点
There is a long line of work on unsupervised skill learning based on mutual information maximization between states and skills, most of the work being motivated by intuition and empirical performance. This work complements Eysenbach et al. (2022) by providing a rigorous understanding of the properties of the learned skills and provides useful insights. The analysis presented in this work is novel, to the best of my knowledge and comprises a fairly significant advancement of our understanding of this sub-area of RL. The quality of analysis and writing is satisfactory, with sufficient background and context provided before explaining the main results of the paper.
References:
Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. The information geometry of unsupervised reinforcement learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=3wU2UX0voE.
缺点
This is a fairly strong submission which checks all the boxes. The only minor complaint is that the empirical results in Appendix H should ideally be a part of the main paper, since including them makes the submission more well-rounded and gives empirical validation for the results presented in Section 3.
问题
None
We sincerely appreciate your time and attention and thank you for your valuable feedback! We will consider your suggestion for the presentation.
This paper provides theoretical analysis for Unsupervised Reinforcement Learning (URL) and proposes new metric LSEPIN and new objective WSEP, which have better theoretical properties than commonly used Mutual Information Skill Learning (MISL). The theoretical analysis is validated on a few environments. This paper provides one of the earliest analysis to an important sub-field of RL. The quality of analysis and writing is satisfactory. We thus recommend acceptance.
为何不给更高分
Empirical validation can be made more comprehensive.
为何不给更低分
The paper addresses an important problem. Quality is great.
Accept (spotlight)