审稿意见

评分: 5置信度: 42024-06-22

This paper presents a method for learning Successor Features (SFs) from pixel-level observations in reinforcement learning (RL) by combining a Temporal-Difference (TD) loss with a reward prediction loss. This approach simplifies the learning process, improves performance, and speeds up learning compared to existing methods.

优点

The proposed method is simple and easy to implement, which is a significant advantage in practical applications. The effectiveness of the approach is demonstrated in both simple 2D and 3D environments.

缺点

The writing quality of the paper could be improved. The Introduction section reads more like an extensive review of related work rather than setting the context for the proposed method.
The first three subplots in Figure 1 are difficult to understand without detailed background information. It is recommended to move these figures to the experimental section.
The statement in lines 61 to 62, "without any of the drawbacks," seems too absolute and should be toned down.
The experiments only demonstrate effectiveness in a few simple 2D and 3D environments. To further validate the proposed method, it is recommended to test in more complex environments, such as Atari games, similar to APS.
The experiments only tested the method with DQN, leaving its effectiveness with other RL methods unknown.

问题

Have the authors tested the proposed method with other RL algorithms and in more complex environments?

局限性

In addition to the limitations mentioned by the authors in the paper, please refer to my comments in the Weaknesses section for further details.

作者回复

2024-08-06

Thank you for your feedback. We appreciate the opportunity to clarify and enhance our manuscript based on your observations. Please let us know if there is further clarification we can provide.

1. Balancing Context and Review of Related Work

Thank you for your feedback on the structure of our introduction. We will revise it to better balance the context setting with the review of related work, enhancing both readability and clarity. In response, we will incorporate the following to the introduction section of our manuscript:

Successor Features (SFs) are crucial in continual RL for decoupling environmental dynamics from rewards. Yet, current SF implementations often face representation collapse when learning from pixels, due to reliance on predefined assumptions and extensive pre-training. Our approach addresses these limitations by integrating an innovative neural network architecture that enhances computational efficiency and scalability.

We have validated our method with experiments in 2D and 3D mazes and the Mujoco environment (added during the rebuttal phase), detailed in Figure 1 of the General Response (GR). Our findings show enhanced learning efficiency and adaptability, proving our model's broad applicability in RL scenarios.

2. Placement and Purpose of Representation Collapse Analysis Plots in Figure 1.

Thank you for your feedback on the placement of the representation collapse analysis plots in Figure 1. We positioned these plots early in the manuscript to establish the central motivation of our research. While representation collapse is a well-known issue in Machine Learning, its empirical analysis within the context of SFs is a novel aspect of our work, warranting prominent placement to set the stage for the discussions that follow.

Introducing these plots at the beginning ensures that readers immediately understand the significance of the challenge we are addressing. This approach supports a cohesive narrative by linking the theoretical motivations directly with our proposed solutions and experimental validations.

Moving these plots to a later section, such as the experimental results, could disconnect them from their theoretical context and reduce their impact on framing the research problem.

3. The Exclusion of Certain Drawbacks in Our Method

We appreciate the chance to clarify the use of “without any of the drawbacks” in our manuscript. To address your concern, we will amend the phrase to “without some of the drawbacks.”

4. Further evaluation in complex environments

Thank you for the suggestion to use Atari benchmarks. While APS's Atari setup involves pre-training and fine-tuning within environments that do not vary in features and reward functions, it is less suitable for assessing continual learning capabilities.

Instead, we opted for a comprehensive evaluation in Mujoco, utilizing pixel-based observations [6], which further demonstrates our model's capabilities with continuous actions. We started in the half-cheetah domain, rewarding agents for running forward in Task 1. For Task 2, we introduced scenarios with running backwards, running faster, and switching to the walker domain. These are detailed in Figure 1 in the GR.

Across all scenarios, our model not only maintained high performance but consistently outperformed all baselines in both Task 1 and Task 2, highlighting its superior adaptability and effectiveness in complex environments. This contrasted sharply with other SF-related baseline models, which struggled to adapt under these conditions.

5. Exclusivity to DQN

Thank you for your comment on the scope of our experiments. There are two reasons why we chose the baselines that we did.

First, we did not only compare to DQN, of course, but to several other techniques for learning SFs. We did this because the focus of our work is learning SFs. Therefore, we selected other techniques for learning SFs as our primary baselines, such as with reconstruction or orthogonality constraints. Thus, we are comparing our approach to several other techniques, namely, those that are most relevant for the question of learning SFs.

Second, we chose DQN as a non-SF baseline because of its direct relation to the mathematical definition of SFs and Q-values, a common practice in SFs literature [1,2,4,5]. This choice helps clarify the specific contributions of our approach in the context of well-understood benchmarks like DQN and DDPG [7].

Moreover, our primary goal was to develop a straightforward method for learning SFs, not to conduct a comprehensive benchmark across various RL algorithms. More complex algorithms do not always lead to better performance, especially in settings with pixel-based observations, as shown in comparisons within the Mujoco environment where simpler algorithms like DDPG often outperform more complex ones like SAC [8] (see Figure 9a in [6]).

While we acknowledge the value of broadening our evaluations to include a wider array of RL algorithms, our focus was on demonstrating the efficacy of our SF learning approach. Exploring performance with additional algorithms remains an important future research direction to enhance the generalizability of our findings.

[1] Machado et al., 2020. Count-based exploration with the successor representation.

[2] Ma, et al., 2020. Universal successor features for transfer reinforcement learning.

[3] Touati et al., 2023. Does zero-shot reinforcement learning exist?

[4] Janz et al., 2019. Successor uncertainties: exploration and uncertainty in temporal difference learning.

[5] Barreto et al., 2017. Successor features for transfer in reinforcement learning.

[6] Yarats et al., 2021. Mastering visual continuous control: Improved data-augmented reinforcement learning.

[7] Lillicrap et al., 2015. Continuous control with deep reinforcement learning.

[8] Haarnoja et al. 2018. Soft actor-critic algorithms and applications.

2024-08-12

Thank you to the authors for their answers to my questions and the additional experiments. This has helped me better understand the work. I am currently inclined to accept this paper and maintain my current scores.

评论- Response to Reviewer nkgZ

2024-08-13

We are very pleased to hear that our response helped to answer the reviewer’s questions, and that the reviewer is inclined to accept our paper. Given this, we wonder if the reviewer would be willing to raise their score to reflect the fact that we addressed their questions and put the score more clearly in the “accept” range.

评论- Final Day Reminder: Clarifying Concerns and Updating Scores

2024-08-13

Dear Reviewer nkgZ,

We hope this message finds you well. As today is the final day for the review discussion, We would like to kindly check in to see if our latest response has addressed your concerns. If the clarifications provided have resolved your questions, we would greatly appreciate it if you could update your score accordingly.

Thank you once again for your time and thoughtful feedback throughout this process. Your input has been invaluable, and we look forward to hearing from you soon.

审稿意见

评分: 5置信度: 42024-07-09

The paper proposes a simpler method to learn Successor Features that avoids representational collapse. For this, the authors decompose the loss function to learn the successor features and task encoding separately. This allows for keeping the basis features fixed while learning the successor features, thus avoiding representational collapse. The experiments involve a continual learning scenario where robustness to task changes is evaluated. The authors show that the method can better adapt to changing tasks.

优点

The problem is well motivated and the approach offers a simple solution to representational collapse when learning deep successor features.
The authors provided a number of insightful ablations. Especially that reconstruction based SF methods have trouble learning a good representation for fully observed settings.
The writing is clear and the method is presented in an understandable manner.

缺点

The tested environments seem to be perhaps too simple for comparison both from the representational and task difficulty perspective, since DQN also has very good relearning capabilities in these environments. Why is it that DQN is better than the successor feature counterparts for the continual learning setting? This seems counterintuitive to me since successor features should be more robust than pure DQN.
The presentation of the figures has issues. Some Figures are pixelated, i.e., not vector graphics. (e.g., Figure 4 or other environment Figures). Also I think Figure 1 could be split into 2 figures for better readability.

问题

Should the method not be tested on environments where DQN itself cannot adapt to the new tasks at all? I wonder if the simple approach still holds when the transition dynamics become more complicated or the observations are more noisy.
For the Minigrid environment: Do you learn the successor features from pixels or do you use the built-in symbolic state representations?
Since DQN is also robust for the showed environments, I wonder how reward sparsity affects the performance of the different algorithms?

局限性

The authors have adequately addressed limitations.

作者回复

2024-08-06

Thank you for your feedback. We appreciate the opportunity to clarify and enhance our manuscript based on your observations. Please let us know if there is further clarification we can provide.

1. Is DQN better than the SFs in Continual RL setting?

Thank you for your observations regarding the experimental results in Figures 2 and 3. While average episode returns offer quick performance insights, they do not fully capture the long-term benefits of our model. Thus, we also analyzed cumulative total returns across all tasks, as shown in Figure 2 of the general response (GR).

These results confirm that our model quickly learns and maintains effective policies, especially in complex 3D environments where tasks recur (Figure 2c-d in GR). Our model significantly outperformed the baseline in cumulative returns, demonstrating its robustness and superior transfer capabilities compared to DQN, which showed little to no transfer effects and needed to re-learn tasks.

We will include these results in our manuscript to more comprehensively demonstrate our model's effectiveness in continual learning settings.

2. Figure 4 Quality

Thank you for your feedback on the graphics in our figures. While all our figures are created with vector-based graphics for high resolution and scalability, Figure 4 is an exception. It uses pixel-based graphics to accurately reflect the native format of the RL environments and the inputs our models process.

3. Figure 1

We appreciate the suggestion to split Figure 1 to enhance readability. Acknowledging the density of the current figure, we will implement several modifications:

Simplification: We'll remove the loss functions from Figures 1d and 1e, with detailed descriptions retained in Appendix E and the main text, respectively. This will help focus attention on the structural content.
Reorganization: Figure 1d will be moved to the Appendix as it primarily presents common approaches rather than our novel contributions, ensuring the main text remains focused on our work.
Relabeling and Relocation: Figure 1e will be renamed as Figure 2 and relocated closer to Sections 4 and 5 where it is first mentioned, aligning it more closely with its textual references and enhancing narrative coherence.
Visual Guidance Enhancements: We will replace terms like “Q-SF-TD loss” with “ $L_\psi$ : Q-SF-TD loss” and introduce color-coded information to improve figure-text integration such as, “Pixel-level observations, $S_t$ , are processed by a convolutional encoder to produce a latent representation $h(S_t)$ , which is used to construct the basis features (indicated by a yellow box in Figure 2) and the SFs (indicated by a green box in Figure 2).”

We hope these changes will streamline the presentation and ensure the figures more effectively complement the text.

4. Complex and noise in environments

Thank you for your comment on our model's effectiveness in complex, noisy environments.

Firstly, our model's resilience to noise was proven in the “3D Slippery Four Rooms environment” (Section 6.1.3), where agents faced altered actions in Task 2. The results (Figure 3) demonstrate our model's superior robustness to induced stochasticity compared to baselines.

Secondly, we expanded our evaluation during the rebuttal phase to include the Mujoco environments, using pixel-based observations and accounting for continuous action spaces. Following the setup in [1], we tested in scenarios like running backwards, running faster, and a major switch from the half-cheetah to the walker domain in Task 2. The outcomes (Figure 1 in GR) show our model consistently outperforming baselines across all scenarios, thereby showcasing its adaptability and effectiveness in more complex settings. These results affirm our model's advanced capability to robustly handle diverse and challenging environments, making it highly suitable for practical applications with complex dynamics and significant noise.

5. Pixel or Symbolic States observations for SFs?

Thank you for your question regarding the input modalities for Successor Features. In our work, we exclusively use pixel observations across all experiments. This choice is intentional, addressing a significant challenge in the field—the direct learning of Successor Features from high-dimensional sensory inputs such as pixels, which, as noted in [2], have historically posed difficulties for conventional methods and remain underexplored in the Successor Features literature [3, 4].

6. Sparse Rewards

Thank you for your question regarding sparse rewards. Like other DQN-based methods, our approach may face challenges in environments with sparse rewards, a recognized issue with bootstrapped learning methods. While our method is tailored for continual reinforcement learning, it is not specifically designed to address sparse rewards.

We acknowledge the need for mechanisms to better manage sparse rewards. Recent findings suggest that reconstruction-based objectives do not always capture task-relevant features effectively in such settings [5]. Integrating techniques that generate intrinsic rewards could help by providing more frequent learning signals.

However, exploring these techniques further is beyond the current scope of our work. Our primary focus remains on demonstrating the viability of our approach in typical continual learning environments, laying the groundwork for future research to more comprehensively tackle the challenges of sparse rewards.

[1] Yarats et al., 2021. Mastering visual continuous control: Improved data-augmented reinforcement learning.

[2] Machado, et al., 2020. Count-based exploration with the successor representation.

[3] Ma, et al., 2001. Universal successor features for transfer reinforcement learning.

[4] Touati et al., 2023. Does zero-shot reinforcement learning exist?

[5] Balestriero., 2024. Learning by Reconstruction Produces Uninformative Features For Perception.

2024-08-12

Thank you for your additional ablations and experiments!

I still feel it is somewhat strange that in Figure 2, DQN is still outperforming other SR methods. At some point during task change it even outperform your proposed method. I feel the environments don't demonstrate precisely the effectiveness of your method, when DQN is outperforming other SR methods.
The results regarding continual RL are encouraging.

I will keep my score, but increase my confidence to 4.

评论- Response to Reviewer Tczf

2024-08-13

This is a re-submit as it seems that our earlier previous response did not notify the reviewers via email.

Thank you for taking the time to review our rebuttal. We sincerely appreciate your thoughtful comments and are glad to have the opportunity to provide further clarifications. Please don’t hesitate to reach out if you have any additional questions or concerns.

We appreciate your observation, but we are unclear which specific plot in Figure 2 you are referring to, as in all Continual RL plots (Figures 2e to 2g in our paper), our approach (orange) consistently outperforms DQN (blue). Additionally, if you refer to the plots generated using the total cumulative return in the same setup as Figures 2e to 2g, as shown in Figures 2a to 2c in the general response, it is clearly evident that our approach performed much better in the later tasks.

To emphasize why we presented (moving) average returns per episode instead of cumulative total return plots in our manuscript, it was to demonstrate that we allow learning for the first task to converge before introducing the second and subsequent tasks.

Furthermore, we acknowledge that the smaller size of these figures might make the trends less apparent. Therefore, we encourage you to refer to the larger illustrations in Appendix G (Figures 12 to 16), where the replay buffer is not reset to simulate conditions with less interference between task switches. Even under these conditions, our approach (orange) consistently demonstrates superior learning performance compared to DQN (blue).

While the performance improvements in the simpler 2D minigrid environments (Center-Wall and Inverted-LWalls) are less pronounced, they remain significant. In contrast, the more complex 3D Four Rooms environment shows a clearer advantage of our method, as seen in Figures 12 and 13. This trend highlights the robustness of our approach, particularly as task complexity increases, further validating the effectiveness of our method across diverse environments.

Moreover, the newly added results during the rebuttal phase, which utilize the more complex Mujoco environment, also show that our method (orange) outperforms DDPG (blue), a variant of DQN designed for continuous actions.

All these results clearly demonstrate that our method, Simple SF (orange), learns more effectively than DQN and DDPG (blue). This superior performance is due to our method's ability to better generalize and transfer knowledge between tasks, as evidenced by the larger improvements in cumulative total returns when the agent re-encounters the tasks (Exposure 2 in Figure 2 in the General Response).