3.8

/10

withdrawn4 位审稿人

最低3最高6标准差1.3

3.8

置信度

正确性2.3

贡献度1.8

表达3.3

ICLR 2025

ARC-RL: Self-Evolution Continual Reinforcement Learning via Action Representation Space

Chaofan Pan,Jiafen Liu,Yanhua Li,Linbo Xiong,Fan Min,Wei Wei,Tianrui Li,Xin Yang

OpenReview PDF

提交: 2024-09-13更新: 2025-01-17

摘要

关键词

continual learninglifelong learningreinforcement learning

评审与讨论

审稿意见

评分: 3置信度: 42024-10-24

This paper addresses the issue of varying action set in the same domains. The proposed approach involves learning action representation through exploration and using the action representation to enhance reinforcement learning. Then, the method combines the reconstruction loss and the EWC loss for Continual Learning to train encoder-decoder network. The method is evaluated in one MiniGrid environment(Empty) and one Procgen environment(Bigfish).

优点

1, This paper pushes in an interesting direction: generalizing policy to new action set by action representation.

2, The proposed method is justified and explained very clearly.

3, Source code is included.

缺点

1, Novelty. Using encoder-decoder framework and reconstruction loss to learn action representation have been discussed in many prior work([1],[3]). This raises major concern regarding the novelty of Eq.6 as the loss just combines it with Elastic Weight Consolidation loss from continual learning.

2, Overly Simple Experiment. The method is only evaluated in one MiniGrid environment(Empty) and one Procgen environment(Bigfish). More experiments could enhance the paper.

3, Related work. This paper is missing several works that are closely related to this paper. The authors appear to have missed the rich body of literature on varying action set and learning action relations/representations.

The list is not by any means exhaustive:

[1] Jain A, Szot A, Lim J. Generalization to New Actions in Reinforcement Learning[C]//International Conference on Machine Learning. PMLR, 2020: 4661-4672.

[2] Chandak Y, Theocharous G, Nota C, et al. Lifelong learning with a changing action set[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2020, 34(04): 3373-3380.

[3] Jain A, Kosaka N, Kim K M, et al. Know your action set: Learning action relations for reinforcement learning[C]//International Conference on Learning Representations. 2021.

[4] Tennenholtz G, Mannor S. The natural language of actions[C]//International Conference on Machine Learning. PMLR, 2019: 6196-6205.

[5] Farquhar G, Gustafson L, Lin Z, et al. Growing action spaces[C]//International Conference on Machine Learning. PMLR, 2020: 3040-3051.

问题

1, The experimental setting in Figure 4 and Figure 5 need more detailed explanation. I think the notation of exploration stage and learning stage is necessary. Meanwhile, how does the action set change in the experiment? The authors appear to train policy in task 1, task 2 and task 3 sequentially but should provide more detailed explanation.

2, I am curious to know why the decoder map action representation e to the action probability of any action space. Why not map action representation to the action just like [Yash Chandak et al. Learning action representations for reinforcement learning.]

评论- Response to Reviewer eRwT

2024-11-21

We greatly appreciate your reading and thoughtful concern for our work. Below are our responses to each of your points:

W1: Novelty Contribution

Thank you for your insightful feedback. We acknowledge that using encoder-decoder frameworks and reconstruction loss for learning representations is prevalent in reinforcement learning. While our work is not the first to explore action representation learning, our focus is on enabling agents to generalize policies in changing action spaces within CRL. To achieve this, we have adapted the encoder-decoder framework and integrated it with a regularization strategy from continual learning.
Although Equation 6 is an important component of our proposed method, it is not the sole innovation in our paper. On the one hand, we extend the other CRL problem by introducing a more complex and realistic scenario of varying action spaces. On the other hand, we empirically demonstrate that action space representations can facilitate policy generalization across different action spaces, mitigate catastrophic forgetting, and enhance knowledge transfer, thereby improving overall performance.
As shown in the ablation study in Section 5.3, the primary contribution of regularization lies in stabilizing the policy, leading to enhanced overall performance. However, even without regularization, our method still performs well.

W2: Experimental Validation

Thank you for your valuable suggestion. Since our problem differs from existing continual reinforcement learning problems, the current environments are not entirely suitable for evaluating related methods. Therefore, we need to design tasks within the environments to reflect the challenges of SE-CRL and clearly demonstrate the performance of different methods. This poses the challenge of using other environments for evaluation. However, to enhance the credibility of our experiments, we have also included an additional MiniGrid environment and an additional Procgen environment in Appendix C.7 of the revised version.

W3: Related Works

Thank you for pointing out our oversight. We have added a section on related works in the revised version to address the missing references and discuss their relevance to our study.

Q1: Experiment Setting and Task Description

Thank you for your suggestion. We have included a more detailed explanation of the experimental setting in the revised version. You can refer to Appendix B.1 for references regarding different tasks in the experiments.
Specifically, when the agent switches from one task to another, the available action space may expand or contract. The agent can only access the currently available action space. For instance, in Figure 4, when the agent is on Task 1, it can only take three actions: turn left, turn right, and move forward. However, when the agent is on Task 2, two new actions are added: move left and move right. The agent trains its policy sequentially in Task 1, Task 2, and Task 3, with each task being trained for 3M steps. While training on each task, testing is simultaneously conducted on all three tasks. The curves in Figures 4 and 5 show the expected return during testing, with the blue shaded rectangular area indicating the training phase on the current task.
Although the exploration phase is important for our method, not all comparison methods have an exploration phase, so we did not explicitly mark it in the experimental results. Additionally, the number of steps in the exploration phase for each task (1e4) is relatively small compared to the total learning steps (3e6 for MiniGrid, 5e6 for Bigfish), making it difficult to visualize in the results. We have added a description of the hyperparameters for the exploration phase in Appendix B.4.

Q2: Decoder Map Function

In the work of [1], the action representation is sampled from a probabilistic function, resulting in uncertain action representations, while the decoder's mapping is a deterministic function. This design is primarily due to the method's goal of generalizing over large action sets, rather than in CRL with continuous action space changes. Therefore, the representations in that method can be uncertain.
In the initial design phase of our method, we also considered mapping the decoder to actions. However, given that our problem involves varying action spaces, the action representation encoder needs to be a deterministic encoding. To maintain the uncertainty of the reinforcement learning policy, we designed a corresponding uncertainty function in the overall mapping (Equation 4). Therefore, our method maps the action representation to action probabilities instead of actions.

Reference：

[1] Chandak, Yash, et al. Learning action representations for reinforcement learning. ICML. 2019.

评论- Response to Author's rebuttal

2024-11-23

Thank you for your response. I have carefully reviewed the revised version of the manuscript as well as other reviewers' comments. I will maintain my score and recommend rejecting this paper.

Here are my key concerns:

Lack of discussion on related work: The revised version still does not include a discussion of the works I mentioned in my review (if they have been added, please indicate the exact location). This omission raises doubts about the novelty of the paper. For instance, [1] and [3] primarily focus on scenarios where the tools (i.e., action spaces) used during training and testing differ. They employ a VAE architecture to represent actions, ultimately enabling the generalization of policies across changing action spaces.

While the authors claim that the paper focuses on CRL, the methods and problems addressed in their work are highly similar to those in the aforementioned papers. The lack of discussion on such related works in the Related Work, Discussion, and Experiments sections is inappropriate.

Experimental environment limitations: I understand that conducting experiments in more complex environments may not be feasible during the rebuttal period. However, the authors' claim that "the current environments are not entirely suitable for evaluating related methods" is not justified.

Changing action spaces can be explored in many complex experimental environments through straightforward setups. For example, [5] demonstrates improved experimental results in the StarCraft environment by gradually expanding the action space. Similarly, [1] and [3] adjust their settings to ensure that the tools and action sets used differ within the same task.

These works clearly show that conducting experiments in more complex environments is entirely feasible and would significantly strengthen the paper's persuasiveness.

审稿意见

评分: 6置信度: 32024-11-01

The authors have introduced a new problem in reinforcement learning called Self-Evolution Continual Reinforcement Learning (SE-CRL), which focuses on the change only in action space. They have formally defined this problem and proposed a novel approach, ARC-RL, to address it. Furthermore, the authors have developed a benchmark to evaluate and test the SE-CRL problem.

优点

The authors have formalized the proposed SE-CRL problem with a well-defined problem statement that is both precise and standardized.

The schematic diagrams and experimental result presentation are clear and easy to understand.

The ARC-RL method proposed in the paper has clear and comprehensible formulas.

缺点

1). Wei Ding's paper [1] also addresses the issue of changes in the action space, and it also extends the state space. The problem discussed in this paper, which involves changes only in the action space, can be considered a subset of the problem described above. The authors claim that this paper is the first to address the problem of changes in the action space, but I believe this statement may not be entirely accurate. The issue of robustness in robotic algorithms under action space changes has already been extensively studied.

2). In this paper, "Although a general policy can be obtained using the union of all action spaces, the previous global optimum may become a local one that does not fit the new action space." The claim made in the paper regarding the potential of an expanded action space to discover more optimal solutions is not clearly supported by Figure 4 or other results. The current presentation of the results does not adequately demonstrate how an expanded action space contributes to improved performance or enables superior solutions. It is recommended to either provide additional evidence or analysis that explicitly illustrates this claim or revise the statement to better align with the presented data.

3). Minor issue: "new neurons are initialized randomly". It should be parameters or weights that are initialized.

[1] Wei Ding, Siyang Jiang, Hsi-Wen Chen, and Ming-Syan Chen. Incremental reinforcement learning with dual-adaptive ε-greedy exploration. In AAAI, volume 37, pp. 7387–7395, 2023.

问题

1). Can you explain why the performance of some algorithms (FT, IND) in Figure 4 is degraded with action expansion? In action expansion setting,

2). Was 0-3M in Figure 4b trained during this period? Do you use the checkpoint load of task1 to perform the test? Why is the purple curve so different? Is it a difference between training and testing?

3). The same with figure 4c, only 6-9M were trained, and what did 0-6M do? The FT?

4). The authors propose a relatively complex algorithm to address the issue of aligning policy with action space changes. However, there are simpler approaches, such as directly adding or masking the output layer neurons of the actor and only training the last layer of the actor when transferring across tasks. I did not find a formal definition of methods like Fine-Tuning (FT) and Mask in the text. Are these methods referring to what I described?

评论- Response to Reviewer zesq (1/2)

2024-11-21

We greatly appreciate the time took to study our work, the in-depth discussion of the related work, and the encouraging comments. Below are detailed responses to your concerns.

W1: Novelty Clarification

Thank you for pointing out the overlap with [1]. We acknowledge that there are similarities between our work and [1], particularly in addressing changes in the action space. However, our approach differs in several key aspects.

While we did not consider state space changes, our work considers more general changes of action space. We place a greater emphasis on the situation where the action space is contracting, and we consider more situations in Section 3.2 and Appendix C.2. In contrast, [1] primarily focuses on the expansion of the action space.
The concept of "incremental reinforcement learning" in [1] is distinct from "continual reinforcement learning". Incremental reinforcement learning mainly focuses on continuous adaptation to new tasks without considering the potential forgetting of previously learned tasks, which is a critical aspect in continual learning $^{[2,3]}$ . Therefore, the goal of incremental reinforcement learning is not consistent with that of CRL.

While our method may not be the first to investigate changes in the action space, it is the first study to consider more comprehensively changes in the action space within the context of continual learning. We have clarified this distinction in the revised version of our paper to avoid any misunderstanding.

W2: Potential of Expanded Action Space

We appreciate your feedback.

It was not our intention to suggest that using an expanded action space inherently offers a performance advantage over using the union of all action spaces. In many tasks, changes in the action space may be unknown to the agent, meaning that the agent does not have prior knowledge of all possible action spaces.
Our experiments under contraction situations demonstrate that a policy learned using the union of all action spaces may not be optimal for a reduced action space. For instance, in Figure 5, the FT method, which initially trains on all action spaces in Task 1, performs well in testing on Task 1 (Figure 5(a): 3M) but poorly in Task 3 (Figure 5(c): 3M).
In some environments, an expanded action space may indeed potentially improve performance by discovering more solutions, which is more evident in Figure 15. Methods like CLEAR and ARC-RL show improved performance with action space expansion.

We will ensure that these points are more clearly communicated in the revised manuscript.

W3: Minor Issue

Thank you for pointing out the error in our terminology. We have corrected the expression in the revised version of our paper to accurately state that parameters are initialized randomly.

Q1: Performance Degradation

As mentioned in the last paragraph of page seven, while most methods may improve performance with action space expansion, some methods experience performance degradation to varying degrees. This highlights the challenge of policy generalization in the SE-CRL problem.

In Figure 4, the performance drop of IND is due to its independent training approach, where it starts relearning from a random policy when switching to a new task (note that the curves have been smoothed).
For FT, the performance decline occurs mainly when testing on previous tasks after training on subsequent tasks. This is because the parameter updates from later tasks affect the policies of earlier tasks. For instance, if the highest probability action was originally "up" with "left" and "right" having lower probabilities, the introduction of a new diagonal-right action could result in it having the highest probability, reducing the probability differences among the original actions and leading to a performance drop.

Q2: Figures Description

In Figure 4(b), 0-3M refers to the training period where the agent is trained on Task 1 and tested on Task 2. This is a common experimental setup in CRL $^{[4-6]}$ . The purple curve represents the performance of the Mask method. Its differing behavior may be attributed to its modulating masking mechanism, which is relatively complex. Additionally, this method was not specifically designed for handling changes in the action space. While it has strong anti-forgetting capabilities, its learning speed is slower (can be observed by comparing Figures 2-4 in [5] with Figure 6 in our paper).

Q3: Figures Description

We apologize for any confusion caused. In Figure 4(c), 0-6M refers to the training phase where the agent is trained on other tasks (0-3M on Task 1 and 3-6M on Task 2) and tested on Task 3, while 6-9M represents the training and testing of the agent on Task 3. We added more detailed explanations of the experimental setup in the revised version to aid understanding.

评论- Response to Reviewer zesq (2/2)

2024-11-21

Q4: Methods Description

There are some differences between the FT and Mask methods and your description. More explanations of all compared methods are provided in Appendix B.2.

The FT method involves fine-tuning all the network parameters on subsequent tasks after training on the previous task. To accommodate different action spaces, the network's output layer is initialized to the size of the union of all action spaces, and adaptation is achieved by masking the output layer neurons.
The Mask method applies modulating masks to the network structure and leverages previously learned knowledge on new tasks through a linear combination of learned masks. To handle changes in the action space, we also implement this by initializing the network to the size of the union of all action spaces. The specific implementation details are available in the supplementary material, which includes the code for all comparison methods.

Reference：

[1] Wei Ding, Siyang Jiang, Hsi-Wen Chen, and Ming-Syan Chen. Incremental reinforcement learning with dual-adaptive ε-greedy exploration. AAAI. 2023.

[2] Wang, Liyuan, et al. A comprehensive survey of continual learning: theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024.

[3] Mundt, Martin, et al. A wholistic view of continual learning with deep neural networks: Forgotten lessons and the bridge to active and open world learning. Neural Networks. 2023.

[4] Powers, Sam, et al. CORA: Benchmarks, baselines, and metrics as a platform for continual reinforcement learning agents. CoLLAs. 2022.

[5] Ben-Iwhiwhu, Eseoghene, et al. Lifelong Reinforcement Learning with Modulating Masks. Transactions on Machine Learning Research. 2023.

[6] Tomilin, Tristan, et al. COOM: a game benchmark for continual reinforcement learning. NeurIPS. 2024.

审稿意见

评分: 3置信度: 42024-11-03

The authors introduces a novel framework for continual reinforcement learning (CRL) where the agent's action space evolves over time. The authors propose ARC-RL, which leverages self-supervised learning to build an action representation space, allowing the agent's policy to adapt to changes in the action space without catastrophic forgetting. The paper claims to make significant contributions to the field of CRL by addressing the challenge of policy generalization across different action spaces.

优点

The setting of continual learning in tasks with varying action spaces is interesting and valuable for research.
The writing is very clear, making it easy for readers to understand main context.

缺点

The experimental validation is limited, with only a few scenarios tested. The authors only evaluated contraction and expansion in Minigrid and contraction in the Procgen Fish environment. It would be beneficial to construct 4-5 discrete action space scenarios, including other environments in Minigrid and Procgen benchmarks, as well as 2-3 continuous action space scenarios, such as Ant with varying leg counts (4, 6, or 8 legs).
The proposed regularization method shows minimal impact, as evidenced by Table 1.
While I appreciate the simplicity of the overall method, it does not demonstrate significant performance improvements over the baseline.
Lack of discussion on related work in continual learning and continual reinforcement learning.

问题

In Section 2.2, the paper states that the only difference between tasks is the action space. However, the transition function is defined as $\mathcal{S} \times \mathcal {A} \rightarrow \Delta(\mathcal{S})$ , and thus, when the action space changes, the transition function also changes. It seems that the oversight of the potential changes in the transition function between tasks has led to the limited improvements of the current method.

评论- Response to Reviewer ZBrm

2024-11-21

Thank you for your thorough review and insightful comments. We appreciate the opportunity to address your concerns and clarify aspects of our work. Below are our responses to each of your points.

W1: Experimental Validation

Thank you for your valuable suggestions regarding the experimental validation. Due to space constraints, we included only three types of experiments in the main text. However, more experiments can be found in Appendix C.2, C.3, and C.6. To enhance the validation, we have added another class of environments from the Minigrid and Procgen in the revised version.
While continuous action spaces are indeed crucial for RL, this paper focuses solely on discrete action spaces. We plan to extend our work to include continuous action spaces in future research. We add future work on this in Appendix E.2.

W2: Regularization Method

We have conducted some analysis of the impact of regularization in Section 5.3, focusing on its role in enhancing the stability of the policy. Although the effect may not be evident in the forgetting and forward transfer, the continual return shows a noticeable improvement. Specifically, in the contraction situation, the performance evaluation of the agent improves by 0.07 with the inclusion of regularization.
Additionally, we provide experiments and analysis on the regularization coefficient in Appendix C.4. Figures 13(b) and 13(d) illustrate how different regularization coefficients impact the agent's performance in the contraction situation.

W3: Performance Improvements over Baseline

Thank you for appreciating the simplicity of our method. We acknowledge that the performance improvement over the baseline is somewhat dependent on the complexity of the environment. As shown in Table 1, in the simpler Minigrid environment, our method achieves an average improvement in continual return of 0.04 (expansion situation) and 0.1 (contraction situation) over the best alternative methods. In the more complex bigfish environment, this gap becomes more pronounced. As presented in Table 2, for the contraction scenario, most other methods achieve continual returns between 1-2, while our method attains an average return of over 10, improving by approximately 7 over the best alternative method.

W4: Related Works

Thank you for your critique. We have added a section in the revised version that introduces works most relevant to ours.

Q1: Transition Function

You are correct that changes in the action space can lead to changes in the transition function for each task. However, this change is implicit rather than direct. In most CRL research, changes in the transition function do not typically involve changes in the action space, which is a key distinction between our approach and other CRL works. We have revised the relevant wording in Section 3.2 to clarify this.

评论- Response to the Authors' Rebuttal

2024-11-23

Thank you for your response, but regarding the current paper, I choose to maintain my score. My view is that if the authors believe their method is viable, they need to support it with extensive experiments, rather than just the limited-scale experiments presented in the paper. Alternatively, the authors should have identified some new problems with varying action spaces and developed methods to solve them. However, the paper fails to meet either of these criteria.

评论- Response to Reviewer ZBrm

2024-11-27

Thank you for your follow-up response and for sharing your perspective. We understand the importance of robust experimental validation and the need to identify new problems. We will consider these suggestions in our future works.

审稿意见

评分: 3置信度: 42024-11-07

The paper proposes Action Representation Continual Reinforcement Learning (ARC-RL), a method for adapting the agent when its action space is changing from one task to another. The idea is to train (fine-tune) an encoder and a decoder pair at the beginning of the task using a self-supervised learning objective in order to learn the action representation space. Then, the decoder maps this space into the action set that is currently available to the agent. The authors perform experiments on the empty grid environment in the MiniGrid framework and the bigfish environment in the Procgen environment to demonstrate the usefulness of the proposed approach.

优点

The problem that the paper considers is very important in continual reinforcement learning, especially in cases when skills are built and added to the agent’s base action set in a continual manner (although this is not explored in the paper).
Albeit not novel, the idea of learning action representation space using self-supervised learning objectives is a very simple idea and yields many useful benefits for training deep neural networks.
The regularized fine-tuning trick, which combines the ideas from elastic weight consolidation and the self-supervised learning objective is new. It is also useful in continual RL to retain past learnings in order to adapt faster when the past is reencountered.

缺点

The first major weakness of the paper is in the claims it is making, especially in the novelty contribution:
- The proposed approach for learning the action representation using the self-supervised learning objective is not new. There are many works in this literature including the one that the paper cites that learns the action representation space [1]. The application of this technique to continual reinforcement learning is not novel either. [2], which is by the same author as [1], proposes a very close approach to the one introduced in this paper. The algorithm is well-studied theoretically in their case. I encourage the authors to check [2] and the papers that cite it in order to better place the proposed approach within the existing literature. It is also crucial to use the method introduced in [2] as a baseline to better understand the benefits of the proposed approach.
- The problem of changing action spaces in continual reinforcement learning is not new either as the paper claims to be. [2] and [3] have discussed this as a subproblem within RL and the proposed SE-CRL’s description perfectly fits well with those.
The experiments are performed on very basic environments and the conclusions drawn from them do not match the plots presented in Fig 4 and Fig 5:
- In Fig 4, for ARC-RL, the agent reaches optimal performance on tasks 2 and 3 when it is trained on task 1. No further boost in performance is observed when new actions are added to the action set. This raises the question of whether the agent is making use of new available actions in order to improve its policy.
- In Fig 5, on task 2, the performance of ARC-RL decreases when the agent is trained on it, perhaps because the action space is reduced from seven to five. But the performance doesn’t drop when the agent’s action space is further reduced to three; it stays the same. This is a very strange behaviour.
- The environments used are too simple and are not well-designed to support the method. Besides, in Sec 2.2, four categories are introduced when the action space changes, but only two of them are used to demonstrate the applicability of the proposed approach experimentally.
- There is no description of the baselines used or how the hyperparameters in them are tuned. Besides, some of the baselines don’t offer much value in terms of understanding the results of ARC-RL. I suggest the authors include: (a) a baseline that uses all the actions; and (b) the method proposed in [2].

问题

Decision:

The paper is not discussed well within the existing literature. The experiments are simple and the conclusions from them don’t match the performance curves. Therefore, I recommend a clear rejection.

Areas of improvement:

In the introduction, the paper says “[...] assumes that the agent’s capabilities remain static [...]”. This is incorrect! It is the opposite case. In CRL, the desiderata of the agent is to have continuous adaptation as discussed in some foundation papers in CRL, e.g., [4] and [5].
The paper should provide a summary of the task description in the main paper. It is unclear how the actions are changing from one task to another.

Questions:

How does the zero-shot generalization metric (forward transfer) relate to the jump-start objective that is commonly used in CRL papers (e.g., [4] and [6])?
What does the double union symbol denote in Sec 3.2 (just before Eq. 4)?
In the caption of Fig 2, the paper says “After the action space changes, the number of actions changes, while the probability distribution is relatively stable.” What does this sentence mean?

References:

[1] Chandak, Yash, et al. "Learning action representations for reinforcement learning." International conference on machine learning. PMLR, 2019.

[2] Chandak, Yash, et al. "Lifelong learning with a changing action set." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.

[3] Khetarpal, Khimya, et al. "Towards continual reinforcement learning: A review and perspectives." Journal of Artificial Intelligence Research 75 (2022): 1401-1476.

[4] Anand, Nishanth, and Doina Precup. "Prediction and control in continual reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).

[5] Abel, David, et al. "A definition of continual reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).

[6] Taylor, Matthew E., and Peter Stone. "Transfer learning for reinforcement learning domains: A survey." Journal of Machine Learning Research 10.7 (2009).

评论- Response to Reviewer sGxG (2/2)

2024-11-21

W2.4: Baselines

Due to space limitations, we only briefly described the methods compared in the main text. We have provided more detailed descriptions in Appendix B.2. Regarding hyperparameters, we have included more information in B.4 and C.4.
Our selection of comparison methods includes three types of CRL approaches. This selection not only covers a broad spectrum of existing methodologies but also demonstrates their performance on the newly introduced SE-CRL problem, highlighting the distinctions between SE-CRL and existing CRL challenges.
We appreciate the reviewer's suggestion regarding baselines and have incorporated a baseline using all actions in the revised version. This baseline does not differentiate between tasks and has access to the largest action space throughout all training phases, thus serving as an upper bound of final performance.

A1: Agent's Capabilities

Thank you for your critique. In our paper, the term "capabilities" refers to the agent's more "physical" abilities. Specifically, the sentence "existing CRL research typically assumes that the agent's capabilities remain static" means the action space of agents is static in most CRL research. Indeed, in continual learning, the agent should possess the ability to continuous adaptation. To avoid confusion, we have revised the introduction to clarify this distinction and prevent misunderstandings.

A2: Task Description

We appreciate your suggestion. We have added a summary of the task descriptions in the revised version, with more details about the environments and tasks available in Appendix B.1.
When the agent switches from one task to another, the set of available actions may either increase or decrease. The agent can only access the currently available action set. For example, in Figure 4, when the agent is on task 1, it can perform only three actions: turn left, turn right, and move forward. However, when the agent is on task 2, two new actions are added: move left and move right.

Q1: Relationship between Forward Transfer and Jump-Start Objective

The terms "jumpstart" and "jumpstart objective" were first defined in [6]. The former refers to the initial performance of an agent in a target task improved by transfer from a source task, while the latter is a learning objective aimed at selecting a policy that maximizes the expected value function from the initial position.
Forward transfer is a common metric in continual learning $^{[7,8]}$ . In recent years, it has become one of the most commonly used additional evaluation metrics in CRL alongside forgetting $^{[8,9]}$ . Forward transfer quantifies the impact of previous tasks on the performance of subsequent tasks. A value greater than zero indicates zero-shot generalization, while a value less than zero indicates that previous tasks hinder the agent's performance on subsequent tasks.
Both the jumpstart objective and forward transfer are related to an agent's zero-shot learning ability, but the former is used more as an objective function to optimize the policy, while the latter serves as an evaluation metric for continual learning.

Q2: Meaning of The Symbol

Thank you for pointing this out. In our paper, the double union symbol $\Cup$ is intended to denote the union of sets. We acknowledge that this symbol can be misleading, so we have changed it to $\bigcup$ to ensure clarity.

Q3: Meaning of Probability Distribution

We apologize for any confusion caused. The phrase "the probability distribution is relatively stable" refers specifically to the distribution of actions within the agent's learned policy function being relatively stable. In the top right corner of Figure 2, you can see that after new actions are added to the action space, the probability distribution of choosing each action from the initial position (represented by the red dashed line) is similar to that in the left figure.

References

[1] Chandak, Yash, et al. Learning action representations for reinforcement learning. ICML. 2019.

[2] Chandak, Yash, et al. Lifelong learning with a changing action set. AAAI. 2020.

[3] Khetarpal, Khimya, et al. Towards continual reinforcement learning: A review and perspectives. JAIR. 2022

[4] Tomilin, Tristan, et al. COOM: a game benchmark for continual reinforcement learning. NeurIPS. 2024.

[5] Gaya, Jean-Baptiste, et al. Building a Subspace of Policies for Scalable Continual Learning. ICLR. 2023.

[6] Anand, Nishanth, and Doina Precup. Prediction and control in continual reinforcement learning. NeurIPS. 2024.

[7] Wang, Liyuan, et al. A comprehensive survey of continual learning: theory, method and application. IEEE TPAMI. 2024.

[8] Parisi, German I., et al. Continual lifelong learning with neural networks: A review. Neural networks. 2019.

[9] Wolczyk, Maciej, et al. Disentangling transfer in continual reinforcement learning. NeurIPS. 2022.

评论- Response to Reviewer sGxG (1/2)

2024-11-21

We appreciate your thoughtful comments and insightful critiques, and we thank you for your extensive questions. We acknowledge that our work builds extensively on existing works of reinforcement learning, and hope that our point-by-point response can clarify the differences between our work and others.

W1: Novelty Contribution

W1.1: Self-Supervised Learning

Thank you for your suggestion.

Recently, self-supervised learning (SSL) has emerged in RL to address the issues of sample efficiency and generalizability by learning a compact and informative representation of the environment. This is a common strategy but its performance and mechanisms in CRL are still worthy of exploration. Moreover, we explore the policy generalization of action representation learned by SSL in the context of CRL with continually changing action space.
We added a section introducing related works of SSL in RL and CRL, including [2]. Additionally, we clarify the differences between our work and similar works in the introduction. In fact, the problem in [2] is different from our proposed problem. Although we attempted to use the method in [2] as a baseline for comparison, we encountered significant discrepancies between the experimental results obtained using their provided code and the results reported in their paper. We are currently in contact with the authors to seek assistance in resolving this issue.

W1.2: Similar Problems

The review $^{[3]}$ provides a formal definition of the general form of CRL problem, and although it involves non-stationary action spaces, it does not delve deeper into this issue, nor does it explain how changing action spaces differ from other forms of CRL. We acknowledge that our work involves a subproblem of the general form discussed in the review, but we believe this subproblem warrants separate investigation. The problem studied in [2] can be considered a subproblem of ours. The "lifelong learning" in [2] is different from "continual reinforcement learning" as mentioned in many CRL works $^{[4,5]}$ .

In CRL, the agent not only needs to continuously adapt to changing environments (new tasks) but also must avoid catastrophic forgetting, and performing well on previous tasks. Therefore, our experiments evaluate using continual return, forgetting, and forward transfer metrics. On the contrary, [2] only considers the agent's adaptation to new tasks.
[2] only investigates expanding action spaces and does not explore more complex changes in action spaces. Our work considers more types of action space changes, including experiments on contracting action spaces (Figure 5, 6, and Appendix C.2). As described at the end of the first paragraph of the introduction, the contraction situation is also common, such as damage to the hardware of a robot.

W2: Experiments and Results

W2.1: Performance Improvement in the Expansion Situation

Since the tasks in Figure 4 are relatively simple, the ARC-RL agent can transfer to tasks 2 and 3 without additional training after being trained on task 1, which is why no further performance improvement is observed with the addition of new actions. In addition, whether additional actions can further improve performance is also related to the impact of these actions on the policy. In the Bigfish environment (Figure 15), it is evident that as the action space expands, the agent's expected return further increases. We have added more descriptions in the revised version to enhance understanding.

W2.2: Performance Decrease in the Contraction Situation

In Figure 5, the agent's performance decreases initially when trained on task 2, possibly due to the reduction of the action space from seven to five. However, later in the training on this task, the agent gradually adapts to the environment, resulting in some performance improvement.
When the agent is trained on task 2, its performance on task 3 remains stable, which might be because the knowledge learned from tasks 1 and 2 can be more easily transferred to task 3. Additionally, the similarity between tasks can affect knowledge transfer and forgetting across different tasks. We provide more detailed experimental results and preliminary analysis in C.1.

W2.3: Experimental Validation

Thank you for your critique. We have added two additional experiments from Minigrid and Procgen in Appendix C.7 of the revised version.
In Section 3, we acknowledge that ARC-RL is a further step in addressing the problem of SE-CRL. Building on the existing research on the first type of problem, we further explore the second type of problem and the combination of the first two types in the context of CRL (Appendix C.2). In addition, we discuss the possibility of extending our method to other situations in Appendix E.1.

2024-11-27

I thank the authors for responding to my review. The changes made in the paper are welcome, and the rebuttal cleared some of my questions. However, I believe the current draft is not ready for publication: experiments (old and new) still don't provide conclusive evidence and raise more questions, and the contributions, along with the discussion of relevant literature, are incomplete/improper. So, I maintain my initial score and recommendation.

Here are some items that could help the authors improve their draft for future submission:

The discussion on relevant literature can be broken down into two components: the problem and the solution. $1$ and $2$ are highly relevant and the proposed problem SE-CRL should be discussed using the ideas and frameworks proposed in those works. Neither $1$ nor $2$ assume that the action set is only expanding, although other changes to the action may not have been considered for experimentation in those works. For the solution, the authors should discuss $2$ , $3$ , and several other works suggested by other reviewers in detail. This should be complemented with a discussion of SSL-based methods for CRL. Then, the authors can briefly mention some latest fundamental papers like $4$ and $5$ to highlight the active interest in CRL research and it should be followed by a brief discussion on seminal papers in the field like $6$ .
I believe the authors’ approach to combining the SSL-based technique with EWC is novel. The paper should focus on the discussion about novelty within this space to avoid any misunderstanding.
Because the paper is empirical, I strongly recommend performing rigorous experiments to provide enough evidence to support the claims. This translates to breaking down the experiments into three groups for comprehensive analysis: small (e.g., minigrid), medium (e.g., procgen), and large-scale experiments (e.g., robotics or crafter-like). In each group, purposeful baselines should be selected in order to make meaningful comparisons. Must-have baselines include $2$ , the union of action sets that the authors have included now, EWC and its variants (already in the paper), in addition to the ones that are already in the paper. The results of these experiments should be conclusive and avoid any confusion, such as the ones raised in the initial review. For example, W2.1 in the rebuttal still doesn’t answer why new and better actions don’t yield superior performance when they are made available in tasks 2 and 3. Or, why there is performance degradation in one of the future tasks and not both (W2.2)? Besides, some of the baselines match and outperform the proposed methods consistently (Fig. 15). So it is unclear whether the proposed approach is useful from the current set of experiments.

References:

[1] Khetarpal, Khimya, et al. "Towards continual reinforcement learning: A review and perspectives." Journal of Artificial Intelligence Research 75 (2022): 1401-1476.

[2] Chandak, Yash, et al. "Lifelong learning with a changing action set." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.

[3] Chandak, Yash, et al. "Learning action representations for reinforcement learning." International conference on machine learning. PMLR, 2019.

[4] Anand, Nishanth, and Doina Precup. "Prediction and control in continual reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).

[5] Abel, David, et al. "A definition of continual reinforcement learning." Advances in Neural Information Processing Systems 36 (2024).

[6] Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the national academy of sciences 114.13 (2017): 3521-3526.

评论- Response to Reviewer sGxG

2024-11-27

We sincerely appreciate the reviewer’s thorough examination of our rebuttal and the valuable suggestions provided to improve our paper. We are grateful for the acknowledgment of our novelty.

Literature Discussion: We recognize the need for a more comprehensive discussion of relevant literature, particularly concerning the problem of our work.
Experiments: We agree that rigorous and comprehensive experiments are crucial to substantiate our claims. Your suggestion to categorize experiments into small, medium, and large-scale groups is highly valuable. We will meticulously plan future experiments to include more baselines.

Given the limited time for revisions, we have decided to take your constructive feedback into account and prepare more thoroughly for a future submission. We are confident that with the enhancements suggested, our work will be significantly strengthened and better positioned for acceptance. Thank you once again for your insightful feedback and guidance.

评论- General Response

2024-11-21

We are very grateful to the reviewers for carefully reviewing our paper and providing constructive comments and suggestions. We have revised the paper carefully according to the comments and suggestions and have marked the changed parts in red. Our response to individual reviewers can be found in the personal replies, but we would also like to make a brief summary of revisions for your convenience.

Novelty and Contribution: We have revised our clarification on the novelty and contribution. Our work explores the issue of changing action spaces within the context of continual reinforcement learning by proposing a more generalized problem formulation and an initial solution. While existing works are limited to the expansion of the action space, our research defines four situations of action space changes. Moreover, existing methods predominantly aim to enhance the agent's adaptability to future tasks, often neglecting performance on previously learned tasks, which is not entirely aligned with the objectives of continual learning. Our method not only focuses on performance in future tasks but also emphasizes maintaining performance on past tasks, effectively balancing the trade-off between avoiding catastrophic forgetting and facilitating knowledge transfer. This method better fulfills the requirements of continual reinforcement learning.
Related Works: A new section has been added to discuss related works, particularly those involving self-supervised learning and continual reinforcement learning. The comparison with existing studies highlights the significance and innovation of our research on the problem of changing action spaces in continual reinforcement learning and our proposed method.
Experimental Validation: We have expanded our experimental validation by including additional environments from the MiniGrid and Procgen benchmarks. The results are basically consistent with the previous. Due to the tight time limitation, we conducted partial experiments and will further supplement in the final version.
Experimental Settings and Baseline: We have incorporated more detailed descriptions of the tasks and experimental settings, enhancing the readability of our methodology. Moreover, we have included a baseline using the union of all action spaces for comparison.

Additionally, it seems that some reviewers have misunderstood the meaning of our experimental results, and we apologize for any confusion this may have caused. To aid reviewers in understanding, we provide a detailed explanation using Figure 4 as an example: The agent is trained sequentially on Task 1, Task 2, and Task 3, with each task involving 3 million training steps. During training on a specific task, the agent is also tested on all tasks. Subplots (a), (b), and (c) in Figure 4 illustrate the expected returns of the agents' testing on Tasks 1, 2, and 3 over timesteps, respectively. The blue-shaded rectangular areas indicate the agent's training phase on the corresponding task, and we have enhanced its visibility in the modified version.

We hope our response has addressed your concerns, but please let us know if we missed anything.

撤稿通知

2024-11-27

Dear ICLR 2025 Committee,

We have decided to withdraw our submission following the comments and feedback from the reviewers. While we highly value the constructive insights and rigorous standards set by the reviewers, time constraints prevent us from completing the recommended experiments during the current discussion phase. We plan to revise our work in light of the suggestions provided and resubmit it to another venue in the future.

We sincerely thank the reviewers and the area chair for their time, effort, and feedback.

Best regards, The authors