PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高8标准差2.0
3
8
5
3
3.5
置信度
正确性2.3
贡献度2.0
表达1.8
ICLR 2025

Solving Continual Offline RL through Selective Weights Activation on Aligned Spaces

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We propose to solve continual offline RL with various state and action spaces by adopting selective weights activation on the quantized alignment spaces.

摘要

关键词
continual offline reinforcement learning

评审与讨论

审稿意见
3

The paper proposes a composite method to perform continual offline reinforcement learning by means of denoising diffusion probabilistic models (DDPMs). Each CORL task can have different observation and action spaces, which are aligned through a mechanism termed quantized space alignment (QSA). Continual learning performance is ensured by using a masking strategy (selective weights activation, SWA) that assigns distinct weights in the DDPM to distinct tasks. The paper presents an experimental evaluation on various established continual RL benchmarks that show that the proposed method performs better than SOTA, as well as an ablation and parameter importance study to identify relevant parameters.

优点

The paper presents an interesting approach to continual RL using DDPMs, a rarely used technique that has great merit. In general, modeling of policies by generative models is a very attractive idea. Experimental verification uses established benchmarks and compares to established SOTA methods, producing good results. The ablation study and the parameter importance investigation are interesting as well. The idea of aligning perceptual and action spaces in a common representation is worthy of attention.

缺点

It is really hard to understand the big picture of the article. Figure 1 is not really helping since it does not indicate data flow through the model. It is only by studying the algorithm in the appendix that I could understand what the various components actually do, in which order. That algorithm should be moved to the main text, or els Fig.1 should be improved and simplified.

This is not really continual learning, as you are processing all data in one go in Alg.1 during the QSA pretraining step. In a CL setting, you would have access to data from task i only when training on task i, but not before. So, from a CL perspective, adapting QSA in one go prior to real RL training is cheating because you are looking into the future.

In the same vein: if you have access to all collected data, why use CL at all and not train everything at once?

It is not clear how this method could scale to more tasks when the mask is filled. It would be very helpful to comment on this, and why you do not simply train a separate DDPM for each task.

Important details are simply missing: 1) how do you use a diffusion model and the inverse dynamics model to generate actions? What role does ψ\psi play? What are the dimensions of obs and action spaces for the different tasks? How do you apply DDPMs to data that are not images (e.g., actions)?

It is unclear how the SOTA methods are implemented, I can find no information at all except the references. What parameter settings? What libraries? Without this information, comparing to SOTA is meaningless since nothing can be reproduced.

It is not made clear what the benefit is of having a single RL controller learn policies for tasks with different state/action spaces. The paper would benefit from examples here.

The language of the paper is often imprecise or slightly incorrect, please be sure to fix this.

More comments: 3.2 the notion of "state sequences" and is unclear, and this paragraph is hard to understand. Maybe explain this in more detail. what is τs0:K\tau_s^{0:K} ? 3.2 Same goes for the notion of "inverse dyanmics model": what is this for? It is never mentioned again... 3.2 " tailored state sequences to facilitate training", what does this mean? Unclear... General: very few details given for DDPM usage: input space, action space, ... General: Continual offline reinforcement learning requires a more in-depth introduction and justification, as offline and reinforcemenet learning seem to be contradictory concepts. What is the advantage here? What can ORL do that RL cannot, and vice versa?

问题

  • For evaluation: is it necessary to know what task you are currently performing evaluation for?
  • If no: how do you select the right mask at test time?
评论

We appreciate your valuable review and respond to your concerns below.

[I]. Explanation of weaknesses

[1/7] W1: It is really hard to understand the big picture of the article. Figure 1 is not really helping since it does not indicate data flow through the model. It is only by studying the algorithm in the appendix that I could understand what the various components actually do, in which order. That algorithm should be moved to the main text, or els Fig.1 should be improved and simplified.

We have revised Figure 1 to make the training and testing process of the model more specific and accurate. Regarding Figure 1, we should first focus on the QSA module in the bottom right corner. For the dataset of a specific task 11, the QSA module is used to map the state and action spaces into a unified representation. Then, the top left section of Figure 1 illustrates the data flow during training and inference. Finally, the bottom left section of Figure 1 depicts the flow of data within the diffusion model.

[2/7] W2: This is not really continual learning, as you are processing all data in one go in Alg.1 during the QSA pretraining step. In a CL setting, you would have access to data from task i only when training on task i, but not before. So, from a CL perspective, adapting QSA in one go prior to real RL training is cheating because you are looking into the future. In the same vein: if you have access to all collected data, why use CL at all and not train everything at once?

Our method strictly follows the CL training paradigm. At any given training time, the network only has access to the dataset of a single task. Similarly, in the code, our method's training process also adheres to the CL training paradigm.

In the original pseudocode, we aimed to clearly express the hierarchical structure of the QSA module and the SWA module during the training of each task. Simultaneously, considering that the QSA module and the SWA module share the same "for each task "i" loop, the pseudocode was written in its original form. However, this has evidently caused some misunderstanding. We have revised the pseudocode to more accurately describe the entire training process. Below is the comparison of the original pseudocode and the new pseudocode.

The QSA module does not share parameters; all task-specific parameters, including the VQE, VQD, and task-related parts of the codebook, are independent. Therefore, during training, we can train the VQE, VQD, and codebook for each task sequentially. From the algorithm, we can also observe that the outermost loops for both the QSA module and the SWA module are "for each task" (lines 4 and lines 15). This makes it entirely feasible to train the QSA module sequentially for each task. To better illustrate how the QSA module aligns with the continual learning (CL) training setup, we have provided an additional Figure 15 to visualize the training process of the QSA module.

Original pseudocode: For each task i (lines 4):

  • Train QSA module
  • Save QSA module

For each task i (lines 15):

  • Train SWA module
  • Save SWA module

------------------------reformulate------------------------>

New pseudocode: For each task i:

  • Train QSA module
  • Save QSA module
  • Train SWA module
  • Save SWA module

[3/7] W3: It is not clear how this method could scale to more tasks when the mask is filled. It would be very helpful to comment on this, and why you do not simply train a separate DDPM for each task.

In continual learning, parameter isolation (i.e., structure-based) methods [1-4] we used are widely adopted to reduce catastrophic forgetting. Previous methods [5-8] aim to use a single model to gradually master solutions for multiple tasks following the CL training paradigm. In these approaches, the number of CL tasks is predefined, and they do not consider how to learn new tasks beyond the predefined CL task sequence. To address this challenge, we propose two solutions.

  • Pruning Existing Masks that is suitable for a small number of new tasks (Refer to Figure 12 in our paper): Instead of increasing the number of masks, we prune the existing masks. Specifically, for each task, we identify less important weights, such as those with small absolute values, and release the corresponding masks for training in new tasks.
  • Introducing New Masks that is suitable for large number of new tasks: In this approach, we keep the learned model parameters and masks unchanged and use additional parameters and masks to learn the new tasks.
评论

[4/7] W4: Important details are simply missing: 1) how do you use a diffusion model and the inverse dynamics model to generate actions? What role does ψ\psi play? What are the dimensions of obs and action spaces for the different tasks? How do you apply DDPMs to data that are not images (e.g., actions)?

To help you better understand how the diffusion model is utilized during the training and inference phases, we have provided a detailed explanation starting from data organization. For aligned state and action features, we need to process the original state and action vectors with QSA module during training and inference.

In the training phase:

    1. For each trajectory in the dataset, assuming the trajectory length is NN , we calculate the cumulative discounted returns t=tNγttrt\sum_{t'=t}^{N}\gamma^{t-t'} r_t from each timestep tt in the trajectory using a discount factor γ\gamma. Then we will obtain Rˉ1:N\bar{R}_{1:N}.
    1. After the computation in step 1, we normalize the NN cumulative discounted returns to make the value range in [0, 1] and associate the normalized returns R1:NR_{1:N} with the state at each timestep {st,Rt}\{s_t, R_t\}.
    1. Sample a batch from the dataset containing sequences of length TeT_e consisting of states, actions, and their corresponding returns, i.e., {st:t+Te,at:t+Te,Rt:t+Te}\{s_{t:t+T_e},a_{t:t+T_e},R_{t:t+T_e}\}.
    1. Use τs0=st:t+Te\tau^0_s=s_{t:t+T_e} as the input to the diffusion model, and Rt:t+TeR_{t:t+T_e} as the condition input for the diffusion model. Compute the diffusion model's loss and update the diffusion model parameters.
    1. Use st:t+Te1s_{t:t+T_e-1} and st+1:t+Tes_{t+1:t+T_e} as inputs to the inverse dynamics model, and use at:t+Te1a_{t:t+T_e-1} as the output targets to compute the inverse dynamics model's loss and update the model's parameters.

In the inference phase:

    1. Receive the current state sts_t from the environment.
    1. Sample a state sequence s^t:t+Te\hat{s}_{t:t+T_e} with length T_e from the normal Gaussian distribution, replace the first state with sts_t
    1. Obtain the diffusion model's input τsK=[st,s^t+1:t+Te]\tau^K_s=[s_t,\hat{s}_{t+1:t+T_e}].
    1. Generally, following previous studies [9], we set the return condition R=0.8R=0.8. Then, we can feed τsK\tau^K_s and RR into the diffusion model.
    1. At each generation step we use diffusion model to calculate the noise ϵθ(τsk,k,R)\epsilon_\theta(\tau^k_s, k, R).
    1. Obtain τsk1\tau^{k-1}_s by performing generation process with the diffusion model. After each generation step, we will replace the first state of generation output with sts_t.
    1. At the end of the generation process, we will obtain τs0=[st,sˉt+1:t+Te]\tau^0_s=[s_t,\bar{s}_{t+1:t+T_e}].
    1. Finally, we use sts_t and sˉt+1\bar{s}_{t+1} as inverse dynamic model's input to obtain action aˉt\bar{a}_t.
    1. Interactive with the environment with aˉt\bar{a}_t.

For all experiments, the quantized state and action representations have dimensions 20 and 10, respectively.

[5/7] W5: It is unclear how the SOTA methods are implemented, I can find no information at all except the references. What parameter settings? What libraries? Without this information, comparing to SOTA is meaningless since nothing can be reproduced.

All the comparison methods used in this paper utilize their official codebases. Based on these, we use different datasets for training. For example:

[6/7] W6: It is not made clear what the benefit is of having a single RL controller learn policies for tasks with different state/action spaces. The paper would benefit from examples here.

Learning in the aligned space offers several advantages:

  • It significantly expands the applicability of the model, eliminating the limitations imposed by task-specific state and action spaces.
  • It facilitates learning multiple tasks sequentially with a single model, removing the need to train a separate model for each task.
  • It is more efficient to master multiple tasks and make decisions with a single model.
评论

[7/7] W7: More comments: 3.2 the notion of "state sequences" and is unclear, and this paragraph is hard to understand. Maybe explain this in more detail. what is τs0:K\tau_s^{0:K} ? 3.2 Same goes for the notion of "inverse dyanmics model": what is this for? It is never mentioned again... 3.2 " tailored state sequences to facilitate training", what does this mean? Unclear... General: very few details given for DDPM usage: input space, action space, ... General: Continual offline reinforcement learning requires a more in-depth introduction and justification, as offline and reinforcemenet learning seem to be contradictory concepts. What is the advantage here? What can ORL do that RL cannot, and vice versa?

We have revised Section 3.2, where the state sequence is a series of state vectors composed of consecutive time steps, e.g., τs={st,st+1,...,st+Te}\tau_s=\{s_t,s_{t+1},...,s_{t+T_e}\}. τs0\tau_s^0 represents the state sequence, where the state vectors come from the environment. τsk=αˉkτs0+1αˉkϵ\tau_s^k=\sqrt{\bar{\alpha}_k}\tau_s^0+\sqrt{1-\bar{\alpha}_k}\epsilon is the noised state sequence, which is used to train the diffusion model.

Offline reinforcement learning (offline RL), also known as batch RL, is a well-established subfield of RL [9]. It focuses on learning effective policies from pre-collected datasets without requiring interaction with the environment during training, which is particularly useful in scenarios where real-time data collection is impractical, expensive, or risky, such as healthcare, robotics, and autonomous driving [10-11]. While traditional online RL involves active exploration and interaction with the environment to learn policies, offline RL operates on the principle of leveraging fixed datasets to approximate the optimal policy [12]. Far from being contradictory, offline RL extends the applicability of RL to settings where interaction is limited or unavailable. Recent advances in offline RL, including techniques like behavior regularization and conservative policy optimization [9, 15], demonstrate its theoretical soundness and practical significance.

Thus, offline RL is not contradictory to reinforcement learning but rather an essential component of its broader framework, enabling learning in real-world, constrained, or high-stakes applications.

Continual offline reinforcement learning (CORL) is highly significant in advancing reinforcement learning research and its practical applications [13-14]. CORL faces two critical challenges in RL: catastrophic forgetting and distribution shift; the former comes from continual learning, and the latter comes from offline RL [15].

  • Many real-world applications, such as robotic control and autonomous driving, involve sequential tasks with evolving objectives or constraints [11, 16]. In such settings, interacting with the environment to retrain models (as in traditional RL) is often impractical due to safety, resource, or ethical considerations. CORL allows the model to adapt to new tasks using offline data while retaining knowledge from prior tasks, making it ideal for these scenarios.

  • CORL pushes the boundaries of RL research by requiring novel techniques to align state and action spaces across tasks, mitigate distributional shifts, and efficiently reuse learned knowledge [17-18]. It also necessitates robust mechanisms for handling task boundaries and building generalizable representations, which are foundational problems in RL.

  • Through training on existing datasets for continual adaptation, CORL reduces the need for costly and time-consuming data collection processes [19-20]. This makes it highly practical for industries aiming to deploy RL solutions without continuous environment interaction or retraining.

[II]. Explanation of questions

[1/1] Q1: For evaluation: is it necessary to know what task you are currently performing evaluation for? If no: how do you select the right mask at test time?

The research scope of this paper is task-aware continual learning, where the boundaries between tasks are known. During evaluation, the model is aware of which specific task is being evaluated.

评论

[1] A Definition of Continual Reinforcement Learning

[2] Continual world: A robotic benchmark for continual reinforcement learning

[3] Towards Continual Reinforcement Learning: A Review and Perspectives

[4] Three scenarios for continual learning

[5] One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning

[6] Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

[7] Split & Merge: Unlocking the Potential of Visual Adapters via Sparse Training

[8] Continual Task Allocation in Meta-Policy Network via Sparse Prompting

[9] Conservative Q-Learning for Offline Reinforcement Learning

[10] A Minimalist Approach to Offline Reinforcement Learning

[11] Offline Reinforcement Learning with Implicit Q-Learning

[12] Offline reinforcement learning: Tutorial, review, and perspectives on open problems

[13] Offline reinforcement learning as one big sequence modeling problem

[14] A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning

[15] Behavior regularized offline reinforcement learning

[16] OER: Offline Experience Replay for Continual Offline Reinforcement Learning

[17] Solving Continual Offline Reinforcement Learning with Decision Transformer

[18] Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay

[19] Awac: Accelerating online reinforcement learning with offline datasets

[20] Learning to influence human behavior with offline reinforcement learning

评论

Dear authors, thank you for your clarifications and modifications. I will respond to your comments first:

  • W1: ok, accepted.
  • W2: point taken
  • W3: the question remains: why not train a separate model for each task? It would be way easier and clearer, and probably more powerful, too.
  • W4: point taken
  • W5: just using official code is not enough. You need to document what parameter settings you used and where you adapted the code, if applicable. If you run the author's code with the authors' parameter settings, then results may be suboptimal because you did not adapt parameters to your problem. That is something that you need to do, too, otherwise the comparison is unfair.
  • W6: I strongly disagree, a single controller should be less efficient, whereas one controller per task would be the optimal case since this controller does not have to incorporate knowledge about other tasks. All in all, the necessity of having a single controller for problems that "live" in different state/action spaces seems very contrived to me, and the arguments you provide are not convincing to my mind.
  • W7: again I disagree, the concept of offline reinforcement learning must be motivated since the essence of RL is in exploration. Some of the points you provide could make a case for offline RL, but the notion of offline continual RL is even more problematic: why do you need to learn tasks one after the other, if you could also learn them all at once in an offline fashion? As long as you cannot provide a convincing use case for this, I remain unconvinced, and I can hardly imagine what such a use case might be.
  • Q1: Even if you argued that training can be done offline, at least testing must be done on-line. And again I find it quite hard to imagine a realistic scenario where someone would provide a task label at test time.

My concerns regarding clarity have been addressed. Some of my misconceptions have been rectified. However, the fundamental problems (W3,W5,W6,W7,Q1) of the paper remain in my eyes, therefore I retain my previous rating.

评论

I would like to express my sincere gratitude for taking the time to provide such detailed comments on our manuscript. Your thorough feedback is valuable and greatly appreciated. Accordingly, we included comprehensive theoretical analyses and experimental results to support our claims. These analyses demonstrate the robustness and validity of our proposed method. Furthermore, we kindly request that you reconsider the novelty and contributions of our work. Despite the existing research in this field, our approach introduces significant advancements and unique insights that address critical challenges and open new directions for future research. Thank you once again for your constructive feedback. We look forward to the opportunity to further discuss our work and address any remaining concerns you may have.

评论

W7: again I disagree, the concept of offline reinforcement learning must be motivated since the essence of RL is in exploration. Some of the points you provide could make a case for offline RL, but the notion of offline continual RL is even more problematic: why do you need to learn tasks one after the other, if you could also learn them all at once in an offline fashion? As long as you cannot provide a convincing use case for this, I remain unconvinced, and I can hardly imagine what such a use case might be.

  • Beyond "the essence of RL is in exploration" you claim, RL has so many other challenges, such as robustness, uncertainty, diversity, generalization, collaboration, long-term dependencies, sample efficiency, adversarial attack, credit assignment, ...
  • No one can predict the kinds of tasks we may face in the future. So continual learning is proposed to continually master new skills to solve new tasks.
  • In many scenarios, such as robotic control and autonomous driving, the commonly adopted approach is to train models on offline data. Additionally, with the rapid development of the generative model community, highly realistic synthetic data is becoming increasingly available. Training models on offline datasets will undoubtedly be a critical research direction.

Q1: Even if you argued that training can be done offline, at least testing must be done on-line. And again I find it quite hard to imagine a realistic scenario where someone would provide a task label at test time.

I will only say that your imagination is truly limited, to the point where I genuinely question whether you actually understand reinforcement learning. Take autonomous driving as an example: autonomous driving agents are trained using offline data and evaluated either in real-world environments or high-fidelity virtual environments. Additionally, robotics is another common example, where many studies are trained with offline data and then tested in online environments [6-8].

[1] A Definition of Continual Reinforcement Learning, Neurips

[2] Towards Continual Reinforcement Learning: A Review and Perspectives, JAIR

[3] Stable Continual Reinforcement Learning via Diffusion-based Trajectory Replay, ICLR workshop

[4] Diffusion-based Curriculum Reinforcement Learning, Neurips

[5] https://www.youtube.com/watch?v=NvfK1TkXmOQ

[6] D4RL: Datasets for Deep Data-driven Reinforcement Learning

[7] Continual World: A Robotic Benchmark For Continual Reinforcement Learning, Neurips

[8] Trafficgen: Learning to generate diverse and realistic traffic scenarios, TPAMI

评论

Concerning some of your replies:

  • you have not at all addressed my concern that a task label needs to be supplied at evaluation for your method to work. I know some authors in CL do this, but it is by no means a dominant approach in CL. And: it is absolutely incompatible with live operation, something you allegedly aim for.
  • concerning the use of baselines: if your setup is exactly the same as in the cited papers, then you need not redo the experiments but can use the tabulated results. So, logically, there must be differences. If so, you must explain, what is different and why the original hyper-parameters are still a good choice. You do not even acknowledge the problem. Besides, I know these papers you cite very well, and they are evaluated in completely different setups, most blatantly for DGR, EWC. So a discussion of differences and parameter settings is very necessary. For example, did you tune EWC parameters by cross-validation, guesswork, ...? All of that is unclear.
  • your replies do not address at all the question I posed: why do CL at all if all data is available offline anyway? As any CL expert will tell you, CL is done precisely because you assume that all data is not available at the same time.
评论

I must say I deplore your use of borderline offensive language. If you cannot take criticism, maybe science is not the right field for you. I consider reporting this behavior to the conference organizers.

评论

We appreciate all Reviewers for spending time on our submission and we have tried our best to address each of your concerns. However, you still judge the rationality of continual learning and at the same time you assign a high confidence score, which is hard to comprehend. A qualified reviewer should focus on the technical aspects of the submission, and all constructive suggestions will be considered to improve the manuscript. However, we follow many previous works in the field of continual learning to make improvements, yet you consider the setting "problematic". This prevents us from receiving constructive criticism from you.

评论

W3: why not train a separate model for each task? It would be way easier and clearer, and probably more powerful, too.

Continual learning is proposed to address the potential and unpredictable demands of future tasks, making it highly relevant to real-world needs. Moreover, extensive research [1-5] has been devoted to tackling the challenge of catastrophic forgetting in continual learning.

  • We must claim the advantages of training a single model for tasks with different state/action spaces: (1) It significantly expands the applicability of the model, eliminating the limitations imposed by task-specific state and action spaces. (2) It facilitates learning multiple tasks sequentially with a single model, removing the need to train a separate model for each task. (3) It can achieve higher computational efficiency and lower storage costs to master multiple tasks and make decisions with a single model.
  • Let me give you an example: if you train one LLM as an agent and, after training, you want it to continuously acquire additional skills, according to your reasoning, we would need to retrain an entirely new LLM for every new skill. However, it is evident that this would result in enormous time and financial costs.

W5: just using official code is not enough. You need to document what parameter settings you used and where you adapted the code, if applicable. If you run the author's code with the authors' parameter settings, then results may be suboptimal because you did not adapt parameters to your problem. That is something that you need to do, too, otherwise the comparison is unfair.

Using official code and the default hyperparameter settings (precisely as reported in the original paper) is a consensus within the community; therefore, it is sufficient to demonstrate the effectiveness of our method.

  • First, using the official code ensures the highest level of reliability in the results.
  • Second, the default hyperparameter settings in the official code have already been meticulously tuned, making them the most representative of the baseline's performance.
  • Third, reporting the default hyperparameter settings of baselines is unnecessary because we can already find the setting in their original papers.

W6: I strongly disagree, a single controller should be less efficient, whereas one controller per task would be the optimal case since this controller does not have to incorporate knowledge about other tasks. All in all, the necessity of having a single controller for problems that "live" in different state/action spaces seems very contrived to me, and the arguments you provide are not convincing to my mind.

Since you claim that "a single controller should be less efficient," why should we even pursue AGI in one agent? The reason we use a single model is to achieve higher computational efficiency and lower storage costs. Another example: current research on LLMs aims to use a single model to continuously learn and tackle most and even all tasks.

审稿意见
8

This work proposes a novel method for continual offline reinforcement learning called VQ-CD that combines vector quantization with diffusion for continual learning. The vector quantization part enables training of a unified state and action space such that VQ-CD can be trained jointly on environments that provide different state/action spaces. The policy is based on a diffusion model and weight masking that relies on task information. The weight masks are constructed in a manner that they can be readily merged into a single model after training. VQ-CD exhibits strong performance on Ant-dir, CW10, and D4RL environments with different state/action spaces compared to baselines.

优点

The idea of leveraging vector quantization to learn a unified state/action space is strong and enables joint training on spaces of different dimensionality.

Empirical evidence seems strong and the method is compared against plenty of baselines.

Ablation studies highlight the importance of combining VQ with CD.

缺点

Significance of results:

While I appreciate the number of experiments conducted and baselines the authors trained, I am not entirely certain the compared methods are state of the art. One particular baseline that comes to mind is L2M [1], which constructs unified state and action spaces (similar to what QSA does) and incorporates a task matching mechanism to train separate weights for each task (similar to SWA). Therefore, L2M should be added as a baseline. Moreover, the unified state/action space of L2M can be used as an additional ablation study to the VQ approach, as it is simpler and does not require training.

Limitations:

The authors do not elaborate on limitations, however there are notable ones that should be mentioned:

  • Training needs to be conducted on each task separately, resulting in separate model checkpoints with different masked weights, this scales linearly with the number of tasks
  • VQ-CD relies on task information, other methods [1] can be used in a task-agnostic manner

Presentation:

First, the title overclaims the contribution of the paper as it does not "solve" the field, please consider renaming. Please also rephrase "any CL tas sequence settings", to something like "CL with different state/action spaces", as "any CL task" is ill-defined. Figure 1 only shows the VQ-CD training pipeline, but line 465 talks about "return-based action generation" which is never explained in the methodology. In fact, it is never explained whether the diffusion part constitutes the policy and how the inverse dynamics model is actually used. Can the authors elaborate in more detail about what is the policy and how the inverse dynamics model is used? Figure 2 is never referred to in the text. Can the authors adjust Figure 4 so that the task boundaries are visible? As of now it is not clear at which steps the task is switched as there are now clear differences in the learning curves. Line 423 sounds like state and action padding is being used for VQ-CD, but I assume it is only used for baselines, is this correct? Line 526 sounds like one action vector is being decomposed into several latent vectors, however VQ only quantizes into a single latent. It is not clear what the labels in Figure 7 indicate, can the authors elaborate and make it more explicit?

References:

[1] Learning to Modulate pre-trained Models in RL, Schmied et al., NeurIPS 2023

问题

  • ww in line 193 has not been defined, what is it?
  • Is there any intuition about why the constraint in eq. (2) is important?
  • The constraint in eq. (2) could easily be incorporated in the loss function, why is clipping used?
  • What are "nonsignificant masks" (line 352)
  • What is the source of the data for CW10 and MuJoCo Ant-dir?
  • How is it possible that VQ-CD is better than Multitask on two of the D4RL tasks?
  • Why would the gaussian constraint for VAE hurt space alignment?
  • How does reconstruction look for VAE-CD in Table 2? Only VQ-CD and AE-CD are shown there.
  • Why does multitask baseline performance decrease with VQ state/action spaces (Figure 11)?
评论

We are particularly encouraged that the reviewer finds our method novel and effective. We appreciate the valuable feedback and respond to your concerns below.

[I]. Explanation of weaknesses

[1/3] W1: Discussion of L2M and the experimental comparison of L2M and VQ-CD.

We discuss the differences between our method and the L2M method as follows:

  • State Space Alignment: The L2M method aligns the state space using padding, where all environmental state vectors are padded to a length of 204 dimensions. In contrast, our method uses vector quantization to align the state space, resulting in an aligned dimension of 20, which is significantly smaller than the 204 dimensions used by L2M.
  • Action Space Alignment: L2M treats each dimension of the action vector as a token and discretizes the action values with range [−1,1] into 64 bins. Using this approach, L2M can ignore the task-specific action dimensionality. For a task with an action dimension of m, L2M simply outputs m tokens using a for-loop to generate the m-dimensional action vector. In contrast, our method aligns the action space using vector quantization, where the action dimension is aligned to 10. We use an MLP network for this alignment, whereas L2M employs a transformer to output 64 dimensions.
  • Comparison of Experimental Results: We conduct experiments on the D4RL dataset, where both the state and action spaces differ across tasks. We use the official implementation of L2M to conduct experiments [1]. The experimental results show that our method significantly outperforms L2M. We attribute this improvement to the following reasons: 1)Regarding L2M’s state padding approach, we have already included state padding baselines in our original submission (Figure 4), and the results demonstrate that our method outperforms the state padding baselines. 2)L2M’s action discretization approach divides the action range into bins, resulting in a minimum resolution of 2/64. In contrast, our method is not constrained by such limitations.

In addition to the comparisons above, we have provided further experimental comparisons in the table below. More experimental results can be found in the revised Figure 4 of the main body.

Table 1: The comparison of L2M and our method (VQ-CD).

MethodCL task settingGPU memory consumption (GB)Parameters of neural network (M)Approximate physical training time consumption (h)Performance
VQ-CD[Hopper-m,Walker2d-m,Halfcheetah-m]4.58389.085548.0
L2M[Hopper-m,Walker2d-m,Halfcheetah-m]6.68757.809213.4

[2/3] W2: Training needs to be conducted on each task separately, resulting in separate model checkpoints with different masked weights, this scales linearly with the number of tasks.

Although the Weights Assembling introduced in our paper is performed after the training is fully completed, it is evident that after finishing the training of each task ii, we can obtain the corresponding mask MiM_i. As long as we have MiM_i, we can progressively save the network parameters without the need to store checkpoints from different time periods. The detailed process is as follows:

    1. Construct an initialized model specifically for saving parameters, with the same parameter structure WW as the model being trained.
    1. Train on the first task. Once training is completed, we obtain the checkpoint W[iω]W[i∗\omega] and the mask M1M_1.
    1. Use M1M_1 to extract the relevant parameters for the first task from W[iω]W[i∗\omega] and save them to the corresponding positions in WW. At this point, W[iω]W[i∗\omega] can be deleted, thus incurring no additional storage overhead.
    1. Repeat steps 2 and 3 until training on all tasks is completed.

[3/3] W3: VQ-CD relies on task information, other methods [1] can be used in a task-agnostic manner.

Continual learning scenarios with explicit task boundaries and those without are two distinct continual learning settings, both of them are widely applicable in various domains [2-6]. Detecting task boundaries requires additional modules to compare task similarities. Admittedly, our method is not yet suitable for task-agnostic continual learning settings where task boundaries are unclear. However, we have successfully implemented continual learning across tasks with different state and action spaces in the explicit task boundary continual learning setting. In future research, we aim to explore the VQ-CD method further in task-agnostic continual learning settings.

评论

[II]. Explanation of questions

[1/13] Q1: The title overclaims the contribution of the paper as it does not "solve" the field, please consider renaming. Please also rephrase "any CL tas sequence settings", to something like "CL with different state/action spaces", as "any CL task" is ill-defined.

Thank you for your valuable suggestion. Do you think whether it is suitable for choosing "Extending to Continual Offline RL with different state and action spaces through Selective Weights Activation on Aligned Spaces" as the new title of this paper.

We would like to finalize the revised title of the paper through further discussion with you. Therefore, we have not revised the title in this current response.

[2/13] Q2: Figure 1 only shows the VQ-CD training pipeline, but line 465 talks about "return-based action generation" which is never explained in the methodology. In fact, it is never explained whether the diffusion part constitutes the policy and how the inverse dynamics model is actually used. Can the authors elaborate in more detail about what is the policy and how the inverse dynamics model is used?

In response to your concern, we have revised Figure 1 in the paper, and we provide a detailed explanation here of how the inverse dynamics model is utilized in our method. For aligned state and action features, we need to process the original state and action vectors with QSA module during training and inference.

During the training phase:

    1. For each trajectory in the dataset, assuming the trajectory length is NN , we calculate the cumulative discounted returns t=tNγttrt\sum_{t'=t}^{N}\gamma^{t-t'} r_t from each timestep tt in the trajectory using a discount factor γ\gamma. Then we will obtain Rˉ1:N\bar{R}_{1:N}.
    1. After the computation in step 1, we normalize the NN cumulative discounted returns to make the value range in and associate the normalized returns R1:NR_{1:N} with the state at each timestep {st,Rt}\{s_t, R_t\}.
    1. Sample a batch from the dataset containing sequences of length TeT_e consisting of states, actions, and their corresponding returns, i.e., {st:t+Te,at:t+Te,Rt:t+Te}\{s_{t:t+T_e},a_{t:t+T_e},R_{t:t+T_e}\}.
    1. Use τs0=st:t+Te\tau^0_s=s_{t:t+T_e} as the input to the diffusion model, and Rt:t+TeR_{t:t+T_e} as the condition input for the diffusion model. Compute the diffusion model's loss and update the diffusion model parameters.
    1. Use st:t+Te1s_{t:t+T_e-1} and st+1:t+Tes_{t+1:t+T_e} as inputs to the inverse dynamics model, and use at:t+Te1a_{t:t+T_e-1} as the output targets to compute the inverse dynamics model's loss and update the model's parameters.

During the inference phase:

    1. Receive the current state sts_t from the environment.
    1. Sample a state sequence s^t:t+Te\hat{s}_{t:t+T_e} with length T_e from the normal Gaussian distribution, replace the first state with sts_t
    1. Obtain the diffusion model's input τsK=[st,s^t+1:t+Te]\tau^K_s=[s_t,\hat{s}_{t+1:t+T_e}].
    1. Generally, following previous studies [7], we set the return condition R=0.8R=0.8. Then, we can feed τsK\tau^K_s and RR into the diffusion model.
    1. At each generation step we use diffusion model to calculate the noise ϵθ(τsk,k,R)\epsilon_\theta(\tau^k_s, k, R).
    1. Obtain τsk1\tau^{k-1}_s by performing generation process with the diffusion model. After each generation step, we will replace the first state of generation output with sts_t.
    1. At the end of the generation process, we will obtain τs0=[st,sˉt+1:t+Te]\tau^0_s=[s_t,\bar{s}_{t+1:t+T_e}].
    1. Finally, we use sts_t and sˉt+1\bar{s}_{t+1} as inverse dynamic model's input to obtain action aˉt\bar{a}_t.
    1. Interactive with the environment with aˉt\bar{a}_t.
评论

[3/13] Q3: Figure 2 is never referred to in the text. Can the authors adjust Figure 4 so that the task boundaries are visible? As of now it is not clear at which steps the task is switched as there are now clear differences in the learning curves.

We have added a reference to Figure 2 in the main text. Additionally, in Figure 4, the number of training steps for each task is 500k, meaning a new task starts every 500k steps. We will also add vertical lines in the figure to indicate task transitions.

[4/13] Q4: Line 423 sounds like state and action padding is being used for VQ-CD, but I assume it is only used for baselines, is this correct? Line 526 sounds like one action vector is being decomposed into several latent vectors, however VQ only quantizes into a single latent. It is not clear what the labels in Figure 7 indicate, can the authors elaborate and make it more explicit?

Yes, for baselines, we report two types of space alignment methods: 1) state and action padding and 2) vector quantization. The first alignment method corresponds to the experiments in Figure 4, and the second alignment method corresponds to the experiments in Figure 11.

In vector quantization, suppose the input vector is mm-dimensional, the codebook contains nn latent vectors, each of which is pp-dimensional, and the number of latents is qq. The process is as follows:

  • The input vector vinputv_{input} is first mapped to a representation vrv_{r} of size p×qp\times q.
  • vrv_{r} is then split into qq vectors, each of size pp, forming a collection {v1,...,vq}\{v_1,...,v_q\}.
  • For each pp-dimensional vector viv_i in {v1,...,vq}\{v_1,...,v_q\}, the closest quantized vector ziz_i is found in the codebook, resulting in {z1,...,zq}\{z_1,...,z_q\}.
  • The qq quantized vectors {z1,...,zq}\{z_1,...,z_q\} are concatenated to obtain the final quantized representation zz.

Figure 7 shows the performance of the model under different numbers of latent vectors. ''action_3_2'' indicates that 3 latent vectors are used, each with a dimension of 2. This means that each quantized vector in the codebook has a dimension of 2. If the task's action vector has a dimension of mm, the action will be represented by a 3×2=63×2=6-dimensional quantized vector.

[5/13] Q5: ω\omega in line 193 has not been defined, what is it?

ω\omega is the guidance scale. It controls the strength of the classifier guidance. Following previous studies [7-9], we usually set ω=1.2\omega=1.2 in experiments.

[6/13] Q6: Is there any intuition about why the constraint in eq. (2) is important?

Adding a constraint in Equation 2 encourages a more concentrated distribution of the quantized representation vectors, which benefits the diffusion model in learning the data distribution.

[7/13] Q7: The constraint in eq. (2) could easily be incorporated in the loss function, why is clipping used?

Our goal is to ensure that the magnitude of the quantized representation vectors does not exceed a certain value ρ\rho, rather than making the norm of zqz_q as small as possible.

[8/13] Q8: What are "nonsignificant masks" (line 352)

For example, we can use the magnitude of the weights as a measure to determine the significance of the mask. We consider weights with smaller absolute values to correspond to nonsignificant masks. Therefore, these masks can be released and reused to learn new tasks.

[9/13] Q9: What is the source of the data for CW10 and MuJoCo Ant-dir?

In order to collect the offline datasets of CW10, we trained Soft Actor-Critic (SAC) [10] on each task. Then, we use the well-trained SAC model to collect the datasets. For MuJoCo Ant-dir, we obtain the dataset from previous studies [11-12].

评论

[10/13] Q10: How is it possible that VQ-CD is better than Multitask on two of the D4RL tasks?

In Figures 4 (a) and (b), the trajectories in the dataset encompass the entire training process of the policy, from a random policy to a well-trained policy. Our method leverages cumulative discounted returns to guide the generation of state sequences, encouraging the generation of higher-return state sequences. Consequently, the actions generated by the inverse dynamics model also yield higher returns. In contrast, the multi-task model does not currently incorporate returns, resulting in lower performance. In Figures 4(c) and 4(d), the variance of trajectory returns in the dataset is smaller, allowing the multi-task model to achieve better learning outcomes.

[11/13] Q11: Why would the gaussian constraint for VAE hurt space alignment?

In robotic control tasks, the distributions of state vectors and action vectors are more complex and are unlikely to follow Gaussian distributions. Existing research also indicates that assuming Gaussian-distributed actions can impair performance [13-14]. Consequently, the Gaussian prior imposed by VAE can interfere to some extent with the effectiveness of aligning state and action spaces.

[12/13] Q12: How does reconstruction look for VAE-CD in Table 2? Only VQ-CD and AE-CD are shown there.

From the experimental results in Figure 5, it can be observed that the performance of the space alignment methods is as follows: VQ > AE > VAE. Since the performance of VAE is significantly worse, we mainly focus on exploring the reasons why VQ outperforms AE and decide not to further investigate why VQ outperforms VAE.

[13/13] Q13: Why does multitask baseline performance decrease with VQ state/action spaces (Figure 11)?

The constraint in Equation 2 encourages the quantized representation vectors to become more concentrated, which benefits the diffusion model in modeling the data distribution. However, this may not necessarily benefit other models that do not focus on modeling distributions. More concentrated quantized vectors can make originally dissimilar state and action vectors from different tasks appear more similar, making them harder to distinguish. This, to some extent, negatively impacts the training of the multitask method.

[1] https://github.com/ml-jku/L2M

[2] Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay

[3] Prediction and Control in Continual Reinforcement Learning

[4] Continual learning of diffusion models with generative distillation

[5] Solving Continual Offline Reinforcement Learning with Decision Transformer

[6] Continual diffusion: Continual customization of text-to-image diffusion with c-lora

[7] Is Conditional Generative Modeling all you need for Decision-Making?

[8] Classifier-free diffusion guidance

[9] Improved denoising diffusion probabilistic models

[10] Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

[11] Prompting decision transformer for few-shot policy generalization

[12] Efficient off-policy meta-reinforcement learning via probabilistic context variables

[13] Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

[14] Diffusion policies as an expressive policy class for offline reinforcement learning

评论

Thank you for the detailed and clarifying response as well as the addition of the L2M baseline. One thing that I noticed from the added table is that the number of parameters for L2M is sgnificantly smaller than the ones used for VQ-CD, which might explain the lower performance. Would it be possible to match the number of parameters to make the comparison more fair?

I greatly appreciate the clarifcation on the overall methodology, I believe it would be good to have one short algorithm in the main body that explains the information flow, as other reviewers also pointed out as an issue. To accomodate for the space I would suggest moving Figure 3 to the appendix and maybe making Figure 5 a wrapfigure and a bit smaller.

It would also be very beneficial for the community to have a discussion section that highlights the interplay between VQ and CD, especially things such as concentrated representations benefit CD, but not the regular multitask model, etc. This gives a bit better intuition why certain changes have been made.

Regarding the title, I am just picky on the word "solving", I think simply replacing "Solving" with "Tackling" would be fine.

As the authors have addressed most of my concerns I am raising my score. If the authors can incorporate the remaining suggestions, I am willing to increase my score even further.

评论

Thank you for your further suggestions. We make the following efforts to adress your concerns.

[I]. Explanation of questions

[1/4] Q1: One thing that I noticed from the added table is that the number of parameters for L2M is sgnificantly smaller than the ones used for VQ-CD, which might explain the lower performance. Would it be possible to match the number of parameters to make the comparison more fair?

Thank you for your suggestion. We re-run the L2M algorithm with a larger network this time, incorporating more attention heads and parameters (we use L2M-large to represent the variant), even exceeding the number of parameters of our method. The experimental results are shown in Table 2. The results demonstrate that our method still outperforms the L2M-large algorithm. Additionally, it can be observed that our method requires significantly less physical training time (almost half of L2M-large) compared to the L2M-large algorithm.

Table 2: The comparison of L2M-large and our method (VQ-CD).

MethodCL task settingGPU memory consumption (GB)Parameters of neural network (M)Approximate physical training time consumption (h)Performance
VQ-CD[Hopper-m,Walker2d-m,Halfcheetah-m]4.58389.085548.0
L2M-large[Hopper-m,Walker2d-m,Halfcheetah-m]8.75195.949415.98

[2/4] Q2: I greatly appreciate the clarifcation on the overall methodology, I believe it would be good to have one short algorithm in the main body that explains the information flow, as other reviewers also pointed out as an issue. To accomodate for the space I would suggest moving Figure 3 to the appendix and maybe making Figure 5 a wrapfigure and a bit smaller.

As you suggested, we have added pseudocode to describe how actions are generated during the inference phase. Additionally, we have moved the original Figure 3 to the appendix and reduced the size of Figure 5 to save space for discussion in the paper.

[3/4] Q3: It would also be very beneficial for the community to have a discussion section that highlights the interplay between VQ and CD, especially things such as concentrated representations benefit CD, but not the regular multitask model, etc. This gives a bit better intuition why certain changes have been made.

Thank you for your suggestion. We have added the section "Discussion" in Section 6 to discuss the interplay between VQ and CD, the intuition of the constraint in the QSA module, and further discussions of the experiments.

[4/4] Q4: Regarding the title, I am just picky on the word "solving", I think simply replacing "Solving" with "Tackling" would be fine.

We have revised the title according to your suggestion.

评论

Thank you for the additional changes!

The authors have addressed all my concerns and in my opinion substantially improved the paper. Therefore I am raising my score, as I believe this is good work and should be accepted.

评论

Dear Reviewer jUqW,

We appreciate your positive feedback and worthy suggestions for our paper, which have significantly helped us improve the paper's quality. We really thank Reviewer jUqW for increasing the score!

Kind regards,

Paper5556 Authors

审稿意见
5

This paper introduces a method for the problem of continual offline reinforcement learning. The proposed method combines two modules: the Quantized Spaces Alignment (QSA) module, which standardizes different state and action spaces by using vector quantization to map them into a unified representation, and the Selective Weights Activation (SWA) module, which preserves knowledge from prior tasks by activating only relevant model weights for each new task through a masking mechanism. Extensive experiments show that the proposed method consistently outperforms existing methods, validating the effectiveness of this proposed method.

优点

  • Overall this paper contains a lot of new ideas. Using a VQ-VAE model to encode different tasks seems to be a reasonable way to enable the generalization of diffusion-based RL methods.

  • The idea of selecting activation weights in a U-Net diffusion model effectively addresses catastrophic forgetting by isolating task-specific weights.

  • The authors provided pretty comprehensive evaluation to demonstrate the effectiveness of this work both qualitatively and quantitatively.

缺点

  • The proposed method is a little bit complicated and might require significant computational resources and engineering efforts.

  • This method is designed for task-aware settings with explicit task boundaries, which could reduce its applicability in task-free continual learning scenarios where such boundaries are not predefined.

  • It seems that the propose method, through effective, is a combination of existing learning methods, e.g. VQ-VAE, Diffusion, tasks masks.

问题

  • Can the learned masks be visualized? Visualizing the masks could reveal task similarities and therefore can provide important insight for this proposed method.
  • Can this method detect non-periodic task changes?

伦理问题详情

N/A

评论

[2/3] W2: This method is designed for task-aware settings with explicit task boundaries, which could reduce its applicability in task-free continual learning scenarios where such boundaries are not predefined.

Task-aware continual learning is quite prevalent in many applications, such as CW10 and CW20 continual learning settings [3], as explicit task boundaries can often be conveniently determined using various methods, such as leveraging large models or human feedback on existing datasets or constructing new datasets with explicit task boundaries based on classical RL algorithms [4].

Compared with task-aware continual learning, task boundary-agnostic continual learning requires an additional mechanism to detect whether a task change has occurred. Clearly, this demands extra task similarity measurement mechanisms for detection. Admittedly, this is currently a limitation of our approach. However, it is also one of the challenges we are actively addressing. We are confident that further progress will be reflected in our future research.

[3/3] W3: It seems that the propose method, through effective, is a combination of existing learning methods, e.g. VQ-VAE, Diffusion, tasks masks.

We must clarify that the main innovation of this paper lies in proposing a continual offline reinforcement learning method that can be applied to arbitrary state and action spaces by space alignment, which significantly broadens the scenarios where our method can be applied. Additionally, we systematically validate the effectiveness of our approach through extensive experiments.

Vector quantization is merely one way to achieve space alignment. Clearly, other methods, such as MLPs or padding, could also be used for this purpose. However, based on experimental comparisons, we found that vector quantization yielded the best results, which is why we adopted this approach for space alignment.

The application of diffusion models in reinforcement learning is already widespread, achieving highly competitive results on various datasets due to the models' strong expressiveness. This is the reason we chose diffusion models as the core structure for the continual learning component.

The masking technique is also one method used to realize parameter isolation in continual learning. We can also use a separate neural network for new tasks, such as LoRA. However, compared with LoRA, task-related weight masking is an effective method because we can hardly define the pre-trained model when applying LoRA.

It is important to emphasize that vector quantization and diffusion models with task-related masks are techniques to realize our idea of performing continual learning on space alignment rather than a simple combination of techniques.

[II]. Explanation of questions

[1/2] Q1: Can the learned masks be visualized? Visualizing the masks could reveal task similarities and therefore can provide important insight for this proposed method.

Yes, thank you for your suggestions. We select [Hopper-m,Walker2d-m,Halfcheetah-m] to visualize the weights mask in Figure 13, Appendix B.6 of the paper. To make it easy to show the mapping relation between masks and the weights, we draw the network structure and mask matrixes, where we only report the first 100 channels of the mask matrixes.

Merely relying on the mask matrix may not intuitively demonstrate the effectiveness of our method. Furthermore, we also visualize the aligned state features in Figure 14, Appendix B.7 of the paper. From the experimental results, we can see that the state features learned by the AE method are not well-mapped to separate regions but are instead mapped to multiple areas. In contrast, the features obtained by our method are better partitioned into individual regions, which is more conducive for the model to capture the data distribution.

[2/2] Q2: Can this method detect non-periodic task changes?

Detecting the task boundaries demands extra task similarity measurement mechanisms for detection. Admittedly, this is currently a limitation of our approach. It is one of the challenges we are actively addressing. We are confident that further progress will be reflected in our future research.

[1] Denoising diffusion implicit models

[2] Learning to Modulate pre-trained Models in RL

[3] Continual World: A Robotic Benchmark For Continual Reinforcement Learning

[4] Continual learning of large language models: A comprehensive survey

评论

We are particularly encouraged that the reviewer finds our method novel and effective. We appreciate the valuable feedback and respond to the questions below.

[I]. Explanation of weaknesses

[1/3] W1: The proposed method is a little bit complicated and might require significant computational resources and engineering efforts.

In our paper, our method integrates the advantages of vector quantization with the strengths of diffusion models, enabling continual learning across tasks with different state and action spaces and achieving superior experimental performance. To ensure the sampling efficiency of the diffusion model, we employed diffusion model acceleration techniques, such as DDIM [1], reducing the generation steps from 200 to 10, resulting in a 19.76x speed improvement during experimental testing, as shown in Table 1. Furthermore, in Table 2, we provide a comparison between our method and the L2M method [2], including model parameters and GPU memory consumption. The experimental results also demonstrate that the computational overhead of our method is entirely acceptable.

Table 1: The comparison of generation speed with different generation steps under the CL setting of Ant-dir task-4-18-26-34-42-49. We conduct the experiment with NVIDIA GeForce RTX 3090 GPUs and Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz. In the main body of our manuscript, we use the 10 diffusion steps setting for all experiments.

Diffusion steps200 (original)10050252010
sampling speed-up stride1 (original)2481020
Time consumption of per generation (s)5.73±0.292.88±0.211.41±0.160.71±0.180.58±0.170.29±0.15
Speed-up ratio1x1.99x4.06x8.07x9.88x19.76x

Table 2: The comparison of GPU memory consumption. We conduct the experiment with NVIDIA GeForce RTX 3090 GPUs and Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz.

domainCL task settingMethodGPU memory consumption (GB)Parameters of neural network (M)
D4RL[Hopper-m,Walker2d-m,Halfcheetah-m]VQ-CD4.58389.08
D4RL[Hopper-m,Walker2d-m,Halfcheetah-m]L2M6.68757.80

Regarding the engineering complexity, we made two simplifications to ensure the algorithm runs efficiently and to minimize the physical time required for training the model:

  • We chose a dense optimizer instead of a sparse optimizer, primarily because dense optimizers are more efficient in updating parameters, requiring less time compared to sparse optimizers. Also, we conduct the physical time comparison of dense optimizer and sparse optimizer and report the resutls in Table 3, where the results illustrate the efficiency of training with dense optimizer and validate what we claim.
  • In the paper, after the final training is completed, we apply Weights Assembling (introduced in Section 4.2) to extract the parameters corresponding to each task and combine them into a complete model. Rather than laboriously tagging which parameters need to be saved, this approach used in this paper (i.e., Weights Assembling) provides a more straightfoward and efficient solution.

Table 3: The comparison of time consumption per update between sparse and dense (normal) optimizers. We compare these two types of optimizers on the CL settings and find that when we first use the normal optimizer, such as Adam, to train the model and then use weights assembling to obtain the final model, the total physical time consumption is significantly smaller than sparse optimizer (e.g., sparse Adam).

DomainCL task settingOptimizer typeTime consumption per update (s)
D4RL[Hopper-fr,Walker2d-fr,Halfcheetah-fr]dense optimizer0.089±0.219
D4RL[Hopper-fr,Walker2d-fr,Halfcheetah-fr]sparse optimizer0.198±0.224
D4RL[Hopper-mr,Walker2d-mr,Halfcheetah-mr]dense optimizer0.096±0.223
D4RL[Hopper-mr,Walker2d-mr,Halfcheetah-mr]sparse optimizer0.197±0.223
D4RL[Hopper-m,Walker2d-m,Halfcheetah-m]dense optimizer0.089±0.211
D4RL[Hopper-m,Walker2d-m,Halfcheetah-m]sparse optimizer0.195±0.224
D4RL[Hopper-me,Walker2d-me,Halfcheetah-me]dense optimizer0.090±0.223
D4RL[Hopper-me,Walker2d-me,Halfcheetah-me]sparse optimizer0.206±0.225
审稿意见
3

This paper propose Vector-Quantized Continual Diffuser (VQ-CD) for any offline continual reinforcement learning tasks, including the tasks with different observation and action space. Specifically, quantized spaces alignment (QSA) module aligns different state and action spaces (with even different dimensions) to perform continual learning within the same space. Selective weights activation diffuser (SWA) module preserves the previous knowledge by separating task-related parameters with task-related masking. VQ-CD achieves SOTA performance on 15 continual tasks comparing with 16 baselines.

优点

  1. Traditional continual reinforcement learning approaches are significantly limited as they only consider tasks with consistent state and action spaces. This paper successfully addresses the challenge of tasks with diverse state and action spaces through the innovative use of a quantized spaces alignment module.
  2. The baselines compared in the paper are thorough and comprehensive, and the experimental environments considered are sufficiently diverse. These aspects robustly demonstrate the superiority of VQ-CD.

缺点

The paper appears to lack a significant amount of crucial details, which hinders my understanding of the algorithms presented.

  1. The paper does not clarify how the 'return' is utilized. Based on Section 3.2, I guess that the algorithm employs the 'return' as a condition. So, is it the 'return' or the Q-function that is being used? Additionally, how is the target return or target function determined during the inference phase?
  2. Within the QSA module, which parameters are shared, and which need to be retrained for different tasks?

问题

  1. What is the "diffusion-based lifelong learning systems" in Abstract?
  2. Figure 1 appears somewhat cluttered and hard to understand. Labeling the sequence within the figure would enhance its clarity.
评论

We appreciate your valuable review and respond to your concerns below.

[I]. Explanation of weaknesses

[1/2] W1: The paper does not clarify how the 'return' is utilized. Based on Section 3.2, I guess that the algorithm employs the 'return' as a condition. So, is it the 'return' or the Q-function that is being used? Additionally, how is the target return or target function determined during the inference phase?

We will now elaborate on how the return is utilized during the training and inference phases, starting from the data processing stage. For aligned state and action features, we need to process the original state and action vectors with QSA module during training and inference. In the training phase:

    1. For each trajectory in the dataset, assuming the trajectory length is NN , we calculate the cumulative discounted returns t=tNγttrt\sum_{t'=t}^{N}\gamma^{t-t'} r_t from each timestep tt in the trajectory using a discount factor γ\gamma. Then we will obtain Rˉ1:N\bar{R}_{1:N}.
    1. After the computation in step 1, we normalize the NN cumulative discounted returns to make the value range in [0, 1] and associate the normalized returns R1:NR_{1:N} with the state at each timestep {st,Rt}\{s_t, R_t\}.
    1. Sample a batch from the dataset containing sequences of length TeT_e consisting of states, actions, and their corresponding returns, i.e., {st:t+Te,at:t+Te,Rt:t+Te}\{s_{t:t+T_e},a_{t:t+T_e},R_{t:t+T_e}\}.
    1. Use τs0=st:t+Te\tau^0_s=s_{t:t+T_e} as the input to the diffusion model, and Rt:t+TeR_{t:t+T_e} as the condition input for the diffusion model. Compute the diffusion model's loss and update the diffusion model parameters.
    1. Use st:t+Te1s_{t:t+T_e-1} and st+1:t+Tes_{t+1:t+T_e} as inputs to the inverse dynamics model, and use at:t+Te1a_{t:t+T_e-1} as the output targets to compute the inverse dynamics model's loss and update the model's parameters.

In the inference phase:

    1. Receive the current state sts_t from the environment.
    1. Sample a state sequence s^t:t+Te\hat{s}_{t:t+T_e} with length T_e from the normal Gaussian distribution, replace the first state with sts_t
    1. Obtain the diffusion model's input τsK=[st,s^t+1:t+Te]\tau^K_s=[s_t,\hat{s}_{t+1:t+T_e}].
    1. Generally, following previous studies [1], we set the return condition R=0.8R=0.8. Then, we can feed τsK\tau^K_s and RR into the diffusion model.
    1. At each generation step we use diffusion model to calculate the noise ϵθ(τsk,k,R)\epsilon_\theta(\tau^k_s, k, R).
    1. Obtain τsk1\tau^{k-1}_s by performing generation process with the diffusion model. After each generation step, we will replace the first state of generation output with sts_t.
    1. At the end of the generation process, we will obtain τs0=[st,sˉt+1:t+Te]\tau^0_s=[s_t,\bar{s}_{t+1:t+T_e}].
    1. Finally, we use sts_t and sˉt+1\bar{s}_{t+1} as inverse dynamic model's input to obtain action aˉt\bar{a}_t.
    1. Interactive with the environment with aˉt\bar{a}_t.

[2/2] W2: Within the QSA module, which parameters are shared, and which need to be retrained for different tasks?

In the QSA module, there are no shared parameters. The primary purpose of the QSA module is to align the state and action spaces across different environments. Consequently, for different tasks, the internal components of the QSA module, vector quantized encoder (VQE), vector quantized decoder (VQD), and codebook are task-specific, and none of their parameters are shared. Thanks to the alignment provided by the QSA module, the inverse dynamics model in the SWA module can be shared. This is because the state and action spaces of different environments are mapped into an alignment space with the same value range.

[II]. Explanation of questions

[1/2] Q1: What is the "diffusion-based lifelong learning systems" in Abstract?

Diffusion-based lifelong learning systems refer to frameworks that utilize diffusion models for continual learning. Relevant research has been widely explored in the fields of computer vision (CV) and reinforcement learning (RL). Recent related algorithms include [2-6].

[2/2] Q2: Figure 1 appears somewhat cluttered and hard to understand. Labeling the sequence within the figure would enhance its clarity.

Thank you for your suggestions. We have labeled the state sequence in Figure 1 of the main body. Additionally, we have added the data flow for the loss computation of the diffusion model and the inverse dynamics model to better illustrate how the model is trained. If you have any further questions, we would be more than happy to address your concerns.

[1] Is Conditional Generative Modeling all you need for Decision-Making?

[2] Continual diffusion: Continual customization of text-to-image diffusion with c-lora

[3] Continual learning of diffusion models with generative distillation

[4] Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay

[5] A Definition of Continual Reinforcement Learning

[6] Addressing Loss of Plasticity and Catastrophic Forgetting in Continual Learning

评论

Thanks for the author' response. Upon re-reading the revised version of the paper, I find that the aforementioned issues persist. In the rebuttal, the authors have clearly clarified how the 'return' is utilized. However, this information is still missing from the main body of the paper. Additionally, Figure 1 remains unclear, making comprehension relatively challenging.

Although the algorithm presented in the paper is indeed effective, considering the current state of presentation, I will maintain my current score.

评论

Thank you for your further reviews. We make the following efforts to adress your concerns.

For each of the concerns you mentioned, we have clearly outlined the improvements made in the revised paper in Table 1 below. Additionally, we have incorporated all these changes into the revised paper. If you have any further questions, we would be more than happy to receive your suggestions and make corresponding modifications to the paper.

Table 1: The correspondence of Reviewer ch1d's concerns and the revision in the paper.

ConcernsDescriptionRevisionExact position in revised paper
W1how 'return' is used in training?Blue sentences in Algorithm 2lines 825-826 and lines 842-854
W1how 'return' is used in inference?Blue sentences in Algorithm 1lines 199-213
W2In QSA, which parameters are shared, and which need to be retrained for different tasksSection "Network Details", A.7 in Appendixlines 1069-1079
Q1diffusion-based lifelong learning systemsFirst sentence in Abstractlines 11-13
Q2Labeling the sequence within the figure 1In top right corner of Figure 1, we use "action sequence", "state sequence", and "return sequence" to indicate the sequenceslines 163-169
AC 元评审

summary

This paper introduces Vector-Quantized Continual Diffuser (VQ-CD), a novel method for continual offline RL, capable of handling tasks with differing observation and action spaces. VQ-CD integrates two core modules: Quantized Spaces Alignment (QSA), which uses vector quantization to unify diverse state and action spaces into a shared representation, and Selective Weights Activation (SWA), which employs task-specific masking to preserve prior knowledge by isolating task-related parameters. Experiments on benchmarks like Ant-dir, CW10, D4RL, and others demonstrate that VQ-CD achieves state-of-the-art performance across various continual tasks, outperforming existing baselines.


strengths

  • Novel approaches: (1) interesting idea of leveraging vector quantization to learn a unified state/action space and (2) leveraging generative models for policy learning

  • Comprehensive Validation: Demonstrates state-of-the-art performance across diverse benchmarks with extensive experiments, including ablation studies and parameter analysis.


weaknesses

  • [w1] Clarity of Writing: Some parts of the paper are unclear, making it difficult to fully understand the methodology.
  • [w2] Baselines: The selection or implementation of baselines could be improved or better justified.
  • [w3] Problem Setup Justification: The justification for the problem setup is unconvincing and could be elaborated further.

decision

All reviewers agreed that the idea is interesting and highlighted the diverse experimental environments as a strength. However, they also raised concerns about the baselines and the problem setup. During the discussion phase, some comments were partially addressed, but the paper requires revision to fully resolve these issues.

审稿人讨论附加意见

During the discussion period, the authors effectively defended several weaknesses pointed out by the reviewers (e.g., complexity and resource intensity, writing clarity). However, some reviewers remained skeptical about the motivation for addressing CL and the justification of the baselines, even after the authors added the L2M baseline. As a result, rejection is recommended.

最终决定

Reject