Improving Deep Reinforcement Learning by Reducing the Chain Effect of Value and Policy Churn
Improving deep reinforcement learning by reducing undesired changes to out of batch data.
摘要
评审与讨论
This work studies a phenomenon called churn, which refers to that the outputs of a network after updates could change to unexpected values for input data not included in the training batch. Specifically, it studies the value churn, the policy churn, and the interplay between them in deep reinforcement learning (DRL) setting. Then a simple regularization method is propose to reduce the the value and policy churn. Empirical results demonstrate the effectiveness of the proposed solution across many DRL settings, such as value-based methods, policy gradients methods, and offline RL.
优点
- Except some minor issues, the paper is generally well-written and easy to follow.
- The proposed method is simple and effective, verified in various RL settings.
- Besides improving the final performance, direct evidences are also provided to show that the proposed method indeed can reduce churns.
缺点
Originality
Although this work studies a relatively new concept (churn), the investigations of the root causes of this phenomenon, the proposed method, and the idea behind are not novel.
- The churn phenomenon is strongly related to interference and generalization of neural networks. And using NTK to measure the interference and generalization is already proposed by previous works for DRL, such as [1-4]. These works should be included in the related works and discussed. Out of DRL, similar phenomena are also studied under supervised learning setting, such as [5-7].
- I like the simplicity and effectiveness of the proposed algorithm (CHAIN), but it is really similar to MeDQN(R)[8] although the latter work focuses on reducing forgetting instead. This greatly damages the novelty of this work unfortunately.
Quality
The empirical and theoretical results of this work can be further improved.
- Equation 1 is inappropriate as the definition of churn. An absolute operation or square operation should be applied. Otherwise, positive and negative values may cancel out. The issue exists for Equation 4 as well. This is a serious issue since changing Equation 1 would also results in changing the latter derivations in the paper. For example, Equation 5 may no longer hold.
- In Equation 3, although using and help with notation clarity. However, it may lead to misleading derivations. For example, the cancel-out effect is no longer revealed (see previous point about Equation 1). Given this, I would suggest to consider the general case of using more samples in and .
- All experiments use only 6 random seeds which are not enough. I would suggest at least 10 seeds.
- The theoretical results are interesting and useful to help understand and support the intuitions behind the problem. However, I don't see how they could help with the proposed algorithm. In other words, the connection between theoretical derivations and the proposed method is too weak.
Clarity
There are some minor issues.
- What is the difference between and ( and )? I understand that is an approximation of (Section 2.2) where are the parameters of the Q-network. But what is in Section 3.1 and Figure 1 exactly? Isn't it also an approximation of ? In fact, I would suggest to remove and to simplify notations, as they seem to be redundant (but I might be wrong).
- Missing one right parenthesis in the definition of in Line 110.
- Typo in Line 322: policy churn reduction (VCR) --> value churn reduction (VCR).
- Liu, Vincent, et al. "Towards a practical measure of interference for reinforcement learning." arXiv preprint arXiv:2007.03807 (2020).
- Liu, Vincent, et al. "Measuring and mitigating interference in reinforcement learning." Conference on Lifelong Learning Agents. PMLR, 2023.
- Joshua Achiam, Ethan Knight, and Pieter Abbeel. Towards characterizing divergence in deep Q-learning. arXiv:1903.08894, 2019.
- Emmanuel Bengio, Joelle Pineau, and Doina Precup. Interference and generalization in temporal difference learning. arXiv preprint arXiv:2003.06350, 2020.
- Fort, Stanislav, et al. "Stiffness: A new perspective on generalization in neural networks." arXiv preprint arXiv:1901.09491 (2019).
- He, Hangfeng, and Weijie Su. "The Local Elasticity of Neural Networks." International Conference on Learning Representations.
- Lan, Qingfeng, and A. Rupam Mahmood. "Elephant neural networks: Born to be a continual learner." arXiv preprint arXiv:2310.01365 (2023).
- Lan, Qingfeng, et al. "Memory-efficient Reinforcement Learning with Value-based Knowledge Consolidation." Transactions on Machine Learning Research.
问题
How does policy/value churn related to catastrophic forgetting? For example, Figure 2 in MeDQN paper[8] shows that a change of greedy action could result in forgetting. Will reducing forgetting reduce churn? Or vice versa?
局限性
Yes, the limitations are discussed in the work.
We appreciate the reviewer’s careful review and the recognition of our experimental work. Our additional results in the one-page pdf further demonstrate the effectiveness of CHAIN in ten more DMC tasks and improving the learning when larger networks are used.
The main concerns focus on the relationship between our work and the previous works mentioned by the reviewer on catastrophic forgetting and interference. Our response aims to systematically clarify these aspects.
Q1: CHAIN v.s. MeDQN(R) [8]
CHAIN and MeDQN(R) are different on two key points.
MeDQN(R) was created to improve the DQN algorithm only when there is a target network available to use as a fixed point to reduce forgetting. CHAIN does not need this fixed point and CHAIN applies to both value-based and policy-based RL settings in a plug-in manner.
The second point is that CHAIN does not aim to reduce forgetting or stop all types of churn. Some churn is beneficial and we don’t want to over-constrain the algorithm otherwise we may lose the plasticity of the network. We show the importance of this point in our additional experiments in Table 9 of the one-page pdf, where the results show how CHAIN improves the training of larger networks.
Moreover, in the next question, we discuss the distinction between CHAIN and MeDQN(R) regarding their effects in reducing churn and forgetting.
Q2: “How does policy/value churn related to catastrophic forgetting? For example, Figure 2 in MeDQN paper[8] shows that a change of greedy action could result in forgetting. Will reducing forgetting reduce churn? Or vice versa?”
Churn is a natural behavior of neural networks. It accompanies each training and occurs instantly. To some extent, we consider that catastrophic forgetting can be viewed as a consequence of the accumulation of churn.
Therefore, reducing churn also helps to reduce forgetting as it suppresses the accumulation. However, reducing forgetting does not necessarily reduce churn and could even increase churn. It depends on the method considered.
Concretely, for example, MeDQN [8] reduces forgetting by preventing DQN’s value prediction from deviating from a previous copy (in practice, the target network). Note that there is a time gap between the current Q-network and the target network. Therefore, MeDQN does not reduce the churn (which occurs instantly) and could incur extra churn when rolling back the network outputs to the target network.
Q3: The relationship between churn and interference
For the earliest NTK-related paper in DRL [3] mentioned by the reviewer, we cited it in Line 159 of our submission (i.e., the first time we mentioned NTK in our paper), and our analysis with NTK expressions is inspired by it. Also following [3], we found [1,2,4] studied the interference as the change in value approximation error with different forms (e.g., squared TD/TD() error), and they mainly focused on value-based RL (e.g., DQN).
To establish a possible view of the relationship between churn and interference: churn is the change in the network output, while inference is the change in the objective function of the network. Thus, we agree that the studies on churn and interference are closely related essentially.
We emphasize that we also study the churn in the policy network (thus going beyond value-based RL). Moreover, we present the interplay between the value churn and the policy churn and the dynamics under the chain effect. This is not studied in previous work to the best of our knowledge.
We appreciate the reviewer also noted [5-7]. We included them in our revision to strengthen our related work section.
Q4: The definition of churn in Equation 1 (an absolute operation or square operation should be applied) and amendment of the derivations
We appreciate the reviewer for pointing out this discrepancy in our derivations. After carefully re-checking the derivation, we found this issue is amendable with several modifications (as shown below), and the insights given by our derivations do not change.
The amendment has been done as follows (please refer to the official comment attached for a complete description with math):
- First, we add the (signed) definitions of churn on a single data point, denoted by: .
- Next, we re-write our Eq.1 as the mean of the (element-wise) absolute churn of each data point, i.e., with .
- We then apply the same logic and amend the definition in Equation 4 by defining and then using them to re-write with absolute operations.
- By using the absolute operation, there is no cancel-out issue any longer. This also makes it rationale to use for simplification in our NTK discussions. We replace the in Equation 3 and 5 with .
- This logic also applies to the analysis in Equation 6 and further the analysis of the upper bound.
Q5: The connection between theoretical derivations and the proposed method
The main message of our formal analysis is the chain effect of churn in Figure 2. It directly motivates us to reduce the value or policy churn to suppress the chain effect. Then the concrete consequences presented in Section 4.1 guide us to the specific choices in different problem settings.
Q6: The difference between and ( and ) in Section 3.1 and Figure 1
Please kindly refer to Additional Response in the official comment attached.
Q7: Number of random seeds
Please kindly refer to Additional Response in the official comment attached.
After carefully re-checking the derivation, we found this issue is amendable with several modifications (as shown below), and the insights given by our derivations do not change.
As we use a general distance metric (which should be non-negative) at the beginning of Section 3.2 for demonstration, we meant to use non-negative distance metrics in our specific cases but we missed them.
The amendment has been done as follows:
-
First, we add the (signed) definitions of churn on a single data point as: .
-
Next, we re-write our Eq.1 as the mean of the (element-wise) absolute churn of each data point: .
-
We then apply the same logic and amend the definition in Equation 4 by defining the (signed) deviations on a single data point and then using them to re-write with absolute operations.
-
By using the absolute operation, there is no cancel-out issue any longer. This also makes it rationale to use for simplification in our NTK discussions. We replace the in Equation 3 and 5 with .
-
This logic also applies to our analysis of parameter update bias in Equation 6 and Appendix A.2 from the view of a single data point. With the absolute operation, this further applies to the analysis of the upper bound of the parameter update bias when we consider a batch of data points.
Q6: The difference between \pi_{\phi} \ \text{and} \ \tilde{\pi}_{\phi} in Section 3.1 and Figure 1
Each function approximation training accompanies churn, i.e., the network is changed by the (active) explicit training and the (passive) churn. Conventionally, the two components are viewed as a whole. In Section 3.1, we use a two-step view for the two components, e.g., for the value learning, step 1 is that the value network is updated to with the explicit training on a batch of data for the policy evaluation, and step 2 is the value churn occurs and leads to .
In practice, the two steps happen at the same time. Thus, and in Figure 1 are the virtual intermediate states in the learning process, used for the need of a disentangled illustration.
We have clarified it and amended the notations to eliminate the confusion in the revision.
Q7: Number of random seeds
In the one-page pdf, we provide the results across 12 seeds for CHAIN PPO in Figure 17 and additional results for ten DeepMind Control Suite tasks with 12 seeds in Figure 19. The results demonstrate the effectiveness of CHAIN in improving the learning performance of PPO.
We found that 6 seeds are sufficient to present a clear evaluation for MinAtar. We will run more seeds for TD3/SAC as suggested.
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
As the end of the discussion period is approaching, we would greatly appreciate it if the reviewer could confirm that our response has addressed your concerns.
Concretely, we provide detailed discussions and clarifications in our response to address the main concerns raised by the reviewer on the difference between CHAIN and MeDQN(R) (please refer to Q1), the relationship between churn, catastrophic forgetting (Q2) and interference (Q3) and the amendment of the formal study (Q4). We also provided additional experimental results in the one-page pdf material with ten new tasks, more seeds, and extra learning settings.
We believe that these additions and modifications help to address all of the concerns raised in the review, but please let us know if there are any further issues to address.
Dear reviewer oLah,
This submission received divergent ratings. You are the most critical about the novelty and quality of this paper. The reviewers have prepared detailed response to the review. I would appreciate your engagement in the discussion with the authors to clarify what concerns remain.
Thanks, AC
Sorry for a late reply. And thank you for your detailed response.
It is good to see that the theory issue is resolved. I've updated my score accordingly.
A few more comments.
Churn is the change in the network output, while inference is the change in the objective function of the network.
This claim is wrong. Interference refers to the change of the network output as well. Afterall, how could it be possible to change the objective function of the network without changing the network output? Specifically, I believe churn is just an outcome of interference.
policy/value churn v.s. catastrophic forgetting
Thank you for the discussion about these two topics. It would be beneficial to include the discussion (policy/value churn v.s. catastrophic forgetting, MeDQN v.s. CHAIN) in the updated draft.
catastrophic forgetting can be viewed as a consequence of the accumulation of churn
I don't agree. Forgetting can also happen after one single update, as demonstated in [8]. In fact, I tend to think that both catastrophic forgetting and churn are outcomes of interference. That is, interference is the root cause of both phenomenia.
6 seeds for MinAtar is enough. That's not sure. The shaded areas of two algorithms in space invaders still overlaps with each other. More seeds are needed.
Overall, the biggest value of this work is neither pointing a new phenomena (i.e., churn) nor a new algorithm, since churn is just an outcome of interference and the key idea behind CHAIN is largely similar to MeDQN. The most valuable insight in this work is identifying the interplay between the value churn and the policy churn and the dynamics under the chain effect.
We greatly appreciate the time and effort the reviewer devoted to reviewing our work and actively participating in the discussion. The reviewer's valuable comments helped us to amend our formal analysis, and established insightful connections to related research topics like catastrophic forgetting and interference.
Here are some more clarifications and discussions regarding our claims on churn and interference mentioned by the reviewer.
We agree with the reviewer that, interference is usually a more general word in the RL community, as claimed in [Liu et al., 2023] “classically refers to an update negatively impacting the agent’s previous learning—eroding the agent’s knowledge stored in the value function”. On the opposite, the word "generalization" often (but not always) refers to an update positively impacting the agent’s previous learning. Churn is a more neutral word, that describes the phenomenon or the behavior of NN. However, they have very similar meanings in essence and their specific meanings often depend on the context.
In the context of our work and the relevant works mentioned by the reviewer, i.e., [Liu et al., 2023] and [Liu et al., 2020], interference is formally defined as several functions regarding TD error in Section 3 of [Liu et al., 2023] or as the difference in Mean-Squared Value Error in Section 4.1 of [Liu et al., 2020] (they are what we meant by “the objective function of the network” in our claim); while churn is defined in Equation 1 in our paper to be the change of network output.
Therefore, we made the claim “Churn is the change in the network output, while inference is the change in the objective function of the network” according to their specific formal definitions in these works. After all, this is a slight difference and specific to the formal definition used in different works.
As in the works mentioned by the reviewer, interference has a specific meaning to denote the change (or deviation) in concrete objectives. That is why interference is often considered to be a “negative” word while generalization is often positive (e.g., the change of network leads to a better prediction). In this sense, churn is neutral since it only describes the change of network output, without taking into consideration an optimal objective. In this context, this is also why we provided the possible understanding: interference is the negative outcome of churn and generalization is the positive outcome of churn. However, we agree that the meanings of churn, interference, and generalization are arguable in the RL community, and we would not make these claims in our paper.
We have added the relevant works mentioned by the reviewer along with these insightful discussions in our paper. And we are running 6 more seeds for MinAtar and will make sure to update our results in our revision.
The authors focus on improving the current state of deep reinforcement learning by addressing the churn effect in deep neural network training. Churn effect in deep reinforcement learning is the phenomenon where output predictions for data outside training dataset can change during training updates. This can lead to instability, suboptimality and even collapse. Previous works in addressing churn are limited in scope and do not provide a method to counter churn effect. In this work, the authors perform a detailed exploration of churn effect in RL and propose an algorithm to counter this issue. They characterize churn in a view of Generalized Policy Iteration with function approximation and discover a chain effect of churn. This chain effect leads to a cycle that compound and bias the learning dynamics throughout the iteration. The authors study the effect of churn across different setups and propose CHAIN algorithm, which can be easily plugged into most existing DRL algorithms. Experiments show that their method is effective in both reducing churn and improving learning performance across online and offline, value-based and policy-based RL settings.
优点
The problem is well motivated, the authors provide detailed explanations of what the problem is, why its important, what has been done previously, and a detailed explanation of their methodology.
缺点
- In the introduction, please provide a short intuition behind the CHAIN algorithm, how is it achieving the explicit control of churn. Perhaps mention that by modifying the original loss function, CHAIN can achieve this.
- Expand on the figure captions. This applies to all figures 1-6.
- For figure 1, what is meant by the policy churn and value churn arrows? Are there any updates going on in between? In the description, the authors mentioned “explicit training”, please elaborate what is meant by this.
- For figure 2, expand the caption for 2b, in the figure 2a there is an arrow without direction, please double check if this is accurate, if yes mention in the caption why that is the case
- For figure 3-6, mention the key takeaways from the graphs in figure caption
- In table 1, CHAIN seems to perform worse for AM-large-diverse-v2 dataset, any explanation/intuition as to why that may be the case?
- Figure 3 appears after figure 4, please correct this.
问题
Please see weakness section
局限性
Yes, the authors mention limitations in the paper.
We appreciate the reviewer’s careful review and constructive feedback, and the reviewer’s recognition of the motivation of this work and the importance of studying the churn problem in deep RL. Our additional results in the one-page pdf further demonstrate the effectiveness of CHAIN equipped with our method for auto-adjustment of the regularization coefficient, with ten new DMC tasks and an additional investigation on learning with larger networks.
The main concerns focus on several expression details and the reviewer also noted valuable writing suggestions. Our response aims to clarify these aspects in detail.
Q1: “For figure 1, what is meant by the policy churn and value churn arrows? Are there any updates going on in between? In the description, the authors mentioned “explicit training”, please elaborate what is meant by this.”
There are no updates between and (and and ), which are denoted by the two red wave arrows. The only “explicit training” is the policy evaluation and the policy improvement, which are denoted by the black arrows.
Each function approximation training accompanies churn, i.e., the network is changed by the (active) explicit training and the (passive) churn. Different from the conventional view that takes the two components as a whole, in Figure 1, we use a two-step view for both the value and policy learning, e.g., for the value learning case, step 1 is the value network is updated to with explicit training on a batch of data for the policy evaluation, and step 2 is the value churn occurs and leads to .
In practice, the two steps happen at the same time. Thus, and in Figure 1 are the virtual intermediate states in the learning process, used for the need of a disentangled illustration.
We appreciate the reviewer for pointing out this. We have clarified it and amended the notations to eliminate the confusion in the revision.
Q2: “In table 1, CHAIN seems to perform worse for AM-large-diverse-v2 dataset, any explanation/intuition as to why that may be the case?”
After adding the seed number to 12 and trying smaller regularization coefficients , the scores of CHAIN IQL for AM-large-diverse-v2 are: 26.67 3.96 for (the one reported in Table 1), 28.33 4.05 for , 35.0 4.48 for , and 32.5 2.08 for .
Thus, we hypothesize that slightly over-regularizes IQL while using or improves IQL. This indicates that there is still room to achieve a better score with a smarter strategy for coefficient choice.
Q3: On the writing suggestions
We appreciate the reviewer’s valuable suggestions on improving the expressions in the introduction, the figure captions, etc.
-
[Introduction, “please provide a short intuition behind the CHAIN algorithm, how is it achieving the explicit control of churn. Perhaps mention that by modifying the original loss function, CHAIN can achieve this”] We added the details “The main idea of CHAIN is to reduce the undesirable changes to the policy and value networks for states (and actions) that are outside of the current batch of data. The motivation is similar to the monotonic improvement principle of PPO and TRPO, which in their cases is achieved by improving just the action distribution for the samples in the current batch. Concretely, CHAIN minimizes one additional churn reduction regularization term computed with a separate data batch along with the optimization of the original policy or value learning objectives” to describe the intuition, right after the first sentence in Line 50.
-
[Figure 2a, “there is an arrow without direction”] It should be an arrow towards the right (i.e., from to ) and we have amended it.
-
[Figure 3-6] We have added takeaways in the captions as suggested.
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
As the end of the discussion period is approaching, we would greatly appreciate it if the reviewer could provide further feedback to our rebuttal, and let us know whether our response has addressed your concerns.
In the response, we provided a detailed clarification for Figure 1 (Q1) and an additional discussion on the results in AM-large-diverse-v2 with more results (Q2). In Q3, we detailed how we took the writing suggestions and modified the text in the revision.
Moreover, to further strengthen our experimental evaluation and demonstrate the significance of CHAIN, we provided additional experimental results in the one-page pdf material with ten new tasks from DeepMind Control Suite, more seeds, and an extra learning setting on DRL scaling.
We believe that these additions and modifications help to address all of the concerns raised in the review, but please let us know if there are any further issues to address.
Deep RL optimisation exhibits many instabilities and performance collapses. Schaul et al. [2022] discuss a pathology termed the "policy churn", where even a single update to the value network frequently changes the optimal action for a huge fraction of all states (most of which were not present in the training batch). Following in these steps, the current paper characterises churn through the lens of generalised policy iteration, and proposes a simple regularisation loss that can be intuitively described as restricting the actor and critic predictions from changing "too much" on other states.
优点
The authors argue convincingly that churn effects in actor and critic optimisation can interact amplify each other, in what they call the "chain effect", and propose a surprisingly simple regularisation loss to weaken this effect. As a result, the churn measurably goes down, which appears to correlate with better learning performance. As there are few papers on mitigating churn, I think these findings are useful and can help inform further research.
缺点
Writing
I have some concerns about the writing. The topic is important, and relatively new, so good communication would be valuable.
The paper is quite well written until page 4, but later, it gets hard to follow.
For example, the language starts to get more vague, e.g. line 176 -
One thing to note is the two types of deviation are derived in an exchange manner between the value and policy churn,
I found Section 3.3 particularly hard to follow and had to reread it many times. It introduces a lot of new symbols in one go, uses them in 1/3rd of the page, and then moves on to the next topic. The section's conclusion is that churn and parameter update bias amplify each other as the optimization goes on (something which the reader is already informally aware of since early on in the paper).
Afterwards, the main method (CHAIN) is introduced as late as page 7, following which the paper appears quite rushed.
I think the paper could benefit from:
- moving the NTK discussion (which is never used afterwards) and everything theoretical after that to the appendix. The math may be saying the right things, but I did not find it useful in this context.
- focusing on describing CHAIN and its experiments in more detail.
326: You mention Fig 5, but I think you mean Fig 6
328: I'm not sure what the following bolded part means. Does it mean that the target critic network slows down credit assignment / weight updates, and therefore reduces churn? It would be helpful to be more verbose in the text right here.
We hypothesize that this is because policy interacts with the environment directly and the target critic network also helps to cap the value churn.
Experiments
The churn reduction looks quite noticeable and the MinAtar evaluations also show a notable performance boost. How much of this observed churn reduction can be explained by simply "slowing down" the parameters' changes?
I think this is quite important to test, one could also simply train both actor and critic with:
- slightly reduced learning rates
- different target network momentums ... and observe if the churn also reduces due to those things (and whether the agent performance also correspondingly improves). I believe such an experiment would give clearer insights, and possibly strengthen the paper.
Line 624:
Between TD3 and SAC, CHAIN-PCR works better for TD3 rather than SAC. We hypothesize that this is because KL-based PCR used for SAC poses more challenges in code implementation.
That is concerning, because it sounds like a bug in your code. Not affecting the entire paper, but it does make it hard to believe the SAC results.
问题
IQL: Line 338 says you apply CHAIN to IQL --- could you explain how? as far as I can tell, IQL does not need to train an actor network in tandem with a critic. It's likely that I did not fully understand this part.
局限性
The authors do mention the limitations of the work, but the paper could be greatly improved by reducing the role of the mathematical exposition that takes up the majority of the paper, and giving more space to the experiments.
This is an important area of work that can benefit from new empirical findings, and I am quite willing to improve my rating if my concerns are addressed.
We appreciate the reviewer’s valuable review and the recognition of the value of the chain effect and the method proposed in our work. Our additional results in the one-page pdf further demonstrate the effectiveness of CHAIN in ten DMC tasks and improving the learning when larger networks are used. We also believe that our study on the chain effect of churn can inspire more future work on understanding and addressing the learning issues of DRL.
The main concerns focus on our writing and several experimental details. Our response aims to systematically clarify these aspects.
Q1: Churn reduction v.s., Slowing down the parameters’ changes (with reduced learning rates and different target network momentums)
We appreciate the reviewer for pointing out this insightful investigation. We ran additional results for using different learning rates and target network replacement rates. The results are summarized in Table 10 of the one-page pdf.
In terms of learning performance (i.e., final episode return), we can observe that either reducing learning rate or target network replacement rate often leads to worse performance, especially for TD3. To some degree, this matches the common knowledge in the RL community.
As to churn reduction, in principle, using smaller learning rates or target network replacement rates should lead to less churn. This is because churn is positively related to the parameter update amount (as shown by the NTK in Equation 3) and a slower target network also slows the churn that occurs instantly in each training more when computing the target value to fit for the critic network. We also observed empirical evidence for this (omitted due to one-page space limitation).
This indicates that the issue of churn cannot be addressed by reducing learning rate or target network replacement rate (which just slows down the learning process). Churn is a “by-product” of the training of DRL agents and should be addressed separately. Actually, we observed that applying CHAIN when using smaller learning rates also improves the learning performance in some tasks, which is omitted due to one-page space limitation.
Q2: “Line 338 says you apply CHAIN to IQL --- could you explain how? as far as I can tell, IQL does not need to train an actor network in tandem with a critic. It's likely that I did not fully understand this part”
According to Section 4.3, Equation 7 () and Algorithm 1 in the original paper of IQL (“Offline Reinforcement Learning with Implicit Q-Learning”), a policy network is explicitly trained with advantage-weighted regression based on the Q and V networks. Concretely, a Gaussian policy is generated by the policy network of IQL for action selection as the D4RL tasks in our experiments have continuous action spaces.
Therefore, we apply CHAIN to IQL by optimizing together with as Equation 10 in our paper. Please let us know if there are any further questions about this point.
Q3: On the writing suggestions
We appreciate the reviewer for pointing out these issues in our writing and providing suggestions, which are very valuable to the further polish of our paper.
-
[Line 176, “exchange manner”] We amended this improper expression by re-rewriting the sentence to “the two types of deviation show the interplay between the value and policy churn as the value churn causes the deviation in policy (i.e., the action gradient) and the policy churn causes the deviation in value (i.e., the policy value)”.
-
[Section 3.3 and the organization of NTK discussions and experiments] In Section 3, we aim to present (1) how parameter update causes churn with the NTK in Equation 3, (2) the interplay between the policy and value churn with the NTK in Equation 5, (3) how churn introduces bias in parameter update with Equation 6. The three components form the cycle: parameter update —> churn —> parameter update —> …. Therefore, we are afraid that completely moving the NTKs to the appendix could make some readers lose track of these connections which causes instability. However, we agree with the reviewer that we can simplify and shorten the content of Section 3 to make the thread clearer and more prominent, and then make more room to strengthen our experiment section with more analysis and our additional results (as provided in the one-page pdf). We have re-organized the content in our revision and included additional experimental analysis on scaling networks as well as an updated version of CHAIN with the method for adaptive regularization coefficients to reduce hyperparameter tuning
-
[Line 326] We amended it to be Figure 6 as pointed out.
-
[Line 328, “the target critic network also helps to cap the value churn”] It means that the value churn that accompanies each training of the critic network is not immediately (or fully) reflected in the target critic instantly due to delayed synchronization of the target network. Thus, the target critic should reduce the value churn by a factor associated with the exponential moving coefficient (i.e., the momentum hyperparameter). We added these explanations to make it clearer.
-
[Line 624, “KL-based PCR for SAC”] We meant to express that the variation (regarding the scale and range) is higher in optimizing the MaxEnt objective and KL-based PCR term together for SAC than in optimizing the Q objective and L2-based PCR term for TD3. Another hypothesis is that the MaxEnt nature of SAC prefers the encouragement of more stochasticity in policy. We re-wrote the sentence to eliminate the expression issue.
Sorry for the wrong line number in the last response. We mentioned that we apply CHAIN to IQL by implementing the policy churn reduction in Line 337-340: “To apply CHAIN on IQL and AWAC, we implement the regularization for the policy churn reduction (Eq. 10) by adding a couple of lines of code without any other modification”. We appreciate the reviewer for pointing out the confusion and we have clarified this in our draft.
For Line 346-348, we will rephrase the discussion by focusing on the effect of policy churn reduction and also with the potential additional results as suggested by the reviewer.
We appreciate the reviewer for making this point clearer and for the valuable suggestions.
Yes. The training of the actor network does not influence the dynamics of the value network training, although the actor and critic are trained iteratively in practice. Therefore, to make a claim like “IQL suffers from the chain effect” would be improper. We will clarify this point in our draft.
[“You are still optimising L_QC with the modified IQL, correct?”] No, we only applied PCR to IQL and AWAC without touching the training of the critic network (as mentioned in Line 343-348).
Therefore, we can now credit the performance improvement achieved by CHAIN to its effect in reducing the policy churn only and the ablation does not exist. One thing to note is that, despite the lack of the chain effect, as long as an explicit actor network is trained, policy churn exists (like in the PPO case), and the actor network directly interacts with the environment for the final evaluation.
For the extra experiment suggestion, we will try to present the results for only applying VCR to IQL to investigate whether "only L_QC should be contributing to the improved performance", before the discussion stage ends with our best effort.
Thanks for the swift response!
Yes. The training of the actor network does not influence the dynamics of the value network training, although the actor and critic are trained iteratively in practice. Therefore, to make a claim like “IQL suffers from the chain effect” would be improper. We will clarify this point in our draft.
Great!
No, we only applied PCR to IQL and AWAC without touching the training of the critic network (as mentioned in Line 343-348).
Hmm, I checked the mentioned lines and here are some issues
- You mention "[...] we use for both CHAIN IQL and CHAIN AWAC [...]". But you do not mention that ! So you need to be clearer and more verbose there.
- Further in
346-348: "[...] CHAIN suppresses the dual bias of policy value and helps address the extrapolation error of offline learning." I think "the dual bias of policy value" is vague and needs to be phrased better. It also gives the impression that the "chain effect" is treated in these methods, but it's not (as you said, you're not addressing value churn here at all!)
Therefore, we can now credit the performance improvement achieved by CHAIN to its effect in reducing the policy churn only
Okay, but this is really very surprising and needs to be probed further. As we discussed, training an actor is optional in IQL. What happens if you train IQL to the very end (without training the actor), then do a fixed budget of actor-only training against the final critic (without L_PC), and compare the results?
- If those actor-trained-against-final-frozen-critic results are better than the IQL baseline (which has interleaved actor and critic updates), it could suggest there is something harmful about training an actor against a time-varying critic, and can support your theory further.
- It would be also interesting to see how it compares against your current IQL + CHAIN runs, and I'd be curious to see your interpretation of the results.
For the extra experiment suggestion, we will try to present the results for only applying VCR to IQL to investigate whether "only L_QC should be contributing to the improved performance", before the discussion stage ends with our best effort.
Thank you, looking forward to it!
Thanks for the rebuttal. That addresses some of my concerns.
Therefore, we apply CHAIN to IQL by optimizing together with as Equation 10 in our paper. Please let us know if there are any further questions about this point.
If my understanding is correct, IQL trains only the critic with in-sample learning, it does not rely on an actor. In the end, once the Q network is trained, an actor can trained to exploit this frozen Q network with AWR. Sure, in practical implementations, one might be constantly training an actor against an also-being-trained Q network, but the actor has no effect on the Q network's training dynamics.
**Therefore, IQL cannot, by design, exhibit the "chain effect" because training the actor network is optional. ** You should acknowledge this in the relevant section.
IQL is a very nice guinea pig for your theory because it permits training without the funky actor-critic two-network dynamics. So in principle, you only have critic-churn to deal with.
The fact that you still get better performance in IQL despite that is quite interesting, and merits a discussion (and you need more space in the paper for that...). I am very curious to hear your thoughts on why this happens. You are still optimising L_QC with the modified IQL, correct? An ablation would be interesting; due of the lack of the "chain effect", I'd hypothesise that only L_QC should be contributing to the improved performance, not L_PC. You could try ablating L_QC and L_PC by setting their coefficients to zero in IQL training: i.e., do the VCR / PCR / DCR experiments. I would be curious to hear what happens, in your response (fine as a markdown table in an openreview comment, if you can't update the PDF).
Because you're using CORL, these ablations should be doable quite quickly.
I would be willing to raise my score upon receiving a through update!
We provide the additional results for IQL (sequential) over 12 seeds, i.e., actor-trained-against-final-frozen-critic, and CHAIN IQL (VCR, ) over 6 seeds below, along with the results for IQL and IQL (PCR, ) which were reported in our submission.
In the following, we provide the implementation details and discuss the results. We are willing to hear more valuable comments from the reviewer and address any further questions.
| Task | IQL | IQL (sequential) | CHAIN IQL (PCR) | CHAIN IQL (VCR) |
|---|---|---|---|---|
| AM-umaze-v2 | 77.00 5.52 | 60.00 3.91 | 86.66 4.11 | 88.33 3.66 |
| AM-umaze-diverse-v2 | 54.25 5.54 | 55.00 5.46 | 63.33 9.42 | 71.67 7.23 |
| AM-medium-play-v2 | 65.75 11.71 | 52.50 3.36 | 83.33 9.33 | 70.00 5.27 |
| AM-medium-diverse-v2 | 73.75 5.45 | 53.33 5.93 | 80.00 12.90 | 70.00 6.67 |
| AM-large-play-v2 | 42.00 4.53 | 17.5 4.10 | 50.00 5.77 | 41.67 7.61 |
| AM-large-diverse-v2 | 30.25 3.63 | 5.83 2.19 | 26.67 12.50 | 38.33 5.97 |
“What happens if you train IQL to the very end (without training the actor), then do a fixed budget of actor-only training against the final critic (without L_PC), and compare the results?” (the results for actor-trained-against-final-frozen-critic)
As suggested by the reviewer, we slightly modified the training process of the CORL implementation of IQL to realize “actor-trained-against-final-frozen-critic”:
- First train the value network and Q network for 1M steps.
- Then train the policy network for 1M steps with the value network and Q network frozen.
- We do not modify any other implementation detail and use the same hyperparameters.
- We check the learning curves of the policy network and the final scores.
We call this variation IQL (sequential). The total number of gradient steps is the same as the default IQL implementation where the critic and actor are trained iteratively.
We report the means and standard errors of the final scores achieved by IQL (sequential) over 12 random seeds.
The results in terms of final score show that IQL (sequential) performs worse than IQL in 5 of 6 tasks. For the learning curves (which we are not able to present at this stage), we found the policy performance increases efficiently and does not change much after 3e4 - 1e5 steps (depending on different tasks) of the policy training, without a large variance or a collapse through training.
The difference between IQL (sequential) and IQL can be fully attributed to the difference in the training dynamics, mainly on the policy network (as the training of the policy network does not influence the value network training in IQL). We think it is somewhat tricky to explain the difference between IQL (sequential) and IQL. The difference in the training dynamics here is beyond the scope of the chain effect of churn studied in our work.
To provide some possible explanation, we guess that the difference in the training dynamics of the policy network of IQL results in a further difference in the policy output on out-of-sample states.
The policy outputs of these out-of-sample states are fully determined by generalization. Compared with IQL (sequential), the distribution of the advantages should cover a wider range of values, thus providing more diverse gradient directions for the training of the policy network. This could lead to better robustness on out-of-sample states.
Finally, we believe that a more systematic study for offline RL is needed by focusing on how the training on in-sample data affects the network output on out-of-sample data under different training schemes, which is beyond the scope of our work.
Discussion on “(only) applying VCR to IQL”
Technically, the value training in VCR includes the training of a Q network and the training of a V network. To apply VCR to IQL, we modify IQL by optimizing one additional Q-value churn reduction regularization term jointly with the original Q loss (without changing the training of the V network). We call this variation as CHAIN IQL (VCR).
The results for CHAIN IQL (VCR) with are reported in the table above.
We can observe that CHAIN IQL (VCR) outperforms IQL in 4 of 6 tasks, and outperforms CHAIN IQL (PCR) in 3 of 6 tasks. This demonstrates the effectiveness of reducing the value churn in IQL. This also indicates that even though the chain effect does not exist in IQL, the value churn and the policy churn in the value training or the policy training can still negatively affect the learning performance.
Thanks for running these experiments.
I certainly did not expect IQL (sequential) to perform worse. The new results are quite surprising and will make people think.
I would urge you to add these findings to the paper, they are quite valuable. Appendix or main text, up to you!
I’m raising my score. Great work!
We greatly appreciate the time and effort the reviewer devoted to reviewing our work and actively participating in the discussion.
The reviewer's valuable comments led us to improve our study with more careful discussions and better interpretations of our method. The discussion with the reviewer also inspired us to think about new ideas for future research beyond this work.
We will add these new findings to our paper with a better organization as suggested.
Thank you!
This paper focues on the training instability from the non-stationary nature of DRL, whose pheononmena are unexpected shift in policy and value network outputs (policy churn and value churn). To mitigate policy and value churn, the proposed CHAIN algorithm add penalize in the policy and value updates.
优点
CHAIN is a versatile solution that can be integrated into different backbone algorithms.
缺点
- Extra introduced hyperparameters and no-trivial hyperparameters setting.
- No statistically significant improvements in many tasks, even equipped with good hyperparameters.
问题
-
CHAIN introuduces two extra coefficients, and to control the degree of regularization. I find that for different benchmark suites or tasks or backbone algorithms, are set differently. So I have the following questions:
What is the insight behind the chosen hyperparameters for different settings? If the potential readers want to apply CHAIN to other settings, how can they set the two hyperparameters? Actually, I find some explanations on this in Appendix C.2, but it requires making [policy loss]/[regularization loss] within 0.1 to 0.01, thus it needs to first try and then observe the ratio and then turn, maybe need several runs, whose costs could not be ignored. Introducing extra hyperparameters and a simple one set of them could not suffice to get satisfying results for varying settings is not encouraged, as actually, it may output worse performance than the backbone algorithm if setting wrong hyperparameters.
-
How many seeds do you use for the experiments in your main paper? For example, Figure 6. The CHAIN SAC in Figure 6 does not show a statistically significant improvement to SAC in Hopper-v4, Walker2d-v4, and HalfCheetah-v4, which may bring some doubts on the effectiveness of CHAIN, especially considering that an unsuitable hyperparameters set may provide even worse results of CHAIN SAC.
-
As the paper focuses on mitigating the training instability of DRL and the non-stationary properties, some other methods may need to be considered to compare to or discuss, such as AlgaeDice [1], etc.
-
Further, in online RL, over-conservative may hinder performance, so maybe the authors need to discuss well: whether the CHAIN would negatively influence the exploration in online RL and then hinder the asymptotic performance.
-
And, another style of work, for example, [2], in online RL actually encourages the network neurons to awaken, thus preventing the local optimal. Could you provide some explanations on whether CHAIN would deteriorate the local optimality issue?
[1] Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., & Schuurmans, D. (2019). Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074.
[2] Xu, Guowei, et al. "Drm: Mastering visual reinforcement learning through dormant ratio minimization." arXiv preprint arXiv:2310.19668 (2023).
局限性
The authors have acknowledged the limitations.
We appreciate the reviewer’s valuable review and the recognition of the versatility of the proposed method. Our method is simple but versatile to improve the performance of many learning algorithms. We show that the method works across continuous and discrete control tasks, online and offline settings. Moreover, in our additional results (Table 9 in the one-page pdf) we show that the method can greatly improve the scaling of RL algorithms. This scaling is also improved by an adaptive version of our method that reduces hyperparameter tuning.
The main concerns focus on the hyperparameter choice for different settings and CHAIN’s effect on exploration and optimality. Our response aims to clarify these aspects in detail.
Q1: The insight behind the chosen hyperparameters for different settings
How to pick or adjust the regularization coefficient is an unavoidable tricky problem for most regularization-based methods. In the context of our work, the difference in the scale of the quantities like Q value and policy objective across different settings leads to the use of difference coefficients. This is also a long-standing problem of DRL.
To relieve the pain of manually picking coefficients in different tasks and settings, we additionally propose a simple but effective method for automatic adjustment of with a target relative loss scale denoted by . It is realized by maintaining the running means of the absolute policy () or Q loss and the PCR () or VCR term, and dynamically computing the regularization coefficient, e.g., . This method adjusts the coefficient dynamically to keep the consistent relative loss scale .
Our additional experiments in the one-page PDF show that this method matches or surpasses the results achieved by the manual choices of for DDQN in five MinAtar task (Figure 16), for PPO in five MuJoCo tasks (Figure 17) and ten additional DMC task (Figure 19), and for TD3 in four MuJoCo tasks (Figure 17).
Q2: Whether the CHAIN would negatively influence the exploration in online RL and then hinder the asymptotic performance
In principle, reducing churn suppresses the correlation of network outputs among different data. This encourages the agent to keep the independence of the action on different states. Thus, we hypothesize that it could help to keep the diversity of action (in other words, prevent the collapse of action mode as often seen in the cases with severe learning instability for both RL[2] and LLM [3]) and positively influence exploration.
In our experiments, we can see that CHAIN improves the asymptotic performance in the tasks that require effective exploration even though the value or policy churn is significantly reduced. For MinAtar, the original MinAtar paper mentioned “Seaquest and Freeway present a greater exploration challenge than the other games” on page 5, and our results in Figure 4 show CHAIN improves the final performance of DDQN by clear margins in these two environments.
Moreover, we additionally provide the comparison between CHAIN PPO and PPO in two sparse-reward tasks of DeepMind Control Suite (DMC): ball_in_cup-catch-v0 and cartpole-swingup_sparse-v0. The results are shown in the first two subplots in Figure 19. We can see that CHAIN PPO achieves a higher asymptotic performance in ball_in_cup-catch-v0 and achieves almost the same performance in cartpole-swingup_sparse-v0.
Q3: DrM [2] encourages the network neurons to awaken, thus preventing the local optimal. Whether CHAIN would deteriorate the local optimality issue
DrM prevents the neurons from being dormant with an adaptive perturbation reset method. This addresses the loss of plasticity/learnability and thus alleviates the local optimality issue.
In principle, CHAIN has a positive effect in preventing the loss of plasticity. As in Equation 3, reducing churn encourages to 0. This prevents the empirical NTK matrix from being low-rank, which is shown to be a consistent indicator of plasticity loss by [1] (they also claimed empirical NTK is a more reliable indicator than dormant neuron ratio).
In our experiments, we can see that CHAIN improves the final scores in most cases across MinAtar, MuJoCo (especially with CHAIN PPO) and additional ten DMC tasks (in Figure 18 of the one-page pdf). Additionally, we provide more results in the one-page pdf to show that CHAIN improves the final performance both when running longer (Figure 17 Ant and Humanoid for 10e6 steps) or with larger networks (Table 9).
Q4: The discussion on AlgaeDice
We appreciate the reviewer for reminding us of this. AlgaeDICE aims to address the limitation of on-policy samples by re-expressing the on-policy max-return objective by an expectation over any arbitrary off-policy data distribution, which is followed by a solution in the form of dual optimization.
AlgaeDICE does not have a direct relation to churn or generalization of DRL agents, however, we believe it is orthogonal to CHAIN in mitigating training instability. We have integrated AlgaeDICE and more in our related work discussion.
Q5: The number of random seed
In the one-page pdf, we provide the results across 12 seeds for CHAIN PPO in Figure 17 and additional results for ten DeepMind Control Suite tasks with 12 seeds in Figure 19. The results demonstrate the effectiveness of CHAIN in improving the learning performance of PPO.
We found that 6 seeds are sufficient to present a clear evaluation for DDQN in MinAtar. We will run more seeds for TD3/SAC as suggested.
Reference:
[1] Disentangling the causes of plasticity loss in neural networks. 2024.
[2] Overcoming Policy Collapse in Deep Reinforcement Learning. 2023.
[3] Controlling Large Language Model Agents with Entropic Activation Steering. 2024.
In your response, I am puzzled by the statement: "In our experiments, we can see that CHAIN improves the asymptotic performance in the tasks that require effective exploration even though the value or policy churn is significantly reduced." Could you elaborate on why CHAIN enhances exploration capabilities without becoming overly conservative? Specifically, what attributes of CHAIN contribute to its improved performance on tasks with sparse rewards? CHAIN applies regularization to policies and values but does not explicitly propose exploration-related terms. How, then, does it effectively address challenging exploration problems?
Thank you for your rebuttal.
I have further questions regarding the factors influencing performance. I am particularly interested in understanding which factors contribute most significantly to achieving high performance.
It is evident that hyperparameters play a crucial role; manual selection often yields suboptimal performance across various tasks. On the other hand, employing a smart mechanism typically enhances performance significantly. In light of this, I am curious why the smart mechanism is not included in the primary implementation. Would such an approach impact the theoretical analysis differently?
Additionally, Figure 17 highlights different churn reduction options, where the efficacy of DCR and PCR varies across tasks. How should one decide between these options, and what strategies can mitigate the need for task-specific tuning?
We appreciate the reviewer for the interest in discussing and investigating the potential effect of CHAIN in exploration, which also inspires us of further research on this point in the future.
The exploration ability/behavior is determined by two factors: (1) the extrinsic factor, i.e., the exploration mechanism used, (2) the intrinsic factor, i.e., the independence or diversity of the policy (or Q-value) network output.
For the extrinsic factor, CHAIN does not suppress the extrinsic exploration mechanisms used in DRL algorithms, e.g., epsilon-greedy in DQN, Gaussian noise in TD3, state-independent variance parameter vector in PPO, and the state-dependent variance parameter vector in SAC.
For the intrinsic factor, in our rebuttal text, we provided a discussion about how CHAIN helps to prevent action correlation (more severely, collapse) and could encourage action independence. Indeed, in the extreme case where the churn reduction regularization term dominates the joint learning objective, the agent will be prevented from learning any effective behavior. Our newly proposed method for keeping a consistent relative loss scale should help to avoid this kind of extreme case.
However, the best that CHAIN can do is to prevent the over-correlation of action and the loss of independence. It does not introduce any extra heuristics or principles to encourage exploration (e.g., novelty, curiosity, uncertainty), which we think is necessary for learning effective exploration behaviors in challenging exploration problems.
Therefore, we do not think CHAIN has the ability to address challenging exploration problems by itself. It could be interesting to study the effect of CHAIN when applied together with existing exploration methods to DRL agents in the future.
Finally, we would like to emphasize that we do not propose CHAIN as an exploration method. It is a method to reduce churn and address the learning issues (as presented in Section 4.1) for better performance of DRL algorithms.
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
It is evident that hyperparameters play a crucial role; manual selection often yields suboptimal performance across various tasks. On the other hand, employing a smart mechanism typically enhances performance significantly. In light of this, I am curious why the smart mechanism is not included in the primary implementation. Would such an approach impact the theoretical analysis differently?
Thanks to the reviewer’s valuable review, the newly proposed method for automatic adjustment of the regularization coefficient is done during the rebuttal stage. We will add this method/mechanism in the main body of the revised paper.
Across all our experiments, we found that manually selecting a regularization coefficient that makes the target relative loss scale less than 0.01 rarely harms the learning performance of the baseline algorithms. The suboptimal performance due to over-regularization is almost only observed when the scale regularization term matches or surpasses the original learning objectives.
This method adjusts the regularization coefficient dynamically through learning to keep a consistent relative loss scale. It follows the same insight into churn reduction presented by our analysis of the chain effect of churn. We consider that this method does not impact our formal analysis.
Figure 17 highlights different churn reduction options, where the efficacy of DCR and PCR varies across tasks. How should one decide between these options, and what strategies can mitigate the need for task-specific tuning?
For applying CHAIN to deep AC methods, we recommend the practitioner to use PCR. This is because policy interacts with the environment directly and the target critic network also helps to alleviate the value churn in the learning of the critic network in practice. Empirically, we also found PCR often contributes more than VCR.
As shown by our additional results in the one-page pdf, the need for task-specific selection of the regularization coefficient is addressed by our auto-adjustment method. Effective adjustment is achieved with a common target relative loss scale for all the tasks within the same domain.
More broadly, we have three pieces of advice for the community audiences to adapt CHAIN to different problems in practice:
- Use our “target relative loss scale” insight to do automatic adjustment of the regularization coefficient.
- Start from a small target relative loss scale (e.g., 1e-5). It should be safe in the sense that the performance will not be harmed.
- Use techniques to normalize the scale of quantities (e.g., reward, advantage, Q). This is also one recipe of the DreamerV3 [Hafner et al., 2023].
Thank you for providing these details, which might be a guide for practical use.
Yet, I suggest the authors clearly state that the need for HP tunning/ DCP and PCR selection is a limitation. Because I think it is not good to propose several mechanisms and solve different tasks using different HPs and mechanisms - of course with only one exception that if all the proposed mechanisms and HPs could outperform.
Further, one of my concerns has not been answered well, "I am particularly interested in understanding which factors contribute most significantly to achieving high performance."
I appreciate your efforts in detailed response. I will adjust my score accordingly.
We sincerely appreciate the reviewer's valuable comments and further feedback during the discussion stage. As the end of the discussion period is approaching, we would greatly appreciate it if the reviewer could confirm that our response has addressed your concerns.
For a brief summary, we provided detailed discussions about how we leverage the insight of relative loss scale to further develop a simple but effective method for adapting the regularization coefficient throughout training. With our additional results in the one-page pdf, we showcased that this method greatly helps to relieve the need for manual selection of the regularization coefficient in different tasks and domains, while achieving and even surpassing the performance of manual coefficients. To apply our churn reduction method to a different problem, we provided our thoughts and several pieces of practical advice.
We also discussed the effect of CHAIN on exploration and plasticity in detail. To provide more empirical evidence for this, we provided two more sparse-reward DMC tasks (in addition to the two exploration tasks in MinAtar) to show the effectiveness of CHIAN in these tasks, although our work focuses on how churn occurs and affects DRL and does not aim to address challenging exploration problems.
Moreover, to further strengthen our experimental evaluation and demonstrate the significance of CHAIN, we provided additional experimental results in the one-page pdf material with ten new tasks, more seeds, and an extra learning setting on DRL scaling.
We believe that these additions and modifications help to address all of the concerns raised in the review, but please let us know if there are any further issues to address.
We greatly appreciate the time and effort the reviewer devoted to reviewing our work and actively participating in the discussion. The reviewer's valuable comments drove us to propose a method for dynamic adjustment of the regularization coefficient, and inspired us of more thinking on the effect of churn reduction in exploration, plasticity, etc.
"I am particularly interested in understanding which factors contribute most significantly to achieving high performance."
As we considered different learning settings in Section 4.1, we would like to note that Value Churn Reduction (VCR) and Policy Churn Reduction (PCR) are not always used together in the different learning settings considered in our experiments.
To be specific, only VCR is used for CHAIN DDQN (mentioned in Line 281) because there is no policy/actor network other than the Q-network (and the policy is implicitly derived by performing the greedy actions based on the Q-network). For the PPO case, only PCR is used for CHAIN PPO (mentioned in Line 302) because the main role played by the V-network is a baseline to subtract (rather than a real critic, according to the opinion in Chapter 13.5 of [Sutton and Barto, 2018]). Therefore, in these cases, only one factor is added to the baseline algorithms and the difference in the learning performance can be fully credited to the corresponding factor.
For deep AC methods, we agree that a better way to select DCR/VCR/PCR is needed. In this work, we provided the results of DCR/VCR/PCR for TD3 and SAC. We have clearly stated this point and updated the limitation section to include the main messages of our discussions as suggested.
Finally, please let us know if we misunderstood the meaning of the “factors” in your comments.
We appreciate all the reviewers’ careful review and valuable comments. Here we summarize the main points of our response to each review and the content of our additional results enclosed in the one-page pdf.
The summary of the responses to individual reviews:
- [Reviewer DNMP] We appreciate the reviewer’s recognition of the versatility of the method proposed in our work. The main concerns focus on the insight behind hyperparameter choice for different settings, and CHAIN’s effect on exploration and optimality. To this end, we propose a simple but effective method for automatic adjustment of the regularization coefficients by developing the insight of keeping a consistent relative scale between the regular policy or value loss and the churn reduction term. Our additional results (Figure 16-19 in the one-page pdf) show that this method can match or surpass the results achieved by manually picked coefficients across different tasks in MinAtar, MuJoCo and additional tasks in DeepMind Control (DMC) suite (Figure 19). We believe that this provides a useful reference for practitioners to use CHAIN in a different problem. Besides, we discuss how CHAIN can have positive effects in encouraging exploration and preventing plasticity loss in principle, and show comparable or better asymptotic performance in four exploration tasks (including 2 in MinAtar and 2 in DMC).
- [Reviewer Qjzk] We appreciate the reviewer’s recognition of the value of the chain effect and the method proposed in our work. The main concerns focus on an insightful discussion on “the effect of reducing learning rate or target network replacement rate in reducing churn” and several expression details. In the individual response, we show either reducing learning rate or target network replacement rate slows down learning and leads to worse performance (with the additional results in Table 10 of the one-page pdf). We also note that using smaller learning rates or target network replacement rates does not address the issue of churn, although it has less churn in principle (which is explained by our NTK expressions in Equation 3).
- [Reviewer c7UJ] We appreciate the reviewer’s recognition of the motivation of this work and the importance of our study. The main concerns focus on several expression details and the reviewer also noted valuable writing suggestions. In the individual response, we provide detailed explanations and describe how we took the writing suggestions to polish our paper concretely.
- [Reviewer oLah] We appreciate the reviewer’s recognition of our experimental work. The main concerns focus on the relationship between our work and the previous works mentioned by the reviewer on catastrophic forgetting (especially MeDQN) and interference. Our response explains how reducing churn also helps to reduce forgetting but reducing forgetting does not necessarily reduce churn and could even increase churn. We also point out the essential connection between churn and interference, and the clear distinction of our work on the study of policy churn, the interplay, and the dynamics under the chain effect.
In the one-page pdf, we provide additional results:
- [A simple method for automatic adjustment of the regularization coefficient (refer to Q1 of Reviewer DNMp for the details)] Figure 16-18 show the results of our auto-adjustment method for CHAIN DDQN/PPO/TD3. The results show that, this method can match or surpass the learning performance achieved by manually picked regularization coefficients across different tasks and different settings. We believe that this method can help to relieve the pain of coefficient choice in practice, and inspire the development of smarter methods.
- [Additional evaluation in ten DMC tasks] Figure 19 shows additional results of CHAIN PPO for 10 tasks in DeepMind Control Suite, with two sparse-reward tasks (the first two).
- [CHAIN helps the scaling of PPO] Table 9 shows our additional investigation on the effect of CHAIN when scaling DRL agents up. We take PPO as the exemplary setting and widen both the policy and value networks to 4 and 8 times. The results demonstrate that CHAIN helps to scale up PPO and achieves clear improvement in terms of episode return. Moreover, we found that larger networks exhibit more churn. We hypothesize that it is a possible reason for the notorious scaling issue of DRL agents, and CHAIN helps the scaling by reducing churn effectively.
- [(To Reviewer Qjzk) Churn phenomenon and learning performance when using smaller learning rates and target network replacement rates] Table 10 shows smaller learning rates and target network replacement rates often slow down learning and lead to worse performance.
With these additional discussions and experimental results, we emphasize that our formal study on the churn in both value and policy (including the interplay, the chain effect, and the concrete issues in popular DRL settings), the simple and general churn reduction method (including CHAIN and the additional auto-adjustment method), and the empirical discoveries (including what the churn is like in popular DRL agents, and how they can benefit from churn reduction) are novel to the best of our knowledge.
We believe that our paper can inspire more future work on understanding and addressing the learning issues of DRL from angles like churn, interference, and generalization as mentioned by the reviewers. And the method proposed in this paper can be implemented very easily and adopted in broader scenarios, e.g., PPO-based alignment of LLMs.
Finally, we sincerely hope that our response can address the questions and concerns. We also hope that the reviewers can re-evaluate the value of our work based on the responses and the additional results. We are also more than willing to address any further questions or concerns during the discussion stage.
This paper studies a problem recently observed in the training dynamics of deep networks in reinforcement learning, aka churn. It refers to the phenomenon where output predictions for data outside the training batch can change during an update. This paper provided an in-depth study of the churn in training the policy and value networks and the interplay between these two dynamics. It proposed a simply regularization loss to suppress the churn and observed in experiments that the policy performance improve accordingly.
Pros:
- Most reviewers consider the paper provide a convincing motivation. The study on the churn concept in deep RL is relative new but important, and this submission is likely to inform future research along this line.
- The main contribution is the study on the interplay of the policy and value churn, and the solution to suppress it effectively.
- The proposed method is simple but effective and versatile to be applied to different Deep RL backbone methods. Either used for single network training or together as "CHAIN".
- The experiments not only show that the performance improves but also provide direct evidences that the proposed method indeed reduce churns.
Cons:
- Two reviewers (DNMp, c7UJ) were concerned on the challenge of choosing the two additional hyperparameters. The authors proposed a new method to automatically choose the coefficients during the rebuttal and provided new experiments to justify its effecacy, which partially addressed the concern.
- Multiple reviewers suggests improvements of the writing, which authors have promised to include in the revision.
- Reviewer oLah considers the "churn" phenomenon to be closely related to the interference of neural networks, and the novelty of studying the churn problem and the proposed regularization loss is not as high as it claims. The authors' response did not fully resolve that concern.
Despite some minor concerns that are not fully addressed, all reviewers vote to accept this submission. I would encourage the authors to take the reviewers' comments seriously, incorporate the recommended changes in writing, explain the new automatic coefficient tuning method in the main paper with the new experiment results, and include the discussion of the missing related work as well as its relationship with interference in a prominent place in the text.
Additional comments from Reviewer Qjzk on the IQL discussion that may help the authors to improve their final version:
I started the IQL discussion because that part of the paper was written poorly / ambiguously. When the authors' IQL experiment details became clearer to me, it was immediately apparent that they weren't training IQL + CHAIN the way they were training the other actor-critic methods. They were training it in one specific way (only penalizing the policy churn using their additional loss term, L_PC), while showing improved performance over the baseline. Because IQL is a special case, in that it permits decoupled training of the actor and critic networks, it allows us to inspect the effects of each CHAIN loss term in isolation (which I asked to authors to try out). The most interesting finding of these experiments was the following:
IQL + PCR > IQL (baseline) > IQL (sequential)
IQL + PCR > IQL (baseline), as shared in the paper, was slightly surprising to me because IQL doesn't really have a "chain effect", but rather only a one-way influence of the ever-changing Q network on the actor's training. But since Q learning itself exhibits churn, one could imagine that this churn would "leak" into the policy training dynamics too, so it is plausible that IQL + PCR would also be helpful, and so this result is not too surprising.
I proposed IQL (sequential) to also break that one-way influence. Waiting to train the actor, until after the Q network is fully trained and frozen, should completely cancel any influence of the Q churn on the actor training, right? Apparently, the authors find that this performs even worse than IQL (baseline). So apparently exposing the actor network to some Q churn --- like in IQL (baseline) --- is, strangely, helping! I don't think this finding necessarily contradicts anything discussed in the paper, but I believe it does open up some new questions about churn, and is surprising in a good and exciting way, which makes me want to see this paper published.