The Ladder in Chaos: Improving Policy Learning by Harnessing the Parameter Evolving Path in A Low-dimensional Space
This paper reveals the parameter evolving path of policy network in a low-dimensional space and propose a general method to leverage the path for better learning performance.
摘要
评审与讨论
This paper first introduce some common phenomena for TD3 and RAD agents, and then introduce a novel deep reinforcement learning algorithm which perform a novel temporal SVD along the policy learning path called Policy Path Trimming and Boosting (PPTB). This algorithm offers us an angle to view how policy evolves in a lower-dimensional space and how to utilizes it in reinforcement learning part. The paper is well written and novel.
优点
1.The paper is well-written.
2.The author methodically analyzes the experimental results using specific indicators (Accumulated Parameter Change, Final Parameter Change, Parameter Update Detour Ratio), making the paper very comprehensible.
缺点
1.Although most of the figures in the paper are very clear, some are unclear and hard to understand, particularly the last two figures in Figure 2(b).
2.Some of the reported improvements in AUC appear to be incorrect. For instance, (189 ± 2 (103.33%)) and (148 ± 15 (81.48%)) seem to contain errors.
问题
1.Could the authors provide the learning curves of TD3 and TD3-PPTB? From my understanding, TD3-PPTB should only accelerate the training, I don't think TD3-PPTB would have better performance than TD3. I believe they should achieve the same converged result eventually.
2.The author only provided results for 100k and 500k steps in the RAD results shown in Table 2. What about the results for 1000k or 2000k steps?
3.Is the parameter P_b fixed in each environment? Can you talk about how to determine the P_b in your experiment? If is not set properly, will it cause excessive oscillation during the training process? Additionally, in the experiment details, a 2000 or 1e5-dimensional neural network is too large for Mujoco training. Most papers indicate that commonly used model dimensions are 64 or 128.
局限性
- This paper does not explain from a theoretical perspective why this approach would accelerate convergence.
Q1: The learning curves of TD3 and TD3-PPTB
We provide the missing learning curves in Figure 9 (for RAD and RAD-PPTB) and Figure 10 (for TD3 and TD3-PPTB) of the one-page pdf uploaded.
We hypothesize that the higher convergence performance in Ant and Hopper can be explained by PPTB’s effect in preventing the plasticity loss in the minor temporal SVD directions to some degree.
In fact, the policy path trimming (PPT) method proposed in our paper can be viewed as a special way to do network resetting to alleviate the primacy bias as in [Nikishin et al., 2022].
Concretely, our method differs from the vanilla network resetting method in [Nikishin et al., 2022] on two points: (1) we do reset in a transformed subspace (with temporal SVD) rather than the original space; (2) we only reset the parameter update in the minor directions and maintain the learned information in the major directions.
From the perspective of plasticity, our method periodically rolls back the plasticity of the network in the directions that are orthogonal to the ones that represent the effective knowledge learned so far.
Q2: “The author only provided results for 100k and 500k steps in the RAD results shown in Table 2. What about the results for 1000k or 2000k steps?”
We follow the evaluation setting as that in the original paper of RAD (see Table 1 in the original paper).
Q3: Some of the reported improvements in AUC appear to be incorrect. For instance, (189 ± 2 (103.33%)) and (148 ± 15 (81.48%)) seem to contain errors.
As described in Line 277-279, for each task, we report the comparative improvement by normalizing with a random-agent baseline as 0 and the DRL base algorithm (i.e., TD3 or RAD) as 1.
This resembles the convention used for Atari tasks, where the random agent is taken as 0 and the human performance is taken as 1.
Q4: “in the experiment details, a 2000 or 1e5-dimensional neural network is too large for Mujoco training. Most papers indicate that commonly used model dimensions are 64 or 128”
For the dimension of each hidden layer, we follow the convention and use 256 for TD3 and RAD. Thus 1e5 is the scale of the total number of network parameters (and 2000 is the window size of the historical policies maintained), not the dimension of each hidden layer.
Q5: Can you talk about how to determine the P_b in your experiment? If is not set properly, will it cause excessive oscillation during the training process?
In our experiments, we found that choosing different may have a difference in the improvement achieved upon the baseline algorithm, but it is quite safe to choose among a relatively large range without incurring a clear performance drop. Choosing a large should be safer as it makes less impact on the parameter update in principle according to the temporal SVD.
This is also supported by our empirical results for the SVD parameter reconstruction shown in Table 3 to Table 8 in the appendix.
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
Thank you for your response and the additional experiments; they make sense. I’m happy to improve my score.
Thank you for your valuable suggestions and we will make sure to take your suggestions to improve our paper.
We sincerely appreciate the time and effort you devoted to reviewing our work! Please let us know if there are any further questions or discussions.
This paper investigates the evolving path of policy network parameters in deep reinforcement learning (DRL). The author conducts experiments on multiple tasks in MuJoCo using TD3 and on multiple tasks in DMC using RAD. The findings reveal significant discrepancies in the amount of change among policy parameters and severe detours in policy parameter updates. To address this, the author employs Temporal SVD to decompose the evolving path of policy parameters. Despite the large number of parameters, the learning dynamics of the policy network are found to be concentrated in a few primary directions, forming a low-dimensional space.
Based on these insights, the author proposed a new method called Policy Path Trimming and Boosting (PPTB). This method improves the performance of DRL algorithms by canceling updates in secondary parameter directions and boosting progress in primary directions. Experiments conducted on TD3 and RAD agents in MuJoCo and DMC environments demonstrate that the PPTB method significantly outperformed the original methods in terms of score and AUC evaluation metrics, thereby substantially enhancing the performance of the DRL algorithm.
The author investigates the dynamics of policy network parameters from a novel perspective of Temporal SVD, and proposes an innovative method PPTB to enhance the performance of the DRL algorithm. The writing style of this paper is clear and easy to understand, with rigorous logic, and it is of great significance to algorithm optimization in the field of DRL.
优点
The author investigates the dynamics of policy network parameters from a novel perspective of Temporal SVD, and proposes an innovative method PPTB to enhance the performance of the DRL algorithm. The writing style of this paper is clear and easy to understand, with rigorous logic, and it is of great significance to algorithm optimization in the field of DRL.
缺点
The article does not clearly explain why only TD3 and RAD algorithms were selected. Wouldn't other algorithms have similar phenomena when updating strategies?
The PPTB experiment was only conducted on TD3 and RAD, and did not reveal the universality of PPTB on other DRL algorithms.
问题
-
Wouldn't other DRL algorithms have similar phenomena when updating strategies?
-
Does PPTB work with DRL methods other than TD3 and RAD?
-
Can the performance of PPTB be tested in a deeper network?
-
Can the performance of PPTB algorithm be tested in more experimental environments or different tasks? Are there more evaluation metrics besides score and AUC?
局限性
The author should enrich the experimental details and experimental content to demonstrate the versatility of the PPTB algorithm, or test the performance of the algorithm in more practical tasks. The article should appropriately discuss the limitations of the proposed algorithm, or provide some prospects for future improvements.
Q1: Why TD3 and RAD are selected?
The motivation for this work starts from the investigation of the dynamics of policy network parameters. Thus, we choose TD3 and RAD for the following reasons:
- They are popular deep AC methods (note RAD is SAC-based), where an explicit policy network is trained. This ruled out the value-based RL methods.
- TD3 and SAC are off-policy methods. They are usually more sample-efficient while less stable than on-policy algorithms like PPO.
- RAD takes visual observations as input. Thus, our choices cover both proprioceptive inputs and image-based inputs.
Q2: Wouldn't other algorithms have similar phenomena when updating strategies?
As suggested by Reviewer zpVk, in Figure 11 of the one-page pdf, we can observe similar phenomena for Behavior Cloning with Adam or SGD optimizers, but with different concrete patterns.
Besides, we observed similar phenomena for DDPG (omitted due to space limitation).
Q3: Other suggestions on the experiment
We appreciate the reviewer’s valuable suggestions on the experiment.
We will present the results for this if we are able to finish this before the discussion stage ends.
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
Thank you for your response. I've read the other reviews and the rebuttal. I’m keeping my initial score.
Off-policy actor-critic Deep RL has unstable and seemingly oscillatory learning dynamics, which are poorly understood. This paper looks closely at the trajectories taken by policy networks during training. An SVD analysis, performed over sequences of policy parameter snapshots, reveals near-monotonic parameter evolution along directions corresponding to the few dominant singular values, and--- indeed---oscillatory behaviour along the minor directions. The authors propose an intuitive and mathematically sound way to remedy this pathology by only permitting the parameters to move in subspace spanned by a small number of the top directions, while retaining the policy performance. Additionally, by taking larger steps in the first two major directions, they are able to train more performant policies.
优点
Significance
The paper addresses an important problem. Deep RL researchers have long observed poor training dynamics in off-policy learning. These include divergence, performance collapses, and oscillatory behaviour (i.e. policies forgetting what was learnt, then recovering). The reasons behind these are poorly understood. While the paper does not quite advance our understanding of the problem, it does potentially identify some mechanistic signatures of the problem (oscillatory weight dynamics) and proposes a treatment for this symptom (suppressing movement in oscillation-afflicted subspaces). I think the proposed PPTB fix is unlikely to be a complete solution, mainly because the critic's behaviour is not yet studied, but it can be a stepping stone towards better behaved learning dynamics.
Clarity
The paper is very clearly written, and I enjoyed reading it. Policy churn and oscillatory dynamics potentially stem from the combined actor-critic training dynamics, but the authors focus exclusively on the policy here. This is good, as it has kept the study focused and revealed interesting phenomena.
Originality
I am unaware of prior work which empirically inspects the policy learning dynamics by actually plotting out the weights' evolution, as the authors have done.
Quality
The idea of using temporal SVD upon the policy weights is quite sensible, and a treatment for the problem immediately pops out of the same tool; this fix also appears to be easily implementable in code, and seems to grant substantial performance boosts.
缺点
-
Limited analysis. The analysis is focused on TD3 and RAD (based on SAC here, which itself is quite close to TD3). I think this is insufficient. Seasoned RL practitioners will note that DDPG exhibits markedly more policy oscillations/collapses/recoveries than TD3: the introduction of the Clipped Double Q trick already substantially mitigates those effects. In this empirical study, I worry that by not including DDPG --- the simplest possible baseline --- the authors are not observing the problematic dynamics in their full glory, and perhaps not testing the full potential of PPTB. Right now, there is a risk that your observed phenomena are quirks of TD3-lineage methods.
-
Related work. The observation that gradient descent moves the parameters along a few dominant directions, chiefly in a low-dimensional subspace, is not new. There is a body of deep learning literature around this phenomenon, which the authors don't cite right now. Here is one such paper, and there are more:
- "Gradient descent happens in a tiny subspace" G Gur-Ari, DA Roberts, E Dyer
-
Potentially limited significance. Following from the above point: the existence of dominant directions is unsurprising. The existence of oscillatory detours in the minor directions could be a novel finding regarding actor-critic RL, but we don't know that yet. To strengthen the paper, here's a test: do you also observe the same harmonics in basic supervised learning? (e.g. when training the actor network with an MSE behaviour cloning loss). I think this is a particularly important thing to investigate, because I'm concerned that it could change the story of your paper.
-
Lack of training curve comparisons. For a method that expressly tries to curb oscillations, it is important that we see how the actual training curves look, instead of just a table with scores.
-
Limited evaluations. 6 is not enough; it's common to report a minimum of 10 independent trials. Also: you have
Humanoid-v4experiments in the paper, but don't report scores with PPTB on that env.
问题
- Line
434: you say that you choose the boosting coefficient from a set of values. Do you mean that you search from this set for a good hyperpameter value, or do you randomly sample from this set at each training iteration?
局限性
The limitations section is well written, but one thing that isn't mentioned is that the method can struggle to scale up to larger network sizes (beyond the smallish networks used in this paper).
Overall, I find this paper exciting, and I'm willing to substantially improve my rating if the authors address all the listed weaknesses (most crucially, number 3).
Q1: The additional investigation for Behavior Cloning policy
We appreciate the reviewer for pointing out this insightful and inspiring point.
We provide additional results in Figure 11 of the one-page pdf to present additional investigation for Behavior Cloning in D4RL halfcheetah-medium-replay with Adam and SGD optimizers.
As noted by the reviewer, we found that BC (Adam), BC (SGD) and the RL cases reported in our paper show different patterns in terms of temporal SVD evolvement. The key observation is that BC policies do not turn out to show the harmonics in the major temporal SVD directions (the leftmost plot in Figure 11b and 11d). In contrast, we can observe that the amplitude of the oscillations seems to decrease throughout training.
We deem that this observation is very interesting which could lead to a deeper understanding of the learning dynamics of BC and RL agents.
For the parameter update analysis, compared with the RL cases reported in our paper, the parameter update amount is more Gaussian distribution like for BC (Adam). This indicates the difference in policy parameter update dynamics between BC (i.e., supervised learning) and RL.
Somewhat surprisingly, the asymmetry of parameter update amount turns out to be more severe in BC (SGD). This is a bit counterintuitive because Adam exhibits less asymmetry than SGD.
We are more than willing to discuss this point with the reviewer in the discussion stage.
Q2: The analysis for DDPG
We appreciate the reviewer for pointing out this.
Actually, we also did the parameter update analysis and the temporal SVD analysis for DDPG. We found the empirical results are quite similar and the patterns in parameter update and temporal SVD directions are almost the same as those of TD3. Therefore, the plots are omitted and not included in the one-page pdf.
This indicates that the phenomena revealed in this work are not closely related to the overestimation bias things (which is the focus of clipped double Q learning in TD3, SAC, RAD).
Q3: Lack of training curves
We provide the missing learning curves in Figure 9 (for RAD and RAD-PPTB) and Figure 10 (for TD3 and TD3-PPTB) of the one-page pdf uploaded.
Q4: “Line 434: you say that you choose the boosting coefficient from a set of values. Do you mean that you search from this set for a good hyperpameter value, or do you randomly sample from this set at each training iteration?”
The complete process is that we first pre-designate a set of candidates of according to our empirical results for the temporal SVD parameter reconstruction (as shown by Table 3-8 in the appendix). Then we narrow the set and search from this set for a good hyperparameter.
Moreover, In our experiments, we found that choosing different may have a difference in the improvement achieved upon the baseline algorithm, but it is quite safe to choose among a relatively large range without incurring a clear performance drop. Choosing a large should be safer as it makes less impact on the parameter update in principle according to the temporal SVD. This is also supported by our empirical results for the SVD parameter reconstruction.
Q5: The related work in deep learning literature
We appreciate the reviewer for pointing out the reference paper. We have added it to our draft and will add more in our related work section to strengthen the discussion.
Q6: Other suggestions on the experiment
We appreciate the reviewer’s valuable suggestions on the experiment.
We will present the results for this if we are able to finish this before the discussion stage ends.
Thank you for the updated results and rebuttal!
We provide the missing learning curves in Figure 9 (for RAD and RAD-PPTB) and Figure 10 (for TD3 and TD3-PPTB) of the one-page pdf uploaded.
The PPTB versions certainly look better. I encourage you to add them to the appendix in a future version of the paper.
As noted by the reviewer, we found that BC (Adam), BC (SGD) and the RL cases reported in our paper show different patterns in terms of temporal SVD evolvement. The key observation is that BC policies do not turn out to show the harmonics in the major temporal SVD directions (the leftmost plot in Figure 11b and 11d). In contrast, we can observe that the amplitude of the oscillations seems to decrease throughout training.
(^ emphasis mine)
Upon looking at the newly provided plots (Fig 11b + 11d in the 1-page PDF), I don't see how one can infer that. Here's what I see:
- The BC-training dynamics do exhibit oscillations just like the RL-training dynamics in the main paper (e.g. Fig 7).
- The oscillation amplitudes for the major components do decrease over BC training, just as they also decrease during RL training.
Unless I'm very mistaken in my interpretation (happy to be proven wrong!) your experiments reveal that the oscillatory patterns in the top SVD directions are not specific to RL, but occur in vanilla supervised deep learning too (irrespective of Adam or plain SGD). And these experiments should go into the paper. This also suggests that maybe PPTB need not be an RL-specific heuristic but could also help supervised deep learning (but that will require more experiments, so the paper raises more questions than it answers...)
I think that these findings necessitate substantial edits to the paper. E.g.
- Emphasising early on that the empirical observations are a broader deep learning phenomenon, not RL-specific
- Adjusting/removing claims that make it sound like the oscillations stem from RL pathologies (e.g. noisy policy gradients)
- Linking the paper more strongly with related work on low-dim subspace training
For this key reason, I'm maintaining my score at the current level.
I still believe your approach is interesting and original, and I'd encourage you to try and publish a future version with an updated story!
A side comment:
If you look closely at Fig 7 column 2 in the paper (the major-component u[k] curves for all 4 MuJoCo envs) you'll find that the oscillatory curves are nearly identical for Hopper-v4, Walker2d-v4, and Ant-v4 but not HalfCheetah-v4. I don't have any intuition for this, but it seems surprising.
As this is an empirical investigation, the authors should think about what that means, how/why that might happen, and discuss that in the paper.
About the expression “The key observation is that BC policies do not turn out to show the harmonics in the major temporal SVD directions”
We agree that BC training dynamics do exhibit oscillations (it is apparent).
What we meant to express is:
- In RL cases (Figure 7, 8), the amplitude of the oscillation decreases less (relatively slightly or slowly). Thus, the pattern is more like “harmonics”.
- In BC cases (Figure 11), the amplitude of the oscillation decreases more (relatively faster, especially at the beginning of the learning. Thus, the pattern is more like “wavelets”. This is why we said “do not turn out to show the harmonics”.
Sorry for using the confusing expression and not making this point clear in our rebuttal.
Moreover, notice the parameter analysis in Figure 11a, we can observe that BC (Adam) does not have a significant proportion of the parameters that have a minor update amount. This is different from the severe asymmetry observed for DRL (Adam) in the first column of Figure 4 and Figure 5.
The observations are also different for BC (Adam) and BC (SGD), by comparing Figure 11a v.s., 11c, Figure 11b v.s., 11d, especially the first column.
Therefore, these results indicate that the pattern of the oscillations is different in different learning problems.
“your experiments reveal that the oscillatory patterns in the top SVD directions are not specific to RL, but occur in vanilla supervised deep learning too (irrespective of Adam or plain SGD)”
We agree that such oscillations in the view of temporal SVD are likely to exist in more learning problems beyond RL.
What we would like to mention is that, the pattern of the oscillations is different in different learning problems. Intuitively, the pattern captures the information/features of the learning dynamics of the model, which is in principle determined by the factors like learning paradigm (e.g., RL v.s., SL), and optimization (e.g., Adam v.s., SGD).
We think that it could not be that easy to find a unified explanation or theory to interpret all the patterns in different problems.
In this work, we start from and focus more on online RL and leave an omniscient explanation/study in the future. We will include our new results in our story to provide more insights as suggested.
If you look closely at Fig 7 column 2 in the paper (the major-component u[k] curves for all 4 MuJoCo envs) you'll find that the oscillatory curves are nearly identical for Hopper-v4, Walker2d-v4, and Ant-v4 but not HalfCheetah-v4. I don't have any intuition for this, but it seems surprising.
We appreciate the reviewer for pointing out this. We hypothesize that this similarity among the 4 MuJoCo tasks (concretely, the "phases" of different oscillation curves corresponding to different SVD direction indices) stems from the impact of initial network parameters, as we use the same initialization method of network parameters across different tasks (this is also a convention in practice).
To verify this, we subtract the initial network parameters from each policy network we collected during the learning process (note that in each task, for each seed, the policies along the update path share the same initial parameters) and then do the same Temporal SVD analysis for these policy networks. By this means, we get rid of the impact of the initial network parameters on our Temporal SVD analysis.
Now we aim to see whether the oscillation curves for different tasks look clearly different or still look "nearly identical" to verify our hypothetical reason of the impact of the initial network parameters.
As expected by our hypothesis, we found that, the "phases" of different oscillation curves are no longer similar, or in other words, the oscillation curves differ clearly among different tasks. Meanwhile, we observed the same phenomenon (i.e., Phenomenon 2.2) that we found in our previously reported results: the major Temporal SVD directions oscillate less and the minor directions oscillate more.
As we are not allowed to upload these plots in this stage (or even with an anonymous link), we will add these additional results and discussion in our revision.
We believe that these results and discussions address the concerns raised in the review, but please let us know if there are any further issues to address.
We sincerely appreciate the reviewer's valuable comments and further feedback during the discussion stage.
As the end of the discussion period is approaching, we would greatly appreciate it if the reviewer could provide more feedback on our responses to the "Following-up Questions".
We believe that these additional clarifications and results help to address the concerns raised the reviewer, but please let us know if there are any further issues to address.
Dear authors,
Here's my combined response to the last comments.
Thus, the pattern is more like “wavelets”.
This is why we said “do not turn out to show the harmonics”.
It would be good to use more precise wording; I think you can see why using "wavelets" and "harmonics" --- terms which have well established meanings --- loosely in this manner is not really helpful: it already led to some misunderstandings in our discussion.
What we would like to mention is that, the pattern of the oscillations is different in different learning problems. Intuitively, the pattern captures the information/features of the learning dynamics of the model, which is in principle determined by the factors like learning paradigm (e.g., RL v.s., SL), and optimization (e.g., Adam v.s., SGD).
Yes, intuitively, it does make sense that those factors would influence the learning dynamics. However, the RL and SL training dynamics, empirically, did not show distinguishably different oscillatory patterns.
We think that it could not be that easy to find a unified explanation or theory to interpret all the patterns in different problems.
That is fine; empirical justifications are OK and sufficient, but they should be framed in a suitable context. Let me explain why that's not happening here.
As you just discovered during the rebuttal, supervised DL also exhibits the same oscillatory patterns that PPTB addresses. But Actor-Critic RL is not the right setting in which to explore these deep learning phenomena for the first time, because it already has two known sources of oscillatiory parameter dynamics:
- TD learning with function approximation, as used in Deep RL, is not really gradient descent, and can in fact exhibit oscillations of the learned parameters strongly resembling the ones you show in the paper. The parameters will not neatly asymptote to the optimum, but will rather spiral towards it. See
Section 9.7 Oscillation Due to Asymmetric Iteration Matrixof http://incompleteideas.net/papers/MahmoodPhD.pdf, which contains a straightforward explanation and examples of this phenomenon. - Actor-Critic training can produce its own oscillatory effects, like in GANs.
Given these pre-existing sources of oscillations, it seems you've found a new, third source of oscillations in parameters, that arises neither from TD learning, nor from actor-critic dynamics, but perhaps just from training a neural network.
I think that's very interesting (and, to me, bigger news than PPTB itself), but it means you need to rework the story of the paper to an extent that we cannot accommodate in this review cycle, given we only have a 1-page PDF to work with. This was precisely my worry in my first review comment (weakness 3).
Now we aim to see whether the oscillation curves for different tasks look clearly different or still look "nearly identical" to verify our hypothetical reason of the impact of the initial network parameters.
As expected by our hypothesis, we found that, the "phases" of different oscillation curves are no longer similar, or in other words, the oscillation curves differ clearly among different tasks.
This is a really good experiment, thanks for trying this!
I think the paper is very promising. The authors, to their credit, wrote it well and shared some very interesting empirical findings. However, for the reasons stated above, I am unable to recommend acceptance, and am keeping my score because the discussion and related work in the paper need substantial redoing. Here's an extended list of suggestions for a future version of the paper:
Important fixes to make the paper stronger:
- Linking the paper more strongly with related work on low-dim subspace training
- Emphasizing early on that your empirical observations regarding the oscillations are a broader deep learning phenomenon, not RL-specific
- Adjusting/removing claims that make it sound like the oscillations you depicted here stem from RL pathologies (e.g. noisy policy gradients)
- Discussing sources of oscillations in Deep RL (actor-critic dynamics, semi-gradient TD, potentially your newly found third source, and anything else that I missed)
- Discussing which of those things PPTB addresses (potentially all three?). Also: discussing if/why PPTB is especially suitable for RL compared to supervised learning.
We greatly appreciate the time and effort devoted by the reviewer to reviewing our work and participating in the discussion.
The comments and discussions provided by the reviewer are really valuable and inspiring, and we believe these comments are leading our work to a higher quality and level.
We will carefully take the suggestions and improve our organization and presentation to include the additional results done in the rebuttal and discussion stages, i.e., the parameter and temporal SVD analysis for DDPG, BC (Adam and SGD), as well as the additional evaluation of PPTB for DoubleDQN in MinAtar (as suggested by Reviewer dZpR).
Concretely, we plan to start by presenting the learning process of policy networks from a relatively general DL angle to cover aspects like optimizers (e.g., Adam, SGD) and learning paradigms (i.e., RL, SL). Then we plan to discuss these aspects separately by presenting the empirical observations regarding the network parameter analysis and the temporal SVD analysis, with a focus on online RL policy learning as presented in our draft. Afterward, we discuss the differences in the empirical observations in different settings, by establishing the connection between the distinct features of different learning settings and the difference in the patterns of the distribution of network parameter update amount or the oscillations of temporal SVD directions.
Here are some results and conclusions we’ve already observed:
- by comparing the results of TD3 and DDPG (as mentioned in Q2 of our rebuttal), we can find that the overestimation bias (i.e., the major difference between TD3 and DDPG) does not seem to be a factor that causes the oscillations.
- by comparing TD3 and BC (Adam), we can find that RL and SL lead different distributions of network parameter update amount (as Figure 4a v.s., Figure 11a) and the “harmonic v.s., wavelet” oscillations (as Figure 7a v.s., Figure 11b).
- by comparing BC (Adam) and BC (SGD) as shown in Figure 11 a,b v.s., Figure c,d, we can find that the momentum mechanism used in Adam leads to a relatively more evenly distributed network parameter update amount, while SGD has a large number of minor updates (also reflected by the flat curve in Figure 11d, the first column) and a long tail in distribution.
These observations are the empirical support for our previous response “the pattern of the oscillations is different in different learning problems”. And with the second observation outlined above, we might not be able to fully agree with “However, the RL and SL training dynamics, empirically, did not show distinguishably different oscillatory patterns” in the comment of the reviewer. We are doing more BC results to investigate this point further.
This paper investigates the evolution of parameters over time during policy optimization with TD3 and RAD. By analysing the SVD of a matrix containing parameters over time, the authors find that there are a few directions in which the parameters move consistently and many with more oscillations. Using this insight, an algorithm is proposed which boosts updates in the top estimated directions and reduces updates in the uninmportant ones. Evaluations on standard deep RL benchmark tasks demonstrate the utility of the approach.
优点
This paper uncovers some nice insights into the training process of deep RL agents. As far as I know, I have not seen the SVD used to study parameter evolution over time, and I find it is an interesting application of that tool. The proposed algorithm is conceptually simple and easy to implement, making it easy to add to a variety of existing algorithms. In terms of impact, the identified phenomena may extend past RL and could be true of deep neural network training more generally, potentially giving widespread impact.
The paper is well-organized, with the different sections flowing nicely into each other. Generally, the paper was easy to follow and experiments chosen appropriately to make the intended arugments.
缺点
My main concern is about the evaluation of the algorithms. For example, reporting the max return in the evaluation of algorithm or using the standard deviation across runs, whereas the standard error or boostrapped confidence intervals would be more appropriate. (see Questions)
Also, the following design choice is confusing to me:
- Fig. 1 caption. "Only upper 80% values according to are taken to plot...for meaningful analysis" Could you elaborate why this decision was made? Does it have to do with many of the values being close to zero?
The improvement in the performance is generally fairly modest. I think this is fine given that main contribution to me is identifying behaviours in the parameter evolution. I think expanding a bit more on the analysis could be interesting.
In terms of clarity, some details could be expanded upon more in the main text. See Questions section.
问题
Clarification questions
-
Line 157: What is ? Is it Singular Value Information Amount?
-
Fig 2 b) 4th and 5th figure from left. Interesting findings. So basically, the detour ratio is smaller for the Again, I wonder if that has more to do with noise or curvature. Perhaps there is less noise in these directions,
- In the 3rd fig from the left, it's hard to see much since the black curve masks everything else. Consider using different colors or some transparency. It would be nice to see how the paths get increasingly noisy.
-
Policy Path Trimming: To clarify the algorithm, does it project the current parameters into the space identified by the top singular vectors? The description of the algorithm could be improved in Sec. 4.1. where it is introduced. While the intuition is described, the exact mechanism that is implemented is not explained.
-
Similarly, in section 4.2, the Policy Path Boosting could be described a bit more clearly. Eq. 2 in particular is a bit confusing since
is updated but then only
are actually used.
The phrase in line 241 "PPB modifies by increasing along the temporal direction..." sounds overly complicated. Perhaps rephrasing it to something like "PPB moves the parameter further in the direction of previous updates along the first two main directions and " could be simpler.
Suggestions and broader questions
-
By constraining the parameter evolution path to focus on the previous main directions, would the effect of "primacy bias" or related phenomena be even stronger? Could we be losing out on performance due to prematurely committing to certain update directions?
-
Line 270. The evaluation metric "SCORE" should no longer be used since the maximum over runs introduces overestimation bias and leads to less reliable estimatse. See [1] for better evaluation practices and [2] for arguments against using the max.
-
I wonder if the detours are mainly due to noise or curvature of the objective. One way to test this would be to increase or decrease the minibatch size, which can control the variance of the updates. Then, by inspecting the effect on the detour ratio or cumulative parameter movement, we could guess the relative impact of noise and curvature.
-
Fig.1 is a CDF plot. I think a histogram or a box plot might be easier to interpret than CDF plots since you need to look at differences in a CDF plot to identify where most of the probability mass is.
-
An ablation study for the two components of the algorithm would be a valuable addition. It is not clear if both pieces are necessary right now or how important they are.
-
I would also be curious to know what would happen if we took the estimated space
-
Here is a paper that could be interested to read [3] where the authors show that, even if you constrain neural network parameters to a random subspace, as long as the dimension of that subspace is not too small, you can recover the same performance as the original network. The idea of Policy Path Trimming could be interpreted as a more intelligent approach which estimates the constrained subspace instead of choosing a random one.
-
Another line of research (e.g. see [4]) has observed that the Hessian contains only a few large eigenvalues in neural network training, which may be related to the ideas discussed in the paper since there would be some interplay between the curvature of the loss surface and directions of updates over time.
-
In the matrix of parameters over time, consider using to index the rows instead of so it's a little easier for the reader to remember which dimension corresponds to what. Alternatively, if is reserved for environment timesptes, could be used as a substitute.
[1] "Deep Reinforcement Learning at the Edge of the Statistical Precipice" Agarwal et al.
[2] "Deep Reinforcement Learning that Matters" Henderson et al.
[3] "Measuring the Intrinsic Dimension of Objective Landscapes" Li et al.
[4] "An Investigation into Neural Net Optimization via Hessian Eigenvalue Density" Ghorbani et al.
局限性
These are discussed.
Q1: the evaluation of the algorithms. For example, reporting the max return in the evaluation of algorithm or using the standard deviation across runs, whereas the standard error or boostrapped confidence intervals would be more appropriate
Technically, we are not reporting the “max return” in our evaluation. We report the “maximum of the mean return”. To be concrete, this is obtained by first (1) computing the curve of mean return evaluation across multiple runs, and then (2) taking the maximum of the mean evaluation curve.
We follow the evaluation scheme presented in TD3 paper (see “Max Average Return over 10 trials” in the caption of Table 1 in TD3 paper).
The rationality of this evaluation scheme is that, for online learning, it is possible to store checkpoints throughout learning. For the opposite case, in offline RL, the convention is to report the mean final score.
In Line 279, we wrote, “We report the means and standard deviation errors across six independent trials”. We meant to say “standard errors” rather than “standard deviation errors”. This is a mistake in writing. We have amended it in our revision.
We appreciate the reviewer for the suggestion of using other more reliable evaluation metrics. We will take the suggestion in our revision.
Q2: “I think expanding a bit more on the analysis could be interesting”
As suggested by Reviewer zpVk, we provide additional results in Figure 11 of the one-page pdf to present additional investigation for Behavior Cloning in D4RL halfcheetah-medium-replay with Adam and SGD optimizers.
Compared with the RL cases reported in our paper, the parameter update amount is more Gaussian-disribution like for BC (Adam). This indicates the difference in policy parameter update dynamics between BC (i.e., supervised learning) and RL. Somewhat surprisingly, the asymmetry of parameter update amount turns out to be more severe in BC (SGD). This is a bit counterintuitive because Adam exhibits less asymmetry than SGD.
Moreover, we found that BC (Adam), BC (SGD) and the RL cases reported in our paper show different patterns in terms of temporal SVD evolvement.
We believe that these results are interesting to potential audiences of this paper and worth a more in-depth discussion.
Q3: “By constraining the parameter evolution path to focus on the previous main directions, would the effect of primacy bias or related phenomena be even stronger? Could we be losing out on performance due to prematurely committing to certain update directions?”
In fact, the policy path trimming (PPT) method proposed in our paper can be viewed as a special way to do network resetting to alleviate the primacy bias as in [Nikishin et al., 2022].
Concretely, our method differs from the vanilla network resetting method in [Nikishin et al., 2022] on two points: (1) we do reset in a transformed subspace (with temporal SVD) rather than the original space; (2) we only reset the parameter update in the minor directions and maintain the learned information in the major directions.
From the perspective of plasticity, our method periodically rolls back the plasticity of the network in the directions that are orthogonal to the ones that represent the effective knowledge learned so far.
We appreciate the reviewer’s insightful comments and we believe that more study can be done in the future by taking into consideration both policy update subspace and the plasticity loss problem.
Q4: "Only upper 80% values according to are taken to plot...for meaningful analysis" Could you elaborate why this decision was made? Does it have to do with many of the values being close to zero?
This is because many parameters have a very minor value (i.e., very close to zero) of accumulated parameter change (as revealed by the parameter update asymmetry phenomenon).
We have clarified this in our draft to eliminate the confusion.
Q5: The clarification questions
-
[Line 157: What is ? Is it Singular Value Information Amount?] is the Singular Value Information Amount for a dimensionality number , which is defined in Line 155.
-
[Fig 2 b) 4th and 5th figure from left. Interesting findings. So basically, the detour ratio is smaller for the Again, I wonder if that has more to do with noise or curvature. Perhaps there is less noise in these directions] Yes, the detour ratio is smaller for the major directions. We consider that it is closely related to the curvature of the landscape of policy objective function.
-
[In the 3rd fig from the left, it's hard to see much since the black curve masks everything else. Consider using different colors or some transparency. It would be nice to see how the paths get increasingly noisy.] We appreciate the reviewer’s suggestion. We will replace the plot with the frequency information by Fourier transformation for a better presentation.
-
[More description for Policy Path Trimming and Policy Path Boosting] The conversion from the original parameter space to the subspace is done by performing temporal SVD. The corresponding left singular vectors are taken as the new coordinates in the subspace. This is described between Line 145-152. We have taken the writing suggestions provided by the reviewer and added more detailed descriptions in Section 4.1 and 4.2 to eliminate the confusion.
Q6: Other suggestions
We sincerely appreciate the reviewer for providing insightful comments along with very useful reference papers.
Due to the time limit of the rebuttal stage, we may not be able to respond to each point. However, we are more than willing to discuss more in the discussion stage.
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
Thank you for the clarifications. A quick follow-up question: How is the PBTT similar to resetting among "minor" update directions? If I understand correctly, it only restricts updates in those directions but does not involve any resetting of weights.
We appreciate the reviewer for further feedback!
How is the PBTT similar to resetting among "minor" update directions? If I understand correctly, it only restricts updates in those directions but does not involve any resetting of weights.
PPTB's resetting effect in the minor update directions is achieved by (1) obtaining the transformed parameter space via performing temporal SVD with the historical policies collected in a sliding window, (2) and then (re-)setting (or dropping) the left singular vector value (i.e., ) to zero.
More concretely, this differs from the resetting method in [Nikishin et al., 2022] at two points:
- [Resetting to random initialization v.s., Resetting to Zero in the transformed parameter space] The resetting method in [Nikishin et al., 2022] resets the network parameters to a set of randomly (re-)initialized parameters. In contrast, PPTB resets the left singular vector value (i.e., ) to zero for the minor temporal SVD directions, i.e., resetting the parameters to zero in the transformed parameter space. One thing to note is that resetting the left singular vector value (i.e., ) to zero does not mean that the network parameters (in the original parameter space) are necessarily zero, because the basis of space is different. A more in-depth analysis on the correlation between them is worth further study in the future.
- [Global resetting v.s., Local resetting] The resetting method in [Nikishin et al., 2022] is global, as the network parameters are reset to a set of randomly (re-)initialized parameters. Differently, we do resetting according to the temporal SVD parameter space based on a sliding window that includes recent historical policies (thus being local). In this sense, this is similar to Shrink-and-Perturb [Ash & Adams, 2020], which can be viewed as a soft version of the resetting method, that shrinks the network parameters and adds random parameters with a coefficient.
Indeed, we agree with the reviewer that this effect can be understood as a restriction in these directions. We appreciate the reviewer's inspiring comments. We will add these discussions in our revision for a more comprehensive understanding.
Please let us know if there are any remaining questions or concerns that we can address to improve your assessment. We are willing to discuss more with the reviewer to improve our work further.
The authors examine the trajectories of policy learning in continuous control reinforcement learning tasks.
They begin by measuring how directly parameters go to their destination and observe large detours and differing update behaviour for different layers. They then examine the singular value decomposition of the different training checkpoints and observe some strong common update trends.
They propose Policy Path Trimming and Boosting (PPTB), where policy boosting is boosting the gradients towards the strongest singular values, and trimming trims the updates from the smaller singular values.
优点
- The paper describes their method and analysis well. I generally found it easy to read and understand what points they were trying to convey.
- The proposed method does seem to improve performance on the tasks presented.
缺点
- The environments and methods investigated, given the authors resutls are purely empirical, are not diverse enough to draw any real conclusions. The authors focus on continuous control environments without considering discrete environments such as Atari, or more diverse network architectures such as recurrent networks or transformers, or different methods such as PPO or DQN. These results hold only for MLPs or CNNs trained on continuous control tasks, which just isn't a convincing enough setting to warrant acceptance.
- The proposed method does not seem particularly practical. Computing SVD is both time intensive and requires storing a wide range of previous parameters. It therefore requires a lot more compute, which is why such methods are typically not used. Compared to approximations such as momentum-based optimisation [1], which are also aimed at achieving more uniform convergence, this time among eigenvalues of the data matrix. Although their method improves performance, they have not convincingly demonstrated that the compute couldn't be better used elsewhere, for example by training a bigger network or by sweeping hyperparameters more effectively.
[1] Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006
问题
- Can you comment more on the relationship between your work and other, less compute intensive methods, that aim to allow more uniform convergence among the different data eigenvalues such as momentum? Is there an explicit connection here? Have you thought about that?
局限性
- The authors, to their credit, provide an extensive discussion of the limitations of their work in Appendix A. I agree largely with the points in that section and enjoyed their contextualisation of their work there.
Q1: “Computing SVD is both time intensive and requires storing a wide range of previous parameters. It therefore requires a lot more compute, which is why such methods are typically not used”
We apologize for causing the reviewers to misunderstand the memory consumption and computational overhead of our method.
We would like to further clarify this point in the following:
[Computation cost] We would like to clarify that (1) each SVD operation in our experiments takes roughly 1 second by using torch.linalg.svd, and (2) the SVD operation is not performed at every time step but rather at sparse intervals (e.g., ). The practical computation cost in wall-clock time is less than 5% of the total training time, which is worth it compared to the benefits SVD brings.
Moreover, the temporal SVD operations remain scalable as matrix size increases, thanks to the use of efficient linear algebra libraries. This scalability has been validated by numerous successful cases, such as the widely adopted large model fine-tuning technique, LoRA (Low-Rank Adaptation) [1]. The effective application of LoRA clearly demonstrates that SVD is not a significant issue in terms of memory overhead and computational cost.
[Memory cost] As described in Appendix B, PPBT does not require storing all historical policies but instead stores policies at sparse intervals. In our implementation, the size of the stored policy parameter matrix is , making the memory overhead entirely acceptable.
We have clarified this point in our draft.
Q2: Can you comment more on the relationship between your work and other, less compute intensive methods, that aim to allow more uniform convergence among the different data eigenvalues such as momentum? Is there an explicit connection here? Have you thought about that?
In Q1, We provide the response to the misunderstanding on the “compute-intensive” point.
We are more than willing to discuss the relationship between our work and any concrete related work provided by the reviewer in the discussion stage.
Reference:
[1] LoRA: Low-Rank Adaptation of Large Language Models. 2021
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
Our response clarified the computational cost of our method clearly. In practice, our method adds minor additional computational costs to the baseline algorithms we considered in our experiments. In addition, we provided our understanding and thoughts on the scaling of our work.
We believe that our response addressed the concern about the computational cost raised in the review. We would greatly appreciate it if the reviewer could confirm that our response has addressed your concern on this point. Please let us know if there are any further issues.
Thank you very much to the authors for their response to my rebuttal. I apologise for my misunderstanding about the required runtime of their algorithm, on which a portion of my criticism was based. I trust the authors have updated this point in the draft, and I think it would be helpful to make this idiot-proof in the final version of the paper! I have raised my score to reflect this. However, I would still like to see more environments and settings investigated in this work, for example by including experiments in Atari.
Additionally, my first question was about the connection between this method and straightforward optimizers with momentum. These try to balance convergence speeds for different matrix eigenvectors, as discussed in the work I cited. Although this may be too loosely related to be relevant, I would be interested in any thoughts the authors have!
We greatly appreciate the time and effort devoted by the reviewer to reviewing our work and participating in the discussion.
We also appreciate the reviewer's valuable suggestions and insightful comments. Here we provided some more results and discussions to address the concerns further.
"I would still like to see more environments and settings investigated in this work, for example by including experiments in Atari."
We provided additional results for applying PPTB with Double DQN in MinAtar [1]. MinAtar is a lightweight version of Atari, which strives to enable more reproducible and thorough experiments (and also a bit faster).
We implemented Double DQN-PPTB based on the official code of MinAtar by adding several lines of code to collect the historical network parameters of DoublDQN and perform PPTB at intervals through learning. We made no change to the default hyperparameters of DoubleDQN.
We run 3M steps for SpaceInvader and Breakout, and 5M steps for Seaquest. We reported the means and standard errors of the final episode returns across 6 seeds for DoubleDQN and 3 seeds for DoubleDQN-PPTB (3 more seeds are still running) in the table below.
| Task | Double DQN | Double DQN-PPTB () |
|---|---|---|
| SpaceInvader | 77.63 6.85 | 93.03 6.70 |
| Seaquest | 11.00 1.70 | 25.43 4.38 |
| Breakout | 28.85 4.01 | 31.43 2.55 |
The results show that PPTB also improves the learning performance of Double DQN somewhat significantly in MinAtar tasks with discrete actions. This also inspired us to further investigate this point in the future.
We appreciate the reviewer for suggesting this. We hope these results will help to make our experimental evaluation of PPTB more convincing and address the reviewer's concern on this point. We will include these results in our paper as suggested.
the connection between this method and the relevant work on straightforward optimizers with momentum mentioned by the reviewer
We appreciate the reviewer for the inspiring comments. The idea of momentum adopted in modern optimizers is essentially to make use of the information of historical gradients. In this sense, we can establish a connection between momentum and PPTB proposed in our work as the key idea of PPTB is to make use of the latent structure of the historical policies along the policy update path.
Momentum-based optimizers use different kinds of aggregation quantities of the historical gradients (e.g., ). In our context, PPTB differs on two points: (1) not all the temporal SVD directions are used, and directions with large oscillations are trimmed (note that both SVD and eigendecomposition are techniques that capture the features of curvature); (2) policy path boosting can be viewed as a type of momentum conducted in the major temporal SVD directions. Another difference to notice is that, both our analysis and experimental evaluation are done with Adam (except for the additional results for SGD in the one-page pdf) which already leverages momentum. Thus our findings are based on the practical optimization of a momentum-based optimizer.
We will add the discussion on this point in our paper. And we believe that a further study on this connection with formal description is worthwhile in the future.
Reference:
[1] MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments. arXiv: 1903.03176
The authors study how parameters evolve in Deep RL. They perform SVD on updates and find that parameters advance along a small number of directions. They then propose a method to trim the policy learning path by focusing the updates on these major directions. They show that their methods improve performance in MuJoCo and DMC.
优点
The work is a well-written exploration of an interesting and novel perspective on RL parameter updates. The authors are extremely thorough and clear with their investigation and show strong results on common benchmarks. Furthermore, their method seems very simple to implement, which is valuable to the community.
缺点
Concerns with the paper:
-
The paper only investigates extremely dense reward settings. Intuitively, the conclusions of this paper should not apply to sparse(r) reward settings, which are arguably much more interesting in RL. (See Q1 below).
-
The results of the paper do not seem like they would be at all specific to RL. I'm not convinced that the observed phenomenon is not just a simple byproduct of using an optimizer with momentum. It would be good to have results that show that this phenomenon does not occur or help in supervised learning tasks (where I would imagine there is significantly more literature on this topic), and/or that this phenomenon still occurs when studying agents trained with plain SGD.
-
The significance of the results are unclear (see below)
Possible improvements:
-
It could be neat to replicate the empirical investigations when using your new method PPTB. Does PPTB actually address the issues presented and dampen parameter updates?
-
The results are not easy to read. It is hard to tell which results are statistically significant (I would recommend using standard error!) and also plotting standard RL training curves with the appropriate error regions. As-is, it's very hard to tell whether this method actually helps or not. In particular, it seems as though the error regions often overlap in Table 2, for example.
-
The writing is often vague. In the abstract, the authors write: "we study how the policy networks of typical DRL agents evolve during the learning process by empirically investigating several kinds of temporal change for each policy parameter". This vague sentence conveys little information. The authors also use the word "asymmetry" in the abstract without explaining what they mean (asymmetric with respect to what?).
问题
-
Do you think this would work equally well in sparse(r) reward settings? My intuition is that it would not, since the early parameter updates likely do not contain significant information about the reward. The continuous control tasks evaluated are particularly dense. Also, doesn't pruning the noisier directions harm exploration?
-
Related to the above, why do you believe this method does not exacerbate issues of primacy bias mentioned in the paper? The related work section's first paragraph does not really compare and contrast to the prior works, just mentions them.
-
Do you think these results apply beyond just RL? I see no reason why this phenomenon is RL-specific.
-
Related to Q3: Doesn't the fact that these methods use momentum (e.g. Adam) make this phenomenon obviously true? Does this phenomenon persist when using plain SGD? What about when you observe the gradient updates as opposed to the parameter updates?
-
Can you include the plots mentioned above? It would really help with my understanding of the paper.
局限性
The authors address the limitations.
Q1: “Do you think these results apply beyond just RL? I see no reason why this phenomenon is RL-specific. Doesn't the fact that these methods use momentum (e.g. Adam) make this phenomenon obviously true? Does this phenomenon persist when using plain SGD? What about when you observe the gradient updates as opposed to the parameter updates?”
We appreciate the reviewer for pointing out this. We agree that it will be interesting and valuable to extend our experimental investigation to more general deep learning settings.
To this end, we provide additional results in Figure 11 of the one-page pdf to present additional investigation for Behavior Cloning in D4RL halfcheetah-medium-replay with Adam and SGD optimizers.
Compared with the RL cases reported in our paper, the parameter update amount is more Gaussian-disribution like for BC (Adam). This indicates the difference in policy parameter update dynamics between BC (i.e., supervised learning) and RL. Somewhat surprisingly, the asymmetry of parameter update amount turns out to be more severe in BC (SGD). This is a bit counterintuitive because Adam exhibits less asymmetry than SGD.
Moreover, we found that BC (Adam), BC (SGD) and the RL cases reported in our paper show different patterns in terms of temporal SVD evolvement.
In this paper, we start from and focus more on online RL and we defer the study across different learning paradigms in the future.
Q2: “why do you believe this method does not exacerbate issues of primacy bias mentioned in the paper?”
In fact, the policy path trimming (PPT) method proposed in our paper can be viewed as a special way to do network resetting to alleviate the primacy bias as in [Nikishin et al., 2022].
Concretely, our method differs from the vanilla network resetting method in [Nikishin et al., 2022] on two points: (1) we do reset in a transformed subspace (with temporal SVD) rather than the original space; (2) we only reset the parameter update in the minor directions and maintain the learned information in the major directions.
From the perspective of plasticity, our method periodically rolls back the plasticity of the network in the directions that are orthogonal to the ones that represent the effective knowledge learned so far.
We appreciate the reviewer’s insightful comments and we believe that more study can be done in the future by taking into consideration both policy update subspace and the plasticity loss problem.
Q3: “Do you think this would work equally well in sparse(r) reward settings? My intuition is that it would not, since the early parameter updates likely do not contain significant information about the reward. The continuous control tasks evaluated are particularly dense. Also, doesn't pruning the noisier directions harm exploration?”
We appreciate the reviewer for pointing out this inspiring point. First, we consider that the asymmetry in terms of parameter update amount and the concentration of temporal SVD information should be more severe in sparse-reward settings, as the self-distillation mechanism can dominate when the reward signal contains little information [1].
Second, our method does not aim to address the exploration problem and thus it is less likely to improve the learning performance especially when the baseline algorithm fails with very sparse rewards.
However, we do not think that our method necessarily harms exploration, because the trimming happens at a very sparse interval and moreover the effect of the parameter update of the minor directions in exploration is unclear yet. One possible understanding is that the periodic trimming rolls back the plasticity (as discussed in Q3) and could encourage learning new behaviors with the parameters in these minor directions.
We are also running additional experiments in several sparse-reward DMC tasks. We will present the results for this if we are able to finish this before the discussion stage ends.
Reference:
[1] DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization. 2021
Dear Reviewer,
We hope that you've had a chance to read our responses and clarification. As the end of the discussion period is approaching, we would greatly appreciate it if you could confirm that our updates have addressed your concerns.
We appreciate the reviewer's further feedback!
"It is hard to tell which results are statistically significant (I would recommend using standard error!) and also plotting standard RL training curves with the appropriate error regions"
Sorry for missing the response to this point in our rebuttal. Actually, we responded to the point of "standard error" in the Q1 of our rebuttal to the Reviewer iaGj, (but we missed it here).
In Line 279, we wrote, “We report the means and standard deviation errors across six independent trials”. We meant to say “standard errors” rather than “standard deviation errors”. This is a mistake in writing. We have amended it in our revision.
In our one-page pdf material, we provided the learning curves with means and standard errors (Figure 9 and Figure 10), which should help to show a more complete comparison than the scores we reported in the tables in our draft.
Please let us know if there are any remaining questions or concerns that we can address to improve your assessment. We are willing to discuss more with the reviewer to improve our work further.
If the error regions refer to standard error, then aren't a lot of your results in Table 2 insignificant? The regions sometimes overlap heavily. In my head, I was converting the standard deviation to standard error (which means I divide the std by the square root of the number of seeds).
That being said, I think my key concerns have been addressed, and this point is a rather minor one (most RL benchmarks have long since saturated). It's honestly hard to tell if this is a significant improvement (looking also at Figures 9 and 10 in the author's rebuttal), but this is not the fault of the authors as much as it is the field. Because of this, I am raising my score.
We sincerely appreciate the time and effort you devoted to reviewing our work.
In our draft, we followed the evaluation scheme in previous works but we agree with the reviewer that more random seeds should improve the quality of our experimental evaluation further. We plan to add the seed number from 6 to 12 in these tasks.
Thank you for your valuable suggestions and we will make sure to take your suggestions to improve our paper.
Thanks for the response! I had a quick question on your thoughts on my comments on statistical significance (the second point on improvements in the weaknesses). I don't see it in the response, but may have missed it. In general, the difference in performance seems extremely marginal, though I do understand that most RL benchmarks have long since saturated.
I do believe most of my concerns have been addressed, though would hope to get a quick answer to the above.
We appreciate all the reviewers’ careful review and valuable comments. Please refer to the individual rebuttals for our responses.
In the one-page pdf uploaded, we provide additional results:
- [(Suggested by Reviewer zpVk) Additional empirical investigation on Behavior Cloning in D4RL halfcheetah-medium-replay with Adam and SGD optimizers] Figure 11 presents the results for an extended investigation on Supervised Learning (BC) and RL, momentum-based optimizer (Adam) and vanilla optimizer (SGD).
- [(To Reviewer B4U6) The missing learning curves] Please refer to Figure 9 and Figure 10.
With these additional discussions and experimental results, we would like to emphasize that our empirical study and the method for improving DRL agents in the policy subspace constructed with temporal SVD have not been studied before to the best of our knowledge.
Finally, we sincerely hope that our response can address the questions and concerns raised by the reviewers. We also hope that the reviewers can re-evaluate the value of our work based on the responses and the additional results provided in the one-page pdf. We are also more than willing to address any further questions or concerns during the discussion stage.
This paper finds that policy networks in some deep reinforcement learning (DRL) agents evolve in a simplified, low-dimensional space, with significant progress in a few key directions while showing oscillations in others. To improve learning efficiency, the authors propose a method called Policy Path Trimming and Boosting (PPTB), which focuses on the key directions and reduces unnecessary updates, leading to better performance in DRL tasks.
This is an insightful and valuable contribution to this venue. Although reviewers do not have any major concerns and generally appreciate the novelty and significance of the work, there are several ways the paper should be improved, as the reviewers suggested. Specifically, the connection with moment-based optimizers is a natural question that should be addressed. Moreover, there is a concern regarding the statistical significance, which the authors are suggested to take this into account given the context of the current state of the art in RL literature. Computational expense is another obvious concern that will naturally arise, given the work is based on an SVD approach, which is typically understood to be computationally expensive. The authors clarified during the rebuttal how the infrequent use of SVD leads to cheap computation, although this might still increase peak computation (spike in computation momentarily). The authors should address this concern early in the paper.