Efficient Reinforcement Learning by Discovering Neural Pathways
To improve energy efficiency and reduce the carbon footprint, we propose Neural Pathway to efficiently use the network parameter space for reinforcement learning.
摘要
评审与讨论
The paper proposes a heuristic approach to learn masks so only parts of the neural networks are activated to save energy in RL problems. They design an algorithm by applying the masking method to SAC and provide experimental results.
优点
The paper considers an important problem of saving energy in RL tasks. The paper is well-written and easy to follow. The algorithm and experimental results are clearly presented.
缺点
-
The algorithm is designed specifically for SAC. What about other RL policies? I guess similar separate masks , can also be learned for algorithms such as DDPG, PPO, etc. It is doubtful if the proposed method generalizes to a broader class of RL policies, and this discussion is not covered in the paper so far even it's important.
-
It's worth visualizing how different parts of the neural networks are activated to convince the concept of neural pathways really makes sense. In fact, depending on different tasks, how do the pathways compare with each other? Does task similarity also imply pathway similarity?
问题
-
I am wondering if the FLOP counts in Fig 5 include learning the masks. Is there a trade-off between the sparsity of the neural networks and the FLOP counts since it is harder to make a very sparse neural network converge?
-
Why doesn't Fig 2.b (Performance under different sparsity) have variance plots or error bars? How many random tests were used to get Fig 2.b?
-
How does the proposed mask learning specifically for RL differ from existing pruning literature?
局限性
The authors have indicated in the conclusion section that more experiments would provide valuable insights.
Author's Response to Reviewer ddsT:
Thank you for taking the time to review our paper and for the valuable feedback. We are thrilled that you recognize the importance of our work and found our paper well-written, easy to follow, and our experimental presentation clear. In response to your suggestion to demonstrate whether DAPD generalizes to a broader range of RL policies, we provide experimental results using PPO in continuous control. We also address the clarifications you requested. We hope we have addressed all the raised concerns and we are happy to answer any further questions you may have.
The algorithm is designed specifically for SAC. What about other RL policies? ..such as DDPG, PPO, etc. It is doubtful if the proposed method generalizes to a broader class of RL policies
- To assess the general applicability of our method across various RL policies, we tested it on PPO for continuous control comparing our method's performance against a dense network in single-task settings.
- Our results demonstrate that our approach enhances performance for on-policy actor-critic methods like PPO. We provide the learning curve over 3 seeds below:
- HalfCheetah-v2: https://imgur.com/XKlqRVJ.png
- Ant-v2: https://imgur.com/PyFQiiq.png
depending on different tasks, how do the pathways compare with each other? Does task similarity also imply pathway similarity?
- In Appendix D.4 and Figure 20, we discuss our findings. We do not impose any condition of task similarity. Hence the pathway overlap happens in an organic way and we do not find any concrete relation between task similarity and pathway overlap. However, imposing task similarity is an interesting avenue for future research and is out of the scope of the current research work.
I am wondering if the FLOP counts in Fig 5 include learning the masks.
- The FLOP count is calculated during inference, where we do not need to consider learning the mask.
Is there a trade-off between the sparsity of the neural networks and the FLOP counts since it is harder to make a very sparse neural network converge?
- Finding the optimal subnetwork poses a challenge in sparse training, which is why we propose data-adaptive sparse training (DAPD). In Table 2, we demonstrate that SAC-DAPD not only reduces FLOP counts but also substantially enhances performance compared to the dense network (SAC-dense) due to less parameter interference.
Why doesn't Fig 2.b (Performance under different sparsity) have variance plots or error bars? How many random tests were used to get Fig 2.b?
- Experiments ran on 5 seeds. Here we show the variance plot of Fig 2.b in the revision.
- https://imgur.com/DGr88gY.png
How does the proposed mask learning specifically for RL differ from existing pruning literature?
- pruning literature focuses on static data distribution in a supervised learning setting, while we provide a learning algorithm under continual distribution shift in RL.
Thank you for answering my questions. I will keep my score.
We, the authors of the paper, would like to follow up on our response to the review. We have addressed the points raised in the review and hope that our revisions effectively resolve them. If any further clarification or discussion is needed, please let us know.
Motivated by the large energy consumption needed to train modern machine learning methods, the authors propose a novel training method to find energy-efficient architectures. To demonstrate this, the authors test their method in single and multi-task reinforcement learning scenarios, showing that the architectures can be reduced to 5% of their original number of parameters while maintaining performance, improving data efficiency, and naturally reducing energy consumption compared to other pruning algorithms. In addition, the authors perform further analysis to understand the impact of hyperparameters and the time evolution of their proposed method.
优点
- The problem addressed in this paper is extremely relevant considering the current state of the field, heavily leaning towards training and using large models with many parameters, which could have a high computational cost.
- The paper is very clear and well-explained. It also helps that the method is simple and effective.
- The experiments are thoroughly studied and analyzed, justifying most of the method's design choices, particularly the Warm-up and Freeze trick. The authors compare the proposed method with other models not only in terms of performance but also in terms of their temporal evolution during training and how they function under different data availability regimes.
- Due to the method's apparent generality, it has the potential to be applied to various function approximation tasks involving neural network training. This could have a significant impact on the machine learning community, especially for LLMs.
缺点
- While some experimental details are omitted from the main text, the authors prioritize including the most relevant aspects. The appendix provides all the necessary details for a thorough understanding.
问题
1.- I wonder if it is possible to apply this method to a very simple supervised setting. That would be a good starting point to generalize the technique to other applications.
2.- To further explore the relationship between pathways and task features (as acknowledged in the appendix limitations), applying this method to a set of controlled, parameterized tasks with varying similarity could be beneficial. Analyzing how pathways emerge in such a setting might elucidate their dependence on the task itself.
3.- In line 196, you mentioned that “...offline RL operates with a fixed training dataset for each task. Consequently, adaptive pathway updates are unnecessary…”. Why is this case? Is this because of the effect of the evolving policy in the online case?
4.- Where is the orange line in Fig 2c?
5.- Is the energy efficiency advantage mostly after training during inference? This is not necessarily a bad thing, considering how expensive running inference can be for large models.
6.- Typo in 255, extra “(“ or missing “)”.
局限性
These are properly discussed in the appendix.
Author's Response to Reviewer wTeM:
Thank you for taking the time for such a thorough review of our paper and for the valuable feedback. We appreciate your recognition of the paper's relevance, clarity, and thorough analysis of our experiments. We are elated to hear that the simplicity and effectiveness of our method, as well as its potential impact, were well-received. We will address the corrections identified in the revised version. Below, we address the questions raised in the feedback. We are happy to answer any further questions.
I wonder if it is possible to apply this method to a very simple supervised setting.
- Experiments conducted in an Offline RL setting are similar to those in supervised learning. Therefore, the principles can generally be applied to supervised learning as well.
In line 196, you mentioned that “...offline RL operates with a fixed training dataset for each task. Consequently, adaptive pathway updates are unnecessary…”. Why is this case?
- Thank you for pointing this out. We meant to say that in offline RL, the training dataset is static, and therefore determining the mask in a single step is still effective. Adaptive mask updates are more crucial in online RL due to the changing dataset distribution as the policy evolves. However, in practice, adaptive pathways might also help in the offline RL setting and therefore we will change the wording of this sentence.
Where is the orange line in Fig 2c?
- We apologize for the mistake, here is the complete Fig 2c and we will correct it in the revised version.
Is the energy efficiency advantage mostly after training during inference?
- We can maintain energy efficiency throughout training just as effectively as during inference at the cost of memory. To do this, we save the unmasked weights in memory and reload them whenever the mask is updated. Once we achieve the desired performance threshold, we can discard the unmasked weights for the remainder of the training process.
Thanks to the authors for responding to some of my questions. Based on the authors' answers and other reviewers' comments, I will keep my scores. Great work!
This paper presents an approach to network pruning in the context of deep reinforcement learning. DRL poses the interesting challenge of non-stationarity in the data distribution as the policy improves and samples states/rewards differently. The idea is to learn a bitmask to selects specific parameters in a larger network for inclusion in a neural pathway. The top K parameters to include in the mask are chosen based on their impact on the loss function using gradients. To handle non-stationarity, only the most recent experiences in the replay buffer are used when computing gradients, a moving average of gradients is used, and there is a warmpup/freeze approach with the masks to allow them to settle and the target network to complete learning.
优点
The paper is in general easy to read and understand. It would benefit from a careful editing pass.
The proposed idea is simple, grounded in past work that is well-cited, and effective. The experiments show, in the single domain case, that DAPD (the proposed approach) actually leads to better domain performance than either an unpruned network or two reasonable baselines.
The experiments reported in the paper explore interesting aspects of the approach, such as the role of warmup, sample efficiency, and online/offline multi-task RL. The figures are informative.
缺点
There is a disconnect between the title and the content of the paper. The title leads one to believe that power consumption will be a critical components of the paper. There are some connections made between FLOPS and energy, but they feel like an afterthought rather than being something that drove the work. The results are interesting in their own right without putting energy in the title. I suggest either changing the title or making the connection between the approach and energy consumption stronger in the paper.
How does one choose K (the number of parameters to keep) and the threshold on reward/return used to freeze the masks? In real domains (i.e., those that are less well-understood than, say, MuJoCo) I suppose they would be chosen empirically and treated as hyperparameters to search over.
It would help to see performance in domains other than MuJoCo as they are all relatively similar. I understand that Ant is a hard one, but they are all locomotion related. How general is the approach?
I would be more clear in section 3.2 on what is novel and what the true contributions are.
问题
How do you ensure that a mask spans continuous path from inputs to outputs? There must be a mechanism in place for that but it was not clear from the paper.
局限性
There is no meaningful discussion of limitations.
Author's Response to Reviewer HQmW:
Thank you for taking the time to review our paper and for the valuable feedback. We are delighted that you found our approach well-grounded and effective, the paper easy to read, and the experiments informative. In response to your feedback, we include additional experimental results in Atari games to demonstrate domain generalization and clarify the questions raised. We hope that you will find our explanations below helpful and we are happy to answer any further questions.
How does one choose K (the number of parameters to keep) and the threshold on reward/return used to freeze the masks? In real domains (i.e., those that are less well-understood than, say, MuJoCo)
- Thanks for your questions. As you point out, K and the threshold on return are hyperparameters that need to be tuned.
- As per the reviewer's suggestion, we explored scenarios without any assumption about the expected return and explored the possibility of updating the mask periodically. We conducted experiments using the DQN on three Atari games, updating the mask every L gradient-steps. We report the final performance after 10 million gradient steps, averaging over 3 seeds, with the mask being updated every L=1 million steps.
| Env | DQN-dense | DQN DAPD |
|---|---|---|
| DemonAttack-v4 | 17670.33 2829.91 | 20803.33 3273.07 |
| BreakoutNoFrameskip-v4 | 346.66 12.21 | 384.0 15.80 |
| PongNoFrameskip-v4 | 20.36 0.58 | 19.09 0.77 |
It would help to see performance in domains other than MuJoCo .. How general is the approach?
- In addition to locomotion tasks, we also present the performance of MetaWorld robot-arm manipulation tasks in Table 2 and Table 3 in the paper.
- To demonstrate the generality of the approach and to check performance in other domains, we provide the performance of DAPD in three pixel-based Atari environments.
I would be more clear in section 3.2 on what is novel and what the true contributions are
- In section 3.2, we use the scoring function from [1] as cited in the paper, while the rest of the method represents our original contribution. We will further clarify this in the revised version.
How do you ensure that a mask spans continuous path from inputs to outputs? There must be a mechanism in place for that but it was not clear from the paper.
- We do not explicitly ensure the path from input to output. This issue is not observed at 5% sparsity; however, in a more sparse network (refer to Appendix: Table 10), this could potentially explain the reduced performance evident in our experiments and can be an interesting future research direction.
There is a disconnect between the title and the content of the paper...The results are interesting in their own right without putting energy into the title. I suggest either changing the title or making the connection between the approach and energy consumption stronger in the paper.
- We will rewrite the title such that it does not mislead the reader.
Reference:
We, the authors of the paper, would like to follow up on our response to the review. We have addressed the points raised in the review and hope that our revisions effectively resolve them. If any further clarification or discussion is needed, please let us know.
The reviewers broadly agreed that the proposed method is interesting and potentially useful, as it containing simple ideas that appear to be important for effectively sparsifying RL agents. The main weakness of the paper is that the empirical evidence is quite preliminary, since the tasks are quite simple and do not require large networks. Nevertheless, since the subject is in its infancy, the paper provides useful signal for future exploration in this area.