4.0

/10

Rejected3 位审稿人

最低3最高6标准差1.4

4.0

置信度

正确性2.7

贡献度2.0

表达2.7

ICLR 2025

Episodic Control-Based Adversarial Policy Learning in Two-player Competitive Games

Siyuan Chen,Fuyuan Zhang,Zhuo Li,Xiongfei Wu,Jianlang Chen,Pengzhan Zhao,Lei Ma,Jianjun Zhao

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

摘要

关键词

Reinforcement Learning; Adversarial Policy Training; Episodic Control

评审与讨论

审稿意见

评分: 6置信度: 42024-10-29

This paper focuses on improving the efficiency of training an adversarial policy in two-player competitive games. Specifically, the paper proposes a framework which leverages historical states and their rewards from previous episodes to compute a score using a recurrent model. This score is then used to augment the rewards during the training of the adversarial policy. Experimental results on MuJoCo environments showed the improved efficiency of the proposed method over existing baselines.

优点

The paper is generally well-written and easy to follow, with the exception of certain sections in the results section, which seemed a little too brief. The idea of using past experiences to augment the rewards to improve the training is not necessarily novel, however, I believe they are novel in the context of applying them to train adversarial policies in two-player games The methodology is also well-organized and the experimental results supports the paper claims. The use of group--based episodic feedback is an interesting technique, which adds to the quality of the findings and I appreciate the additional experiments which demonstrates the generalizability of the proposed methods to other baselines in the Appendix. The findings have potential implications for improving the robustness of agents in two-player competitive games.

缺点

Some sections lack clarity, specifically the sections on the Ablation study and terms like "non-loss rate" should be defined for completeness and would help readers understand these concepts better. The related works section would also benefit from rephrasing as certain paragraphs sounds a little misrepresented to me. The idea of using past experiences to improve the efficiency is also similar to works like Prioritized Experienced Replay, Hindsight Experience Replay, etc which could be cited in the related works section to provide a more comprehensive overview. In terms of the soundness of the results, the latter parts of the result heavily depends on anecdotal evidence and including more examples of such observations would make the findings more convincing. Additionally, discussing the limitations of the proposed method in terms of additional complexity would also help the readers better weigh the pros and cons of the method. My specific concerns are listed below.

问题

In the related works section, the authors wrote: "However, the above attacks are argued to be unrealistic since the real-world environment can not be manipulated". While the authors referred to this statement by referencing other works, I believe this statement is slightly misrepresented. In reality, adversarial attacks on the environment via perturbing the observation/action channels are definitely realistic. Furthermore, the proposed approach of two player games is also a form of indirect environmental manipulation, which ultimately affects the observation space. Hence, this statement seems rather contradictory to me given the topic of the paper.
As mentioned above, the idea using good trajectories from the past to augment the training process to improve efficiency is not new. I would suggest the authors cite more related papers to create a more comprehensive overview. For example, how does the idea proposed in this paper differ from those of hindsight experience replay, prioritized experience reply, etc?
"The reward R′α obtained by the adversarial agent only includes the evaluations from a single game and does not contain evaluations from the historical experiences.". Does this imply that existing works on adversarial policy thus far are all on-policy?
In Equation 3, explaining what the "otherwise" scenario entails would make the paper clearer.
In Section 3.2.2, the authors used a moving average of recent rewards to compute the feedback. What happens the training destabilizes (which is common for on-policy) and the moving average decreases? Would the feedback still be helpful and is the formulation designed to handle such scenarios or is the assumption that the moving average rewards always increases needed?
"(In practical, we use sliding window)" --> Grammatical mistake
From what I understand, computing and updating the episodic memory every episode increases the computational complexity considerably of this method. Given this increased in complexity, a comparison of the wall-clock running time would be helpful, and would this method work if the memory updates are performed less frequently?
The conclusion that the agent is able to learn multiple winning strategies and learn adversarial and non-adversarial strategies are all based on a single anecdotal observation. The conclusions would be more convincing if the authors could provide more anecdotal examples
Terms like "non-loss rate" should be explained briefly for better clarity and completeness
In general, I find sections 4.3.1 and 4.3.2 a bit too brief to truly understand what the authors are comparing. For example, in 4.3.1, what is the difference between using the historical scores vs 3-step patterns? Could the authors elaborate more?

伦理问题详情

This paper proposed a method for improving the efficiency of training adversarial policies, which could be used to harm other RL policies and the ethical implication of this method should be highlighted by the authors

评论- Rebuttal Part 3

2024-11-22

Weakness:

8.The conclusion that the agent is able to learn multiple winning strategies and learn adversarial and non-adversarial strategies are all based on a single anecdotal observation. The conclusions would be more convincing if the authors could provide more anecdotal examples

Thanks for pointing out this point. This is a subjective conclusion we made after we watched the videos of the policies in YouShallNotPassHumans. We observed that the actions of our policy in YouShallNotPassHumans combine the lying down on the ground actions (adversarial actions) and the obstacling the victim actions (non-adversarial actions) to fail the victim. Although this kind of policy is only observed in YouShallNotPassHumans, we think it still indicates that our method may have the ability to enable the adversarial agent to learn multiple winning strategies.

We will further clarify this in the paper, for example, we will change

“Fourth, our work demonstrates that by identifying and highlighting the winning states with historical experiences, adversarial agents can achieve higher winning rates against fixed victim agents and possess the capability to integrate multiple winning strategies to defeat the victims.”

in the introduction to

“Fourth, our work demonstrates that by identifying and highlighting the winning states with historical experiences, adversarial agents can achieve higher winning rates against fixed victim agents and may have the potential to possess the capability to integrate multiple winning strategies to defeat the victim, which is demonstrated in one Mujoco game environment in our experiment. “

9.Terms like "non-loss rate" should be explained briefly for better clarity and completeness

“non-loss rate” is just the rate of the games which do not lose. It is equal to the winning rate plus tie rate.

10.In general, I find sections 4.3.1 and 4.3.2 a bit too brief to truly understand what the authors are comparing. For example, in 4.3.1, what is the difference between using the historical scores vs 3-step patterns? Could the authors elaborate more?

Section 4.3.1 shows the ablation study of states and patterns which are introduced in Section 3.2.1. Section 4.3.2 shows the ablation study of historical score and episodic feedback, as episodic feedback serves as an optimization of historical score. In Section 4.3.1, we use the words as “we calculate the episodic feedback with historical scores of both states and 3-step patterns and then revise the rewards with two episodic feedbacks.” So the comparison is based on using the historical score of state and the historical score of 3-step pattern, but not the historical score vs 3-step pattern.

We hope our responses and clarifications have addressed your key questions and concerns. If so, we kindly request that you consider revising your score to better reflect this. If there are any remaining questions, please let us know, and we will do our best to resolve them.

评论- Response to author's rebuttal

2024-11-23

I acknowledged that I have read the authors' responses and I am more or less satisfied with the responses.

While I would be willing to raise the score to a 7 if the option was available, unfortunately I don't believe the existing results qualify for a score for of 8, hence I am maintaining my score.

Nevertheless, I hope the authors would consider including their responses back into the draft to further improve the clarity. All the best!

2024-11-23

Thanks for your reply. We understand there are limitations and clarity issues in our paper, which may not justify a score of 8. However, we are pleased to see that our rebuttals have addressed your main concerns, and we sincerely appreciate your willingness to raise the score to 7. We will incorporate our responses into the paper and work on improving its clarity based on your valuable feedback.

评论- Rebuttal Part 2

2024-11-22

Weakness:

4.In Equation 3, explaining what the "otherwise" scenario entails would make the paper clearer.

The adversarial agent can end the game in three states, win, tie and lose, and there are also three types of the episodic feedback, greater than 0, less than 0 and equal to 0, totally 9 possible scenarios. The otherwise just refers to the rest 7 scenarios except for “the adversarial agent wins and episodic feedback is less than 0” and “the adversarial agent loses and episodic feedback is greater than 0”. Since these conditions are too long to be written into the equation, we use otherwise to replace them.

5.In Section 3.2.2, the authors used a moving average of recent rewards to compute the feedback. What happens the training destabilizes (which is common for on-policy) and the moving average decreases? Would the feedback still be helpful and is the formulation designed to handle such scenarios or is the assumption that the moving average rewards always increases needed?

Since the reward revision in our method is cumulative, the decrease of moving average (in two-player competitive game situations, the amount of losing episodes increases) may not immediately harm the training process. On the other hand, the states in winning episodes from the group with low average may get vastly encouraged compared with those in the winning episodes from the group with high average.

However, we are not sure if the feedback will still be helpful when the training destabilizes. We didn't encounter such examples during our experiments thus we can not give the conclusion.

7.From what I understand, computing and updating the episodic memory every episode increases the computational complexity considerably of this method. Given this increased in complexity, a comparison of the wall-clock running time would be helpful, and would this method work if the memory updates are performed less frequently?

We agree that computing and updating the episodic memory every episode increases the computational complexity and time cost. We have considered this aspect so we use a relatively simple architecture (LSTM+MLP) to build the episodic memory. The experiments also show that such episodic memory is sufficient for Mujuco two-player competitive games. We will also add the time difference between our method and PPO in our experiments in the appendix as following:

YouShallNotPassHumans: 10 hours -> 12 hours

KickandDefned: 22 hours ->24 hours

SumoAnts: 30 hours -> 33 hours

SumoHumans: 24 hours -> 26 hours

评论- Rebuttal Part 1

2024-11-22

Thanks for your invaluable and enlightening review.

Weakness:

1.In the related works section, the authors wrote: "However, the above attacks are argued to be unrealistic since the real-world environment can not be manipulated". While the authors referred to this statement by referencing other works, I believe this statement is slightly misrepresented. In reality, adversarial attacks on the environment via perturbing the observation/action channels are definitely realistic. Furthermore, the proposed approach of two player games is also a form of indirect environmental manipulation, which ultimately affects the observation space. Hence, this statement seems rather contradictory to me given the topic of the paper.

Since RL has been applied in settings like autonomous driving, negotiation and automated trading; in domains such as these, an attacker cannot usually have access to the victim policy’s input. However, adversarial training treats the victim policy as a black box and utilizes natural actions to induce the failure of the victim. Such actions reach the same effect as other adversarial methods (e.g., observation manipulation), but do not need any direct access or manipulation to the victim policy, thus may still work for the real-world situations that the adversary can not access to the victim policy.

2.As mentioned above, the idea using good trajectories from the past to augment the training process to improve efficiency is not new. I would suggest the authors cite more related papers to create a more comprehensive overview. For example, how does the idea proposed in this paper differ from those of hindsight experience replay, prioritized experience replay, etc?

Prioritized experience replay assigns a priority to each transition in the replay buffer, ensuring that more significant or impactful transitions are sampled more frequently. Different from this method, our method first identifies good states from the historical evaluations and then directly modify the environmental rewards given from the environment, improving the training by assigning higher rewards to the good states so that the good states are more likely to be learned and selected.

For Hindsight experience replay, we both modify the rewards but in different ways and for different goals. HER sets an alternative goal for the state transitions of an episode to create a learning signal from failed attempts, while we identify good states from historical evaluations and encourage the training process to learn and select the good states. HER modify the rewards based on whether the alternative goal is reached, for example, it may assign the reward as 0 if the goal is achieved, but we revise the rewards conditionally with the historical evaluations which can evaluate the quality of the states and may contain more information.

As we stated in Section 2.2, we implement our method using episodic control. This kind of method contains an extra episodic memory to save and analyze historical experiences, which aligns with our insight. Different from previous episodic control methods, we focus on the win-lose information from the past episodes and incorporate such information into the rewards from the environment via conditional reward revision. We further implemented the SOTA episodic control method (Li et.al.) [1] with PPO and compared its performance to our method in the YouShallNotPassHumans and KickandDefend environments. The experiment results are shared in the same link we show the videos of our experiments in the appendix. In YouShallNotPassHumans, our method outperforms [1]. In KickandDefend, [1] even harms the training method. This indicates that existing episodic control method may not be directly applicable to two-player competitive games.

3."The reward R′α obtained by the adversarial agent only includes the evaluations from a single game and does not contain evaluations from the historical experiences.". Does this imply that existing works on adversarial policy thus far are all on-policy?

"The reward R′α obtained by the adversarial agent only includes the evaluations from a single game and does not contain evaluations from the historical experiences." does not imply that existing works on adversarial policy thus far are all on-policy. As stated in Section 3.1, the “reward R′α” is the environmental rewards the adversarial agent received from the environment, which can be also stored and used in off-policy methods. This reward only contains the information for evaluating a single state transition in one game and what we want to do is to directly incorporate historical evaluations into such rewards for the adversarial agent to learn.

[1] Zhuo Li, Derui Zhu, Yujing Hu, Xiaofei Xie, Lei Ma, Yan Zheng, Yan Song, Yingfeng Chen, and Jianjun Zhao. Neural episodic control with state abstraction. arXiv preprint arXiv:2301.11490, 2023.

审稿意见

评分: 3置信度: 52024-11-03

This paper introduces an episodic control-based adversarial policy training approach for two-player competitive games. The authors propose a method that enhances adversarial policy training by utilizing an episodic memory to store historical evaluations of game states and modify the corresponding rewards accordingly, helping adversarial agents prioritize states that lead to winning outcomes. The technique is tested in MuJoCo simulation environments and is claimed to show superior attack performance and difficulty in defending against trained adversarial agents when compared with baseline adversarial policy training methods.

优点

The proposed method appears effective within MuJoCo games and may serve as a reward-shaping technique to enhance existing approaches.

缺点

Although the proposed method achieved some empirical success at the selected Mujoco games. I do have some concerns regarding the current method.

(1) The proposed method lacks a theoretical guarantee and relies primarily on empirical results. The approach essentially applies reward adjustments by assigning intermediate rewards to specific states, which resembles a form of reward manipulation. It would be better if the author could give some bounds analysis regarding their propose method.

(2) Clarification is needed on how patterns are selected during an episode. Are multiple patterns chosen, or is only a single pattern used per episode? Could you elaborate more on this?

(3) It is unclear whether this method would succeed in more complex environments, such as StarCraft II or other sophisticated games. In these games, identifying successful patterns would likely be more challenging than in simpler environments like MuJoCo.

问题

Based on the weaknesses noted, I believe the paper does not meet the quality standards required for acceptance at the ICLR conference.

伦理问题详情

2024-11-22

Thanks for your invaluable and enlightening review.

Weakness:

1.The proposed method lacks a theoretical guarantee and relies primarily on empirical results. The approach essentially applies reward adjustments by assigning intermediate rewards to specific states, which resembles a form of reward manipulation. It would be better if the author could give some bounds analysis regarding their propose method.

Our work does come from a relatively empirical insight, which is utilizing historical evaluations based on win-lose information to differentiate between good and bad states and encouraging the policy to prefer good states by reward revision. As we discuss in our paper, we implement this approach using episodic control, which we believe effectively embodies our insight. To the best of our knowledge, existing works on episodic control [1,2,3,4,5] do not contain clear theoretical insights but mostly rely on empirical insights and results, which may indicate that finding a theoretical insight for methods like ours could be a hard research question.

2.Clarification is needed on how patterns are selected during an episode. Are multiple patterns chosen, or is only a single pattern used per episode? Could you elaborate more on this?

We apologize for the unclear clarification. In Section 3.3.3, we state ‘In practice, we use sliding window’, which means we extract all patterns in an episode with sliding window for the episodic memory. To better clarify this, we will change ‘The episodic memory M is then trained to map patterns in P to the new cumulative reward R’ in Section 3.3.1 to ‘The episodic memory M is then trained to map all the patterns in P to the same new cumulative reward R’ .

3.It is unclear whether this method would succeed in more complex environments, such as StarCraft II or other sophisticated games. In these games, identifying successful patterns would likely be more challenging than in simpler environments like MuJoCo.

We implemented our method on PPO and conducted experiments in the StarCraft II environment. We use two agents that are the same to play against each other and train one of them as the adversarial agent. To our best knowledge, only the method from Guo et.al.[6] has done experiments on StarCraft II, so we compare our method with PPO and Guo et.al.. The results (“StarCraft II.png”) are available in the same link where we show the videos of our experiments in the appendix. From the results, we can see that our method enhances PPO's performance in adversarial training for StarCraft II. However, unlike its performance in Mujoco two-player competitive games, our method does not surpass the performance of the method proposed by Guo et al., which includes specific designs tailored for environments like StarCraft II which may not be strictly zero-sum. Since our method is scalable and can be integrated with various DRL algorithms, we applied our method to the approach proposed by Guo et al. The results indicate that our method slightly improves the sample efficiency of their approach in the StarCraft II environment.

[1] Máté Lengyel and Peter Dayan. Hippocampal contributions to control: the third way. Advances in neural information processing systems, 20, 2007.

[2] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprint arXiv:1606.04460, 2016.

[3] Steven Hansen, Alexander Pritzel, Pablo Sprechmann, André Barreto, and Charles Blundell. Fast deep reinforcement learning using online adjustments from the past. Advances in Neural Information Processing Systems, 31, 2018.

[4] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. In International Conference on Machine Learning, pages 2827–2836. PMLR, 2017.

[5] Hung Le, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, and Svetha Venkatesh. Model-based episodic memory induces dynamic hybrid controls. Advances in Neural Information Processing Systems, 34:30313–30325, 2021.

[6] Wenbo Guo, Xian Wu, Sui Huang, and Xinyu Xing. Adversarial policy learning in two-player competitive games. In International Conference on Machine Learning, pp. 3910–3919. PMLR, 2021.

2024-11-23

Thank you for your response. I will maintain my score, as I don't believe this paper offers significant scientific value to the field of adversarial reinforcement learning. I am sorry for that.

2024-11-23

Thank you for your reply. Based on your reviews, we understand that your main concerns are:

Whether our method provides theoretical insights.
Whether our method remains effective in more challenging environments like StarCraft II.

For Concern 2, we addressed this through experiments, which show that our method continues to work well in such environments. We believe the results we provided effectively address this concern.

For Concern 1, we acknowledge that our method is based on empirical insights rather than theoretical ones. Proving the effectiveness of methods like ours theoretically is challenging, as many similar approaches also rely on empirical insights and experiments. For example, episodic control methods (mentioned above) and Hindsight Experience Replay [1] (referenced by Reviewer tuqK) are well-known works in the DRL field that follow a similar empirical approach. We believe the significant improvements demonstrated in our experimental results on Mujoco games can provide some evidence supporting the validity of our method.

In conclusion, while our paper may not offer significant scientific value (which would merit a score of 8) to the field, we believe our method makes a meaningful and valuable contribution to the field, which fits for a score of 6 or 7. We sincerely hope that you may reconsider your score based on our answers and contriutions.

[1] Andrychowicz M, Wolski F, Ray A, et al. Hindsight experience replay[J]. Advances in Neural Information Processing Systems, 2017, 30.

2024-11-25

Our basic insight is grounded in one of the fundamental principles of RL: assigning higher rewards to good states and lower rewards to bad states. The core challenge lies in distinguishing between good and bad states and accurately evaluating their quality. In the context of two-player competitive games, we hypothesize that this differentiation can be achieved by leveraging win-loss outcomes from past games.

To the best of our knowledge, it is challenging to theoretically guarantee whether a state is good or bad in complex RL environments without relying on empirical results, as the quality of most states depends on long-term interactions and may not directly correlate with the outcome of a single episode. This may also explain why most previous methods similar to ours (we have already mentioned above) primarily rely on empirical experiments to demonstrate their effectiveness. Additionally, as our method depends on historical experiences collected during training, the quality of a state cannot be definitively determined until sufficient episodes have been run to generate the necessary historical data. Therefore, using empirical experiments and results should be an appropriate and practical way to validate our method.

We value the time and effort you've dedicated to reviewing our work. As it is close to the end of the discussion, we would like to address any remaining questions or concerns you might have. If you have further questions, please feel free to share them. Otherwise, we kindly request you to reconsider your score based on our rebuttals and responses.

审稿意见

评分: 3置信度: 32024-11-04

The paper assumed that adversarial agent’s historical experiences can be used to distinguish winning states and losing states. The paper selected episodic control as their distinguishing model. By distinguishing winning states and losing states and adjusting rewards, the paper designed a method to improve the effectiveness of the adversarial agent. For adjusting rewards, the paper used the prediction of episodic control minus the recent average reward as the new reward. The results show that it improves on some of the Mojoco two-player environments.

优点

1 Using history to predict the rewards makes sense.

2 Results seem better than the baseline.

缺点

1 Using the history in robust RL is not new. With more information, it is not surprising that the results are better. "Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations"

2 Need more justification for the design selection. For example, why not train a reward function that takes some history features as input. Or Why not use a transformer or a CNN instead of a LSTM?

3 It is well known that the agents are not robust to adversarial attacks. It is more meaningful if it is attacking a robust agent. Or maybe at least test on more agents. For example, attacking agents that also have histories as inputs.

问题

see weakness.

评论- Rebuttal Part 2

2024-11-22

Weakness:

3.It is well known that the agents are not robust to adversarial attacks. It is more meaningful if it is attacking a robust agent. Or maybe at least test on more agents. For example, attacking agents that also have histories as inputs.

We apologize that we do not have enough time to train a robust agent for Mujuco two-player competitive game as the training method from "Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations" is for single-player game and no well-trained robust agents in two-player competitive game field are open for use. However, we believe this does not represent a significant weakness in our work, as our experimental setup aligns with the common practices in the field [1, 2, 3, 4] for adversarial training in two-player competitive games. Additionally, we conduct extra experiments in a more complex game environment, StarCraft II, and include the results in the same link where we provide videos of our experiments. These results demonstrate that our method remains effective in more challenging scenarios.

[1] Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJgEMpVFwB.

[2] Wenbo Guo, Xian Wu, Sui Huang, and Xinyu Xing. Adversarial policy learning in two-player competitive games. In International Conference on Machine Learning, pp. 3910–3919. PMLR, 2021.

[3] Xian Wu, Wenbo Guo, Hua Wei, and Xinyu Xing. Adversarial policy training against deep reinforcement learning. In 30th USENIX Security Symposium (USENIX Security 21), pp. 1883–1900, 2021.

[4] The Viet Bui, Tien Mai, and Thanh H Nguyen. Imitating opponent to win: Adversarial policy imitation learning in two-player competitive games. arXiv preprint arXiv:2210.16915, 2022.

评论- Rebuttal Part 1

2024-11-22

Thanks for your invaluable and enlightening review.

Weakness:

1.Using the history in robust RL is not new. With more information, it is not surprising that the results are better. "Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations"

We agree that using history features is not new in RL. As stated in Section 2.2, we implement our method using episodic control, which has been proved to effectively enhance sample efficiency by leveraging historical information in single-player games. Since adversarial training for two-player competitive games reduces the environment to a single-player setting, existing episodic control methods can be adapted for this task. To evaluate this, we implemented the SOTA episodic control method (Li et.al.) [1] with PPO and compared its performance to our method in the YouShallNotPassHumans and KickandDefend environments. We share the experiment results in the same link we show the videos of our experiments in the appendix. The experimental results demonstrate that our method outperforms [1] in YouShallNotPassHumans. Moreover, [1] appears to hinder adversarial training in KickandDefend, suggesting that existing methods may not be directly applicable to two-player competitive games. In contrast, our proposed method effectively utilizes historical information to improve adversarial training in two-player competitive games settings.

2.Need more justification for the design selection. For example, why not train a reward function that takes some history features as input. Or Why not use a transformer or a CNN instead of a LSTM?

2.1.Why not train a reward function that takes some history features as input?

We believe this approach effectively utilizes history features. In fact, the value functions of basic algorithms like PPO also leverage history features. However, as mentioned in Section 3.2, we do not directly use our reward signals (episodic feedback) as replacements for the actual rewards. Instead, we add these signals to the real rewards cumulatively. This may better differentiate between good states and bad states over time, which can more effectively highlight the good ones.

2.2. Why not use a transformer or a CNN instead of a LSTM?

The reason we use LSTM is to encode a pattern into an abstract vector which can be processed with MLP. While more complex models, such as transformers, could also fulfill this role and may have better performance, our method requires training an additional neural network alongside the primary one. This introduces challenges related to computational complexity and time cost. Implementing our method with simple neural networks like LSTM and MLP can reduce such burden, and the effectiveness of our experiments has shown that LSTM is sufficient for relatively good performance for Mujoco two-player competitive games.

[1] Zhuo Li, Derui Zhu, Yujing Hu, Xiaofei Xie, Lei Ma, Yan Zheng, Yan Song, Yingfeng Chen, and Jianjun Zhao. Neural episodic control with state abstraction. arXiv preprint arXiv:2301.11490, 2023.

2024-11-25

We value the time and effort you've dedicated to reviewing our work.As it is close to the end of the discussion, we kindly request your feedback on the response to ascertain whether it effectively addresses your concerns.

AC 元评审

2024-12-22

This paper proposes a reward engineering approach for training adversarial agents in two-player games. The reviewers raised the following main concerns:

The proposed method is essentially a single-agent approach (as the victim policy is fixed) and falls within the category of techniques that modify previously collected data in the buffer, such as Prioritized Experience Replay and Hindsight Experience Replay. Despite this, the paper is framed as a multi-agent method, yet no comparisons are made to those related single-agent approaches.
The method appears highly ad hoc. In single-player settings, it is straightforward to construct scenarios where the proposed method biases the agent away from learning the optimal policy. In two-player settings, it can fail to learn the best response.
It is well-known that adversarial attacks can easily succeed in multi-agent RL. However, it is unclear how significant it is to achieve a higher score in attacking non-robust RL agents in a few toy tasks, especially given the ad hoc nature of the proposed method.

审稿人讨论附加意见

Reviewers raised significant concerns that the method is essentially single-agent (as the victim policy is fixed), yet it is framed as multi-agent without relevant comparisons. The approach appears highly ad hoc, potentially biasing agents away from optimal policies in single-player settings and failing to learn the best response in two-player scenarios. Furthermore, while adversarial attacks are known to easily succeed in multi-agent RL, the significance of achieving higher scores against non-robust RL agents in a few toy tasks remains unclear, particularly given the ad hoc nature of the method. I feel these issues were not adequately addressed during the rebuttal phase, which is the main reason I lean toward rejection.

最终决定Reject

2025-01-22

Reject