4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

3.3

置信度

正确性2.5

贡献度2.3

表达3.0

ICLR 2025

Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

Zhihao Liu,Xianliang Yang,Zichuan Liu,Yifan Xia,Wei Jiang,Yuanyu Zhang,Lijuan Li,Guoliang Fan,Lei Song,Jiang Bian

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

摘要

关键词

Multi-agent reinforcement learningAction space pruningExplorationCoding large language model

评审与讨论

审稿意见

评分: 3置信度: 32024-10-30

The paper presents eSpark, a framework that leverages large language models (LLMs) to enhance multi-agent reinforcement learning (MARL) by pruning unnecessary actions. eSpark addresses the challenge of combinatorial action space growth in MARL, generating exploration functions in a zero-shot manner without manual intervention. Through iterative cycles of policy feedback and evolutionary search, it optimizes agent behavior. Evaluated on inventory management and traffic control tasks across 15 scenarios, eSpark outperforms baseline methods, showing a 34.4% performance gain and improved scalability with up to 500 agents. The framework demonstrates the effectiveness of LLM-driven action pruning in some environments.

优点

The paper's primary strength lies in its innovative integration of large language models (LLMs) into multi-agent reinforcement learning (MARL) for action space pruning.
The extensive experiments and ablation studies provide robust evidence of eSpark’s effectiveness and generalizability.

缺点

While the paper aims to address the dimensional explosion in complex environments with many agents, the experimental settings, though varied, may not fully represent truly complex real-world environments. This raises doubts about whether the experiments sufficiently support the paper’s stated motivation.
The framework lacks a theoretical guarantee on how pruning actions with LLMs affects the optimality of learned policies, leaving open the possibility that some optimal actions may be discarded during exploration.
The computational efficiency comparisons may not be entirely fair, as different algorithms could have varying levels of computational complexity, especially with the use of LLMs, which are resource-intensive.
The paper does not include comparisons with straightforward, rule-based action space pruning techniques, which could serve as useful baselines and provide clearer insights into the added value of the LLM-driven approach.
While the framework performs well with homogeneous agents, it is unclear how well it would generalize to heterogeneous agents or to tasks with sparse rewards. The lack of experiments in such scenarios limits the generalizability of the proposed method.
The effectiveness of the eSpark framework depends heavily on the quality of the LLM outputs. Errors in exploration function generation or feedback handling could negatively impact performance, yet the paper provides limited discussion on handling these risks.

问题

How would eSpark perform in settings with heterogeneous agents, where each agent may require distinct exploration functions? Is there a plan to extend the framework to such scenarios?
How does the framework manage incorrect or suboptimal outputs from the LLMs, especially during iterative exploration function generation? Are there fallback mechanisms to prevent performance degradation from such errors?

2024-11-20

We thank the reviewer for the sincere comments. We hope we can address your concerns below.

W1: The experimental settings, though varied, may not fully represent truly complex real-world environments.

We disagree with the reviewer’s point. In our experiments, we aimed to evaluate our approach using simulators of real-world applications. Among our selected environments, MABIM is a multi-level inventory management simulator built on a real-world inventory dataset [1]. Our experimental scenarios tested up to 1500 agents, demonstrating high scalability. SUMO is a microscopic traffic simulation environment based on real-world road networks and traffic flow data, widely used in MARL research [2-4]. Additionally, in the attached PDF, we have included experiments conducted in the active voltage control environment [5], which are also built on real-world datasets.

Although gaps exist between simulated and real-world environments, we believe that the results obtained from these simulators still demonstrate the potential of our approach in real-world applications.

W2: The framework lacks a theoretical guarantee on how pruning actions with LLMs affects the optimality of learned policies, leaving open the possibility that some optimal actions may be discarded during exploration.

In Proposal 1, we prove the existence of an exploration function that can improve the performance of any given policy. The theoretical proof is detailed in Appendix A. However, when using an LLM to generate exploration functions—as with most LLM-integrated approaches—it's challenging to provide strict theoretical guarantees due to the unpredictable nature of LLM outputs. Nonetheless, our experimental results in Tables 1 and 2 demonstrate that our framework consistently outperforms all baselines in most scenarios. This empirically validates the LLM's effectiveness in pruning the action space within our framework.

W3: The computational efficiency comparisons may not be entirely fair, as different algorithms could have varying levels of computational complexity, especially with the use of LLMs, which are resource-intensive.

In Appendix E.1, we provide detailed comparisons of the training time, GPU utilization, and LLM resource consumption of eSpark versus baseline algorithms. As mentioned in Section 4.3, eSpark only invokes the LLM to generate or edit exploration functions based on feedback. Beyond these sparse invocations (limited to 10 times in our current settings), the exploration function interacts directly with the MARL policy.

Furthermore, in homogeneous MARL environments, a single exploration function is shared across all agents. This design ensures that the LLM invocation cost remains constant, regardless of the number of agents or training steps.

W4: The paper does not include comparisons with straightforward, rule-based action space pruning techniques, which could serve as useful baselines and provide clearer insights into the added value of the LLM-driven approach.

There may be a misunderstanding. We have indeed included comparisons with rule-based action space pruning methods in Section 5.3 of the paper. In addition to random pruning, we introduced Upbound pruning and (S,s) pruning for the MABIM task, and pruning methods based on MaxPressure for the SUMO task. Detailed descriptions of these methods are provided in Appendix D.3, and their performance comparisons with eSpark are presented in Tables 3 and 4.

W5: While the framework performs well with homogeneous agents, it is unclear how well it would generalize to heterogeneous agents or to tasks with sparse rewards. The lack of experiments in such scenarios limits the generalizability of the proposed method.

We acknowledge that these are limitations of our current work, which we have already stated on line 528. However, we have also proposed some potential solutions in Section 6, which we reiterate here.

For heterogeneous agent scenarios, we proposed to extend eSpark to by grouping agents based on their distinct action spaces and reward functions, then providing unique exploration functions for each group. This approach would enable eSpark to adapt to heterogeneous agent environments with minimal modifications.

For sparse reward settings, we could instruct the LLM to reference human-derived heuristics for such tasks. Using these heuristics, we could design intrinsic rewards to provide feedback for the LLM's pruning mechanism, as in [6] for instance.

Expanding eSpark to handle heterogeneous agents and sparse rewards presents an exciting avenue, and we will leave these questions for future research.

W6: The effectiveness of the eSpark framework depends heavily on the quality of the LLM outputs. Errors in exploration function generation or feedback handling could negatively impact performance, yet the paper provides limited discussion on handling these risks.

Please see the answer of Q2.

2024-11-20

Q1: How would eSpark perform in settings with heterogeneous agents, where each agent may require distinct exploration functions? Is there a plan to extend the framework to such scenarios?

As we discussed in Section 6, eSpark can be extended to heterogeneous MARL settings by grouping agents based on their distinct action spaces and reward functions, then assigning separate exploration functions to each group. This approach requires minimal modifications to adapt eSpark to heterogeneous environments. We plan to explore this extension in our future work.

Q2: How does the framework manage incorrect or suboptimal outputs from the LLMs, especially during iterative exploration function generation? Are there fallback mechanisms to prevent performance degradation from such errors?

We have implemented two methods to address errors or suboptimality in the generated exploration functions. These issues primarily stem from two sources:

Logical errors: The generated exploration function may contain flaws such as failing to account for task transitions or misusing variables.
Design flaws and hyperparameter selection: The function may have inadequately designed mask structures or suboptimal hyperparameters.

To tackle logical errors, we employ a dedicated LLM checker that verifies each generated exploration function for logical inconsistencies, as detailed in Section 4.1. This role-assignment approach has proven effective in previous studies [7,8].

For design flaws and suboptimal hyperparameters, we utilize evolutionary search to iteratively identify and select the best-performing exploration functions, as described in Section 4.2. This process naturally eliminates functions with inadequate designs or poor performance, excluding them from subsequent iterations.

By combining these two mechanisms, we effectively mitigate the risk of performance degradation by either rectifying or excluding errors and suboptimal outputs from the LLM.

[1] Yang, Xianliang, et al. "A versatile multi-agent reinforcement learning benchmark for inventory management." arXiv preprint arXiv:2306.07542 (2023).

[2] Lu, Jiaming, et al. "Dualight: Enhancing traffic signal control by leveraging scenario-specific and scenario-shared knowledge." arXiv preprint arXiv:2312.14532 (2023).

[3] Wei, Hua, et al. "Colight: Learning network-level cooperation for traffic signal control." Proceedings of the 28th ACM international conference on information and knowledge management. 2019.

[4] Jiang, Haoyuan, et al. "A general scenario-agnostic reinforcement learning for traffic signal control." IEEE Transactions on Intelligent Transportation Systems (2024).

[5] Wang, J., Xu, W., Gu, Y., Song, W., & Green, T. C. (2021). Multi-agent reinforcement learning for active voltage control on power distribution networks. Advances in Neural Information Processing Systems, 34, 3271-3284.

[6] Zhang, Alex, Ananya Parashar, and Dwaipayan Saha. "A simple framework for intrinsic reward-shaping for rl using llm feedback." (2024).

[7] Chen, Weize, et al. "Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents." arXiv preprint arXiv:2308.10848 (2023).

[8] Zhang, Bin, et al. "Controlling large language model-based agents for large-scale decision-making: An actor-critic approach." arXiv preprint arXiv:2311.13884 (2023).

2024-11-20

Thanks for replying, but some of my concerns are not addressed. I believe the main issue with this paper is that action masking is a commonly used technique in the multi-agent domain. My primary concern is whether eSpark can outperform the action mask rules crafted by human experts (as highlighted in Weakness 4).

2024-11-20

We appreciate the reviewer's prompt response. We further clarify the rule-based baselines and heuristic baselines of this paper:

The rule-based policies we examined—base stock, (s, S) in inventory management and FTC, MaxPressure in SUMO—were all crafted by domain experts.
The heuristic pruning baselines we examined—(S,s) pruning in inventory management and MaxPressure pruning in SUMO—utilized expert-designed algorithms to provide heuristic action masks for reinforcement learning.

One of the primary reasons for studying the problems in our paper is that action mask rules are challenging to craft, even for human experts, due to the complex interactions among numerous agents. This challenge is also one of the fundamental assumptions underlying all MARL algorithms. Moreover, we have demonstrated that our approaches, which employ an LLM-guided dynamically adaptive exploration policy, outperform existing MARL baselines, expert rule-based baselines and expert rule-based heuristic pruning methods.

To the best of our knowledge, we are unaware of other rule-based action masking policies for the benchmarks we're considering. Since the mask needs to match the dimensions of the action space precisely, expert-designed action masks signify manually incorporating knowledge of a specific benchmark into the RL or search algorithms, which shows poor generalizability. We would welcome the opportunity to compare our approach with any other action masking policies the reviewer might suggest.

审稿意见

评分: 6置信度: 32024-11-03

In this paper the authors propose an action pruning method called eSpark for action pruning in multi-agent reinforcement learning using LLMs. This approach utilizes LLMs to improve MARL training via optimized exploration functions, which are used to prune the action space. eSpark begins by using LLMs to generate exploration functions from task descriptions and environmental rules in a zero-shot fashion. It then applies evolutionary search within MARL to pinpoint the best performing policy. The authors overcome the limitations of the existing action pruning methods which are either computationally expensive, hard to scale or require the underlying domain structure knowledge. The authors show that their proposed method is able to prune action space for large number of agents with a 29.7% improvement in scalability.

优点

Strengths:

The paper is clear and well-written.
eSpark requires no complex prompt engineering and can be easily combined with MARL algorithms.
The paper effectively shows that eSpark can handle scenarios with a large number of agents, which is a significant step in overcoming limitations of existing MARL methods.
Good ablation studies provided for the proposed method.

缺点

See the questions below.

问题

Questions:

Did authors test what will be the performance with other base MARL algorithms other than IPPO?
While the paper states that no complex prompt engineering is needed, did the authors experiment with different prompts, and how did that influence the exploration function quality?
It seems like the authors only compare their method against the random pruning and heuristic pruning methods. There are other works that the authors have mentioned in the related work section for pruning. Have authors considered comparing with those baselines?
Does the inclusion of the LLM checker at any time cause the flawed exploration functions (e.g., variable misuse, misaligned task logic)? How can this be handled?

2024-11-20

Thank you for your detailed feedback. We hope our responses can address your concerns.

Q1: Did authors test what will be the performance with other base MARL algorithms other than IPPO?

We have included the performance of eSpark combined with MAPPO in the 100 SKUs scenarios of MABIM in Figure 22 of the supplementary materials. The results demonstrate that, whether combined with IPPO or MAPPO, our proposed framework consistently outperforms all MARL baselines. It achieves stable improvement and the best performance across all scenarios, showcasing the strong generalization capability of our method.

Q2 : While the paper states that no complex prompt engineering is needed, did the authors experiment with different prompts, and how did that influence the exploration function quality?

Prompt engineering is crucial for tasks involving LLMs. During eSpark's development, we experimented with various prompt types (e.g., excluding the LLM checker, modifying feedback instructions, adding or removing task-related components). These modifications influenced the LLM's generation, comprehension, and refinement abilities, thereby impacting the quality of exploration functions.

When finalizing the prompt design, we adhered to two key principles:

Minimal manual input: Users need only provide essential environment-specific information. The framework automatically handles exploration function generation, evolutionary search, and function refinement through fixed prompts, ensuring quick adaptability across different environments.
High-quality generation: The LLM must understand task requirements, conditions, and feedback to produce exploration functions that are executable, task-compliant, and effective in guiding MARL to explore redundant action spaces.

In our final design, users input only a few RL formulations (mostly derivable from the environment) and a task description. Our framework then automatically generates exploration functions, trains models, and builds feedback loops. Experimental results show that our method consistently outperforms baselines across most scenarios.

Q3 :It seems like the authors only compare their method against the random pruning and heuristic pruning methods. There are other works that the authors have mentioned in the related work section for pruning. Have authors considered comparing with those baselines?

We mention several approaches in the Related Work section, but these methods face significant challenges when directly applied to our task. The method in [1] requires an elimination signal from the environment to learn the action elimination module, which is typically unavailable. The approach in [2] requires a high-quality offline dataset for pretraining. The method in [3] uses LLM knowledge to filter actions in text-based environments, but such pruning strategies are difficult to transfer directly to data-driven MARL environments. The approaches in [4-6] rely on manually designed data structures based on prior knowledge to filter actions, which lack generalizability. To the best of our knowledge, we have not found any pruning methods sufficiently generalizable for our task.

2024-11-20

Q4: Does the inclusion of the LLM checker at any time cause the flawed exploration functions (e.g., variable misuse, misaligned task logic)? How can this be handled?

The LLM checker aims to detect flawed exploration functions, increasing the pool of viable candidates for evolutionary search. However, even with this checker in place, it cannot entirely eliminate all errors. We address errors based on their nature:

Programmatic errors: These include issues like using undefined variables or mismatched input-output variable structures. In these cases, we exclude the exploration functions from the training process and do not use them as templates for subsequent editing.
Logical errors: These occur when exploration functions run successfully but contain issues such as variable misuse or misaligned task logic. During the evolutionary search process, exploration functions with logical errors typically perform poorly and are naturally eliminated. Consequently, we do not select them for further iterations or use them as templates for refinement.

We will clarify this process further in the revised version of the paper.

[1] Zahavy, Tom, et al. "Learn what not to learn: Action elimination with deep reinforcement learning." Advances in neural information processing systems 31 (2018).

[2] Shirali, Ali, Alexander Schubert, and Ahmed Alaa. "Pruning the Way to Reliable Policies: A Multi-Objective Deep Q-Learning Approach to Critical Care." IEEE Journal of Biomedical and Health Informatics (2024).

[3] Ammanabrolu, Prithviraj, and Mark O. Riedl. "Playing text-adventure games with graph-based deep reinforcement learning." arXiv preprint arXiv:1812.01628 (2018).

[4] Dulac-Arnold, Gabriel, et al. "Deep reinforcement learning in large discrete action spaces." arXiv preprint arXiv:1512.07679(2015).

[5] Padullaparthi, Venkata Ramakrishna, et al. "FALCON-FArm Level CONtrol for wind turbines using multi-agent deep reinforcement learning." Renewable Energy 181 (2022): 445-456.

[6] Nagarathinam, Srinarayana, et al. "Marco-multi-agent reinforcement learning based control of building hvac systems." Proceedings of the eleventh ACM international conference on future energy systems. 2020.

2024-11-24

We have incorporated all changes in the supplementary material. Here are our responses to your concerns:

For question 1: We tested eSpark with MAPPO (Figure 22, supplementary materials), and it also consistently outperformed all MARL baselines, demonstrating eSpark's strong generalization with different MARL algorithms.
For question 2: We illustrated our principles of prompt engineering and finalized a design requiring minimal input while ensuring effective exploration function generation and adaptability.
For question 3: We clarified that existing pruning methods face challenges in our MARL tasks due to their reliance on specific signals, datasets, or manual designs, making them unsuitable for direct comparison.
For question 4: We illustrated various causes of flawed exploration functions and discussed strategies for handling them effectively.

We appreciate your thorough review and look forward to your feedback.

2024-11-26

Thank you for your detailed response. All my concerns are answered. Therefore, I will maintain my score of 6.

审稿意见

评分: 5置信度: 42024-11-03

The paper introduces eSpark, a novel framework designed to enhance Multi-Agent Reinforcement Learning (MARL) by leveraging Large Language Models (LLMs). Specifically, it addresses the combinatorial explosion of action spaces in MARL by utilizing LLMs to prune irrelevant and redundant actions, thereby improving the efficiency of exploration. eSpark generates exploration functions in a zero-shot manner, using only a basic task description, and refines the exploration process iteratively based on policy feedback. The framework is evaluated across tasks in inventory management and traffic light control, showing significant improvements in performance relative to existing MARL methods.

优点

Innovative Use of LLMs in MARL: The paper presents a novel application of LLMs for action space pruning in MARL. Leveraging LLMs to generate exploration functions in a zero-shot manner is a unique and promising approach that could pave the way for more efficient MARL systems.
Scalability and Generalization: The proposed eSpark framework demonstrates strong scalability, as shown in scenarios involving over 500 agents. The method also generalizes well across different domains, including traffic control and inventory management, indicating its wide applicability.
Performance Gains: eSpark achieves notable improvements over baseline MARL methods, including an average performance gain of 34.4% in inventory management tasks and 9.9% in traffic control tasks, showcasing its effectiveness in complex environments.

缺点

I believe the biggest limitation of this paper is the lack of in-depth analysis and discussion of the proposed method. The baselines selected in the paper are mainly MARL methods and heuristic approaches, lacking comparisons and discussions with existing LLM-based methods[1-4]. Additionally, the environments chosen in the paper do not include classical MARL benchmarks, such as SMAC. Lastly, the paper does not provide detailed analysis or case studies of the action masks generated by the LLM; it mainly focuses on overall performance.

Zhuang, Yuchen, et al. "Toolchain*: Efficient Action Space Navigation in Large Language Models with A* Search." ICLR 2024.
Zhang, Shenao, et al. "How Can LLM Guide RL? A Value-Based Approach." arXiv preprint arXiv:2402.16181 (2024).
Wang, Xingyao, et al. "Executable Code Actions Elicit Better LLM Agents." ICML 2024
Yan, Xue, et al. "Efficient Reinforcement Learning with Large Language Model Priors." arXiv preprint arXiv:2410.07927 (2024).

问题

How can eSpark be extended to continuous action spaces? Does this limit its usage?
In the current eSpark framework, each agent independently generates its action mask without considering collaboration between agents. Could this limit eSpark's utility in tasks requiring strong cooperation?
Generating an action mask at each time step is somewhat equivalent to generating a reward function at each time step, a topic that has already been explored in single-agent RL. The authors should analyze and discuss this.
The benchmarks selected by the authors primarily come from the operations research field. What was the rationale behind this choice?

2024-11-20

We thank the reviewer for the insightful comments! We hope we can address your concerns below.

W1: Only compare with MARL and heuristic methods, lack comparisons with existing LLM-based methods

We agree with the reviewer’s perspective on the importance of comparing eSpark with existing LLM-based methods. However, we find it challenging to adapt existing LLM-based methods to our task. Most existing LLM-based systems focus on textual-related tasks, and our task presents the following difficulties:

No Abstract Actions: The environments used in prior studies [1,2,4] generally include well-designed textual action interfaces or directly utilize text as actions. These abstract actions always consist of a series of low-level actions contain explicit meaning, reducing the cost of LLM calls while being easy to evaluate. In contrast, most data-driven RL environments provide only single-step action interface, making it difficult for an LLM to evaluate the consequences of each individual action effectively and increasing the calling cost.
Multi-Agent Environments: The environments considered in [1-4] are primarily single-agent settings. In such scenarios, LLM-based methods only need to evaluate actions and observations for a single agent, avoiding the complexity of joint action spaces. However, in MARL tasks, the action space and the number of actions to output grow significantly as the number of agents increases. Directly using LLM-based methods to assist in action selection for MARL tasks would result in prohibitive computational costs.

To the best of our knowledge, we have not found any existing algorithms that can be directly applied to our task. We would greatly appreciate it if the reviewers could point us toward any existing methods that might address these challenges. Besides, we will conclude the relevant literature and challenges in the revised version the paper.

W2: The environments chosen in the paper do not include classical MARL benchmarks, such as SMAC.

Please refer to the answer of Q4.

W3: The paper does not provide detailed analysis or case studies of the action masks generated by the LLM; it mainly focuses on overall performance

In Section 5.3, Figure 2 illustrates the policy action distribution, which effectively analyzes the action masks. These masks, edited based on policy feedback and integrated with specific policies, cannot be evaluated in isolation. However, they significantly influence policy exploration. Consequently, we assess the effectiveness of the action masks by examining the behavior of the final learned policies. Additionally, Appendix J presents a case study that demonstrates how eSpark interprets policy feedback and refines exploration functions accordingly.

Q1: How can eSpark be extended to continuous action spaces? Does this limit its usage?

Extending eSpark to continuous action spaces is feasible. We propose modifying the exploration function in eSpark to define upper and lower bounds for the clipped action space. Agents can then restrict their action outputs to this clipped space, similar to existing continuous action space clipping methods. Two potential approaches are:

Clipping actions: Define upper and lower bounds for the actions, and clip any out-of-range actions to these boundary values.
Projection: Normalize the sampled action to [0, 1] and project it back to the specified upper and lower bounds.

We plan to implement and experiment with these approaches in future work.

Q2: In the current eSpark framework, each agent independently generates its action mask without considering collaboration between agents. Could this limit eSpark's utility in tasks requiring strong cooperation?

Our current design for generating action masks in eSpark is based on two key considerations:

Task-specific collaboration complexity: Complex collaboration between agents often involves task-specific topological structures. These structures can be challenging to describe linguistically (e.g., in the MABIM environment, the cooperative relationships between different SKU types are unclear) and may change dynamically over time. This complexity adds significant difficulty to prompt design and reduces eSpark's generalizability. Rather than aiming for LLMs to solve all MARL challenges, eSpark focuses on pruning the joint action space, while leaving the learning of complex inter-agent cooperation to the MARL algorithm and its interactions with the environment.
Iterative improvement via feedback: Although the exploration function generates action masks based solely on individual agents' observations, each iteration refines the exploration function using feedback from the overall policy reward. This approach allows the exploration function to maximize the total reward while abstracting away the underlying complexity of inter-agent structures.

2024-11-20

Q3: Generating an action mask at each time step is somewhat equivalent to generating a reward function at each time step, a topic that has already been explored in single-agent RL. The authors should analyze and discuss this.

We concur with the reviewer to some extent. Fundamentally, RL learning aims to learn a reward function (the Q*-value function). However, in our work, we focus on optimizing exploration policies to accelerate the training procedure, which is particularly time-consuming in MARL. While both action masking and dynamic reward functions guide agent behavior, their underlying mechanisms differ significantly:

Exploration control mechanism

Dynamic reward functions: These indirectly influence action selection by modifying incentives for each action while keeping the action space intact. For multi-agent tasks with large action spaces, dynamic reward functions can guide learning but may struggle to prevent unproductive exploration. Even with properly rewarded actions, agents still face numerous low-reward options at each step.
Action masking: Exploration functions generate action masks that directly prune the action space, imposing explicit constraints to prevent invalid exploration. Compared with negative reward guidance, it empirically scales well as the space of invalid action gets larger [5].

Support iterative improvement or not

Dynamic reward functions: Modifying the original reward signal can lead to inaccurate evaluations of previously sampled experiences. This essentially changes the underlying MDPs, making iterative learning challenging, if not impossible.
Action masking: This approach doesn't alter the reward signal or the underlying MDPs. Instead, it restricts exploration to a subset of the action space. Changes in the action mask don't invalidate prior transition estimates, making it more suitable for iterative refinement during training based on policy feedback.

Q4: The benchmarks selected by the authors primarily come from the operations research field. What was the rationale behind this choice?

Our goal was to focus on realistic applications that naturally lend themselves to MARL problems involving numerous agents. To be honest, we weren't initially aware that these environments had roots in the operations research field. In reality, these two environments are commonly employed as benchmarks within the reinforcement learning community, as demonstrated by [6] and [7]. In addition, we have added an active voltage control benchmark [8], where the training curve can be seen in the attached PDF.

[1] Zhuang, Yuchen, et al. "Toolchain*: Efficient Action Space Navigation in Large Language Models with A* Search." ICLR (2024).

[2] Zhang, Shenao, et al. "How Can LLM Guide RL? A Value-Based Approach." arXiv preprint arXiv:2402.16181 (2024).

[3] Wang, Xingyao, et al. "Executable Code Actions Elicit Better LLM Agents." ICML 2024.

[4] Yan, Xue, et al. "Efficient Reinforcement Learning with Large Language Model Priors." arXiv preprint arXiv:2410.07927 (2024).

[5] Huang, Shengyi, and Santiago Ontañón. "A closer look at invalid action masking in policy gradient algorithms." arxiv preprint arxiv:2006.14171 (2020).

[6] Hwang, HyeongJoo, et al. "Information-theoretic state space model for multi-view reinforcement learning."ICML (2023).

[7] Zhang, Chuheng, et al. "Whittle Index with Multiple Actions and State Constraint for Inventory Management.” ICLR (2024).

[8] Wang, J., Xu, W., Gu, Y., Song, W., & Green, T. C. (2021). Multi-agent reinforcement learning for active voltage control on power distribution networks. Advances in Neural Information Processing Systems, 34, 3271-3284.

2024-11-24

We have incorporated all changes in the supplementary material. We have made the following answers for your concern:

For weakness 1: We analyzed challenges in adapting existing methods to our MARL tasks.
For weakness 3: The analysis of the policy in our Figure 2 is actually an analysis of the action mask. The case study in Appendix J demonstrates eSpark's iterative refinement based on policy feedback.
For question 1: We proposed feasible modifications to extend eSpark to continuous action space.
For question 2: We clarified that eSpark focuses on joint action space pruning while leaving complex inter-agent cooperation to the MARL algorithm, with iterative refinement ensuring overall collaboration.
For question 3: We compared action masking with dynamic reward functions, highlighting the scalability and suitability of action masks for iterative refinement without altering the MDPs.
For question 4: We explained the rationale behind selecting our benchmarks, emphasizing their realistic applications and common use in the RL community.

We hope our answers can address your concerns. We look forward to your feedback.

2024-11-26

Thank you for your detailed response and for addressing my concerns. I appreciate the additional explanations and the updates provided in the supplementary material. That said, I have decided to keep my score. While I understand the challenges involved in adapting existing methods to your MARL tasks, conducting comparisons with LLM-based baselines and more classical MARL benchmarks is crucial for evaluating the completeness and broader impact of the work. I encourage the authors to further explore these comparisons to strengthen the overall contribution of the paper.

审稿意见

评分: 5置信度: 32024-11-03

The paper introduces the eSpark framework, which integrates large language models (LLMs) to address the challenges of exploration in environments with an increasing number of agents by pruning unnecessary actions. This approach is tested across various settings, including inventory management and traffic signal control, and demonstrates significant performance improvements.

优点

Innovative Use of LLMs: The paper effectively harnesses the capabilities of LLMs to generate exploration functions, providing a novel approach to action space pruning in MARL. Clear Presentation: The paper is well-structured and clearly presents the methodology, experiments, and results. The use of figures and tables is effective in illustrating the improvements made by eSpark.

缺点

High Training Costs: The eSpark framework necessitates multiple iterations, each generating k exploration functions and evaluating all state-action pairs within the action space. This approach substantially increases both the financial and computational complexity of training, requiring greater GPU memory and prolonging the overall training duration.

Lack of Theoretical Guarantees: The manuscript lacks a comprehensive assessment of the quality of exploration functions produced by the large language model (LLM). Consequently, it is challenging to ascertain whether the utilization of LLMs for pruning adversely affects the pursuit of optimal solutions.

Limited Experimental Environments: The experiments are confined to two specific tasks—logistics and traffic management—raising questions about the algorithm's generalizability and effectiveness in more widely encountered task environments. It remains uncertain whether the proposed algorithm can be effectively generalized across a broader spectrum of task scenarios.

问题

The example provided in Section 3.2 bears limited relevance to the proposed method in this paper. Could a more compelling example be introduced to illustrate the advantages of using LLMs for generating exploration functions? The current example merely demonstrates the reasoning abilities of GPT, which is a widely accepted understanding, and does little to support the argument that LLMs can effectively generate exploration functions.
The paper only presents the final results in a tabular format without accompanying training curves. Specific details regarding the number of iterations required by eSpark and how the iteration count affects its performance remain unaddressed. Would it be feasible to include these additional visualizations?
Relying solely on rewards as feedback may not adequately capture the current state of the policy. Could the authors consider providing more informative feedback, such as the individual components of the rewards?

2024-11-20

Thank you for your detailed feedback and review! Below we give the response to your concerns.

W1: High Training Costs

Finding (near)-optimal policies in MARL incurs high training costs, primarily due to its inherent complexity, especially when involving numerous agents. We demonstrated in our paper that simply increasing resource consumption doesn't significantly improve performance, indicating the need for a smarter exploration strategy to accelerate training. As shown in Tables 5 and 6 in Section 5.4, our ablation study of eSpark without LLM reveals that merely increasing computational resources without action space pruning offers limited—and potentially detrimental—performance benefits. Conversely, with identical training steps and GPU usage, eSpark's unique action space pruning approach effectively leverages increased computational resources, outperforming all baselines in most scenarios.

W2: Lack of Theoretical Guarantees

In Proposal 1, we demonstrate that for any MARL policy, there exists an exploration function that can enhance its performance. However, the inherently stochastic nature of LLM outputs presents a challenge in providing theoretical guarantees on the resulting exploitation policies and their impact on optimality. This is particularly true in the MARL domain, where the optimal policy is often unknown.

Despite the lack of theoretical guarantees, we can evaluate the effectiveness of exploration functions by examining policy behaviors and experimental results. Our case study (Figure 2) comparing eSpark with baselines demonstrates that eSpark learns more interpretable and superior policies. Moreover, the results presented in Sections 5.2 and 5.3 show that eSpark consistently outperforms all baselines. These findings collectively validate the effectiveness of LLM-based pruning in MARL without compromising the pursuit of optimality.

W3: Limited Experimental Environments

To address this concern, we have evaluated all approaches on another environment: active voltage control environment [1]. We benchmarked eSpark against other MARL baselines across three provided scenarios. To evaluate performance, we adopted two indicators proposed in the paper: Control Rate (CR) and Power Loss (PL). Figure 23 of the attached PDF showcases the results. Our proposed eSpark outperforms the baselines in all experimental scenarios within this environment, demonstrating its strong generalizability across various tasks. We plan to continue exploring other MARL benchmarks in the future and would be happy to evaluate on additional environments that interest the reviewer.

Q1: Could a more compelling example be introduced to illustrate the advantages of using LLMs for generating exploration functions?

The primary advantage of using LLMs to generate exploration functions stems from their human-level reasoning ability and robust coding capabilities [2-5]. This reasoning allows LLMs to comprehend the structure and variables involved in a task and make performance-enhancing pruning decisions—the core of LLM-based pruning. Their coding capability enables the implementation of this reasoning in code form, facilitating seamless interaction with MARL policies. To address the reviewer’s concern, we have designed a more challenging sequential decision-making example, where agents collaborate at each time step to allocate resources, while the sum of resource allocation is constrained. In the attached PDF, we provide the complete problem definition and the LLM’s response. Leveraging its powerful reasoning and encoding capabilities, the LLM successfully understood the problem structure and variables, and designed a reasonable exploration function.

Q2: Specific details regarding the number of iterations required by eSpark and how the iteration count affects its performance remain unaddressed. Would it be feasible to include these additional visualizations?

We appreciate the reviewer's suggestion. In response, we have included the training curves for eSpark and the MARL baselines in Figures 22 and 23 of the attached PDF. These curves demonstrate that eSpark's performance steadily improves as feedback is collected after each iteration, coupled with evolutionary search and the regeneration of exploration functions. eSpark's performance stabilizes after approximately 10 iterations. Compared to the baselines, eSpark exhibits smoother improvement and consistently outperforms all MARL baselines across all test scenarios.

2024-11-20

Q3: Relying solely on rewards as feedback may not adequately capture the current state of the policy. Could the authors consider providing more informative feedback, such as the individual components of the rewards?

We appreciate the reviewer's suggestion. This is precisely what we have implemented in our paper. Recognizing that relying solely on total rewards limits the LLM's ability to analyze the underlying reasons for policy performance, we provide reward components as additional information. This approach enables a multifaceted evaluation, as detailed in Section 4.3. In Section 5.5, we conducted an ablation study where we removed individual reward components, leaving only the total rewards for evaluation. Appendix H illustrates the differences in LLM outputs under these two conditions. The results demonstrate that providing reward components is crucial for LLMs to effectively refine and improve exploration strategies.

[1] Wang, Jianhong, et al. "Multi-agent reinforcement learning for active voltage control on power distribution networks." Advances in Neural Information Processing Systems 34 (2021): 3271-3284.

[2] Roziere, Baptiste, et al. "Code llama: Open foundation models for code." arXiv preprint arXiv:2308.12950 (2023).

[3] Nejjar, Mohamed, et al. "Llms for science: Usage for code generation and data analysis." Journal of Software: Evolution and Process (2023): e2723.

[4] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In ICLR, pp. 1–33, 2023.

[5] Zhu, Xizhou, et al. "Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory." arXiv preprint arXiv:2305.17144 (2023).

2024-11-24

We have incorporated all changes in the supplementary material. In response to your concerns, we gave the following answers:

For weakness 1: While simply increasing resource consumption did not significantly improve performance, we demonstrated that eSpark could benefit from the increased computational resources (Tables 5 and 6).
For weakness 2: While theoretical guarantees are difficult due to LLM stochasticity, we provided experimental evidence (Sections 5.2 and 5.3, Figure 2) showing eSpark learns interpretable and superior policies, validating its effectiveness.
For weakness 3: We added experiments on an active voltage control environment, demonstrating strong generalizability and consistent superiority of eSpark (Figure 23).
For question 1: We designed a more compelling example to illustrate the advantages of using LLMs for generating exploration functions.
For question 2: We include training curves showing the performance of eSpark with respect of the iterations (Figures 22 and 23).
For question 3: We clarified some misunderstandings.

We hope our answers can address your concerns, and we look forward to your feedback.

评论- General Response

2024-11-20

Dear AC and Reviewers,

We sincerely appreciate the insightful questions and valuable feedback provided by all reviewers. We're delighted that our reviewers recognize the motivation behind eSpark's innovative use of LLMs for action space pruning. We've addressed each individual concern raised by the reviewers in separate, detailed responses. To enhance the paper's clarity and readability, we have made the following revisions, which will be incorporated into the next version:

We clarified some misunderstandings of the reviewers about the method and experiment
We added a more challenging example to illustrate the advantages of LLM-generated exploration functions
We concluded more recent LLM-based approaches in related work.
We added experimental results of our framework combined with the MAPPO algorithm.
We added some visualizations of training curves comparing eSpark and baselines in the MABIM environment.
We added experimental results of eSpark and baselines in a novel active voltage control environment.

The attached PDF of the additional experimental results can be found in the supplementary material. We appreciate your feedback and look forward to your further comments.

Thank you,

The Authors

AC 元评审

2024-12-20

This paper considers tasks within the MARL framework, pointing out the large size of the joint action space as a function of the number of agents, and proposes a method that uses an LLM-based to prune the joint action space that prunes redundant or irrelevant state-action pairs. Experiments that compare the method to a few other methods show the method leads to substantial performance gains.

Strengths

Reviewers mentioned that the use of LLMs for this task was innovative, compelling scalability and generalization and performance gains, clear writing, no complex prompt engineering, and good ablation studies

Weaknesses

Reviewers mentioned the high training costs, lack of theoretical guarantees, limited amount and complexity of experimental environments, lack of in-depth analysis and discussion, lacking comparisons to existing LLM-based methods, unclear effectiveness with heterogenous agents, and heavy dependence on the LLM output quality with no clear strategy for mitigating LLM errors

Overall, several reviewers remained unconvinced after the author response period. I think the remaining unaddressed concerns include missing comparisons with LLM-based baselines (despite the challenges in doing so), no results with heterogeneous agents, and the limited amount of experimental environments (I agree with the authors' response that the complexity of those included is sufficient). While the topic is of interest to some of the ICLR community, in its current state, this paper is not an obvious case for acceptance due to these issues.

审稿人讨论附加意见

The reviewers meaningfully engaged with the author responses and drew the conclusions mentioned above. I described how their final conclusions were weighed in my final decision above.

最终决定Reject

2025-01-22

Reject