Towards Complete Expressiveness Capacity of Mixed Multi-Agent Q Value Function
摘要
评审与讨论
This paper aims to tackle the issues of value decomposition. Initially, it explores the expressive capacity of linear mixing functions and then develops a decomposition structure to achieve complete representational capabilities. Additionally, it identifies and offers a solution to the problem of optimal representation interference. Experimental results substantiate the efficacy of the proposed method.
优点
The paper presents a substantial and meticulously detailed theoretical derivation and proof. The proposed method to construct the complete representational structures is interesting and novel.
缺点
The paper suffers from some unclear expressions and inconsistencies with prior research. The experimental evaluation is lacking. See questions below for detail.
问题
-
The relationship between this paper and [1] requires further elucidation. What's the connection between "the LMF in indecomposable MMDP has unbaised TD target in sarsa" and "the on-policy LMF with fitted-q-iteration has at least one fixed-point Q-value that derives the optimal policy" ? It appears that the results in [1] might be stronger.
-
The paper mentions that addressing single-step matrix games is applicable to solve the optimal policy, but only for sarsa. Does this imply that the theoritical results can only be upheld through on-policy training, or is this just a conclusion unrelated to the subsequent paper?
-
Why is Eq6 used to define complete representational capacity in this paper? The meaning of complete representational capacity under IGM remains unclear. Is this consistent with prior works? QTRAN is known be necessary and sufficient for IGM, but it has incomplete expressiveness in the context of this paper. Does this imply that methods with incomplete expressiveness can also theoretically guarantee optimal outcomes?
-
Confusing about the "monotonic mixing" and "strictly monotonic mixing". I think the original "monotonic" in previous work means "strictly monotonic" in this work, rendering the "monotonic" in this work redundant. To the best of the reviewer's knowledge, none of the existing methods under IGM are non-monotonic. In fact, non-monotonic mixing might not satisfy IGM.
-
The paper introduces multiple mixers to cover a wider range of mixing functions. This needs further clarification. Is this theoretically necessary, or is it merely a technique for improved performance? It would be valuable if an ablation study on different numbers of mixers in SMAC is conducted.
-
The meaning of Eq11 requires further explanation. Why is the optimal representation ratio defined as such, and what does a low w* signify in terms of ORI?
-
The final algorithm, particularly the expression of Q_tot, is unclear. How are Eq12 and Eq13 applied in Eq10?
-
Does the final algorithm possess complete expressiveness or is it necessary and sufficient for IGM?
-
Why is the single-step matrix game trained in an on-policy manner? Can the proposed method converge to the optimum through off-policy training? Additionally, can QTRAN achieve convergence to the optimum in this game?
[1] Wang et al. Towards understanding linear value decomposition in cooperative multi-agent q-learning. 2020.
1. Relationship between this paper and a previous work
This paper and a previous work [1] investigate the representational limitation of LMF from different perspectives and reach different conclusions.
[1] focuses on the joint Q value function of LMF, exploring the properties of LMF under representational limitation. Our paper focuses on the action-value function of MARL tasks, exploring the task property to predict representational limitation.
The conclusion "the on-policy LMF with fitted-q-iteration has at least one fixed-point Q-value that derives the optimal policy" refers to that multi-agent Q-learning with LMF has a convergent region, in which all included value function induces optimal actions.
The conclusion "the LMF in indecomposable MMDP has unbaised TD target in sarsa" refers to that the representational limitation does not change the TD target of sarsa value iteration, which indicates the effect of representational limitation is limited within current time-step. In this case, the task can be decomposed by time steps into multiple one-step games.
The differences between two conclusions lie in:
- the former conclusion is about Q-learning while the latter conclusion is about sarsa;
- the former conclusion describes the convergent property of LMF while the latter describes a possibility to decompose the task by time steps.
We would add the discussion of the differences between [1] and our work to the related works.
2. Conclusion of "addressing single step matrix games is enough to solve the optimal policy for sarsa"
This is an conclusion unrelated to the subsequent paper.
3.1 Eq.6 as the definition of complete representation capability
As we defined in Eq.5, we use the value range of function to represent its "representational capacity". A function with complete representational capacity should be able to represent any value from to . Notice the optimal action-value is the largest. Therefore, for a mixing function with complete representational capacity, we require , the value range .
More details about Eq.6:
Note that Eq.6 can not be reached by SMMFs. Detailed proof is available in Section 4.1. Here we provide an intuitive example:
- Problem setting: agent number = 2, local action space = , one state (omitted)
- Assume: greedy action = . We have and
- According to the definition of SMMF (Eq.4), we have and
- which equals:
- which violates Eq.6.
3.2 Discussions to QTRAN: optimal outcomes without complete expressiveness
Although QTRAN is able to learn the optimal policy without complete expressiveness theoretically, QTRAN is still less effective than the methods with complete expressiveness.
The mixing function defines the mapping from local Q value functions to the joint Q value function . In QTRAN, we have . Therefore, QTRAN is an application of LMF.
As we discussed in related works (the section paragraph, Appendix G.2), QTRAN applies the LMF in a special way. QTRAN introduce two auxiliary networks to train the joint Q value function. Firstly, QTRAN introduce a central Q value function (with complete expressiveness) to approximate the action-value function. Secondly, QTRAN introduce a ventral state value function to represent the difference between and . is trained with the target and would not updated if .
To sum up, QTRAN does not apply a mixing function with complete expressiveness. As a result, to overcome the non-optimality caused by the representational limitation, QTRAN requires auxiliary networks. These networks increase the error of value representation. Specifically, QTRAN need to train the auxiliary networks and at first, and then train the joint Q value function with the target . The training error of and would accumulate to the train of . By contrast, the joint Q value function with complete expressiveness can be trained end-to-end directly by value iteration, which is more efficient.
We would add the discussions above to the related works.
Reference:
[1] Wang et al. Towards understanding linear value decomposition in cooperative multi-agent q-learning. 2020.
4.1 Confusing about "monotonic" and "strictly monotonic"
There is only the concept of "monotonic" in previous works, which is introduced by QMIX. Let denote it by "original monotonic" (OM for brevity). The definition of OM dose not match the reference of it, for which we separate the concept into two part, i.e., "monotonic" (M for brevity) and "strictly monotonic" (SM for brevity).
The OM in previous works equals M in definition but actually refers to SM. In most cases, the OM in previous works refers to the QMIX mixing function. In QMIX, the OM is defined as , which is the same as the definition of M (Eq.3) in our paper. But according to our definition of SM (Eq.4), the mixing function of QMIX (which equals to the SMMF, Table 2) is actually SM.
We would supplement the intention of separating the OM into M and SM in Section 2.2.
4.2 All OM satisfies the IGM
As we discussed in Section 2.2, "strictly monotonic mixing functions naturally satisfy the IGM (the proof is available in Appendix A)". Since the OM in previous works refer actually to the SM, as the reviewer mentioned, in the view of previous works, all OM functions satisfy the IGM.
4.3 redundant definition of "monotonic"
We need to find M but non-SM functions because:
- M functions have better convergence property than non-M (as shown in Fig.10).
- SM functions suffer from the representational limitation due to the bounded difference.
Such functions can not be described with out the definition of M.
5. Intention to introducing multiple mixers
We introduce multiple mixers for two reasons:
- Multi-channel mixing (i.e., mixing by multiple mixers) enable the MUD structure covering a broader range of mixing functions. For example, QPLEX applies n-channel mixing.
- Multi-channel mixing sometime has better performance than single-channel mixing. We would supplement the results of empirical studies on this problem in the Appendix.
6. Further explanation of Eq.11 (optimal representation ratio)
We define the optimal representational ratio as the relative representational weight of the optimal action-value.
According to the illustration of representational cross interference in Fig.4, and shapes each other through the gradient on the common term . Therefore, the gradient can be viewed as the representational weight of . Besides, the representational weight of an action-value also depends on its probability .
Note that . In Eq.11, denotes the relative representational weight of the optimal action-value on the term , where
- the expectation of gradient denotes the representational weight of the optimal Q value ;
- the denominator of is applied for normalization; We calculate the average relative representational weight across all terms of by .
A larger optimal representation ratio indicates a larger relative representational weight of the optimal action-value. In this case, the representation of optimal action-value would be less affected by the others. We will supplement more details of in our paper.
7. Unclear final algorithm
Taking Eq.12 as an example, we have
where . can be directly implemented by an MLP. Notice is equivalent to . Therefore, we have .
We can compound strictly monotonic mixing functions as
We will supplement the implementation details of our proposed mixing functions in the Appendix.
8.1 Whether final algorithm possess complete expressiveness
The final algorithms have complete representational capability because they apply the MUD structure.
8.2 Whether final algorithm is necessary and sufficient for IGM
The final algorithms are sufficient but not necessary for IGM. As shwon in Table 1 (Appendix G.2), the final algorithms has many good properties such as monotonicity, satisfying the IGM, complete representational capacity and addressing ORI.
9.1 On-policy training of matrix game
The condition of on-policy data distribution is only necessary for Theorem 3.3. In words, on-policy training is not necessary for the proposed methods, which are also able to converge to the optimum for off-policy training.
9.2 Performance of QTRAN
As we tested, QTRAN is able to tackle this task. According to our response to Question 3.2, QTRAN is an application of LMF. We do not present the evaluation results of QTRAN in this task because we only evaluate the performance of different mixing functions here, where the mixing function of QTRAN (LMF) is evaluated.
We would supplement the evaluation results of QTRAN and discussions in the context.
This paper focuses on addressing the important problem of representational limitation in value decomposition methods for Multi-Agent Reinforcement Learning (MARL) problems. The contributions of the paper are as follows:
-
The paper defines the circumstances under which the linear representational limitation occurs, specifically for Linear Mixing Function (LMF).
-
It introduces a two-stage mixing framework called Mixing for Unbounded Difference (MUD) that addresses the representational limitation by ensuring complete representational capacity under the Independent Global Max (IGM) constraint.
优点
This paper exhibits several notable strengths across multiple dimensions.
Originality
- The introduction of the Mixing for Unbounded Difference (MUD) framework, as far as the reviewer is concerned, represent new solutions to address the representational limitation issue. (Although it shares some similarities with other methods, see the weakness section).
Quality
- The proposed MUD framework is supported by mathematical reasoning, enhancing its quality as a potential solution.
缺点
-
The first contribution about LMF and the second contribution on SMMF seems to be separated.
-
The findings about LMF is not quite surprised, as previous work indicates its limitations. What could be interesting is why LMF can work well (empirically) on many tasks, especially when used with gradient-based RL methods.
-
Why do the experiments demonstrate advantage of the proposed method on SMAC, but the performance is similar to baselines in relatively easy task of Predator-and-prey?
-
(*) The proposed MUD share similarities with the QPLEX framework. So the empirical comparison is very important. The reviewer is curious why the performance of QPLEX on SMAC is significantly different from what was reported in the original paper.
4.1 Please discuss the difference from QPLEX in detail.
- (*) The representational interference problem is not unique to the MUD framework and has been discussed by previous work [1].
(4 and 5 are the main reasons for the overall negative score.)
[1] Ye, J., Li, C., Wang, J. and Zhang, C., 2022. Towards global optimality in cooperative marl with sequential transformation. arXiv preprint arXiv:2207.11143.
问题
Please see the weakness section.
1. Separated contributions
The two contributions are coherent logically in the flow of the paper. Our paper focuses on the issue of the representational limitation of the Strictly Monotonic Mixing Function (SMMF), where the answers of two key questions remain unclear for the community. The first question is: is the representational limitation a frequent problem for SMMF? If so, it is necessary to solve it, which result in the second question: how the limitation arises and how to solve it?
Our first contribution answers the first question, by which we reach the conclusion that the representational limitation is a frequent and notable problem for LMF. Our second contribution answers the second question. We prove that the representational limitation of SMMF arises from the bounded difference and propose a unbounded mixing framework.
2.1 Not quite surprising findings about LMF
Although previous works indicate the limitation, we are the first to give a sufficient and necessary condition to the occurrence of the limitation. The novelties of our findings lie in
- We provide a prospective that the representational capacity of a value function depends not only on the function expression, but also on the task settings. We also propose an approach to investigate whether the task settings match the value function in reinforcement learning.
- Since we have proved a necessary condition of the limitation, given the task settings, we can directly judge whether the limitation occurs by the definition of the decomposable MMDP.
- We also reach some interesting conclusions such as: the representational limitation does not change the TD target of sarsa value iteration, which indicates the effect of representational limitation is limited within current time-step. In this case, the task can be decomposed by time steps into multiple one-step games.
2.2 Why LMF work well empirically
LMF works well empirically on many tasks because of two good properties, which have been discussed in our paper.
- LMF satisfies the IGM, which ensures the decentralized executions (according to the largest local Q values) always bring the best estimated group interests (i.e., the largest joint Q values).
- LMF is monotonic. As shown in Fig.10 (Appendix H), the monotonic mixing function has better convergence property than the non-monotonic ones.
3. Advantages of proposed methods on SMAC but similar performance on predator-prey
Because these two benchmarks are different in evaluation orientations. Experiments on both benchmarks demonstrate the advantages of our algorithms in different perspectives.
SMAC is a complex task involving a variety of strategies such as focusing fire, adjusting formation, utilizing terrain and unit features. These rich strategies increase the tiers and gaps of algorithms' evaluation.
By contrast, predator-prey is a task with a single strategy, i.e., capturing at the same time. As a result, the performances of algorithms are two-tiered. The algorithms tackling this strategy achieve the similar performance with largest return while the algorithms failing to tackle the strategy receive the return of 0.
Our algorithms are the only ones successful to tackle the strategy under the punishment of -5, although our algorithms perform similar with some of the others in predator-prey of punishment -2.
4.1 Performance difference of QPLEX with original paper
We apply the QPLEX code from https://github.com/oxwhirl/pymarl, which is a wildly used MARL library. The algorithm of QPLEX is trained and tested under the default settings. As reported by the author of the library in the paper of WQMIX [2], QPLEX performs not as well as the original paper.
The poor peformance of QPLEX could arises from:
- We do not have the training seed of the original curve. As we can see from the shadow of the QPLEX training curve in Fig.13, the performance of QPLEX demonstrates a large variance in SMAC.
- Altough QPLEX is a significant work which firstly achieves complete expressiveness under the IGM constraint, it has some problems, as we response to the question 4.2.
4.2. The difference between proposed MUD with QPLEX
Both QPLEX and our methods apply the MUD structure. The differences between QPLEX and our methods lie in
- QPLEX suffers form low sample efficiency. The representation of the action-value with poor performance has little contribution to the optimal policy, which should be assigned with small weights. However, QPLEX treats all action-values equally, leading to low sample efficiency. By contrast, our methods are more efficient since they put larger weights on the action-values with better performance.
- QPLEX does not address the problem of optimal representational interference. Our proposed MUDs address the problem by reweighting the representation of action-values.
We would add detailed discussions to these differences to the related works.
5. Representational interference has been proposed by previous works
The problem proposed and addressed by TAD [1] (the paper mentioned by the reviewer) is different from the Representational Cross Interference (RCI) proposed in our paper.
In a nutshell, TAD reveals a convergence problem of decentralized execution under gradient descent. TAD finds that the policy could be trapped in local optimums under a small learning rate when initialized at the following case: a better coordination requires updating the local policies of more than one agent, while decentralized updating each agent's local policy could result in a worse coordination.
By contrast, the RCI describes the cross interference of the representation between different action-values. according to Fig.4, and shapes each other through their gradients on the common term .
To sum up, the differences between the Sub-Optimal Convergent (SOC) problem proposed by TAD and RCI proposed by our paper lie in:
- SOC depends mainly on initialization and learning rate, while RCI depends mainly on sample distribution and the structure of mixing function.
- SOC occurs when updating the policies involving more than one agent, while RCI occurs under samples with shared local actions.
We would add the discussions of TAD to the related works.
References:
[1] Ye, J., Li, C., Wang, J. and Zhang, C., 2022. Towards global optimality in cooperative marl with sequential transformation. arXiv preprint arXiv:2207.11143.
[2] Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Advances in neural information processing systems, 33:10199–10210, 2020a.
The paper studied the problem representation limitation in value decomposition method in fully cooperative multi-agent reinforcement learning. The author proposed a novel idea that the representation limitation comes from two different perspectives, task property and mixing function.
The authors theoretically proved that LMF is free from representational limitation only in a rare case of MARL problems. SMMF suffers from representational limitation due to the bounded difference between the outputs of greedy and current actions.
To address these issues, the authors also proposed a new framework of mixing for unbounded difference and test with experiments on several different environments.
优点
This paper studied a very basic problem in value-based multi-agent reinforcement learning, and theoretically proved the limitations of both the problem itself and existing methods.
缺点
The writing quality could be further improved. It's a little bit hard to follow the paper currently.
问题
- We still have some other combination types that is stronger than IGM but is not perfect. Please discuss about this part.
Discussions to other mixing types
We have discussed recent value decomposition methods with monotonic mixing functions. As concluded in Table 1, all these methods satisfy the IGM. The IGM ensures that the decentralized executions (according to the largest local Q values) always bring the best estimated group interests (i.e., the largest joint Q values).
Although IGM has become a general property in most of recent value decomposition methods, there are also works which decompose the joint Q value function in special ways without the IGM constraint.
For example, Q Path Decomposition (QPD) [1] decomposes the joint Q value function with the integrated gradient attribution technique, which directly decomposes the joint Q values along trajectory paths to assign credits for agents. Being different from other value decomposition works with explicit form of mixing functions, QPD does not restrict the representation relation of the joint Q value function and the local Q value functions. Therefore, QPD does not satisfy the IGM. QPD provide a unique perspective for value decomposition, which learns the mixing function instead of manual design.
Model-Based Value Decomposition (MBVD) [2] propose an implicit model-based MARL methods based on value decomposition. By MBVD, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space. The imagined future states is predicted from the joint historical trajectories, which include all agents' current actions. As a result, the local Q value function depends on the joint action, which could violate the IGM. MBVD introduces environment model in value decomposition, which empower agents with the capability of foresight.
We would add these discussions to the related works of our paper.
References
[1] Yang, Y., Hao, J., Chen, G., Tang, H., Chen, Y., Hu, Y., Fan, C., and Wei, Z. Q-value path decomposition for deep multiagent reinforcement learning. In International Conference on Machine Learning, pp. 10706–10715. PMLR, 2020a.
[2] Xu, Z., Zhang, B., Zhan, Y., Baiia, Y., & Fan, G. (2022). Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning. Advances in Neural Information Processing Systems, 35, 11327-11340.
This paper considers the value decomposition approach for fully cooperative MARL, and in particular focuses Strictly Monotonic Mixing Function (SMMF). The paper first theoretically shows that Linear Mixing Function (LMF) --- a special case of SMMF, has limited expressiveness ability. Then, the paper proposes a new two-stage mixing framework to improve the expressiveness of SMMF and empirially demonstrate the effectiveness of the proposed methods. After reading the paper, reviews and author responses, I believe several major issues remain to be inherent to this paper, thus recommend rejection: 1. the theoretical result is rather disconnected to the second part of the algorithm. The theoretical result only shows that LMF (a special case of SMMF) has limited representation power, this does not imply SMMF also has limited power, which does not justify the need to improve the representation power of SMMF which is the second part. 2. As reviewers pointed out, there is significant inconsistency between the performance of an existing algorithm showed in this paper versus the performance reported in prior paper, especially regarding QPLEX. Instead of speculating the reason due to random seed and high variance, the author is encourage to reach out to the authors of QPLEX to investigate the reason for difference and make sure there is no error in the implementation/hyperparameter choice. 3. The writing of this paper should be significantly improved.
为何不给更高分
Theory and empirical results are disconnected. Empirical result is inconsistent with prior papers. Writing is poor.
为何不给更低分
N/A
Reject