PaperHub
5.5
/10
Rejected4 位审稿人
最低3最高8标准差1.8
8
3
5
6
2.8
置信度
ICLR 2024

Best Possible Q-Learning

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11

摘要

关键词
multi-agent reinforcement learning

评审与讨论

审稿意见
8

The authors propose a strategy called best possible Q-learning for agents learning the optimal joint policy in fully decentralized learning. Under this strategy, when learning the Q function, each agent assumes that after choosing the action, the transition of the NN-agent environment will be the ``best case" so that the expected return is maximized. The authors prove the convergence of such an elegant strategy to global optimal joint policy (under some assumptions). The authors also provide a simplified version of this strategy that is more computationally attractive.

优点

The writing is clear, which makes it enjoyable to read. Even restricted by the assumption of deterministic policies and the uniqueness of optimal joint policy, it is still impressive that with such an elegant best possible Q-learning, the global optimal policy can be computed.

缺点

Lemma 2 is a crucial lemma for justifying the best possible operator. However, the second equality appears to be incorrect. Could the authors explain how it holds?

问题

  • Is it a common setting that the reward function is only dependent on the current state and the next state (without the dependency on the action)? To me this is not commonly seen.

  • I think (3) (where Q(s,a)Q(s,\boldsymbol{a}) is equipped with some arbitrary a\boldsymbol{a}) is not the expected return of the optimal joint policy.

  • In the proof of lemma 1, I notice that there is an underlying assumption that the actions taken by other agents do not restrict the allowable actions of each agent. If this restriction is considered, does the best possible learning strategy still apply?

  • The 3rd line under problem (6), could the authors provide exactly where is the reference in (Puterman, 1994)?

评论

Lemma 2 is a crucial lemma for justifying the best possible operator. However, the second equality appears to be incorrect. Could the authors explain how it holds?

We made a correction in the revision (the colored line). The maxai\max _{a_i} should be factored out. After the correction, the proof still holds because aia^{'*}_i maximizes the second term but does not maximize the first term.

Is it a common setting that the reward function is only dependent on the current state and the next state (without the dependency on the action)? To me this is not commonly seen.

The setting is still general because if two joint actions lead to the same transition dynamics in the environment, the two actions should have the same effects, and we do not need to discriminate the actions. The effects of actions are reflected in the distribution of ss'.

Imagine that we are playing a computer game. There is a certain state where the left key and the right key can lead to the same next state or the same distribution of the next state. Do we need to give the right key a higher reward to encourage the players to always select the right key?

I think (3) (where Q(s,a)Q(s, \boldsymbol{a}) is equipped with some arbitrary a\boldsymbol{a} ) is not the expected return of the optimal joint policy.

Q(s,a)Q(s, \boldsymbol{a}) means the expected return if you start in state ss, take an arbitrary joint action a\boldsymbol{a}, and then forever after act according to the optimal policy in the environment. (https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#value-functions)

In the proof of lemma 1, I notice that there is an underlying assumption that the actions taken by other agents do not restrict the allowable actions of each agent. If this restriction is considered, does the best possible learning strategy still apply?

Yes, fully decentralized learning requires the assumption that the agents can independently make decisions. If we consider the restriction of the allowable actions of each agent, we can train the agents in a modified MDP. If the selected joint action breaks the restriction, the MDP ends and the agents receive a very large negative reward. Then the optimal joint policy in the modified MDP will be equal to the optimal joint policy in the original MDP, and agents can independently make decisions in the modified MDP.

The 3rd line under problem (6), could the authors provide exactly where is the reference in (Puterman, 1994)?

Theorem 7.1.9. of (Puterman, 1994) shows that there is at least one deterministic optimal joint policy. Therefore, when there is only one optimal joint policy, the optimal joint policy must be deterministic.

评论

Thank you for addressing my comments in detail! I will raise my score.

审稿意见
3

This paper studies decentralized multi-agent reinforcement learning (RL). It proposes a new operator called best possible operator in updating the Q-values, along with a computationally efficient variation called simplified best possible operator.

Analytically, both operators are shown to converge to the optimal joint policy.

Empirically, Q-learning based on such operator is shown to achieve good performance across a number of tasks.

优点

  • Good empirical performance on a variety of tasks

  • A systematic method for updating Q-values in a decentralized multi-agent setting

缺点

[Disclaimer: While reviewing this paper, I focused on the reinforcement learning aspects and the correctness of the theoretical analysis. It's worth noting that my exposure to multi-agent settings is limited. Therefore, I would lean toward the inputs from other reviewers to assess the novelty as to how this work fits within the existing body of multi-agent literature.]

  1. The proofs have inaccuracies which impede a clear understanding of the underlying logic.

a.Lemma 2:

this line appears to be incorrect:

=γPi(ss,ai)maxai(maxaiQ(s,ai,ai)Qk1(s,ai))=\gamma P_i^*(s^′|s, a_i) \max_{a_i^′} (\max_{a_{-i}^′} Q(s^′, a_i^′ , a^′_{-i} ) − Q^{k−1}(s^′, a_i^′ ))

: the maxai\max_{a_i} cannot be factored out here since the maximum actions can be different for the two Q value terms inside the parenthesis and there is a minor sign. Apart from this, the remainder of the lemma appears to be correct.

b.Lemma 3:

In the proof of Lemma 3, the second inequality appears to be valid. However, it seems to omit intermediate steps. The absence of these steps makes it challenging to follow the reasoning. Elaborating on these steps would enhance the clarity of the proof.

c.Lemma 4:

“Similar to the proof of Lemma 2, we can easily prove maxaiQ(s,ai,ai)Qik(s,ai)\max_{a_{−i}} Q(s, a_i, a_{−i}) \leq Q^k_i (s, a_i)

While the result seems valid, it may not be as straightforward as implied, especially considering the nuances in this context. In particular, the simplified operator incorporates an additional max operator (8), when compared to the operator (6) used in Lemma 2. A more detailed elaboration of the steps involved in this proof would be beneficial.

  1. As I will detail in the following, the conditions in the convergence proof for both operators seem too strong to hold in practice. There seems to be lack of sufficient attention to how these conditions might be met, thus leading to doubts around the practical relevance of the operators.

Both operators implicitly assume that the expectation w.r.t. the transition probabilities in (6, 7) can be computed exactly, at every step of the value update (i.e., k\forall k).

Although the simplified operator is “simpler” due to not requiring the computation of expectations for every Pi~\tilde{P_i}, it still poses practical challenges. The need to compute the expectation for even one Pi~\tilde{P_i} at every update step seems impractical.

In fact, an approximation error in computing the expectations, when combined with the max operation, could cause the over estimation of Q values. This issue, briefly mentioned in the text following Eq. (10), lacks a clear explanation in my opinion.

The approximation error would also lead to the violation of the Q value upper bound, undermining the convergence guarantees for both operators.

Suggestions for improvements:

Please correct me if I’m wrong in the above. If my understanding is correct, the authors might consider the following two ways to mitigate the concerns:

  1. revise the convergence analysis to account for the approximation error in computing the expectation
  2. modify the BQL algorithm such that the Q value estimates are both increasing and bounded above.

Minor:

  • Many plots use “reward” as the y-axis label. It should be “return” instead.

问题

Can the authors please comment on my concerns listed in the weaknesses?

评论

1.a.Lemma 2

Sorry for this mistake. The maxai\max _{a_i} should be factored out. We have corrected it in the revision (the colored line). After the correction, the proof still holds because aia^{'*}_i maximizes the second term but does not maximize the first term.

1.b.Lemma 3 and 1.c.Lemma 4:

For the second inequality of Lemma 3, we replace the expectation over transition probabilities with the infinite norm (which is larger). We will elaborate on the proofs of these two lemma in the final version with additional pages.

  1. As I will detail in the following, the conditions in the convergence proof for both operators seem too strong to hold in practice. There seems to be lack of sufficient attention to how these conditions might be met, thus leading to doubts around the practical relevance of the operators.

There are four questions in this part. For the first two questions, in theoretical analysis, it is reasonable to assume that the expected values over transition probabilities are computed exactly at every step. Even Q-learning makes such assumptions. In the implementation, as DQN, we use TD-error by sampling from a non-stationary replay buffer to fit the expected values under P~i\tilde{P}_i. This approximation technique is commonly adopted, so we believe your main concern is the next two questions about approximation error.

For the last two questions, the main concern is the overestimation caused by approximation error. We use Eq. 10, the weighted max update, to overcome this problem. We can choose a large λ\lambda to mitigate overestimation. This technique is proposed in Hysteretic IQL to solve the overestimation in Distributed IQL (as discussed in Appendix B), and has been verified to be practical in decentralized learning. This solution similar to quantile regression has been used in many RL methods, e.g., implicit Q-learning. Therefore, we follow this practical solution, and experimental results show that this solution can avoid severe overestimation.

评论

Moreover, as the discussion is ending, we finally hope you can consider the problem we solved, fully decentralized MARL, is much more challenging than CTDE setting, which means that the complexity of our operator should be high, thus we have to use some practical techniques to achieve the trade-off between efficiency and theory (the approximation error you proposed). We propose the first operator which can theoretically converge to optimal joint policy, and make it more efficient than the practical baselines which ignore the convergence and optimality. From the perspectives of theory and algorithm, we believe that BQL is really a contribution to the novel field: fully decentralized MARL. We hope you can consider this point when evaluating our work. Thank you very much!

Bests,

Authors

评论

I thank the authors for their further clarifications.

To ensure clarity in my review, I'd like to recap my perspective on the paper:

  1. My initial review acknowledged the empirical performance as a strength, and my perspective on this aspect remains unchanged.
  2. However, I do have major concerns regarding the theoretical results and the presentation of them. Given that the theoretical analysis accounts for a signification portion of the paper and the convergence has been claimed as a main contribution, these concerns substantially affect my evaluation.

In my opinion, the presentation of the theoretical results tends to obscure the actual contribution.

The proposed operators in Sec. 2.2 and 2.3 essentially perform dynamic programming. The simplified operator in Eq. (7) can be interpreted as a form of (action-)value iteration, known to converge at a geometric rate in single-agent RL [5], extended to the multi-agent setting. Yet no reference to dynamic programming was made in the paper.

Regarding the algorithm BQL described in Sec 2.4, it uses sampled interaction data to approximate the expectation, which seems to have close connections to the fitted Q iteration literature. Yet again, no reference was given to that regard.

Furthermore, at the end of Sec. 2.4, it is claimed that “If DiD_i sufficiently goes through all possible transition probabilities, Qi(s,ai)Q_i(s, a_i) converges to …“. This is imprecise. From the fitted Q iteration literature, it is well known that such convergence does not hold in general (convergence has only been established under specific regressing methods) and counter-examples have been shown. Chapter 3.3 in [5] has a nice summary of the relevant literature.

As to the main contribution of the work, it appears that the convergence only applies to the proposed operators, not the algorithm BQL itself. That renders the following claim incorrect: “BQL is the first decentralized algorithm that guarantees the convergence to the global optimum in stochastic environments”, from the last paragraph of the introduction section. Similarly, the claim at the end of related works section is also incorrect: “Our BQL is the first fully decentralized algorithm that converges to the optimal joint policy …”.

The paper currently lacks a convergence proof for the BQL algorithm itself, which seems feasible to provide for certain regressors, based on the similarity to fitted Q iteration.

In the current form of the paper, substantial revisions to the presentation and the claimed contributions seem necessary.

Until these issues are addressed, I have to respectfully maintain my initial score.

References

[5] Szepesvári, C. (2022). Algorithms for reinforcement learning. Springer Nature.

评论

Thank you to the authors for their response. My apologies for the delayed reply; I hope there is still adequate time for further discussion.

I find the clarification regarding the proofs for Lemma 2-4 satisfactory, confirming my initial understanding. The issues identified appear to be presentation-related, which I believe can be rectified in subsequent revisions.

However, upon further review, I remain concerned about the theoretical framework.

Specifically, the assumption that the algorithm can exactly compute expectations with respect to transition probabilities is, in my opinion, overly restrictive and a significant limitation of the proposed method.

For instance, both the best possible operator (Eq. 6) and the simplified operator (Eq. 7) require exact computation of expectations at each Q-function update. And their convergence proofs rely on such computations.

The authors claimed that “it is reasonable to assume that the expected values over transition probabilities are computed exactly at every step”, and drew parallels with Q-learning, stating “Q-learning makes such assumptions”. I must respectfully disagree. The update rule of Q-learning does not entail such an expectation calculation. Moreover, the majority of convergence proofs in the Q-learning literature [1-4] utilize results from stochastic approximation without presuming the agent’s access to true transition probabilities.

This assumption raises a fundamental question: If the algorithm indeed has access to true transition probabilities, as implied in Equations 6 and 7, the necessity of a Q-learning-like method is questionable. In such a scenario, direct application of dynamic programming would seem more straightforward.

In light of these concerns on the theoretical results, my evaluation remains unchanged. However, I am open to revising my score should there be further clarification or evidence that contradicts my current understanding.

References:

[1] Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning8, 279-292.

[2] Jaakkola, T., Jordan, M. I., Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185–1201.

[3] Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185–202.

[4] Borkar, V. S., & Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization38(2), 447-469.

评论

Thanks for your reply! We use true transition probabilities in the proof and use samples to approximate the value under the true transition probabilities in practice. We know your main concern is the approximation error. First, with more samples, the error will be small. And then, we use weighted max Eq. 10 to avoid overestimation to meet the assumption. The approximation techniques are practical can work well in experiments.

On the other hand, you pointed out the essence of our method! BQL is just Q-learning-like, it is essentially dynamic programming. However, in practice, the agent has no access to true transition probabilities, so we have to use samples to approximate the expected value like other methods. So, in the proof, we use true transition probabilities to show convergence and optimality. But in algorithm, we try to use the above techniques to approximate the expected values and reduce the theoretical gap. We can see that although the complexity of BQL operator is much higher than IQL operator, the results show that BQL algorithm is more efficient than IQL algorithm.

In summary, our work is divided into two parts:

  • proposing operators with convergence and optimality
  • using practical techniques to extend operators into algorithm

We hope that dividing our work into two parts can address your questions.

评论

Thanks for your reply. We would like to defend our method. It seems you have two concerns in the reply. Before "Furthermore, at the end of Sec. 2.4", the concern is "no reference". However, the equation and proof are basic common knowledge in this field. "Using sampled interaction data to approximate the expectation" is the most common technique in this field. Both you and other reviewers can easily understand the the method and theorems. So we believe the presentation is clear for the researchers in this field.

After "Furthermore, at the end of Sec. 2.4", there are two questions. For the first one, when we mention convergence, we mean the ideal case where we can obtain the true transition probability from DiD_i. We know there is a estimation error but we have claimed the trade-off between efficiency and theory. For the second one, note the BQL has two versions. The Q-table version (Algorithm 1) strictly follows the operators, so we can make such a claim. The results in Figure 1(a) can demonstrate it. And for NN version (Algorithm 2), we know that even single-agent DQN with NN cannot guarantee the convergence.

Moreover, we want to ask a question. We propose the first operator with convergence and optimality and extend it into algorithm, but other baselines cannot theoretically guarantee the convergence and optimality even in the ideal case. Do you think our work is the best among these methods in terms of theory and practicality? If the answer is no, could you tell us which baseline is better than BQL and why? If the answer is yes, we want to know why these baselines meet the standard of top-tier conference.

评论

Let’s take Eq. (2) in your paper as an example: it is the Bellman optimality equation for action values and not an update rule. However, the preceding text reads “the convergence of independent Q-learning”, leading to confusions. It is unclear as to whether Eq. (2) is being described as part of independent Q-learning, or as the optimal Q value that should be converged to.

There is a whole sentence before and after Eq. (2): " the convergence of independent Q-learning is not guaranteed". Eq. (2) is exactly the value iteration of independent Q-learning, and it is not guaranteed to be converged, which we have analyzed multiple times in this paper. We do not think that this sentence is misleading.

And the algorithms that make use of them were referred to as value iteration, rather than Q-learning

The equations in our paper are exactly value iteration. We do not call them Q-learning. "Best Possible Q-learning" is just the name of our paper, to establish the relationship and difference with existing methods: independent Q-learning, Hysteretic independent Q-learning, and Distributed independent Q-learning.

Concerning the comment that “Both you and other reviewers can easily understand the method and theorems”: this is somewhat subjective.

Yes, it is subjective. But the concern of presentation is a subjective topic, which is evaluated by whether the readers can easily understand the method and theorems.

one would need to have infinite amount of data in each epoch

This is the trade-off between efficiency and theory which we have discussed in the replies above.

Indeed, proposing “the first operator with convergence and optimality” and developing a well-performing practical algorithm is commendable. However, as I have highlighted, that was not exactly how it was claimed in the paper.

This is exactly what we claimed in this paper. In abstract and introduction, we claimed that "we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy", "By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity.". And the structure of method section follows the same logic. Sec. 2.2 and 2.3 are operators, and Sec. 2.4 is algorithm. The structure is really clear.

I believe it is premature to draw comparisons with existing works, before addressing the aforementioned concerns.

The theorems have been supported by proofs, and the algorithm has been supported by experimental results. Therefore, it is important to compare with existing methods to defend our contribution.

评论

Dear Reviewer jF9W,

We are sorry for bothering you repeatedly, but the left time is very limited.

Although you are not familiar with MARL, you can thoroughly understand our paper. So we think the presentation is not a severe issue. We will revise our paper according to your suggestions in the final version.

Bests,

Authors

评论

We are sorry for bothering you repeatedly, but the left time is very limited.

I understand the time-sensitive nature of this discussion. Sorry for not being able to engage in the discussion earlier.

评论

It is less so about trade-off, but rather an overly strong assumption. In fact, it is not common in RL theory that an algorithm starts with collecting an infinite amount of data from the very first epoch/iteration.

Yes, collecting infinite amount of data is a strong assumption. In practice, we use replay buffer with pre-defined size to estimate values. So, there is a trade-off (you can also call it a gap) between theory and practice as we have discussed, which was verified by experiments. Especially, the results in figure 1(a) show that without infinite amount of data, BQL can converge to optimal policy. In fact, we can still obtain the optimal policy under estimation error. If the difference between the optimal value and the second optimal value is ϵ\epsilon, we can tolerate QQ^<ϵ/2|Q-\hat{Q}|_{\infty}<\epsilon/2.

Thanks for your feedback! We will clarify the presentation in the revision to reduce the misleading parts. And we hope you can re-consider the score after the discussion.

评论

Eq. (2) is exactly the value iteration of independent Q-learning

Well, that's certainly not how it was presented in the paper. And that has been my point.

The equations in our paper are exactly value iteration. We do not call them Q-learning.

My point was that the placement of Equation (2) immediately following the phrase "independent Q-learning" in the middle of a sentence, may lead to confusion.

the concern of presentation is a subjective topic, which is evaluated by whether the readers can easily understand the method and theorems.

I have been pointing out specific instances where the presentation is lacking.

This is the trade-off between efficiency and theory which we have discussed in the replies above.

It is less so about trade-off, but rather an overly strong assumption. In fact, it is not common in RL theory that an algorithm starts with collecting an infinite amount of data from the very first epoch/iteration.

This is exactly what we claimed in this paper.

In my previous response, I highlighted certain claims in the paper that I found to be incorrect. I believe that acknowledging and addressing these issues is crucial for strengthening the overall impression and credibility of your work

The theorems have been supported by proofs

The issues have not been about the proofs.

评论

As the discussion is ending, we have to make a supplement. As you said, the proofs of operators are correct and the empirical performance is good. The concern is the estimation error in algorithm. Naturally, the influence of estimation error in algorithm should be verified using experiments, which is the main purpose of experiments. Our algorithm can achieve good performance, which demonstrates that our algorithm can overcome the estimation error. Our paper is not just a theory paper. We focus on practical algorithm which is derived from theory.

评论

Thanks for your prompt response. I appreciate the further discussions.

Regarding the comment that “the equation and proof are basic common knowledge in this field” I understand that this may refer to the equations (rather than proofs) being well-known. However, I would like to respectfully point out that common knowledge should not preclude precise and accurate descriptions in a paper.

Let’s take Eq. (2) in your paper as an example: it is the Bellman optimality equation for action values and not an update rule. However, the preceding text reads “the convergence of independent Q-learning”, leading to confusions. It is unclear as to whether Eq. (2) is being described as part of independent Q-learning, or as the optimal Q value that should be converged to.

In comparison, similar equations in single-agent settings, as seen in the DQN papers [6,7], are explicitly referred to as Bellman equations. And the algorithms that make use of them were referred to as value iteration, rather than Q-learning. Q-learning update rule does not involve expectations, but instead only use samples, which was also clearly delineated in the DQN papers. While such distinctions might be considered common knowledge, their precise description remains crucial for clarity, especially in establishing the relationship and differences between proposed methods and existing ones.

Concerning the comment that “Both you and other reviewers can easily understand the the method and theorems”: this is somewhat subjective.

Regarding your clarifications on the statement in Sec. 2.4.

  • “when we mention convergence, we mean the ideal case where we can obtain the true transition probability from DiD_i

  • “The Q-table version (Algorithm 1) strictly follows the operators, so we can make such a claim.”

To be able to follow these operators given access to DiD_i, one would need to have infinite amount of data in each epoch DijD_i^j, for all jj. I find it to be overly restrictive. That’s partially why I suggested the connection to the fitted Q iteration, which more naturally aligns with algorithms that have access to data instead of the model itself.

“We propose the first operator with convergence and optimality and extend it into algorithm, but other baselines cannot theoretically guarantee the convergence and optimality even in the ideal case. Do you think our work is the best among these methods in terms of theory and practicality?”

My exposure to multi-agent settings (and so the baselines) is limited as I acknowledged initially. But I can offer my perspective based on the content of your paper.

Indeed, proposing “the first operator with convergence and optimality” and developing a well-performing practical algorithm is commendable. However, as I have highlighted, that was not exactly how it was claimed in the paper. The presentation tends to be misleading, incorrect at times, and lacks sufficient discussions on assumptions and limitations.

I believe it is premature to draw comparisons with existing works, before addressing the aforementioned concerns.

References:

[6] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[7] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature518(7540), 529-533.

审稿意见
5

This work proposes Best Possible Q-Learning. They introduce the best possible operator, which is like the standard Bellman operator, but because the actions of the other players are unknown, the operator also includes a maximization over all marginal transition probabilities under other players' policies. The computation of the best possible operator is heavy since it involves searching over all possible transition probabilities. Therefore, the work further proposes the simplified best possible operator and use it to design the algorithm. In the algorithm, every player, with some probability, will execute some random policy in order to explore possible transition probabilities.

优点

  • The proposed algorithm looks promising in achieving decentralization in cooperative reinforcement learning.
  • The experimental part looks extensive and has covered many different environments. Though due to my unfamiliarity with previous algorithms, I cannot confidently say whether the comparison is fair or not.

缺点

  • For the explanation on the algorithm and theoretical justification, the writing is unclear in general. There are several arguments I cannot verify, and I suspect that there are flaws in the proof of the convergence. Please see Questions below. This also largely hinders my understanding of the algorithm.
  • The figure for experiments are too small.

问题

  • In the Equation in Lemma 1 (the definition of Qi(s,ai)Q_i(s,a_i)), should the rr be ri(s,ai)r_i(s,a_i)? If it should, should we also include maxri(s,ai)\max_{r_i(s,a_i)} in the definition of Qi(s,ai)Q_i(s,a_i)? It's a bit strange if the reward does not depend on the current state and the action chosen by the player.
  • Again, in the definition of Qi(s,ai)Q_i(s,a_i), what is the search space of Pi(s,ai)P_i(\cdot|s,a_i)? Do you search for "all possible transitions probabilities" from (s,ai)(s,a_i)? If it is, then why not the solution simply be Qi(s,ai)=r+maxs,aiQi(s,ai)Q_i(s,a_i)=r + \max_{s', a_i'} Q_i(s', a_i')? That is, the optimal solution of Pi(s,ai)P_i(\cdot|s,a_i) should put all probability to the state ss' that has the highest value of maxaiQi(s,ai)\max_{a_i'}Q_i(s',a_i').
  • I have difficulty understanding the first inequality in the equation series in Lemma 2, i.e., the step that replaces PenvP_{\text{env}} with PiP_i^\star with a \geq inequality. Can you explain that inequality?
  • I don't understand the following sentence in Page 5: "As PiP_i depends on πi\pi_{-i} and agents act deterministic policies, DimD^m_i contains one PiP_i under a deterministic πi\pi_{-i}. " I thought DimD^m_i only contains tuples of the form (s,a,r,s)(s,a,r,s'), why would it contain PiP_i?
  • In Page 5, it is said that "When M is sufficiently large, given any (s,a)(s,a') pair, any Pi(s,ai)P_i(s,a_i) can be found in a replay buffer. " I thought the set of Pi(s,ai)P_i(s,a_i) is infinitely large since these are continuous values, how is it possible that a finite set can contain all possible values of Pi(s,ai)P_i(s,a_i)? Again, it's unclear what you mean by Pi(s,ai)P_i(s,a_i) can be found in replay buffer, since replay buffer only contains (s,a,r,s)(s,a,r,s') tuples.
  • I don't understand the following sentence in page 5: "Simplified best possible operator ... does not care about the relation between transition probabilities of different state-action pairs in the same buffer. "
  • In Page 5, it is said that "BQL ideally needs only A|A| buffers.. which is very efficient" Suppose that every player has KK actions, is A=KN|A|=K^N? I'm not sure one can call this very efficient, given that this is exponentially large, and running a centralized Q-learning (so the number of joint action is K^N) should also just incur the same amount of computation.
评论

We don't need to further search for all possible PenvP_{env}, because there is only one PenvP_{env}. PenvP_{env} is the property of the environment. It is given and will not change.

Let us give an example. In the following tabular case, penvp_{env} is defined as

ai1a_{-i}^1ai2a_{-i}^2
ai1a_{i}^1penv(s1s0,ai1,ai1)=0.2p_{env}(s_1\mid s_0,a_i^1,a_{-i}^1) = 0.2, penv(s2s0,ai1,ai1)=0.8p_{env}(s_2\mid s_0,a_i^1,a_{-i}^1) = 0.8penv(s1s0,ai1,ai2)=0.8p_{env}(s_1\mid s_0,a_i^1,a_{-i}^2) = 0.8, penv(s2s0,ai1,ai2)=0.2p_{env}(s_2\mid s_0,a_i^1,a_{-i}^2) = 0.2
ai2a_{i}^2penv(s1s0,ai2,ai1)=0.1p_{env}(s_1\mid s_0,a_i^2,a_{-i}^1) = 0.1, penv(s2s0,ai2,ai1)=0.9 p_{env}(s_2\mid s_0,a_i^2,a_{-i}^1) = 0.9penv(s1s0,ai2,ai2)=0.9p_{env}(s_1\mid s_0,a_i^2,a_{-i}^2) = 0.9, penv(s2s0,ai2,ai2)=0.1 p_{env}(s_2\mid s_0,a_i^2,a_{-i}^2) = 0.1

Agent i can only perceives PiP_i, which depends on πi\pi_{-i}. For example, if πi(ai1s0)=1\pi_{-i}(a_{-i}^1|s_0) = 1, Pi(s1s0,ai1)=0.2P_i(s_1\mid s_0,a_i^1) = 0.2. However, in this environment, Pi(s1s0,ai1)P_i(s_1\mid s_0,a_i^1) is impossible to be 1.

评论

In the Equation in Lemma 1 (the definition of Qi(s,ai)Q_i\left(s, a_i\right) ), should the rr be ri(s,ai)r_i\left(s, a_i\right) ?

As defined in section 2.1, the reward is r(s,s)r(s,s'), which does not depend on joint action. The setting is still general because if two joint actions lead to the same transition dynamics in the environment, the two actions should have the same effects, and we do not need to discriminate the actions. The effects of actions are reflected in the distribution of ss'.

Imagine that we are playing a computer game. There is a certain state where the left key and the right key can lead to the same next state or the same distribution of the next state. Do we need to give the right key a higher reward to encourage the players to always select the right key?

Again, in the definition of Qi(s,ai)Q_i\left(s, a_i\right), what is the search space of Pi(s,ai)P_i\left(\cdot \mid s, a_i\right) ?

Yes, the search space is "all possible transitions probabilities", but the situation you proposed might be impossible to happen in the environment. Look at the definition of PiP_i (Eq. 1), if PenvP_{env} is stochastic, no matter what πi\pi_{-i} is, we cannot put all probability to the state that has the highest value.

I have difficulty understanding the first inequality in the equation series in Lemma 2, i.e., the step that replaces PenvP_{env} with PiP_i^{\star} with a \geq inequality. Can you explain that inequality?

Look at the first "=" line in lemma 2, there are two terms. maxaiPenv\max_{a_{-i}}P_{env} achieves the highest first term, and PiP_i^* achieves the highest second term. If we replace maxaiPenv\max_{a_{-i}}P_{env } with PiP_i^* in the first term, the first term decreases and we have a \ge.

I thought DimD_i^m only contains tuples of the form (s,a,r,s)\left(s, a, r, s^{\prime}\right), why would it contain PiP_i ?

Yes, DimD_i^m contains tuples of the form (s,ai,r,s)\left(s, a_i, r, s^{\prime}\right), and the transitions probabilities Pi(ss,ai)P_i(s'|s,a_i) estimated from tuples of (s,ai,r,s)\left(s, a_i, r, s^{\prime}\right) is called the contained PiP_i.

I thought the set of PiP_i is infinitely large since these are continuous values, how is it possible that a finite set can contain all possible values of PiP_i.

Look at the definition of PiP_i (Eq. 1), the number of PiP_i depends on the number of πi\pi_{-i}. In the lines after Eq. 6, we have claimed that we only consider the deterministic policies, because when there is only one optimal joint policy, the optimal joint policy must be deterministic. Deterministic πi\pi_{-i} is countable, and thus PiP_i is countable.

I don't understand the following sentence in page 5: "Simplified best possible operator ... does not care about the relation between transition probabilities of different state-action pairs in the same buffer."

In the simplified best possible operator, each transition independently contributes to the update of the operator. Even though the transitions are in the same buffer, they are not related during the update of the operator.

In Page 5 , it is said that "BQL ideally needs only A|A| buffers.. which is very efficient" Suppose that every player has KK actions, is A=KN|A|=K^N ? I'm not sure one can call this very efficient, given that this is exponentially large, and running a centralized Q-learning (so the number of joint action is KN\mathrm{K}^{\wedge} \mathrm{N} ) should also just incur the same amount of computation.

Yes, A=KN|A|=K^N. The reason why you think it is not efficient is that the problem itself is complex. Decentralized learning is more complex than centralized learning, and centralized Q-learning is an ideal solution for decentralized learning. If our method can achieve the same amount of computation with centralized Q-learning, we can say that our method is efficient.

We know that you expect a decentralized learning method which is much more efficient than centralized Q-learning and can guarantee the convergence to optimal policy. However, it seems to be impossible. If such a decentralized method really exists, for any single-agent Q-learning problem, we can factorize its action space, convert it to a decentralized MARL problem, and solve it using this method efficiently. It will be a huge revolution for Q-learning.

评论

Thanks for the response. I want to clarify one thing: are you assuming PenvP_{env} is known by every player (so we don't need to further search for all possible PenvP_{env})?

If not, then when finding the Pi(s,ai)P_i(\cdot|s,a_i) that maximize Eq.(6), there requires an inner search over the space of PenvP_{env}, i.e.,

maxPi(s,ai)[]=maxaimaxPenv(s,ai,ai)aiPenv(ss,ai,ai)πi(ais)[]\max_{P_i(\cdot|s,a_i)} \left[\cdots\right] = \max_{a_{-i}}\max_{P_{env}(\cdot|s,a_i, a_{-i})} \sum_{a_{-i}} P_{env}(s'|s,a_i,a_{-i})\pi_{-i}(a_{-i}|s)[\cdots]

In this case, Penv(ss,ai,ai)P_{env}^\star(s'|s,a_i, a_{-i}) (PenvP_{env}^\star is the maximizer in the expression above) being deterministic with all weights on the maximizer ss' is definitely possible.

评论

In theory of the operators (Section 2.2), the agents of course can know PenvP_{env}. But knowing PenvP_{env} is for knowing PiP_i, so they can directly know all possible PiP_i in theory.

In algorithm, the agents do not need to know PenvP_{env}, but they can search for all possible PiP_i. We will explain why they can do it. Once other agents i-i take a policy πi\pi_{-i}, the agent ii will observe a possible PiP_i by collecting experiences in the environment. Do you think so? If other agents i-i go through all possible policies πi\pi_{-i}, the agent ii search for all possible PiP_i. So in this process, the agent ii searches for all possible PiP_i and does not know PenvP_{env}.

Take an example. I do not know the review mechanism of ICLR (PenvP_{env}), but if I submit enough papers (collecting experiences in the environment), I can know all possible rating situations (PiP_i).

Note there is only one PenvP_{env}, the PiP_i changes only when πi\pi_{-i} changes. A πi\pi_{-i} totally determines a PiP_i.

评论

Thanks for the response. Based on your explanation, to my understanding, you're using empirical data to fit an estimated transition P^env(s,ai,ai)\hat{P} _{env}(\cdot | s, a _i, a _{-i}), and then use this estimated transition P^env\hat{P} _{env} to form a search space for PiP _i. But then in the first inequality in the proof of Lemma 2, the PiP^* _i on the right-hand side is calculated based on the estimated transition P^env\hat{P} _{env}, while the PenvP _{env} on the left-hand side the the true transition. The mismatch is not handled in the derivation. I think this has significant impact to the convergence proof later, because you make the Q function monotonically increasing. If at the earlier phase the estimated transition P^env\hat{P} _{env} is inaccurate, then it's possible that Qi(s,ai)Q_i(s, a_i) may overestimate its true maximum value (and will never be reduced again). Then convergence may not hold. The only case where such error is not presented is when the environment is deterministic.

About the difference between "containing (s,ai,r,s)(s, a_i, r, s')" and "containing PiP_i", they are not really the same concept --- difference (s,ai,r,s)(s,a_i, r,s') can be generated from the same PiP_i, and same (s,ai,r,s)(s,a_i, r,s') can be generated from different PiP_i. I can understand now that you mean the (s,ai,r,s)(s,a_i,r,s') are diverse enough. But the imprecise language has to be avoided.

I'm probably having more understanding about the algorithm now. But the explanation of the algorithm has to be improved to avoid ambiguity for first-time readers, and notation system (e.g., the distinction between P^env\hat{P} _{env} and PenvP _{env}) needs also some refinement. You may also need to more clearly define the search space of PiP_i (it's based on P^env\hat{P} _{env} but not the true PenvP _{env}). The convergence proof may also need some fix (addressing the issue above).

评论

In theory we use the real transition probabilities, and in algorithm we collect experiences to estimate the value under the transition probabilities by TD-error. It is the standard paradigm in reinforcement learning. The proofs are based on the real transition probabilities, so the proofs are correct. In algorithm, we address the overestimation using a practical technique (Eq. 10), which is a commonly used technique. Experimental results show that this technique can avoid overestimation.

评论

"We don't need to further search for all possible PenvP_{env}, because there is only one PenvP_{env}. " --- This only applies when you know PenvP_{env}.

So can you confirm that the technique developed in Section 2.2 only applies to the case when PenvP_{env} is known to every player? That's because the search space of Pi(s,ai)P_i(\cdot|s,a_i) depends on PenvP_{env}. So in order for every player to perform the search, knowledge on PenvP_{env} is required.

审稿意见
6

This paper proposes an alternate update strategy for Multi-Agent Q-Learning in the "fully decentralized" setting, i.e., where individual agents only have access to their own actions. This update uses the "best possible operator" to update individual agents' Q-values based on an optimistic estimate of the other agents' Q-values using a 2-step update. On the theoretical front, the paper justifies this choice of update strategy by showing the optimality of this update strategy in ideal conditions (i.e., in the asymptotic limit and with access to πi\boldsymbol{\pi}^*_{-i}). The paper evaluates the performance of BQL on 4 domains and shows improved performance in each.

优点

  • The experiments are extensive: The paper compares BQL against reasonable baselines on four reasonable "domains". I am not well-versed in the MARL literature, so it's possible that these are cherry-picked, but there is enough evidence to convince me that BQL is doing something useful, esp., that BQL uses the batch data D\mathcal{D} in a more useful way than baseline methods in the fully decentralized setting. Minor nitpicks:
    1. The experiments are often not run till convergence, e.g., 3(c), 4(a,b), 5(b,d).
    2. It would be nice to have upper bounds like Joint Q-Learning in the Tabular Case and a CTDE baseline in the NN-case.
  • The paper is reasonably well-written: I think the descriptions of the methods and experiments are good. It's not clear to me (as someone who doesn't actively work in RL) how novel the theory is, but it is quite clean and has some interesting ideas. Minor nitpicks:
    1. It seems like the theory is mostly a reduction of the "fully decentralized setting" to the "joint Q-learning"/CTDE setting. However, the actual connection is quite obfuscated; could you clarify how these two are related (and what is different)? It also seems like BQL can be seen as an extension of I2Q to the stochastic setting. Is this a fair comparison?
    2. I did not follow the last couple of steps in the proof of Lemma 4. Specifically, how are you combining the various expressions to get the second last equation, and how are you unrolling it to get the last one?

缺点

  • BQL is not "fully decentralized"? In the tabular setting, Algorithm 1 assumes that you can ensure that agents independently explore, but it's not clear that's a realistic assumption in this "fully decentralized setting". BQL- is more realistic and still seems to outperform baselines, but it seems like the key ingredient to getting BQL- to work is ensuring enough exploration. It would be useful to compare how well BQL does based on how much exploration is allowed (e.g., by changing the distance of the initialization to an equilibrium). It would also be useful to talk about whether sufficient exploration is a realistic assumption in practice.
  • No intuition for the 2-step update when the theoretical assumptions are broken: The paper leans heavily on asymptotic intuitions, but a lot of the wins in 4.3 and 4.4 seem to come from sample efficiency. Is there any intuition for this? More generally, there seems to be a gap between theory and practice, esp., the improved performance of BQL when there is only a single replay buffer D\mathcal{D}. Are there any intuitions for why BQL works in this case?

问题

The things that it would be most useful to clarify are:

  1. Are the theoretical proofs a reduction of the "fully decentralized setting" to the CTDE setting? If so, what are the assumptions required? If not, what is the gap?
  2. How does the performance of BQL change as you modify the amount of exploration that is performed by the agents?
评论

It seems like the theory is mostly a reduction of the "fully decentralized setting" to the "joint Q-learning"/CTDE setting. However, the actual connection is quite obfuscated; could you clarify how these two are related (and what is different)? It also seems like BQL can be seen as an extension of I2Q to the stochastic setting. Is this a fair comparison?

Decentralized learning is very different from CTDE. As analyzed in section 2.1, due to the lack of other agents' actions, the transition probabilities become non-stationary, which breaks the assumption of convergence and optimality of existing RL methods. Therefore, the main challenge of decentralized learning is to deal with non-stationary transition probabilities. However, in CTDE and joint learning, the transition probabilities are stationary, so they do not need to consider this problem at all. Therefore, we can say that the theory of BQL has no connection with CTDE and joint Q-learning. BQL is not an extension of I2Q. I2Q tries to directly obtain the optimal transition probabilities, but BQL obtains the value under the optimal transition probabilities by new operators. BQL is an extension of Hysteretic IQL to the stochastic setting, as discussed in Appendix B. Our method and baselines adopt the same settings and hyper-parameters, so we believe the comparison is fair.

I did not follow the last couple of steps in the proof of Lemma 4. Specifically, how are you combining the various expressions to get the second last equation, and how are you unrolling it to get the last one?

For Lemma 4, look at the RHS of the first inequality (the one after "According to the proof of Lemma 3, we have"). Notice the definition of ss^* and aia_i^*, which achieve the maximum of RHS. For the LHS of this inequality, if we replace the argmaxs,ai\arg \max_{s,a_i} (the max one for LHS ) with ss^* and aia_i^* (not the max one for LHS), the inequality holds. Then we know Qik1+ϵQieQ_i^{k-1}+\epsilon \geq Q_i^e, so if we replace QieQ_i^e with Qik1+ϵQ_i^{k-1}+\epsilon, the inequality still holds.

For the last inequality of Lemma 4, after merging the same terms, notice the definition of ss^* and aia_i^*, which exactly correspond to the infinite norm.

BQL is not "fully decentralized"?

We only need a pre-defined exploration schedule for each agent. In training, the agents independently follow the exploration schedule without centralized control, so the decentralized assumption still holds. The key ingredient to getting BQL- to work is not enough exploration but the operator, because BQL- adopts the same exploration method with IQL, H-IQL, MA2QL-, and I2Q.

There seems to be a gap between theory and practice, esp., the improved performance of BQL when there is only a single replay buffer. Are there any intuitions for why BQL works in this case?

On the one hand, as analyzed in the last paragraph of section 2.4, if the non-stationary single replay buffer sufficiently goes through all possible transition probabilities, the theory still holds. On the other hand, the intuition behind the operators is optimistic value estimation, which has been verified to be useful in decentralized learning, e.g., H-IQL. However, since the optimistic target in H-IQL is the highest state value, the overestimation problem is severe in a stochastic setting. The optimistic target in BQL is the expected value over possible transition probabilities, so BQL can avoid overestimation and achieve better performance.

评论

We thank all the reviewers for the efforts on reviewing our paper and the insightful suggestions! We noticed that the main concern is the proofs, and we have carefully clarified the confusion and corrected the second equality in Lemma 2 in the revision (the colored line). We confirm that the proofs are correct. If you think the questions are addressed, we hope you can re-evaluate our paper. If you still have some confusion, please let us know. We are looking forward to further discussion!

AC 元评审

Two of the reviewers complain about the quality of the presentation and I agree. Before this paper can be accepted, it needs to be rewritten to be a lot clearer about the assumptions being made on what is known and unknown, what decentralization means, and the gap between the theory developed and the algorithms provided. I have attempted to read the paper and was confused on many of the same points raised by the reviewers. The authors should revise the paper to be clearer so that it can be evaluated more completely.

为何不给更高分

Paper writing is unclear.

为何不给更低分

N/A

最终决定

Reject