Mediator Interpretation and Faster Learning Algorithms for Linear Correlated Equilibria in General Sequential Games
摘要
评审与讨论
This paper introduces an extension of communication deviation, termed UTC deviations, and theoretically proves that UTC deviations are equivalent to linear deviations in the context of online convex optimization. Given that UTC deviations can be represented using DAG decision problems, regret minimization algorithms like CFR, when applied to these decision problems, can also be utilized to minimize linear swap regrets in online convex optimization scenarios. Building upon prior findings, employing CFR as a -regret minimizer not only promises the same upper bound but also offers superior computational efficiency by avoiding the need for an expensive projection. Empirical evaluations further demonstrate that CFR achieves reduced average regrets and requires significantly less runtime.
优点
- The paper bridges decision problems and the online learning problem by proving the equivalence of UTC deviations and linear deviations. This enables the use of algorithms, such as CFR, in online optimization problems to minimize linear swap regrets.
- The paper provides an intuitive example of a UTC deviation and demonstrates that UTC deviations do not necessarily have to be communication deviations
- The empirical analysis is very persuasive to me. The experimental settings and results are explained clearly.
缺点
- There is no improvement in regrets when using CFR based on UTC dynamics. Although Figure 2 shows that the regret of CFR outperforms that of Farina & Pipis (2023), the paper does not provide an explanation for this.
- The paper provides only an informal proof of why UTC dynamics have better computational efficiency than Farina & Pipis (2023) in Section 1. A more formal discussion is required and should be placed in Section 6, rather than in the introduction.
- Some notations, such as , are used before being defined.
- I cannot locate a formal definition of linear deviations. Given its frequent use in this paper, at the very least, an explanation is necessary.
问题
- Can you provide some explanations for why CFR achieves a better regret than the algorithm in Farina & Pipis (2023)? Figure 2 is presented without accompanying discussion.
- Since both CFR and Farina & Pipis (2023) offer an upper bound of , is it possible to demonstrate that this is also the lower bound?
- Could you elaborate on how to employ the algorithm from Cohen et al. (2021) to achieve complexity? What is the linear program that must be solved in formal? This addition would make the paper easier to read.
Can you provide some explanations for why CFR achieves a better regret than the algorithm in Farina & Pipis (2023)? Figure 2 is presented without accompanying discussion.
PCFR+ performs better than projected gradient descent, both in terms of per-iteration complexity and in terms of empirical performance. However, PCFR+ can only be applied to certain convex sets (unlike projected gradient descent, which applies to any convex set). One of the key contributions of our paper is indeed in showing that, through a different characterization of (specifically, the constraints that define ), algorithms such as PCFR+ can be employed. The practical merits of CFR-family algorithms have been documented by many researchers before us, which established that family as a natural default choice for tree- and DAG-form decision spaces. Indeed, the fact that our construction, unlike that of Farina & Pipis, allows the application of CFR-family algorithms---thus enabling this practical improvement---is, in our minds, one of the main merits of our approach.
Since both CFR and Farina & Pipis (2023) offer an upper bound of , is it possible to demonstrate that this is also the lower bound?
We do not believe that this bound is tight, in fact. For example, in games, there are ways to achieve regret for EFCE, as we discuss in the conclusion section. Also, CFR usually doesn't get the optimal theoretical guarantees: for example, for external regret on tree-form decision problems, KOMWU [3] improves upon it by a quadratic factor. Both of these ideas seem somewhat tricky to extend to linear deviations; we leave these as open questions for future work.
[3] G Farina, CW Lee, H Luo, C Kroer (ICML 2022). "Kernelized Multiplicative Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Normal-Form Games"
Could you elaborate on how to employ the algorithm from Cohen et al. (2021) to achieve complexity? What is the linear program that must be solved in formal? This addition would make the paper easier to read.
Cohen et al. (2021) show that linear programs with variables/constraints can be solved in time . The fixed-point problem is just an LP feasibility problem of the form "find such that and ", where is the matrix representing the linear deviation and the constraint can be represented with constraints via the sequence form (first equation, page 3). This LP is also stated in the paper at the end of Section 2. In the updated version, we have included a forward pointer from the introduction to Section 6, where the fixed point problem is discussed in more detail.
This paper makes a conceptual step towards the design of swap regret minimizing algorithms. In particular, inspired by the -regret framework they introduce a set of deviation mappings called untimed communication (UTC) deviations that are shown to be equivalent to the set of linear deviations. These linear deviations have been recently shown to be useful in obtaining sublinear regret in polynomial time for sequential games. Following that, the authors show that UTC deviations are expressible in terms of scaled extensions, which allows them to apply CFR to learn linear correlated equilibrium much faster than the indicative previous work of Farina and Pipis. Finally, several experiments are shown that corroborate the theory and show fast convergence to linear correlated equilibria.
优点
The paper is well written and the structure is clear. Moreover, showing a connection between linear deviations and communication deviations draws a novel parallel with standard sequential deviation sets. This opens up a wealth of potential untimed deviation mappings (along with accompanying equilibrium concepts) to be studied in future. The example presented also gives an intuitive understanding of the deviation and how it can be useful. Finally, the empirical performance of the CFR algorithm instantiated with regret matching+ greatly speeds up convergence to the linear correlated equilibria.
缺点
While showing a connection between linear and communication deviations is a very neat result, the paper does not seem to be very substantial in otherwise exploring the full potential of this new class of deviations. The ability to obtain faster convergence to linear correlated equilibria is welcomed, but relies on an existing algorithm. In that sense, the major contribution of this work is introducing UTC as a part of the framework for which fast convergence to equilibria can be obtained. Other than that, none of the other results shown seem technically significant. To alleviate this, more discussion or results about the nature of untimed communication equilibria would be helpful, should space allow (perhaps a section about connections to standard communication equilibria, which has been written in the appendix already).
问题
- If a generic definition of untimed communication equilibrium cannot be derived, are there relaxations thereof that can be derived in your framework?
- What is the next reasonable step beyond linear deviations in the search for swap regret minimization?
more discussion or results about the nature of untimed communication equilibria would be helpful, should space allow (perhaps a section about connections to standard communication equilibria, which has been written in the appendix already).
Appendix A.6 defines the untimed private communication equilibrium, which partially addresses the reviewer's question. We wanted to place it in the body; unfortunately, space did not permit this, but we agree with the reviewer that the appendix contains interesting content. Defining the untimed (non-private) communication equilibrium seems quite tricky, as we pointed out in a footnote. Since the reviewer seems interested in this, we will now elaborate a bit more.
The untimed private communication equilibrium is easily defineable. We define it in Appendix A.6, and also prove the relevant revelation principle. As we state, the word private here is included to emphasize the fact that untimed private communication equilibria are correlated strategy profiles, and that this corresponds to a sort of privacy constraint, where the mediator cannot pass information between players. This is unlike the usual communication equilibrium definition: communication equilibria may not be strategy profiles at all, because the mediator can effectively "pass information" between players.
Untimed communication equilibria (without the privacy constraint) are far more tricky to define, especially in a way that doesn't quickly collapse to the regular notion of communication equilibrium. The following is not formal but should give some intuition. In a 3-player game, the mediator is always guaranteed that two of the players have not deviated, and those two players will have messages synchronized with the game clock. Therefore, under reasonable assumptions on how often each player makes moves, the mediator will immediately know if the deviating player is sending out-of-order messages, and this concept would reduce immediately to the regular communication equilibrium.
Perhaps in two-player games there is some interesting way to define untimed communication equilibrium (since the above issue does not arise---the mediator wouldn't know which player is out of order and which player is not), but this remains quite subtle and mostly outside the scope of the paper; we leave it to future work.
What is the next reasonable step beyond linear deviations in the search for swap regret minimization?
A very timely question! Two simultaneous very recent preprints [1, 2], both appearing after the ICLR submission deadline, have recently demonstrated that there exist efficient swap-regret minimizers for extensive-form games with regret , implying that it is possible to compute -NFCE in time . (Note that this is in stark contrast with linear-swap regret, which can be minimized in time instead of paying exponential cost in ). A very natural open question remains as to whether it is possible to compute NFCE in time .
[1] B Peng and A Rubinstein (arXiv 2023). "Fast swap regret minimization and applications to approximate correlated equilibria"
[2] Y Dagan, C Daskalakis, M Fishelson, and N Golowich (arXiv 2023). "From External to Swap Regret 2.0: An Efficient Reduction and Oblivious Adversary for Large Action Spaces"
Thank you very much for the comments and additional explanation, the significance of the untimed private communication equilibrium formulation is clearer to me now. I have indeed seen the recent papers cited in your response, and overall I feel that your work is well placed in the current literature. With the changes proposed to the manuscript, I would be happy to recommend acceptance for the paper.
This paper studies the computation of a relaxed version of correlated equilibria, known as linear correlated equilibria. The authors take a regret minimization perspective and design their algorithm for computing linear correlated equilibria by designing algorithms that minimize linear-swap regret. They demonstrate an equivalence between linear deviation and a generalization of communication deviations, where the deviator can send untimed queries to the mediator. Hence, minimizing linear swap-regret can be done via minimizing regret defined under such deviations with untimed communications. Invoking a previous result on the latter problem then gives a linear-swap regret minimizer. Experiments based on several types of games are further conducted, which demonstrate significant improvement of the proposed algorithm compared with existing ones.
优点
The paper is overall clean and rigorously written. The problem studied is interesting and results presented look sound technically. The empirical results also provide significant improvement in comparison to existing approaches.
缺点
Some parts of the paper seems to lack necessary details, making them quite hard to follow. In particular, the communication deviations and the UTC decision problem in Section 3 are introduced in a very abstract language. For readers not familiar with previous work on communication equilibrium, it would be quite hard to understand why these concepts are defined in described ways. For example, when communication deviations are introduced, it is said that the deviator will maintain a state with the mediator, but there is no explanation about why there needs to be a state. The introduction of the interaction between the mediator and the deviator also lacks intuitions. Overall, while I find the previous sections clear and informative, Section 3 seems to appear a bit abruptly. It might be helpful to provide some more details about comminication equilibrium, or introduce the concepts in this section along with the example in Section 5 (I think I understand them better after reading the example).
Some typos:
-
In Section 5, the sentence "must immediately its first mediator query" reads problematic.
-
In the paragraph above Section 7, "this significant(ly) improves"
问题
-
Could you provide some intuition why deviations in EFGs need to be defined as interactions betweem a mediate and a deviator?
-
There seem to be some typos in Theorem 2.3. On the second line, "where x^{(t)} \in X ... " Do you mean x^{(t)} \in \Phi instead of X? The utility vector u^{(t)} in this sentence also seems to come from nowhere. And when you say "deterministic" external regret, the notion of deterministic seems undefined.
The main purpose of the example in Section 5 is precisely to clarify the relevant concepts, and we are glad that the reviewer found it helpful in understanding. We have added a forward pointer to the example in Section 3, as well as a schematic/table of the different solution concepts (Table 1).
Thank you for pointing out the typos; we have fixed them.
Could you provide some intuition why deviations in EFGs need to be defined as interactions between a mediator and a deviator?
It is not a requirement for a set of deviations to be defined as an interaction between mediator and deviator. Indeed, for example Farina and Pipis did not give this interpretation (and instead left the question open). From a game-theoretic modeling point of view, however, the question is important, as correlated equilibria and its variants---in fact all of them except linear-correlated equilibria before our paper---are defined in this way. This sort of mediator-deviator interaction is a very convenient framework for thinking about sets of deviations , and one important contribution of this paper is to place linear deviations also within the framework. The second important contribution of this paper is in showing that this improved understanding based in the language of mediators enables significant computational speedups. So, it is up to a reader and their background whether they are motivated more by the game-theoretic modeling or by the computational results; this paper contributes to both. (See also answer to Reviewer xFFq.)
There seem to be some typos in Theorem 2.3. On the second line, "where x^{(t)} \in X ... " Do you mean x^{(t)} \in \Phi instead of X? The utility vector u^{(t)} in this sentence also seems to come from nowhere.
The where clause is misplaced and should simply be removed. We have done this in the updated version.
And when you say "deterministic" external regret, the notion of deterministic seems undefined.
A deterministic regret minimizer is simply one that does not use randomness internally to compute its strategies. All regret minimizers we study in this paper are deterministic. We have clarified this in the updated version.
Thank you for your clarifications!
This paper studies regret minimization for extensive-form games. In detail, the authors established a relationship between the linear deviation and the proposed UTC deviation. Therefore, any algorithms which are used to minimize the UTC regret is equal to minimize the linear swap regret. The authors showed sublinear regret bound and experiment results to suggest that the proposed algorithms are the state of the art.
优点
This paper is technically sound, and the experiment results are also comprehensive.
缺点
The presentation of this paper can be further improved. In order to further demonstrate the contribution of this work, the authors can have a table which records all existing algorithms for linear-swap regret, as well as their computational complexity. Right now it is a little bit hard for the readers to even identify which of the algorithms (UTC-based CFR or algorithms from Farina and Pipis) are better. Another suggestion is the presentation of Section 5. I understand that the authors want to use the example to demonstrate that UTC is more expressive than communication deviation. However, I spend several hours and still can not get an intuitive understanding about why such a claim hold. For instance, why A and B are 'irrelevant' according to footnote 6? I suggest the authors rewrite Section 5 by reorganizing the Figure 1 into several smaller figures with shorter captions. That would make the demonstration much clearer.
Following the above comment, I can not tell what is the key contribution of this work. The main theorem (Theorem 4.1) suggests that the deviation sets and are identical, and later the authors suggested CFR in Zhang et al. can be used to solve UTC, which can be further used to solve linear deviation. How the exactly can CFR be applied to linear deviation? A reduction process is expected, similar to what has been done in Theorem 2.3 (an existing result). Meanwhile, why people are interested in forming the equivalence between UTC and linear deviation but not some other types of game and linear deviation? Last, the authors suggested that the per-iteration complexity of CFR depends on the complexity of computing a fixed point iteration problem. How should we compare such a complexity with previous approaches, like Farina & Pipis (2023)?
问题
See weaknesses section.
In order to further demonstrate the contribution of this work, the authors can have a table which records all existing algorithms for linear-swap regret, as well as their computational complexity. Right now it is a little bit hard for the readers to even identify which of the algorithms (UTC-based CFR or algorithms from Farina and Pipis) are better.
We know of only two algorithms for this problem---ours, and that of Farina and Pipis. As clearly stated at various points in the paper including the abstract, ours has significantly better per-iteration complexity in theory, and all-around better performance in practice.
If the reviewer is asking about comparisons between our algorithms and other -regret algorithms, we have included a table with such a comparison in Appendix B.
Why A and B are 'irrelevant' according to footnote 6?
That is because, at decision points A and B in the game, player is guaranteed utility exactly 0 regardless of what it does. Since a deviating player only cares to maximize its utility, what it does at these decision points is irrelevant.
The key point in the example is that the untimed nature of the deviations allows the deviating player to delay its query to the mediator until it reaches either infoset D or infoset E, and then decide which query to make based on which infoset was reached (A if D is reached, and B if E is reached). In a timed communication deviation, this behavior is impossible, because the player must make its first query (infoset A, B, or C) before reaching D or E, and thus that query cannot be conditioned on whether D or E will be reached.
We've updated the submission to add some more detail about this.
Following the above comment, I can not tell what is the key contribution of this work. [...] How the exactly can CFR be applied to linear deviation?
You are correct that the main theorem asserts that the set of deviations (studied directly by Farina and Pipis) is equal to the set that we introduce. This an important statement, for two different reasons:
- Mathematically, the structure of the constraints that define expose a combinatorial structure that is significantly more amenable to computational techniques, namely, that of a DAG decision problem. The DAG structure of is what enables the use of (a variant of) the CFR algorithm, which is extremely efficient---the cited paper Zhang et al (2023) gives more detail on the application of CFR to arbitrary DAGs. CFR could not be applied by Farina and Pipis, who instead are forced to use an algorithm whose stated (polynomial) complexity scales as the tenth power of the number of actions in the extensive-form game (see also answer below).
- Conceptually, gives an interpretation of linear deviations in the language of correlation devices (aka mediators), which is the standard game-theoretic model underpinning correlated equilibrium and its variants. Farina and Pipis has left the question open, and instead studied the question from a learning-in-games angle.
A reduction process is expected, similar to what has been done in Theorem 2.3 (an existing result).
We use Theorem 2.3 (implicitly) as part of the proof of Theorem 6.1 (Theorem 2.3 is what allows us to say that an external regret minimizer on , which is what we have constructed, implies a -regret minimizer). We have clarified this in the updated verison.
Last, the authors suggested that the per-iteration complexity of CFR depends on the complexity of computing a fixed point iteration problem. How should we compare such a complexity with previous approaches, like Farina & Pipis (2023)?
Farina & Pipis's algorithm requires, on every iteration, (1) a fixed-point operation and (2) an projection onto the polytope. The latter step, though polynomial, is very expensive: in theory, it requires solving a convex quadratic program, and Farina & Pipis (Theorem D.1) bound the per-iteration complexity as . By virtue of the identification of and our , which presents a much more favorable structure of constraints, our paper shows that the projection can be removed altogether, and that fast approaches such as CFR can be used instead.
The complexity of computing the fixed point (needed by both Farina and Pipis, and by us) can in theory and in practice be computed far faster than this: theoretically, computing a fixed point can be done via a linear program in time , where . (see also response to Reviewer YUk3 for further discussion on the fixed point computation)
Practically, power iteration is usually sufficient to compute a fixed point and can be much faster than linear programming as well; that is what we use in the experiments (we have clarified this in the updated version)
This paper proposes a connection between linear deviations and a generalization of communication deviations to develop a no-regret algorithm for computing linear correlated equilibrium, and illustrated the efficiency and effectiveness of the paper via theoretical analysis and empirical evaluations.
Pros: Interesting topic, solid theoretical analysis, and comprehensive experimental results.
Cons: the main take home message is unclear. Reviewers also raised various presentational issues.
为何不给更高分
None of the reviewer seems to be extremely excited due to the Cons and as detailed in their reviews. This is perhaps because of the presentational issues.
为何不给更低分
The connection is interesting and the theoretical and experimental results are solid.
Accept (poster)