PaperHub
6.0
/10
Rejected4 位审稿人
最低5最高8标准差1.2
8
5
6
5
3.3
置信度

摘要

关键词
no-regret learningextensive-form gamesoptimal equilibriamechanism designinformation designpayments

评审与讨论

审稿意见
8

The paper considers a setting where there is a mediator whose goal is to direct (i.e. steer) the learning agents to desirable equilibria by paying the agents, thus affecting the learning agents' payoff. The authors show that if the total payment is upper bounded by a constant then it is not possible to steer the agents. However, it is possible when the per-iteration payments are o(1). In particular, in the setting where the agents use an algorithm that has sqrt{T} regret, they show that in the full information setting or normal setting, the per-iteration payment is O(T^{-1/4}) and, in the bandit setting for extensive-form games, the per-iteration payment is O(T^{-1/8}). The authors study both the full-information setting and the bandit setting.

To prove their results, one idea is to pay the agents so that playing a Nash equilibrium is a dominant strategy. However, the technical difficulty here is that we want to ensure that the per-iteration payment goes to 0 as t tends to infinity. The authors show how to do this by designing a payment function with essentially three components. The first component is to subsidize the agents when the other agents are not bidding the equilibrium. An important aspect of this is that when the agents are in equilibrium, this component should vanish. The second is a reward which incentives the agents to bid the equilibrium. And the last component is just to ensure the payments are non-negative.

优点

The problem is quite original and is very interesting. The results are interesting and I believe will be of interest to researchers working in the intersection of AGT and learning. In particular, there is a large body of work on understanding dynamics of games and this paper should fit right in.

In addition, the writing is clear and the techniques seem quite sophisticated. I enjoyed reading the paper and appreciate that it is a novel problem.

缺点

No weaknesses to discuss from my end.

问题

p. 1: Maybe clarify what is meant by bounded regret. Is it constant regret per-iteration? Vanishing regret? Algorithm 5.1: Do the sandboxing payments have the same property in the normal form case where having those payments ensures that d_i is a dominant strategy?

评论

We thank the reviewer for the helpful feedback.

Bounded regret (page 1)

We mean that the regret should be bounded by a sublinear function of TT, as is the usual assumption. We've changed it to "sublinear regret".

Algorithm 5.1: Do the sandboxing payments have the same property in the normal form case where having those payments ensures that d_i is a dominant strategy?

Yes--the ability to set these sandboxing payments is precisely what distinguishes the full-feedback case from the bandit case. As we discuss, in the bandit case, it is impossible in general to make did_i dominant.

评论

Thanks for answering my questions. It looks good to me.

审稿意见
5

The paper studies the problem of guiding no-regret-learning agents toward desirable equilibria using nonnegative payments. The authors first show that achieving vanishing average payments allows for steering in scenarios where complete player strategies are observable. However, steering is impossible with a finite budget across all iterations. In cases where only game tree trajectories are observable, the feasibility of steering varies, being possible in normal-form games or with growing per-iteration payments, but generally impossible in extensive-form games with constant per-iteration payments. The paper supplements its theoretical findings with experimental validation, demonstrating the efficacy of steering in large games.

优点

On the positive side, I found the paper to be well-written, and if you take the model as given, the paper gives a fairly satisfying and complete first investigation – the questions asked are exactly the ones I would hope are answered first.

缺点

I have a major concern regarding the motivation behind this paper, particularly with respect to the assumption that agents willingly accept being directed. A key question arises when these agents are required to continue their interactions for extended rounds after being steered towards equilibrium – should compensation be provided once they reach this equilibrium? While the paper mentions that a small constant, P (P <= 8), is sufficient to guide them to the exact equilibrium within a finite number of iterations, it overlooks the potential utility loss (because of being steered into a different equilibrium) experienced by agents after this steering process. Intuitively, it seems reasonable that they should also receive compensation to offset this particular loss. Moreover, an alternative approach could be to simplify the model into a single step, instructing agents to directly adopt the ideal equilibrium. In this case, the mediator could provide one-time compensation equal to the total utility difference experienced under an alternative equilibrium.

Specifically, regarding the assumption that followers will consistently play a no-regret strategy in response to the leader's queries, especially when they are aware of the steering process towards equilibrium. Steering mechanisms have the potential to influence agents' decisions and behaviors, raising questions about the potential introduction of bias or preferential treatment for certain groups of agents. For instance, in the design of payment schemes, concerns about fairness may emerge. It is crucial for the paper to elaborate on the validity of this behavior model and provide justifications into the application domains where such assumptions hold.

From a technical perspective, I found it challenging to understand how the algorithms can be effectively applied to mixed-strategy equilibria through a mediator-augmented game. Specifically, when dealing with pure strategies, solving equation (2) seems feasible. However, when considering mixed strategies, X_i transforms into an infinite set, and solving equation (2) would necessitate the optimization of a non-convex problem.

问题

The concept of a mediator-augmented game reminds me of a paper by Deng, Yuan, Jon Schneider, and Balasubramanian Sivan titled "Strategizing against no-regret learners." This paper, which addresses similar problems involving strategies against no-regret learners, was cited in your work but not extensively discussed. In a mediator-augmented game, the mediator essentially takes on the role of the leader who strategizes against the no-regret learner. I am curious whether you can provide a detailed explanation of the difference between your models and results.

伦理问题详情

Steering mechanisms have the potential to influence agents' decisions and behaviors, raising questions about the potential introduction of bias or preferential treatment for certain groups of agents. For instance, in the design of payment schemes, concerns about fairness may emerge.

评论

We thank the reviewer for the helpful feedback. Below we address the reviewer's questions and concerns.

Assumption that agents willingly accept being directed

We're a bit confused what the reviewer means by this. The players do not need to willingly accept anything: the only assumption we make about their behavior is that they are no-regret. In the non-pure-equilibrium setting, the players are free to ignore or disobey the recommendations of the mediator if they wish.

Should compensation be provided once they reach this equilibrium?

In our model, if the players are playing the equilibrium, they will continue to receive a small (and vanishing with time) amount of payment, just to ensure that they continue to play the equilibrium. Intuitively, this payment exists to address the case where the equilibrium is not strict.

Moreover, an alternative approach could be to simplify the model into a single step, instructing agents to directly adopt the ideal equilibrium. [...] Specifically, regarding the assumption that followers will consistently play a no-regret strategy in response to the leader's queries, especially when they are aware of the steering process towards equilibrium. [...] It is crucial for the paper to elaborate on the validity of this behavior model

The alternative model proposed by the reviewer, in which the agents would directly switch to the equilibrium upon being instructed to do so, is far stronger (and harder to justify) than the one we work with. As the reviewer also mentions, working in that model would essentially render the problem trivial, because steering to an equilibrium would then amount to nothing but telling the players what to play.

We agree that, if the players were aware of the steering process and could collude against our mediator, they could manage to "break" the algorithm by, for example, extracting high payments from the mediator. However, this would result in a "prisoner's-dilemma-like" situation where at least one player would end up having high regret.

Many "intuitive" algorithms that could be adopted by learning agents in practice, such as (projected) gradient descent, multiplicative weights, or regret matching, are no-regret dynamics. Further, a player who incurs high regret has in some sense "failed to learn", because, in hindsight, it would have done better by acting in some other way. For these reasons, we believe that the assumption of no regret is a fairly weak and reasonable assumption in our eyes (and indeed that is the reason we work with it).

We have included this discussion in the revised version.

From a technical perspective, I found it challenging to understand how the algorithms can be effectively applied to mixed-strategy equilibria through a mediator-augmented game. Specifically, when dealing with pure strategies, solving equation (2) seems feasible. However, when considering mixed strategies, X_i transforms into an infinite set, and solving equation (2) would necessitate the optimization of a non-convex problem.

Considering the infinite game in which each player selects a mixed strategy is just one way to address the issue of mixed strategies. Indeed, Monderer and Tennenholtz (2004) discuss this approach in Section 3 of their paper. It is not the way that our paper takes. Instead, our paper addresses the mixed equilibrium case by empowering the mediator to give action recommendations to the player. This augmented game is still finite (as long as the underlying game is finite, of course): the mediator has finitely many action recommendations to select from, and the player, upon seeing an action recommendation, picks among its finitely many choices of action. This augmented game is also extensive form---as such, the players' strategy spaces are still polytopes, and the optimization remains convex and efficient.

We continue our response below.

评论

The concept of a mediator-augmented game reminds me of a paper by Deng, Yuan, Jon Schneider, and Balasubramanian Sivan titled "Strategizing against no-regret learners." This paper, which addresses similar problems involving strategies against no-regret learners, was cited in your work but not extensively discussed. In a mediator-augmented game, the mediator essentially takes on the role of the leader who strategizes against the no-regret learner. I am curious whether you can provide a detailed explanation of the difference between your models and results.

The setting of that paper is quite different from the one that we study, in several important ways.

  • Their setting is a normal-form setting with one leader and one follower, in which a critical assumption is that the follower has no weakly dominated pure strategies. We make none of these assumptions: our setting is extensive form setting with multiple followers in general, and we make no assumptions about domination. These changes make the problem significantly harder: in fact, in their setting, payments are not even required to guide the learner to the "equilibrium" (best response); it suffices for the leader to be clever about picking its mixed strategy. In our setting, however, (as we make clear in the paper), payments are most certainly required for steering, even in the one-player case.
  • Our mediator is more restricted in terms of what it can do to affect actual gameplay: our mediator can only talk to and pay the players; it cannot directly affect the outcome of the game in the way that a Stackelberg leader can, since the players are free to ignore or disobey the mediator's actions. If the mediator wants something to happen, it must actually incentivize the players to do it.

We have added this discussion in the new version.

评论

With the discussion period coming to a close, could you let us know if we have addressed your concerns? If not, could you let us know what concerns remain so that we may have an opportunity to address them?

评论

I thank the authors for the clarification. I will take it into consideration during the discussion phase with AC.

To clarify what I mean by "Assumption that agents willingly accept being directed". Initially, I presumed that players were aware of the steering process but still chose to follow directions willingly. However, the author's response seems to suggest that the assumption is that players are unaware of the steering process and are myopically playing no-regret algorithms. While this setting may be suitable for theoretical studies, it might lack practical applications. Additionally, the result of vanishing payment is not surprising, considering the behavior assumption that players incur vanishing average regret. Therefore, I maintain my borderline decision.

评论

To clarify about "awareness": It does not matter whether or not players are "aware" of the steering process (it is not even clear to us what that formally means in our setting). They don't even need to be following traditional no-regret dynamics (e.g., gradient descent/multiplicative weights/CFR) per se. The only thing that we assume is that players have no regret in hindsight. We believe that this is a fairly reasonable assumption even in practice: if a player has high regret, it is reasonable to say that they have acted suboptimally. We also emphasize that many papers before us---including the one cited by the reviewer in their original review and several cited in the introduction and related work sections of our paper---use the no-regret assumption, so we are far from alone.

We did mention in our earlier response that players who are willing to incur high regret may be able to exploit the mediator, for example, by colluding among themselves to extract large payments---but, again, such players would inevitably violate our no-regret assumption. Whether this would happen in practice is a difficult matter likely dependent on the desired application, and is beyond the scope of this primarily theoretical work.

See also the discussion with Reviewer VPEt above for more discussion surrounding the no-regret assumption, specifically regarding possible assumptions that are stronger than no-regret :)

The assumption of sublinear budget (i.e., vanishing average payments) is indeed not surprising given that the alternatives that we discuss both have trivial resolutions (namely, linear budget trivially allows steering to anything; constant budget precludes steering even in simple settings). But, we believe that the analysis that comes of it is indeed quite surprising and interesting: for example, we certainly would not have guessed that, in extensive-form games in the bandit setting, bounded per-iteration payments do not suffice (Th 5.4) but unbounded per-iteration payments do (Th 5.6)!

审稿意见
6

The paper targets the following problem can we steer no-regret dynamics towards optimal equilibria. The paper considers both settings of normal form and extensive form games as well as full feedback versus bandit feedback. The results can be roughly organized into two categories. Firstly, there are results targeting pure Nash equilibria. In this case the paper provides a number of results showing that such targeting is possible given sublinear total of payments. They also provide an example where finite amount of payments do not suffice for a specific sequence of no-regret play in a 2x2 game. Secondly, there are results targeting mixed NE, and different notions of CE. There is a significant departure in what the paper interprets as steering equilibria. Instead of considering standard no-regret algorithms playing e.g. Follow-the-Regularized-Leader playing the original game, they expand the setting to add a new "action" to each algorithm. The new actions is not an action in the original game but effectively implements the policy "I will play what my mediator tells me to do" which mediator can e.g. solve for an optimal correlated equilibrium. This technique effectively allows the paper to leverages recent computational reductions by (Zhang et al. (2023) and reduce the mixed equilibrium problem to a pure equilibrium problem.

优点

The paper targets a rather interesting problem of how to steer no-regret learning algorithms to implement equilibria. It connects interesting areas in game theory, online learning and mechanism design. It presents a plethora of theoretical results in a wide range of different settings and solution concepts. The mathematical parts of the paper are generally well articulated. The authors show good technical grasp of the area and leverage expertly recent results. The paper has also shows some experiments to complement the theory.

缺点

I have two different criticisms for the paper. One for the pure Nash results and one for the rest of the mixed solution concepts.

  1. For the case of non-pure Nash equilibria, if I am not mistaken, the paper does not solve the problem that is advertised in the title/abstract/intro. Specifically, from the first paragraph of the intro:

Any student of game theory learns that games can have multiple equilibria of different quality—for example, in terms of social welfare (Figure 1). How can a mediator—a benevolent third party—steer players toward an optimal one? In this paper, we consider the problem of using a mediator, who can dispense nonnegative payments to players, to guide players to a better collective outcome.

That is not what the paper is doing. The mediator in the case of mixed eq is a much more powerful entity that can actually broadcast advice to all players. Furthermore, the players are not playing any recoginzable no-regret algorithm but learning with advice algorithms, which have a special option (ignore the actual algorithm and follow centralized advice according to a pre-configured scheme that allows for global instantaneous broadcast of signals that are unambiguously interpreted by all users and acted upon). This is not steering no-regret algorithms but a different class of algorithms altogether where agents learn in the presence of a broadcasting device. The current paper mentions in passing in the introduction some of these papers by Balcan et al (see also [1] for a more recent paper) but they fail to mention that these settings work with a more realistic broadcasting device, which only reaches a small random fraction of all agents (e.g. 1%) instead of all. More importantly, I would have expected that words such as public service advertising, learning with centralized advice, global information, etc. would play central role in the description of the approach. This is a critical part of the model and important difference between the case of pure and mixed eq. To be clear, what is being done is still interesting but it is significantly easier to achieve than merely using payoffs to steer the dynamics.

  1. For the case of pure Nash equilibria, it is not clear to me that the problem is hard to begin with. The paper provides a lower bound where a simple game along with a specific type of very unnatural no-regret behavior can be hard to steer. Following results from e.g. [2-4], pure/strict Nash (e.g. like in Stag Hunt games explored in the paper) are locally asymptotically stable for effectively all no-regret algorithms and in fact arguably for all reasonable game dynamics. So here is how to informally stabilize all of them with a finite amount of money. Pad the target strategies with enough money at an initial phase of the algorithm. With a finite excess payoff any reasonable aggregate payoff based dynamics (not even no-regret) after some finite amount of investment will be in the attractor. This is a stronger convergence result since it is actually even guarantees last-iterate convergence (and it does so with a finite amount of money).

My question is the following: Why should we care enough for totally non-meaningful no-regret behaviors that do not capture any realistic algorithm either from an optimization or a behavioral point of view to be willing to accept paying an infinite amount of money when finite amount should suffice for all reasonable dynamics?

You also have an experiment with Exp3 where clearly a finite amount of money suffices to select the optimal equilibrium. Can you create a negative example where you would need unbounded money for Exp3? If not, then doesn't this show that the TalphaT^{alpha} unbounded payoffs are too loose in comparison to real world lower bounds?

[1] Balcan et al. Near optimality in covering games by exposing global information ACM Transactions of Economics and Computation, Volume 2 Issue 4, October 2014. [2] Vlatakis-Gkaragkounis et al. "No-regret learning and mixed nash equilibria: They do not mix." Advances in Neural Information Processing Systems 33 (2020): 1380-1391. [3] Giannou et al. "Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information." Conference on Learning Theory. PMLR, 2021. [4] Giannou et al "On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond." Advances in Neural Information Processing Systems 34 (2021): 22655-22666.

问题

Please see my questions above.

评论

We thank the reviewer for the valuable feedback. Below we address the reviewer's concerns.

For the case of non-pure Nash equilibria, if I am not mistaken, the paper does not solve the problem that is advertised in the title/abstract/intro

For mixed equilibria, we do not believe that allowing the mediator to also give recommendations to players is too strong or far-fetched an assumption, especially given that the mediator is already interacting with the players to provide payments. Indeed, for normal-form games, Monderer and Tennenholtz (2004, Section 6) use the same principle to address mixed equilibria in their paper.

Further, for mixed equilibria, without advice, there is no hope of performing steering. This is best illustrated by example. Consider the normal-form two-player coordination game where the payoff for both players is the identity, and suppose that the mediator wishes to steer the players toward the mixed (uniform) equilibrium. Since the target equilibrium is fully mixed, every action profile (terminal history) is reached with positive probability in equilibrium; therefore, the mediator cannot give positive limiting payments to any action profile because such payments would actually appear in equilibrium. We have made this formal in Appendix D of the new version.

The above shows that without advice, steering to a mixed-Nash equilibrium is in general impossible, already in normal-form games. In extensive-form games challenges related to steering only become more acute. All in all, the discussion above shows that the definition of steering that the reviewer suggests is impossible to achieve.

The current paper mentions in passing in the introduction some of these papers by Balcan et al (see also [1] for a more recent paper) but they fail to mention that these settings work with a more realistic broadcasting device, which only reaches a small random fraction of all agents (e.g. 1%)

There are many crucial differences between our paper and the paper of Balcan et al. [1]. Namely, 1) their results apply only to a limited class of games, while our results apply to general extensive-form games; 2) using the broadcasting device of Balcan et al., which reaches only a small random fraction of all agents, does not suffice to yield optimal equilibria, but rather only yields an approximation of the optimal welfare; and 3) our results apply under a weaker behavioral assumption, namely the no-regret property.

To elaborate more on point 3) above, in Section 3 of the paper of Balcan et al. it is assumed that the dynamics consist of two phases. In the first phase a subset of the players are assumed to follow the advertising strategy, while the rest of the players are assumed to be best responding. These assumptions are quite brittle compared to our behavioral assumption of assuming the no-regret property. Similar limitations apply to the learning protocol described in Section 4 of the paper of Balcan et al.

We have included a discussion on those points (and a comparison to [1]) in the related work section of the revised version.

Why should we care enough for totally non-meaningful no-regret behaviors that do not capture any realistic algorithm either from an optimization or a behavioral point of view to be willing to accept paying an infinite amount of money when finite amount should suffice for all reasonable dynamics?

So here is how to informally stabilize all of them with a finite amount of money. Pad the target strategies with enough money at an initial phase of the algorithm. With a finite excess payoff any reasonable aggregate payoff based dynamics (not even no-regret) after some finite amount of investment will be in the attractor. This is a stronger convergence result since it is actually even guarantees last-iterate convergence (and it does so with a finite amount of money).

First of all, we do not believe that there is a definite answer as to what constitutes "reasonable dynamics." For example, the papers cited by the reviewer do not (to the best of our understanding) capture regret matching, one of the most fundamental and natural learning dynamics. Making stronger assumptions regarding the class of update rules followed by the players would limit the scope of our results, and so instead we chose to treat the problem under the much broader condition of incuring sublinear regret. As we point out in the last paragraph of our related work section, there is a recent literature in mechanism design that operates under this exact assumption.

We continue our response below.

评论

We next make several remarks about the reviewer's argument on how to stabilize the dynamics with finite payments:

  • Even if we assumed that players follow, say, online gradient descent, the reviewer's argument only applies when the learning rate is independent of the time horizon; however, online gradient descent could incur linear regret (in the adversarial regime) when considering a constant learning rate. So it seems unrealistic to assume that a player will adopt a parameterization that in general could incur linear regret unless that player happens to know that other players will also use the same dynamics. The same applies more generally to algorithms deriving from either the mirror descent or the FTRL framework: using the parameterization required to guarantee the no-regret property is incompatible (to our best knowledge) with a finite budget. We do not believe that there is a good justification for assuming that a player will follow an algorithm that does not even guarantee the no-regret property.
  • The reviewer's proposed method would work in the full-feedback or normal-form case with constant step size, but it may fail in the extensive-form setting with bandit feedback. In the bandit extensive-form setting, it is possible for there to be infosets that are unreached in the desired equilibrium. Let's say, for the sake of concreteness, that P1 has an infoset II that is unreached in equilibrium (i.e., the other players avoid II). Then, in the reviewer's scheme, P1 would eventually stop receiving any reward signal at II and thus its strategy will stop changing---possibly well before P1's strategy at II comes close to the desired strategy. Thus, when the payments are subsequently removed, since P1's strategy at II is different from the desired strategy, it may fail to induce the correct incentives for the other players, and therefore other players may start avoiding the equilibrium. In some sense, this issue of the existence of unreached infosets in the extensive-form bandit setting is precisely the reason why that setting is more involved than the other settings we study!

All that said, we agree with the reviewer that characterizing the class of learning dynamics under which the steering problem can be solved with a smaller, perhaps finite, budget is an interesting direction for future work. We have included this question in the concluding section of the revised paper, where we also refer to the papers mentioned by the reviewer pertaining strict Nash equilibria.

You also have an experiment with Exp3 where clearly a finite amount of money suffices to select the optimal equilibrium. Can you create a negative example where you would need unbounded money for Exp3? If not, then doesn't this show that the TαT^\alpha unbounded payoffs are too loose in comparison to real world lower bounds?

As we explained above, the answer depends on how we set the learning rate. If we use the usual vanishing learning rate, in general it shouldn't be possible to steer with a finite budget.

评论

I thank the authors for their response.

I particularly appreciate the bold part of their response without advice, steering to a mixed-Nash equilibrium is in general impossible, already in normal-form games.

My main question to them is shouldn't this be unambiguously stated (ideally with a similarly bold font) in the introduction?

I.e. We want to solve the problem of steering no regret algs with pure payoff signals to equilibria.

  1. We solve the problem for the case of pure Nash.
  2. One could naively hope that something similar would be true for mixed Nash/CE etc but due to reasons a, b, c, this is impossible.
  3. Hence we now expand the abilities of the mediator to do the following on top of providing payoffs (Here is how this is related/an improvement upon previous approaches)
  4. Here our results for the case of mixed Nash and other correlated solution concepts.

Again and this is a direct question above that you have not answered: Regardless of whether it is reasonable or not (I do not disagree here with this point), where in the introduction do you discuss that the mediator has the ability to give recommendations?

Copying from your current introduction:

Summary of Our Results Our formulation enables the mediator to reward players with nonnegative payments. The goal of the mediator is to reach an equilibrium, ....

I find this to be a very important point to disambiguate and if you would be willing to rewrite the introduction to reflect this line of reasoning, that you seem to agree with me in your response above, I would raise my score accordingly to a weak accept.

For the issue of what constitutes reasonable dynamics I agree, this is in the eye of the beholder and I am willing to concede this point. Not all lower bounds have to be reflective realistic cases. But given that the issue of achieving stability with a finite budget is left as an interesting direction for future work for many practical cases, I feel that the contribution does not reach the level of a clear accept.

评论

We thank the reviewer for the helpful feedback. We have revised our submission to address the reviewer's points. To summarize the key points of the revision: 1) we mention throughout the introduction that the mediator has also the ability to offer advice to the players; 2) we clarify that offering advice is not necessary for steering to pure Nash equilibria (Sections 4 and 5), but necessary even for mixed Nash equilibria in normal-form games (Appendix D), thereby justifying our assumption in Section 6; and 3) we point out that using advice is in line with prior work, and we provide an extensive overview in Appendix A.

Let us know if our revision has addressed your points.

评论

Thank you for the quick update. I have increased my score accordingly.

审稿意见
5

This paper explores the problem of steering no-regret-learning agents to play desirable equilibria in games, using nonnegative payments. The authors present a framework for achieving this goal, and provide theoretical and experimental evidence to support its effectiveness. They also discuss the relationship between this framework and other areas of game theory, such as mechanism design and information design.

优点

The authors study an interesting question that might have interesting potential applications in mechanism design (and in my opinion) in federated learning.

缺点

I find the paper to be inaccessible and lack precision at times (for instance, a central notion to the paper such as optimal equilibrium is not well-defined—is it one that is Pareto-optimal or that maximizes some notion of welfare?). The experiments seem also unrelated to the theoretical results (Pure Nash vs. Correlated). The definition of the steering problem in terms of pure Nash equilibria is also odd, since pure Nash equilibria are not guaranteed to exist.

问题

Comments and questions:

Page 2: full feedback is not defined

Is this problem solved

Section 3, regarding regret definition and repeated play. Are the time iterations, the time iterations of some online learning algorithm or is this the steps of an episode? I think as is standard in the literature, the author are considering repeated-play settings but (although this can be understood clearly in the learning literature, I do not think this is clear in this setting). A description of the learning setting would be appreciated.

What is the meaning of the directness gap? I cannot read the math as the description given, an explanation would be appreciated.

The steering problem is defined pure strategy Nash equilibria, when are pure strategy Nash equilibria guaranteed to exist in extensive form and normal-form games? How much of the theory provided by the authors apply to mixed Nash equilibria?

评论

We thank the reviewer for the helpful feedback. Below we address the reviewer's questions and concerns.

The definition of the steering problem in terms of pure Nash equilibria is also odd, since pure Nash equilibria are not guaranteed to exist.

The steering problem is defined pure strategy Nash equilibria, when are pure strategy Nash equilibria guaranteed to exist in extensive form and normal-form games? How much of the theory provided by the authors apply to mixed Nash equilibria?

As we formalize in Section 6, our results directly apply to mixed-strategy Nash equilibria as well as correlated and communication equilibria. While pure-stratgy Nash equilibria are not guaranteed to exist in general, as the reivewer correctly points out, our results apply to any extensive-form game through the notion of a mediator-augmented game (Definition 6.1).

The experiments seem also unrelated to the theoretical results (Pure Nash vs. Correlated).

There seems to be a misunderstanding here. As we explained above, our results also directly apply to correlated equilibria (see Section 6 of the paper). As a result, our experiments are directly under the umbrella of our theoretical results.

A central notion to the paper such as optimal equilibrium is not well-defined—is it one that is Pareto-optimal or that maximizes some notion of welfare?

By optimal we mean an equilibrium that, among the set of all equilibria, maximizes the objective of the mediator, as we already introduce in Definition 6.1 of the paper. This objective could be, for example, the social welfare or the revenue of a mechanism, but our results apply for any mediator objective function. Note that Sections 4 and 5 concern steering to a pure Nash equilibrium (without any underlying notion of optimality), which is why our notion of optimality is formally introduced later in Section 6 where we apply our earlier results to steer to optimal equilibria.

What is the meaning of the directness gap?

The directness gap quantifies how close players' strategies are from the direct (i.e., obedient) strategies---as prescribed by the target equilibrium.

Page 2: full feedback is not defined

The full feedback setting is introduced in the beginning of Section 5, and it means that the mediator can observe the players' strategies. We have added a reference to Section 5 in the revised version to help the reader.

Are the time iterations, the time iterations of some online learning algorithm or is this the steps of an episode? I think as is standard in the literature, the author are considering repeated-play settings...

We are indeed operating in the standard setting from the repeated games literature, as we specify in Definition 3.1.

评论

I thank the authors for their response, I have no further questions at this point.

评论

Thank you. If we have addressed all of your concerns, would you consider raising your score? If not, could you let us know what concerns remain so that we may have an opportunity to address them?

评论

My current concern is the presentation of the paper which I believe requires a thorough re-reading and re-writing. Throughout the paper, there exists either expositional inconsistencies or statements which confuse. Often, the paper mentions in a sentence that the results apply for some other notion. For instance, the paper states the results directly apply to correlated equilibria too, and experiments are ran with correlated equilibria. If it is so, then the mathematical model that is presented should present the definition of a steering equilibrium for all notions of equilibrium that the results apply to. Or for instance, take section 6:

"We will assume the revelation principle, which allows us to fix a target pure strategy profile that we want to make the equilibrium profile for the non-mediator players."

What is the revelation principle, it is defined in mechanism design contexts traditionally, why can you just assume it?

The results seem interesting (but I cannot verify any of them intuitively but this might be my lack of knowledge) but the exposition can be made much more rigorous. As a result, I maintain my score.

评论

The revelation principle is applicable beyond mechanism design as well. Here are some examples, starting with the usual mechanism design version.

  • For mechanism design, the revelation principle states that, WLOG, messages are information reports and all players are honest in equilibrium. The direct strategy di\boldsymbol d_i for each player is the strategy that always sends honest information.
  • For correlated equilibrium (and other versions thereof, e.g., EFCE), the revelation principle states that, WLOG, the signals sent to the players are action recommendations and players in equilibrium should always play the recommended actions. The strategy di\boldsymbol d_i for each player ii in this case is the strategy that always plays the recommended actions.
  • Similarly, for communication equilibrium, the revelation principle states that players should both send honest information and play action recommendations in equilibrium.

These notions of revelation principle are ubiquitous when dealing with their respective equilibrium concepts, even if often they are not referred to as revelation principles (e.g., the correlated equilibrium revelation principle is often simply taken as part of the definition of correlated equilibrium). We refer the interested reader to any of the cited papers, especially to Appendix A of the cited paper Zhang et al (2023) which Section 6 uses, for more detail about augmented games and the revelation principle. We also commit to adding more detail about the revelation principle, including the above clarifications, in the final version.

Section 6 does define a notion of equilibrium and notion of steering formally, via Definitions 6.1 and 6.2. It is the revelation principle that allows the construction of the augmented game (which only includes honest messages and action recommendations) and the definition of direct strategies to be without loss of generality (i.e., without excluding any equilibria).

Please let us know what else, if anything, you believe can be improved in the presentation. Due to the lack of time remaining in the discussion period, we may not be able to revise during the discussion period, but we promise to take them into consideration in the final version.

AC 元评审

This paper considers the problem of a mediator providing payments to the players of a game with the goal of steering them to an "optimal" equilibrium. The authors consider several different settings and frameworks (both normal-form and extensive-form games, as well as different types of equilibria), and they derive conditions under which the players' time-averaged frequency of play is steered to the desired equilibrium.

This paper generated a rich and extensive discussion at all stages of the reviewing process. On the positive side, the more positively-inclined reviewers appreciated the outcome and generality of the "steering" process. At the same time however, several concerns were raised, both regarding the paper's presentation and readability as well as the actual implications of the authors' results:

  1. First, as several reviewers pointed out, the paper is "often inaccessible and lacks precision at times" – for instance, in the role of the mediator, the payments to the players, the question of pure v. mixed v. correlated equilibria, etc. All in all, a consensus was reached that the paper is trying to do too much too fast, and some of the missing / skipped details can lead to confusing or ambiguous statements.

  2. Regarding the paper's positioning, the authors are not the first to consider the question of learning "optimal" equilibria. Reviewer VPEt provided a lot of relevant input, but the current version is still missing certain classical branches of the literature, such as the work of Pradelski and Young ("Learning efficient Nash equilibria in distributed systems," Games and Economic Behavior, 2012) who showed that a simple modification of trial-and-error learning (Young, 2009) can also lead to optimal equilibria, with no external payments required (see also the literature expanding on these results, e.g., by J. Marden and co-authors).

  3. The authors' desiderata require more justification and motivation: the payment desideratum (S1) is actually allowing the mediator to "buy off" the players' regret (a fairly strong assumption), while the "steering desideratum" (S2) does not, in general, imply convergence to equilibrium. It was not clear if these desiderata are the best that one can hope for - e.g., if the convergence of the players' actual strategies isimpossible (or too difficult, or too costly for the mediator) - but it is important to make their implications clear to the reader.

Overall, the committee felt that the paper contains interesting ideas that are worth pursuing, but the current treatment does not quite clear the bar. Perhaps the biggest obstacle to acceptance is the page limit of ICLR, which seems to hinder the presentation of the authors' setting at the required depth and breadth. For this reason, a decision was reached to make a "reject" recommendation at this stage while encouraging the authors to revise their paper from the ground up and resubmit - possibly to a top journal like JMLR instead of a conference with a similarly restrictive page limit.

为何不给更高分

The paper is trying to fit too much material in too little space, leading to several presentation issues.

为何不给更低分

N/A

最终决定

Reject