PaperHub
4.2
/10
Rejected5 位审稿人
最低3最高5标准差1.0
3
5
5
5
3
3.2
置信度
正确性2.8
贡献度2.0
表达3.0
ICLR 2025

Markov Persuasion Processes: Learning to Persuade From Scratch

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
Bayesian PersuasionOnline LearningMarkov Persuasion Process

评审与讨论

审稿意见
3

In Bayesian persuasion, a sender sends information to a receiver with the goal of persuading them to take particular actions, while the receiver is aware of the sender’s incentive and has their own goals, and thus may deviate from the recommended action. Typically one designs a “persuasive” communication scheme, meaning it is always in the receiver’s best interest to follow the recommendation. Usually, this occurs in a single-round setting.

The present submission studies Markov Persuasion Processes (MPPs), which essentially combine MDPs with Bayesian persuasion. Specifically, the sender instead interacts with T different receivers sequentially, and actions on previous time steps affect the current state. This paper assumes the sender starts with no information and must learn everything on the fly (transition function, sender reward function, receiver reward function). The goal is then regret which is sublinear in T. This paper also requires the cumulative violation of the persuasiveness constraint to be sublinear in T.

The authors provide algorithms which achieve both of those goals (regret sublinear in T and persuasiveness violation sublinear in T), in both a “full feedback” setting and a “partial feedback” setting. They also provide a matching lower bound.

优点

I think this is a nice paper. It’s well-written, the model seems mostly reasonable, the results are satisfying, and there are some interesting technical ideas. Without considering prior work, I would probably lean towards acceptance.

缺点

Unfortunately, I have some serious concerns about the lack of novelty when compared to [1], which previously studied no-regret learning in MPPs. From my reading, the authors articulate four differences between the present submission and [1]:

A. [1] assumed that the receiver’s reward function is known to the sender, while the present submission does not. This is the primary difference emphasized by the authors.

B. [1] assumed rewards are deterministic, while the present submission allows rewards to be stochastic

C. [1] assumed the reward function is fixed, while the reward function to vary across episodes, but assumes that the reward function for each episode is sampled iid

D. [1] assumed that receivers know everything about the environment in order to compute a best-response, while the present submission does not.

I commend the authors for providing a clear discussion of the technical differences between their work and [1]; this made my job as a reviewer much easier. Unfortunately, none of (A) - (D) seem especially significant to me. I discuss each in turn.

(A) and (B). I argue that typically in RL theory, most of the technical difficulty comes from learning the transition function, and knowing the reward function doesn’t help much. The distinction between stochastic and deterministic rewards is also typically not significant. To semi-formalize this, I argue below that for {0,1} rewards, any no-regret algorithm for MDPs with known deterministic rewards can also solve MDPs with unknown stochastic rewards.

This reduction breaks down for more general rewards than {0,1}, but I think the conceptual takeaway still holds: that most of the difficulty comes from learning the transition function. Also note that {0,1} rewards are sufficient for most RL lower bounds [2].

Beyond the technical connections, [1] also provide a conceptual justification for known rewards: “The receiver’s utility is known to the sender because the pricing rules are usually transparent, some are even set by the platform. For example, a rider-sharing platform usually sets per hour or mile payment rules for the drivers.” I think this is a reasonable argument, although it certainly doesn’t apply to all applications.

(C). I would guess that since the reward function for each episode is iid, their average quickly converges to the mean, such that this does not add much difficulty. However, I may be wrong, and I could imagine this component adding some novelty to the paper.

(D). If I understand correctly, the present submission instead assumes that the receiver always plays the action recommended by the sender and does not play a best response at all. The authors then show that the cumulative violation of the persuasiveness constraint is sublinear in T. I think this approach is reasonable: as the authors note, it is impossible to be immediately persuasive when the sender starts out with no information about the receiver. However, I don’t think this is a strength over [1]: this model seems easier in some ways and harder in some ways than [1]’s assumption of perfect best response.

Beyond the comparison with [1], I also have some concerns about the model. I’m unsure whether there exist natural applications which both involve Bayesian persuasion and MDP-style dynamics. The authors cite movie recommender systems as a motivating example, but what do the states represent? Receivers would be users, right? If I understand correctly, the sender interacts with a new receiver on every time step, so how does one user affect the state of a different user? It also seems like users’ incentives are pretty aligned with the recommendation system (they want to watch movies that they like), so it seems unlikely for users to engage in reasoning like “the system is recommending me movie A which isn’t what I want to watch but it does give me information about which movie I should actually watch”.

Even if the authors agree with everything above (and I am open to disagreement), I don’t think the paper has zero novelty, but I think the novelty is too low for ICLR.

References

  1. Wu et al (2022) https://arxiv.org/pdf/2202.10678
  2. Osband and Van Roy (2016) https://arxiv.org/pdf/1608.02732

Reduction sketch: Consider an MDP M with unknown rewards in {0,1} and suppose we have a no-regret algorithm for MDPs with known rewards. Define a new MDP M’ with the original state space plus 2|A||S| additional states, each labeled (s,a,r): one new state for each possible state-action-reward combination. If the agent takes action a in an original state s, we add an intermediate time step where the agent transitions to some state (s,a,r), where r = r(s,a) is chosen stochastically. Regardless of the action taken in state (s,a,r), the agent receives reward r and transitions according to P(s,a) (i.e., how they would have transitioned in M). Basically, every time step in M becomes two time steps in M’, during which the agent obtains the same expected reward. Then E[agent reward in M after T steps] = E[agent reward in M’ after 2T steps]. This means that to solve M, we can simply run our no-regret algorithm on M’ and ignore the actions it takes on even time steps (that is, in (s,a,r) states).

问题

Do you agree with my comparison with [1]? For (A) and (B), do you agree with my reduction? For (C), could you elaborate on whether you believe this difference is significant? Is my understanding of (D) correct?

I would also be curious to hear if the authors have any response to my concerns about the model.

评论

We agree with the Reviewer's comparison between our work and [1]. Here, we provide a detailed answer to each point.

(A) and (B). I argue that typically in RL theory, most of the technical difficulty comes from learning the transition function, and knowing the reward function doesn’t help much. The distinction between stochastic and deterministic rewards is also typically not significant. To semi-formalize this, I argue below that for {0,1} rewards, any no-regret algorithm for MDPs with known deterministic rewards can also solve MDPs with unknown stochastic rewards. This reduction breaks down for more general rewards than {0,1}, but I think the conceptual takeaway still holds: that most of the difficulty comes from learning the transition function. Also note that {0,1} rewards are sufficient for most RL lower bounds [2].

We agree with the Reviewer’s arguments. However, we would like to point out that the main challenges tackled in our paper do not stem from rewards being stochastic, but rather on the fact that both the transitions and the receiver’s utility functions are unknown, as the Reviewer noticed, and the particular type of feedback received by the learner in our setting. Notice that the knowledge of the receiver’s utility is crucial in our framework to achieve the desired incentive compatibility. In this work, we decided to work with stochastic rewards rather than simplifying to deterministic ones since notation was not excessively cluttering.

(C). I would guess that since the reward function for each episode is iid, their average quickly converges to the mean, such that this does not add much difficulty. However, I may be wrong, and I could imagine this component adding some novelty to the paper.

The Reviewer is right, stuff converges quickly so it is easy to achieve a O(T2/3)O(T^{2/3}) regret/violation bound. The main challenge in our setting is how to guarantee a O(Tα)O(T^{\alpha}) regret and O(T1α/2)O(T^{1 - \alpha/2}) violation, for each α[1/2,1]\alpha \in [1/2, 1]. Indeed, this was still an open problem in similar settings. Notice that Bernasconi et. al. (2022) achieve such guarantees only when α[1/2,2/3]\alpha \in [1/2, 2/3]. Furthermore, the incentive compatibility constraints introduce partial observability in the feedback that the sender receives regarding the constraints at each round. This requires different approaches compared to those used in traditional constrained Markov decision processes.

(D). If I understand correctly, the present submission instead assumes that the receiver always plays the action recommended by the sender and does not play a best response at all. The authors then show that the cumulative violation of the persuasiveness constraint is sublinear in T. I think this approach is reasonable: as the authors note, it is impossible to be immediately persuasive when the sender starts out with no information about the receiver. However, I don’t think this is a strength over [1]: this model seems easier in some ways and harder in some ways than [1]’s assumption of perfect best response.

The reviewer is right, the two are different models and present different challenges. While we attempt to relax some of the assumptions in [1], such as the fact that the utility of the receivers are unknown, at the same time we assume that the receivers follow the recommended actions.

Beyond the comparison with [1], I also have some concerns about the model. I’m unsure whether there exist natural applications which both involve Bayesian persuasion and MDP-style dynamics. The authors cite movie recommender systems as a motivating example, but what do the states represent? Receivers would be users, right? If I understand correctly, the sender interacts with a new receiver on every time step, so how does one user affect the state of a different user? It also seems like users’ incentives are pretty aligned with the recommendation system (they want to watch movies that they like), so it seems unlikely for users to engage in reasoning like “the system is recommending me movie A which isn’t what I want to watch but it does give me information about which movie I should actually watch”.

The main real-world applications of these models are presented in the papers that first introduced them, see, e.g., Section 1 in (Wu et al. 2022). We did not report these examples in our work either due to lack of space.

评论

Thank you for the response, it has helped me understand the paper better. I think the comparison with Wu et al in the current version of the paper is okay, but I think including an even more explicit and direct comparison with Wu et al would make it easier for future reviewers to judge the paper.

I can understand the reasoning behind omitting the discussion of real-world applications due to space constraints, but I think that is pretty important to include, actually more important than some of the technical details. I also understand that the MPP model was defined prior to your work, but I think it is still worth justifying why this is an important problem to work on.

I will maintain my score.

审稿意见
5

This paper studies the Markov persuasion process (MPP) in which the sender does not know the environment, i.e., the transition, prior, and reward functions. They consider two types of feedback: full feedback and partial feedback. Full feedback means that the sender can observe all agents' rewards for the realized state and outcome in every round, but partial feedback means that the sender can only observe agents' rewards for the realized state, outcome, and action in every round.

The authors design algorithms for both kinds of feedback that achieve sublinear regret and violation of persuasiveness. Furthermore, the algorithm for the partial feedback has a matching lower bound.

优点

  1. The paper removes the assumption in the previous related paper that the sender knows everything about the environment. Given their result that it is impossible to achieve sublinear regret while being persuasive at every episode with high probability, it's reasonable to turn to the goal that the violation of persuasiveness is sublinear in T. Overall, the question studied by this paper is well motivated and their desiderata are reasonable.

  2. The paper provides a solid theoretical analysis. The authors achieve tight results for the partial feedback setting.

缺点

  1. The setting of this paper is similar to the setting of [1]. To be specific, in this paper, the authors consider MDP with persuasiveness as the stochastic long-term constraint because the receiver's reward function is drawn from some distribution. In [1], the authors consider MDP with constraint matrices, which can be stochastic or adversarial. As a consequence, these two papers share a similar upper-bound technique that maintains estimators and confidence bounds of the environment and uses them to update the occupancy measure. The similarity makes me cast doubt on the novelty and technical contribution of this paper. I think the authors should clarify the difference between these two papers.

  2. While the result is tight in the partial feedback setting, it is not shown in the paper whether the result is tight in the full feedback setting. I think the lower bound in the full feedback setting should be included for the completeness of this paper.

  3. For the proposed algorithm: (1)the computational complexity can be another issue that may prevent MPPs from being used in practice because large linear programming needs to be solved at every round; (2) the intuition of the algorithms should be explained more for better understanding.

  4. There is a typo on page 8: "every triple \si" not "si".

[1]. Jacopo Germano, Francesco Emanuele Stradi, Gianmarco Genalti, Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. A best-of-both-worlds algorithm for constrained MDPs with long-term constraints.

问题

Is the regret bound of the algorithm in the full feedback setting tight?

评论

The setting of this paper is similar to the setting of [1]. To be specific, in this paper, the authors consider MDP with persuasiveness as the stochastic long-term constraint because the receiver's reward function is drawn from some distribution. In [1], the authors consider MDP with constraint matrices, which can be stochastic or adversarial. As a consequence, these two papers share a similar upper-bound technique that maintains estimators and confidence bounds of the environment and uses them to update the occupancy measure. The similarity makes me cast doubt on the novelty and technical contribution of this paper. I think the authors should clarify the difference between these two papers.

We agree with the Reviewer that both papers use the occupancy measure tool. However, aside from this similarity, the two papers are very different. We outline the main differences below.

  • The algorithm in [1] requires full feedback and does not work under partial feedback, while our main contribution addresses the partial feedback case. Furthermore, even the full-feedback case considered in [1] is much stronger than the one considered in this paper. Specifically, [1] assumes observing the loss of all the possible triplets (x,ω,a)(x, \omega, a) at the end of the episode, whereas we assume observing only the triplets visited during the episode.

  • From a technical point of view the two are also unrelated. Indeed, the IC constraints cannot be handled using the technique from [1]. This is because, in our case, the constraints depend on the difference between the action recommended by the sender, aa, and another action, aa', which is the best response. However, the sender receives feedback only on the receiver's utility for action aa and does not observe anything regarding the utility in aa'. This introduces partial observability in the IC constraints, which cannot be addressed with the techniques from [1], as mentioned by the Reviewer.

For the proposed algorithm: (1) the computational complexity can be another issue that may prevent MPPs from being used in practice because large linear programming needs to be solved at every round; (2) the intuition of the algorithms should be explained more for better understanding.

(1) Linear programs can generally be solved efficiently. However, designing more efficient algorithms represents an interesting future direction. (2) We did our best to give any possible intuitions about the algorithms presented in our work. If there is a specific point in the algorithms that is not clear, please let us know.

Is the regret bound of the algorithm in the full feedback setting tight?

We refer the Reviewer to the updated version of the paper, where we present the lower-bound for the full-feedback case (see Theorem 6), showing that our result is tight.

审稿意见
5

This paper studies the Markovian Persuasion Processes (MPPs), a repeated version of Bayesian persuasion, by relaxing the common assumption that the sender knows receivers’ rewards. Unlike previous works, the authors use the setting in which the senders have no knowledge about transitions, prior distributions over outcomes, sender’s stochastic rewards, and receivers’ ones. Removing this requirement allows for capturing real-world problems, for e.g., online streaming platforms recommending movies to its users. The authors propose a learning algorithm, Optimistic Persuasive Policy Search (OPPS) for the sender with partial feedback and prove that its regret with respect to an optimal information-disclosure policy grows sublinearly in the number of episodes.

优点

The paper considers a natural extension of the existing MPP studies by relaxing the assumption of full knowledge about the environment and the players. The problem formulation as well as the objectives of the learning problem are clearly described. Mathematical proofs are also provided that highlight the effectiveness of the proposed algorithm.

缺点

I think the major weakness of this paper is the missing case study. While the authors highlight “practical/real-world application” as the key motivation behind studying this variation of MPP, they fail to provide an example (or a simple toy case). I think the paper would benefit from a case study as it would also help readers to understand the problem better.

问题

  • I am not an expert in this topic, but from what I gathered, the setup in the paper resembles a general-sum game, is that correct? If so, could it be also extended to model a zero-sum game in which it may not be in the receivers' best interest to follow the sender?

  • In line 383:

...the sender does not observe sufficient feedback about the persuasiveness of ϕt\phi_t.

  • Could the authors comment on what are the necessary feedback to infer about the persuasiveness and why the information about only the agents' rewards for the visited triplets are not enough?
评论

I think the major weakness of this paper is the missing case study. While the authors highlight “practical/real-world application” as the key motivation behind studying this variation of MPP, they fail to provide an example (or a simple toy case). I think the paper would benefit from a case study as it would also help readers to understand the problem better.

We agree with the Reviewer, we simply did not add a real-world example due to space constraints. However, the main real-world applications of these models are presented in the papers that first introduced them, see, e.g., Section 1 in (Wu et al. 2022).

I am not an expert in this topic, but from what I gathered, the setup in the paper resembles a general-sum game, is that correct? If so, could it be also extended to model a zero-sum game in which it may not be in the receivers' best interest to follow the sender?

We believe that our model is quite different from general-sum games, and that the game-theoretic model closest to ours is represented by Stackelberg games. Indeed, while in general-sum games the two agents may play simultaneously, our framework includes a commitment stage, which is not present in general-sum or zero-sum games.

In line 383: ...the sender does not observe sufficient feedback about the persuasiveness of ϕt\phi_t. Could the authors comment on what are the necessary feedback to infer about the persuasiveness and why the information about only the agents' rewards for the visited triplets are not enough?

To be perfectly persuasive, the sender should know exactly both the prior distribution and the receiver's utility function. While in our paper the sender just observes samples from the probability distributions that define these quantities.

评论

Thank you for the response. To clarify, my concern wasn't so much the absence of a "real-world" example, but rather the lack of a toy case to introduce the problem and illustrate the optimal policy. This would help readers get a better perspective of what's actually going on. In that regard, I would agree with one of the other reviewers who pointed out that it is sometimes more important to discuss the applications than the technical details, and would add that working through a simple example would be more illuminating.

审稿意见
5

This paper studies the online learning version of Markov Persuasion Process (MPP). An MPP involves a sender and a receiver interacting for LL time steps, where at each time step with state xkx_k commonly observed by both players, the sender sees an additional outcome ωkμ(xk)\omega_k \sim \mu(x_k) (only observed by the sender) and sends a signal (action recommendation) to the receiver according to some signaling scheme, then the receiver takes an action that affects the rewards of both players and the state transition. The authors assume that the sender does not know any information about the environment (the prior μ\mu, the transition matrix, the receiver's reward, and their own reward). Repeating this MPP TT times, the sender aims to learn the environment and the optimal signaling scheme over time. The main results include:

  • With full feedback (which includes the realized rewards of all actions at each round), the authors design a learning algorithm that achieves O(L2XTAΩln(1/δ))O(L^2|X| \sqrt{T |A||\Omega| \ln(1/\delta)}) regret for the sender, and O(L2XTAΩln(1/δ))O(L^2 |X| \sqrt{T |A| |\Omega| \ln(1/\delta)}) violation of the persuasiveness constraint (i.e., the regret for the receiver if the receiver follows the action recommendation.)
  • With partial feedback (which includes the reward of the taken action only), the authors design a learning algorithm with O(Tα)O(T^\alpha) regret for the sender and O(T1α/2)O(T^{1-\alpha/2}) violation for the receiver, where α[1/2,1]\alpha\in[1/2, 1].
  • A lower bound showing that no learning algorithm can achieve regret o(Tα)o(T^{\alpha}) and violation o(T1α/2)o(T^{1-\alpha/2}).

优点

(S1) This paper provides a relatively complete picture for the online MPP problem, providing upper and lower bounds on the sender's regret and the receiver's violation that are tight in terms of TT.

(S2) Compared to previous works in online MPP that all assume that sender has some type of knowledge of the environment, this work makes no such assumption. This is a significant conceptual contribution.

缺点

(W1) Although this paper provides a relatively complete picture for the online MPP problem, most of the techniques are standard and directly come from previous works. The upper bound techniques come from the constrained Markov Decision Process (MDP) literature, e.g., Germano et al, 2023, as the authors mentioned. The MPP problem is basically a constrained MDP with linear objective and linear constraint (maximizing the sender's expected reward subject to persuasiveness constraint), for which the constrained MDP literature already provides satisfactory solutions.

The lower bound is basically the same as previous paper Bernasconi et al (2022), except that previous bound only works for α[1/2,2/3]\alpha \in[1/2, 2/3] while this paper improves to α[1/2,1]\alpha \in [1/2, 1].

(W2) Although the regret bounds in this work are tight in TT, there is still a large gap between the upper and lower bounds in terms of other parameters L,X,Ω,AL, |X|, |\Omega|, |A|. The lower bound does not contain those parameters, while the upper bound have polynomial dependency on them. In the full information setting for example, an upper bound in the form of O(parameterT)O(parameter * \sqrt{T}) is typical in the online learning literature (in particular, it can be easily derived from the previous constrained MDP techniques), and a lower bound like Ω(T)\Omega(\sqrt{T}) is also expected. Since this work aims to give a complete picture for the online MPP problem, I think it is important to characterize the tight dependency of the regret on parameters other than TT as well. Moreover, given that there are already several previous works on online Bayesian persuasion and MPP (like Wu et al (2022)), only characterizing the regret dependency on TT does not seem to be a large enough contribution to the existing literature.

问题

Question:

(Q1) How is your work different from the constrained MDP works you cited? What's the additional contribution?

(Q2) Can you provide a lower bound that includes the parameters of the environment?

Suggestions:

  • Typo: Line 428: "si" -> "is".
评论

Although this paper provides a relatively complete picture for the online MPP problem, most of the techniques are standard and directly come from previous works. The upper bound techniques come from the constrained Markov Decision Process (MDP) literature, e.g., Germano et al, 2023, as the authors mentioned. The MPP problem is basically a constrained MDP with linear objective and linear constraint (maximizing the sender's expected reward subject to persuasiveness constraint), for which the constrained MDP literature already provides satisfactory solutions.

The results in Germano et al. (2023) are developed for a feedback that is much stronger than what we refer as full-feedback in our work. Precisely, in Germano et al. (2023), the learner observes the rewards and constraints sample for every path. Thus, their techniques cannot be employed in our work. Furthermore, the IC constraints cannot be handled using the technique from Germano et al. (2023), or any other CMDP algorithm. Indeed, in our setting, the constraints depend on the difference between the action recommended by the sender, aa, and another action, aa', which is the best response. However, the sender receives feedback only on the receiver's utility for action aa and does not observe anything regarding the utility in aa'. This introduces partial observability in the IC constraints, which cannot be addressed with the techniques from Germano et al. (2023), or any other CMDP algorithm as mentioned by the Reviewer.

Although the regret bounds in this work are tight in TT, there is still a large gap between the upper and lower bounds in terms of other parameters L,X,Ω,AL, |X|, |\Omega|, |A|. The lower bound does not contain those parameters, while the upper bound have polynomial dependency on them. In the full information setting for example, an upper bound in the form of O(parameterT)O(parameter * \sqrt{T}) is typical in the online learning literature (in particular, it can be easily derived from the previous constrained MDP techniques), and a lower bound like Ω(T)\Omega(\sqrt{T}) is also expected. Since this work aims to give a complete picture for the online MPP problem, I think it is important to characterize the tight dependency of the regret on parameters other than TT as well. Moreover, given that there are already several previous works on online Bayesian persuasion and MPP (like Wu et al (2022)), only characterizing the regret dependency on TT does not seem to be a large enough contribution to the existing literature.

The main goal of our work was to both relax some of the limiting assumptions made by Wu et al. (2022), such as the assumption that the sender knows the receivers’ utility function, and design an algorithm that guarantees O(Tα)O(T^{\alpha}) regret and O(T1α/2)O(T^{1 - \alpha/2}) violation for each α[1/2,1]\alpha \in [1/2, 1]. Both of these were open problems, which we address in our paper. Understanding the tightness of our algorithm with respect to all its parameters is an interesting future direction, but we believe it was not the primary point to be addressed.

How is your work different from the constrained MDP works you cited? What's the additional contribution?

Please refer to previous questions.

Can you provide a lower bound that includes the parameters of the environment?

It is an interesting direction that we want to explore in the future. At the moment, we don’t actually know if these constants are tight or not.

评论

Thank the authors for the response!

Now I agree that this work is different from the CMDP literature. It might be worthwhile to compare this work with the CMDP literature more explicitly in Appendix A. However, I still have other concerns:

  • Regarding the tightness of lower bound, the authors say that "it was not the primary point to be addressed", and the main goal of this work was to "both relax the limiting assumptions made by Wu et al. (2022), such as the assumption that the sender knows the receivers’ utility function, and design an algorithm that guarantees O(Tα)O(T^\alpha) regret and O(T1α/2)O(T^{1-\alpha/2}) violation for each α[1/2,1]\alpha \in [1/2, 1]". However, this main goal seems a bit incremental. As reviewer AiNi mentioned, a more explicit comparison with Wu et al (2022) might be needed for readers to see the novelty of this work. And the trade-off between O(Tα)O(T^\alpha) regret and O(T1α/2)O(T^{1-\alpha/2}) violation has been noticed by Bernasconi et al (2022) [Sequential information design: Learning to persuade in the dark]. Although the present work improves previous work by extending the analysis from α[1/2,2/3]\alpha\in[1/2, 2/3] to α[1/2,1]\alpha\in[1/2, 1], this contribution still seems too incremental.

I read other reviews and agree with two other concerns about this work:

  • One is the lack of real-world motivation for MPP (raised by both w5p2 and AiNi).

  • The other is AiNi's comment (D) "The present submission assumes that the receiver always plays the action recommended by the sender and does not play a best response at all. The authors then show that the cumulative violation of the persuasiveness constraint is sublinear in T. I think this approach is reasonable: as the authors note, it is impossible to be immediately persuasive when the sender starts out with no information about the receiver. However, I don’t think this is a strength over [1]: this model seems easier in some ways and harder in some ways than [1]’s assumption of perfect best response." This seems to be a limitation of this work, because there is no guarantee that the receiver will actually follow the action recommendation if the recommendation is only approximately persuasive.

审稿意见
3

This paper considers learning in a Markov persuasion process (MPP) where the learner cannot take actions directly, and must persuade a receiver to take the intended action. The focus is on the setting where the learner is not aware of the receiver's reward function and must guarantee that both the regret in total reward and violation of persuasion constraints is sublinear in TT. The authors consider two settings -- (a) full feedback setting where the learner observes the rewards for all the actions at a time-step, and (b) partial feedback setting where the learner observes rewards only for the chosen action.

The authors propose an optimistic planning based algorithm which finds persuasive policy by being optimistic according to sender's rewards and receiver's persuasion constraints. For the full feedback setting the reward regret and violation of constraints is O(T)O(\sqrt{T}), whereas for the partial feedback setting there is a trade-off between regret in reward and violation of constraints.

优点

The authors consider an interesting extension of MPP when the sender is unware of the receiver's rewards and must learn them over time. The algorithms and the main ideas are explained clearly in the paper. It's also interesting to see that the authors provide a trade-off between the reward regret and the violation of persuasion constraints.

缺点

  1. The authors claim that it is impossible to satisfy persuasive constraints at all rounds. However, the negative result is provided only for the partial feedback setting, and I wonder if one can avoid the negative result for the full feedback setting. If so, it makes more sense to study a version of MPP where the constraints are satisfied with high probability. In general, I favour the guarantee of high probability persuasion compared to providing only upper bound on the total amount of violation. Since we cannot make any assumption about receiver's preferences, it is not clear whether the receivers will still follow the sender's recommendation.

  2. The proposed algorithms and the proof techniques are very similar to prior work (in particular Bernasconi et. al. 2022 or Gan et. al. 2024). All these works propose explore then exploit (based on optimistic planning) algorithm for persuasion problems. So the authors need to clarify what are the technical difficulties in the analysis in this paper.

  3. It is not clear that the optimistic planning problem can be solved efficiently i.e. poly(|X|, |A|, L) time.

  4. Finally, the authors consider only tabular MDPs. The original paper on MPP did consider more general MDPs and without such extensions, it is not clear whether the result generalizes.

问题

  1. What is the trade-off between reward regret and constraint violation for the full feedback setting?

  2. What is the time-complexity of Opt-Opt problem (2)?

  3. Can you generalize your results beyond tabular MDP?

评论

The authors claim that it is impossible to satisfy persuasive constraints at all rounds. However, the negative result is provided only for the partial feedback setting, and I wonder if one can avoid the negative result for the full feedback setting. If so, it makes more sense to study a version of MPP where the constraints are satisfied with high probability. In general, I favour the guarantee of high probability persuasion compared to providing only upper bound on the total amount of violation. Since we cannot make any assumption about receiver's preferences, it is not clear whether the receivers will still follow the sender's recommendation.

We thank the Reviewer for the opportunity to properly tackle this aspect. In our setting, it is not possible to guarantee zero violations with high probability if the goal is to design a no-regret algorithm, even when full-feedback is available. We updated the paper, proving this lower bound for the full-feedback setting (see Theorem 6). Please refer to the updated version of the paper for the formal proof.

The proposed algorithms and the proof techniques are very similar to prior work (in particular Bernasconi et. al. 2022 or Gan et. al. 2024). All these works propose explore then exploit (based on optimistic planning) algorithm for persuasion problems. So the authors need to clarify what are the technical difficulties in the analysis in this paper.

Approaches such as explore and commit are well known in the literature. In our setting, the main challenge is how to guarantee a O(Tα)O(T^{\alpha}) regret and O(T1α/2)O(T^{1 - \alpha/2}) violation, for every α[1/2,1]\alpha \in [1/2, 1]. Indeed, Bernasconi et. al. (2022) achieve such guarantees only when α[1/2,2/3]\alpha \in [1/2, 2/3], while Gan et. al. (2024) when α=2/3\alpha = 2/3. On the contrary, our work is the only one to completely close the gap between upper and lower bound. This result is attainable by employing a different kind of exploration phase w.r.t. Bernasconi et. al. (2022) and Gan et. al. (2024), that is, we continuously shrink the decision space before the beginning of the exploitation phase. We believe that this is a remarkable contribution of our approach.

Furthermore, while we agree with the Reviewer that our algorithmic approach is similar to existing ones, we emphasize that our technical analysis significantly differs from the one pointed out by the Reviewer. Specifically:

  • Bernasconi et al. (2022) do not need to address the uncertainty associated with the transition model (note that they do not work in a Markovian setting) and the receiver's utility function.

  • Gan et al. (2024) employs an exploration procedure adopted from Jin et al. (2020). However, this exploration procedure does not guarantee to achieve a tight O(Tα)O(T^{\alpha}) regret and O(T1α/2)O(T^{1 - \alpha/2}) violation, for every α[1/2,1]\alpha \in [1/2, 1]. To do so, we need to maximize the probability of visiting each triplet (x,ω,a)(x, \omega, a) for NN times during the initial phase, in order to continue shrinking the decision space during the commitment phase.

What is the trade-off between reward regret and constraint violation for the full feedback setting?

As specified in the answer above, the O(T)O(\sqrt{T}) regret/violation bound is tight even in the full feedback case. Please refer to the updated version of the paper.

What is the time-complexity of Opt-Opt problem (2)?

Problem (2) (Opt-Opt) is a linear program, so it can be solved efficiently in poly(X,A,L)\textnormal{poly}(|X|, |A|, L) time.

Can you generalize your results beyond tabular MDP?

Extending the results to general MDPs is an interesting future direction. We conjecture that, with some work, our approach can be generalized to achieve an O(T2/3)O(T^{2/3}) regret and violation bound in general MDPs. However, achieving O(Tα)O(T^{\alpha}) regret and O(T1α/2)O(T^{1 - \alpha/2}) violation, when α[1/2,1]\alpha \in [1/2, 1], as done in our paper for the tabular case, would require new ideas and a different approach.

[Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870– 4879. PMLR, 2020.]

AC 元评审

The reviewers acknowledged that the paper investigates an interesting problem setting in sequential Bayesian persuasion where a sender has limited information about the receiver and provides a thorough theoretical analysis for this setting. However, the reviewers pointed out several weaknesses and shared common concerns related to the limited novelty w.r.t. the prior work. We want to thank the authors for their detailed responses. Based on the raised concerns and follow-up discussions, unfortunately, the final decision is a rejection. Nevertheless, the reviewers have provided detailed and constructive feedback. We hope the authors can incorporate this feedback when preparing future revisions of the paper.

审稿人讨论附加意见

The reviewers pointed out several weaknesses and shared common concerns related to the limited novelty w.r.t. the prior work. There was a consensus among the reviewers that the paper is not ready for acceptance.

最终决定

Reject