PaperHub
4.0
/10
Rejected3 位审稿人
最低1最高4标准差1.2
2
4
1
ICML 2025

Achieve Performatively Optimal Policy for Performative Reinforcement Learning

OpenReviewPDF
提交: 2025-01-24更新: 2025-06-18
TL;DR

This work proposes the first algorithm that converges to the desired performatively optimal policy with polynomial computation complexity for performative reinforcement learning.

摘要

Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent's policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there can be a positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order performative policy gradient (0-PPG) algorithm that **for the first time converges to the desired PO policy with polynomial computation complexity under mild conditions**. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace $\Pi_{\Delta}$, where the policy value has a constant lower bound $\Delta>0$ and thus the gradient becomes bounded and Lipschitz continuous.
关键词
performative reinforcement learningperformatively optimal

评审与讨论

审稿意见
2
  • Proposed Algorithm: This work introduces a zeroth-order performative policy gradient (0-PPG) algorithm that converges to the PO policy with polynomial computational complexity under mild conditions.
  • Key Theoretical Properties:
    • When the policy regularizer dominates the environmental shift, the value function exhibits a gradient dominance property, meaning any stationary point is a PO.
    • Although the value function may have unbounded gradients, all sufficiently stationary points lie within a convex and compact policy subspace ΠΔ\Pi_\Delta, where the policy value is bounded below by Δ>0\Delta > 0, ensuring the gradient is both bounded and Lipschitz continuous.

给作者的问题

My current score is approximately 2.5 (I cannot choose between 2 and 3, but I am currently just set at 2), primarily due to following questions, especially 2 and 4. I am willing to increase my score if the following issues are addressed:

  1. Clarification on Existing Work: Could the authors specify why previous research in performative RL has focused solely on the PS policy?

  2. Interpretation of Regularized Value Function: In both the abstract and the remark following Theorem 1, the analysis suggests that when the policy regularizer dominates the environmental shift, the value function exhibits a gradient dominance property, which is intuitively appealing. However, I am concerned about the practical significance of the optimal policy derived from this regularized value function. Since the policy regularizer may impede convergence to the true optimal policy (thereby affecting generalization), if it dominates the environmental shift, does this imply that the optimal policy is biased towards a more uniform distribution? If so, this might render the primary contribution somewhat trivial.

  3. Insights from Theorem 3: Could the authors elaborate on the key takeaways of Theorem 3? The Lipschitz continuity property appears to be a direct consequence of Assumptions 1 through 3. It would be helpful to understand how the upper bound is affected by the parameters LL and ll.

  4. Experimental Validation: Has the proposed approach been tested empirically? Given that the paper introduces several convergence theorems, including experimental results—perhaps on a simple environment like a grid world—would strengthen the manuscript by demonstrating the convergence behavior of the PO policy.

论据与证据

Evidence well supports the claims

方法与评估标准

Proposed methods make sense, and there are no experiments.

理论论述

I have not taken a close look at all proof,s but the takeaways and intuitions, remarks after the Theorem all make sense (at least to me)

实验设计与分析

There are no experiments.

补充材料

No.

与现有文献的关系

This paper looks promising since it has done a theoretical analysis on PO convergence. However, the second concern that I have written down on [Questions For Authors] may challenge the novelty of this paper.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  • I like how authors provide a key takeaway of Theorems (Like Remark after Theorem 1 or Implications after Theorem 2, and remark after Proposition 1). This makes the reading more comfortable.
    Weakness:
  • Please refer to the [Questions For Authors]

其他意见或建议

Please refer to the [Questions For Authors]

作者回复

Clarification on Existing Work: Could you specify why previous research in performative RL has focused solely on the PS policy?

A: Great question. There are two reasons. First, the method to obtain a performatively stable (PS) policy is more straightforward to think of than to obtain a performatively optimal (PO) policy. To elaborate, since a PS policy πPS\pi_{PS} is only required to be optimal in its corresponding fixed environment (pπPS,rπPS)(p_{\pi_{PS}}, r_{\pi_{PS}}), so we can obtain a PS policy by repeated training, i.e., applying traditional policy optimization methods to a fixed environment. In contrast, a PO policy πPO\pi_{PO} is required to have larger value in the environment (pπPO,rπPO)(p_{\pi_{PO}}, r_{\pi_{PO}}) than the value of any policy π\pi in its own environment (pπ,rπ)(p_{\pi}, r_{\pi}), so we cannot use traditional policy optimization methods. Second, the distance between a PS policy and a PO policy is O(ϵp+ϵr)\mathcal{O}(\epsilon_p+\epsilon_r) where ϵp\epsilon_p and ϵr\epsilon_r are the Lipschitz constants of the Lipschitz continuous pπp_{\pi} (transition kernel) and rπr_{\pi} (reward), so PS approximates PO well in a slowly changing environment with small ϵp\epsilon_p and ϵr\epsilon_r (Mandal et. al. 2023).

Interpretation of Regularized Value Function: I am concerned about the practical significance of the optimal policy from this regularized value function. Since the policy regularizer may impede convergence to the true optimal policy (thereby affecting generalization), if it dominates the environmental shift, does this imply that the optimal policy is biased towards a more uniform distribution? If so, this might render the primary contribution somewhat trivial.

A: Great question. The answer is partially yes. The optimal policy for regularized objective is closer to the uniform policy than the optimal policy for unregularized setting. However, we do not think it as a bias, because the optimal policy for entropy regularized setting is also an important target. To elaborate, entropy regularization has been demonstrated to make the policy robust against perturbation to the environment (transition kernel and reward), thereby improving generalization [1], and to encourage the agent to explore unknown environment, yielding better exploration-exploitation trade-off (Mnih et al., 2016; Mankowitz et al., 2019; Cen et al., 2022; Chen and Huang, 2024). As our algorithm converges to this important target policy, the regularized optimal solution, we do not think it as a bias.

[1] Eysenbach, Benjamin, and Sergey Levine. "Maximum Entropy RL (Provably) Solves Some Robust RL Problems." International Conference on Learning Representations (2022).

Insights from Theorem 3: Could you elaborate on the key takeaways of Theorem 3? The Lipschitz continuity property appears to be a direct consequence of Assumptions 1 through 3. It would be helpful to understand how the upper bound is affected by the parameters LL and \ell.

A: The key takeaway of Theorem 3 is that the objective function Vλ,ππV _ {\lambda,\pi}^{\pi} is Lipschitz continuous and Lipschitz smooth in the domain πΠΔ={πΠ:π(as)Δ}\pi\in\Pi_{\Delta}=\{\pi\in\Pi:\pi(a|s)\ge\Delta\}. You may misunderstood Theorem 3. First, the Lipschitz property comes from Assumption 1-2 but the proof is not very straightforward. Second, the upper bounds for Lipschitz continuity and Lipschitz smoothness are proportional to LλL_{\lambda} and λ\ell_{\lambda} (not LL and \ell) respectively, defined by Eqs. (22) and (23) respectively. LλL_{\lambda} and λ\ell_{\lambda} depend on problem-related constants like S|\mathcal{S}|, A|\mathcal{A}|, γ\gamma, λ\lambda, ϵp\epsilon_p, ϵr\epsilon_r, not tunable parameters.

Experimental Validation: Has the proposed approach been tested empirically?

A: Good question. We compared our Algorithm 1 with the existing repeated training algorithm in a simulation environment with 5 states, 4 actions, discount factor γ=0.95\gamma=0.95, entropy regularizer coefficient λ=0.5\lambda=0.5, transition kernel pπ(ss,a)=π(as)+π(as)+1s[π(as)+π(as)+1]p_{\pi}(s'|s,a)=\frac{\pi(a|s)+\pi(a|s')+1}{\sum_{s''}[\pi(a|s)+\pi(a|s'')+1]}, and reward rπ(s,a)=π(as)r_{\pi}(s,a)=\pi(a|s). We implement our Algorithm 1 for 400 iterations with N=1000N=1000, β=0.01\beta=0.01, Δ=103\Delta=10^{-3}, δ=104\delta=10^{-4} and value functions evaluated by value iteration. The repeated training algorithm obtains the next policy πt+1\pi_{t+1} by applying the natural policy gradient algorithm [1] with 100 steps and stepsize 0.01 to the entropy-regularized reinforcement learning with transition kernel pπtp_{\pi_t} and reward rπtr_{\pi_t}. Both algorithms start from the uniform policy (i.e. π(as)1/4\pi(a|s)\equiv 1/4). Our experimental results in the anonymous website https://docs.google.com/document/d/1bH3eEoGhfDwq1NBNW7_zjCSLvvmcUyDusaINivK5bdo/edit?tab=t.0 shows that the existing repeated training algorithm stucks at the initial policy which is performatively optimal, while our Algorithm 1 converges to a much larger objective function value.

审稿意见
4

The paper studies the problem of performative reinforcement learning, where the choice of policy actively influences the dynamics in the environments (transitions) as well as the rewards.

The authors introduce the first algorithm which provably converges to the performatively optimal (not stable policy) under standard regularity conditions.

给作者的问题

NA

论据与证据

Yes, all the claims are well supported.

方法与评估标准

The main contributions of the paper are theoretical. Their analysis makes sense.

理论论述

I did not.

实验设计与分析

NA

补充材料

No

与现有文献的关系

The paper makes an excellent contribution to the growing area of performative prediction and performative reinforcement learning. To date, there was no known algorithm that one could show convergence to the performatively optimal solution.

Their results mirror a similar story developed in the classical performative prediction literature over the last few years where initially people only knew of algorithms that would converge to stable points. Then, in 2021, Miller et al introduced the first set of conditions under which the performative risk was convex, and designed algorithms which converged to the performatively optimal solution.

This result completes a similar arc for the performative reinforcement learning setting which is substantially more complicated than that initially considered by Perdomo et al in their paper on performative prediction. This is a very nice result that will be of interest to the community. Here, gradient dominance is somehow the analogous structural condition to convexity in the standard setup.

遗漏的重要参考文献

The relevant literature is appropriately cited. It might be nice to tell a bit of this story above around how their results contribute to the broader literature on performative prediction, but this is really up to the authors.

其他优缺点

Convergence to optimality, not stability, is a real strength of the paper. The analysis is substantial and involved but the authors do a good job of providing intuition. I think the paper would be even better if they give a broader overview of performativity and spend a bit more time delving into the intuition for their proofs. For instance, readers may not be familiar with these kinds of gradient dominance conditions and a gentler review of why these conditions are useful and where they have been previously studied in the literature (e.g. LQR) could be very nice.

其他意见或建议

NA

作者回复

Essential References Not Discussed: The relevant literature is appropriately cited. It might be nice to tell a bit of this story above around how their results contribute to the broader literature on performative prediction, but this is really up to the authors.

A: Thank you very much for telling the story showing our contribution to the broader literature on performative prediction. We are glad to add this story to our revision.

Other Strengths And Weaknesses: I think the paper would be even better if they give a broader overview of performativity and spend a bit more time delving into the intuition for their proofs. For instance, readers may not be familiar with these kinds of gradient dominance conditions and a gentler review of why these conditions are useful and where they have been previously studied in the literature (e.g. LQR) could be very nice.

A: Thanks for your suggestion. We will add a discussion of related works including those on performative prediction. We will stress that the major idea of both performative prediction and performative reinforcement learning is performativity which means the data distribution can be affected by the decision, as observed in many applications.

We will elaborate more on gradient dominance right after our Theorem 1 as you suggested. Specifically, when μ0\mu\ge 0, our Theorem 1 implies the following gradient dominance result widely used in reinforcement learning [1,2]. f(π)f(π)C1maxπΠf(π),ππ,(G1)f(\pi^*)-f(\pi)\le C_1\max _ {\pi'\in\Pi}\big\langle \nabla f(\pi),\pi'-\pi\big\rangle,\quad{\rm(G1)} where we use the constant C1=D1C_1=D^{-1}, the objective function f(π)=Vλ,ππf(\pi)=V _ {\lambda,\pi}^{\pi}, the performatively optimal solution πargmaxπf(π)\pi^*\in{\arg\max} _ {\pi}f(\pi), C1=D1>0C_1=D^{-1}>0. This further implies the following weaker gradient dominance result widely used in optimization [3,4] and linear quadratic regulator (LQR) [5,6]. f(π)f(π)C2f(π)α,f(\pi^*)-f(\pi)\le C_2||\nabla f(\pi)||^{\alpha}, where we use the constant C2=2D1>0C_2=2D^{-1}>0 (since ππ2||\pi'-\pi||\le 2 in Eq. (G1) above), and the power α=1\alpha=1.

Both the gradient dominance conditions above are useful for global convergence to the optimal solution π\pi^*, since under either of these conditions, f(πt)0||\nabla f(\pi_t)||\to 0 can imply f(πt)f(π)f(\pi_t)\to f(\pi^*).

[1] Agarwal, A., Kakade, S. M., Lee, J. D., & Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98), 1-76.

[2] Chen, Z., Wen, Y., Hu, Z., & Huang, H. (2024). Robust Reinforcement Learning with General Utility. Advances in Neural Information Processing Systems, 37, 11290-11344.

[3] Masiha, S., Salehkaleybar, S., He, N., Kiyavash, N., & Thiran, P. (2022). Stochastic second-order methods improve best-known sample complexity of SGD for gradient-dominated functions. Advances in Neural Information Processing Systems, 35, 10862-10875.

[4] Nesterov, Y., & Polyak, B. T. (2006). Cubic regularization of Newton method and its global performance. Mathematical programming, 108(1), 177-205.

[5] Mohammadi, H., Zare, A., Soltanolkotabi, M., & Jovanović, M. R. (2021). Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem. IEEE Transactions on Automatic Control, 67(5), 2435-2450.

[6] Ye, L., Mitra, A., & Gupta, V. (2024, December). On the Convergence of Policy Gradient for Designing a Linear Quadratic Regulator by Leveraging a Proxy System. In 2024 IEEE 63rd Conference on Decision and Control (CDC) (pp. 6016-6021). IEEE.

审稿意见
1

This paper proposes an algorithm to compute performatively optimal policies, i.e. policies maximizing the expected sum of rewards in an MDP-like environment where the transition and reward functions are dependent on the policy that is executed. The algorithm consists in iteratively building an ascent direction from samples in the decision process and using this direction in the Frank-Wolfe algorithm to update the policy. Convergence is guaranteed, as the ascent direction is "valid" and the objective function is gradient dominated.

给作者的问题

Does convergence require the batch size NN to grow unbounded?

论据与证据

The claim that it is possible to find the performatively optimal policy with the proposed algorithm is theoretically supported.

There are several other claims that are incorrect or inefficiently detailed:

  1. Authors claim that there is no analytical form to the performative policy gradient [line 320 right column]. To my understanding this has not been shown, and, intuitively, It is unclear for me why there would not be an analytical form to the gradient.
  2. Authors claim in the abstract (and through the paper) that it is a "zeroth-order policy gradient method". This is insufficient to well-understand how the policy is effectively optimized and misleading to my understanding of what a zero-order method, first-order method, and policy gradient method is. On the one hand, a zero-order method optimizes a function without computing gradients but solely estimating the function. A first-order method, on the other hand, uses gradient. Policy-gradient methods fall into the second type of methods as the point is to estimate the gradient of the return (and computing the gradient of the policy) to do stochastic gradient ascent steps. If one where to use finite difference to compute an ascent direction to optimize the return, I am not sure it can still be considered a policy gradient method. The abstract should be clearer about the how the ascent direction is computed and used to update the policy.
  3. Authors highlight that the performative optimal policy cannot be computed with previous algorithms from the literature. It nevertheless seems that the problem at hand is a particular case of some stochastic game where the objective to compute policies against adversarial opponents, e.g. [1, 2, 3]. Does this part of the literature provides algorithms that would compute an optimal performative policy?

[1] Sessa, P. G., Bogunovic, I., Kamgarpour, M., & Krause, A. (2020). Learning to play sequential games versus unknown opponents. Advances in neural information processing systems, 33, 8971-8981.

[2] Ramponi, G., Metelli, A. M., Concetti, A., & Restelli, M. (2021). Learning in non-cooperative configurable markov decision processes. Advances in Neural Information Processing Systems, 34, 22808-22821.

[3] Jackson, M. T., Jiang, M., Parker-Holder, J., Vuorio, R., Lu, C., Farquhar, G., ... & Foerster, J. (2023). Discovering general reinforcement learning algorithms with adversarial environment design. Advances in Neural Information Processing Systems, 36, 79980-79998.

方法与评估标准

There is no evaluation of the final algorithm.

理论论述

Theoretical claims seem correct, but I haven't checked proofs in appendices.

实验设计与分析

There is no empirical evaluation, which is to me problematic. The paper should include experiments to validate the final algorithm, and compare to algorithms from the literature dealing with non-stationary or adversarial settings.

补充材料

No.

与现有文献的关系

The contribution should be related to the literature dealing with non-stationary or adversarial settings. Do there exist algorithms that could be applied to compute performatively optimal policy?

遗漏的重要参考文献

See previous remarks on non-stationary or adversarial RL.

其他优缺点

Authors should formally define the n-step transition distribution in equation (2).

In section 2.1, when defining P\mathcal{P}, the sum should be over ss' and not ss I beleive.

I think equation (5) might be wrong, is it rdr_{d'} or rdr_d? In (Mandal et al., 2023), they use the measure dd in their equation (3).

Authors should be mathematically clear about what a "valid approximation" is in Proposition 1.

其他意见或建议

I would have clearly stated that distributions are represented by vectors at the beginning of section 2.1. In other words, sentence line 90 should come earlier for clarity.

作者回复

Claims And Evidence (1): Why there would not be an analytical form to the gradient?

A: Good question. I later found this gradient can be computed by chain rule, but involves the unknown πpπ(ss,a)\nabla_{\pi}p_{\pi}(s'|s,a) and πrπ(s,a)\nabla_{\pi}r_{\pi}(s,a). We will revise this claim.

Claims And Evidence (2): The abstract should be clearer about the how the ascent direction is computed and used to update the policy. Can we use the name "zeroth-order policy gradient method"?

A: Thanks for your suggestions. Our algorithm uses a Frank-Wolfe update to find the ascent direction, where the policy gradient is approximated by its zeroth order estimation (will be added to the revised abstract), so the name "zeroth-order policy gradient method" is valid, as also been used in [1,2]. We may also use "zeroth-order Frank-Wolfe algorithm" to reveal more optimization details.

[1] Wang, Z., et al. Policy evaluation in distributional LQR. In Learning for Dynamics and Control Conference 2023.

[2] Han, Y., Razaviyayn, M., & Xu, R. Policy gradient finds global optimum of nearly linear-quadratic control systems. NeurIPS 2022 Workshop.

Claims And Evidence (3): It seems that the problem at hand is a particular case of some stochastic game where the objective to compute policies against adversarial opponents, e.g. [1-3]. Do their algorithms compute an optimal performative policy?

A: No, these adversarial settings are very different from our performative reinforcement learning problem without adversarial environment.

Experimental Designs Or Analyses: The paper should include experiments.

A: Thanks for your suggestion. Due to limited space, see the experimental details in my final response to Reviewr uoEY. Our result in https://docs.google.com/document/d/1bH3eEoGhfDwq1NBNW7_zjCSLvvmcUyDusaINivK5bdo/edit?tab=t.0 shows that our Algorithm 1 outperforms the existing repeated training algorithm.

Relation To Broader Scientific Literature: The contribution should be related to the literature dealing with non-stationary or adversarial settings. Do there exist algorithms that could be applied to compute performatively optimal policy?

A: No, since to our knowledge, performative reinforcement learning is not a special case of any other problems.

We will add a discussion of related works including those on non-stationary MDP (e.g. [1,2]) that are weakly related to our work. To elaborate, during the training of performative reinforcement learning, the policy π\pi and thus the environment (pπ,rπ)(p_{\pi}, r_{\pi}) change with iterations. In non-stationary MDP, the environment (pt,rt)(p_t, r_t) changes with MDP time scale tt not the iteration.

[1] Chandak, Yash, et al. Optimizing for the future in non-stationary MDPs. ICML 2020.

[2] Chandak, Yash, et al. Towards safe policy improvement for non-stationary MDPs. Neurips 2020.

Other Strengths And Weaknesses (1): Authors should formally define the n-step transition distribution in Eq. (2).

A: Thanks for your suggestion. Since st+1pπ(st,at)s_{t+1}\sim p_{\pi}(\cdot|s_t,a_t), atπ(st)a_t\sim\pi(\cdot|s_t) and s0ρs_0\sim\rho, the n-step transition can be computed below, which will be added to the revision. Pπ,p,ρ(sn=s,an=a)=s0,...,sn1Sa0,...,an1Aρ(s0)π(as)pπ(ssn1,an1)π(an1sn1)t=0n2[π(atst)pπ(st+1st,at)].\mathbb{P} _ {\pi,p,\rho}(s_n=s,a_n=a)=\sum_{s_0,...,s_{n-1}\in\mathcal{S}}\sum_{a_0,...,a_{n-1}\in\mathcal{A}}\rho(s_0)\pi(a|s)p_{\pi}(s|s_{n-1},a_{n-1})\pi(a_{n-1}|s_{n-1})\prod_{t=0}^{n-2}[\pi(a_t|s_t)p_{\pi}(s_{t+1}|s_t,a_t)].

Other Strengths And Weaknesses (2): In Section 2.1, when defining P\mathcal{P}, the sum should be over ss' and not ss I believe.

A: Thanks. We have corrected that.

Other Strengths And Weaknesses (3): I think equation (5) might be wrong, is it rdr_{d'} or rdr_d? In (Mandal et al., 2023)? they use the measure dd in their equation (3).

A: We use rdr_{d'} for two reasons. First, since performatively stable policy is defined as πSargmaxπVπSπ(ρ)\pi_S\in{\arg\max} _ {\pi'}V _ {\pi_S}^{\pi'}(\rho), their Eq. (3) that defines the corresponding performatively stable occupancy measure dSd_S should have used rdSr_{d_S}, corresponding to our Eq. (5) with d=dSd'=d_S. Second, their Eq. (5) about their repeated training algorithm corresponds to our Eq. (5) with d=dtd'=d_t at iteration tt.

Other Strengths And Weaknesses (4): Authors should be mathematically clear about what a "valid approximation" is in Proposition 1.

A: Thanks for your suggestion. "Valid approximation" means the π+δui\pi+\delta u_i and πδui\pi-\delta u_i in Eq. (26) are valid policies, i.e., π(as)0\pi'(a|s)\ge0 and aπ(as)=1\sum_a\pi'(a|s)=1 for π{π±δui}\pi'\in\{\pi\pm\delta u_i\}. We will add that explanation to the revision.

Other Comments Or Suggestions: Sentence line 90 should be moved to the beginning of section 2.1 for clarity.

A: Thanks. We have done that.

Questions For Authors: Does convergence require the batch size to grow unbounded?

A: No. Usually we fix ϵ,η\epsilon,\eta, so the batch size N=O[ϵ2log(η1ϵ1)]N=O[\epsilon^{-2}\log(\eta^{-1}\epsilon^{-1})] is also fixed.

审稿人评论

Thank you for your response. I would advise updating the paper so that these elements are made clear.

I think that the paper is still incomplete without an experimental validation, which I cannot review solely based on the elements you provided in the response to reviewers.

作者评论

Did you see our experimental results in https://docs.google.com/document/d/1bH3eEoGhfDwq1NBNW7_zjCSLvvmcUyDusaINivK5bdo/edit?tab=t.0

The experimental details are in our final response to the Reviewer uoEY. We have added the above experimental details and results to our paper, but ICML 2025 does not allow us to upload the updated paper.

In addition, how do you think about our responses to your other concerns?

Thanks. Authors

最终决定

The submission received three reviews with all three reviewers acknowledging the rebuttal. There was some divergence in the opinions of the reviewers, however, after the discussion, the panel feels that the paper is not ready yet for publication. While the main strength is a credible proof of convergence in polynomial time to a perfomatively optimal policy for entropy regularised objective value functions, the two main limitations that prevent a more positive recommendation at the moment are 1) convergence holds only for large values of the regularizer and 2) there is a lack of proper experimental evaluation. In more detail:

  • The convergence holds when the regulariser is large enough, which essentially cancels out the performative effect rendering the result somewhat standard (convergence towards an approximately uniform distribution when regularisation is high). It is also indicative that the main proof leverages/adapts standard convergence theory to show this result. The paper does not accurately present that limitation resulting in the impression that it (slightly) overclaims its contributions. A proper account of this would benefit the paper.
  • Lack of proper experimental evaluation. While this is a theoretical paper, and could be appreciated solely as such, the introduction of the entropy regulariser in the objective function and the ensuing uncertainty (see previous comment) would benefit from experimental study to assess how much this can depart from the original (unregularised) objective function and affect actual optimisation in different settings. The experiments offered in the rebuttal are welcome but are still insufficient to provide an adequate experimental evaluation.

The relation to the literature was lifted as a concern and some other comments were satisfactorily discussed in the rebuttal leading to improvements in a potentially revised submission. In sum, the main result remains important, but the limitations mentioned above prevent a stronger positive recommendation at the moment.