PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
3
5
4
3.5
置信度
创新性2.8
质量2.8
清晰度2.0
重要性2.0
NeurIPS 2025

Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Reinforcement learningMulti-objective reinforcement learning

评审与讨论

审稿意见
4

The paper addresses the max-min multi-objective reinforcement learning problem and reformulate it into two-player zero-sum regularized continuous game. In addition, the paper proposes “entropy-regularized adversary for max-min MORL” algorithm for policy update. Authors have proven last-iterate convergence with exact policy evaluation and approximate policy evaluation. They further extend it to “adaptively regularized adversary for max-min MORL” algorithm. Finally, they test the proposed algorithms in tabular setting and traffic signal control.

优缺点分析

Strength: Transform multi-objective reinforcement learning into a study from the perspective of game theory

Weaknesses:

  1. It has a high correlation with paper [27] and the innovation is not significant.
  2. The experimental part is not comprehensive enough. The main issue lies in the comparison with [27].
  3. The author's writing is not rigorous enough and there are many minor mistakes.

问题

  1. In the introduction section, the authors introduce the background of the min-max criterion, which is not consistent with the problem of the max min research in the paper.
  2. The paper demonstrates last-iterate convergence in tabular MDPs. However, in practical applications, function approximation (e.g., neural networks) is more commonly employed. It would be beneficial to further investigate whether convergence can still be maintained when the policy parameterization extends beyond softmax, particularly in cases involving linear function approximation, which represents an intermediate level of complexity.
  3. The theoretical justification for the regularization parameter τ_w appears somewhat tenuous. The paper sets a condition that τ_w must be greater than a certain complex expression (Theorem 4.1), yet this threshold is fixed at 0.05 in the experiments. A deeper analysis of the impact of τ_w on the convergence speed would be valuable.
  4. The description of error propagation mechanisms in the case of approximate policy evaluation is insufficiently clear. Theorem 4.3 provides an error term involving δ, but does not explain how to control the accumulation of errors during the iteration process. This is particularly relevant in deep reinforcement learning, where value function estimation may be biased. It would be helpful to further discuss the cascading effects of updates to the weight vector w.
  5. Providing additional baseline comparisons from works such as Welfare and Fairness in Multi-Objective Reinforcement Learning [12] and Multi-Objective Reinforcement Learning: Convexity, Stationarity, and Pareto Optimality [23] would strengthen the argument and enhance the persuasiveness of the paper.
  6. In the related work section, the authors claim that the paper is a generalization of GGF-PPO, but in the Memory and Computational cost section, it only compared with [27].
  7. Minor: (1) The order of the paper “checklist” and “appendix” is wrong. (2) Some first-occurrence abbreviations are undefined, such as MORL on line 12 and GGF on line 251. (3) The indefinite article is misused with “a MOMDP” on line 57, and “a NE” throughout the paper. (4) There's a typo in the subtitle of line 131: “Two-plaer zero-sum regularized continuous game formulation”.

局限性

Yes

最终评判理由

The paper now makes a compelling case for its algorithmic and theoretical advancements in MORL. We look forward to seeing the new version!

格式问题

The order of the paper “checklist” and “appendix” is wrong.

作者回复

Novelty and Relation to Park et al. (2024) (Weaknesses 1)

We acknowledge that Park et al. (2024) serves as an important starting point for our work, and we do address the same overarching problem. However, our approach is fundamentally different: we propose a novel game-theoretic formulation that leads to a distinct algorithm from Park et al. (2024). Our method requires significantly fewer model parameters and substantially reduces training time, resulting in improved memory and computational efficiency.

Moreover, real-world problems often involve correlated objectives. To address this, we introduce a new mechanism that explicitly accounts for inter-objective correlations during learning, which in turn improves the max-min performance.

[Ref.] Park et al., "The max-min formulation of multi-objective reinforcement learning: From theory to a model-free algorithm," ICML, Jul. 2024

More on Complexity Comparison (Weaknesses 2, Question 6)

Thank you for pointing this out. In terms of memory usage, ERAM and ARAM use the same PPO backbone as GGF-PPO, and the only additional parameters come from storing the weight vector ww, which adds just 4 parameters in Base-4 and Asym-4, and 16 in Asym-16.

Regarding computational cost, we presented the training wall-clock time in the table. Across all three environments, ERAM, ARAM, and GGF-PPO exhibited comparable computational cost. We will clarify these comparisons and include GGF-PPO in the revised version of Section 6.3.

EnvironmentsERAMARAMGGF-PPO
Base-4111±2.6120±3.9122±4.0
Asym-487.2±2.487.4±2.484.8±2.2
Asym-16356±27365±20394±5.5

Regarding Writings (Weaknesses 3, Question 7)

We appreciate your careful attention to detail. We will correct all the noted issues in the revised version, including the section order, missing abbreviation definitions, article usage, and the typo in the subtitle.

Clarification on Max-Min Objective (Questions 1)

We suspect the confusion may have come from the phrase “min-max (or max-min) criterion” on line 34, and we apologize for the lack of clarity. To clarify, our objective is a max-min problem. We mentioned min-max to note that a min-max cost problem is equivalent to a max-min reward problem, assuming the reward is defined as the negative of the cost. We hope this clears up the misunderstanding.

Regarding Policy Parameterization Beyond Softmax (Questions 2)

We agree that analyzing convergence under more advanced policy parameterizations—such as linear function approximation—is an important and interesting direction. However, this lies beyond the scope of the current work. Convergence analysis even for linear function approximation is not easy, as shown in Baird's example. We consider this a promising avenue for future research.

Deeper Theoretical Analysis on τw\tau_w (Questions 3)

The lower bound on τw\tau_w given in Theorem 4.1 is a sufficient condition for convergence. In tabular cases, where the environments are relatively small, we fixed a relatively small τw=0.05\tau_w = 0.05 for simplicity.

Regarding deeper theoretical analysis of the impact of τw\tau_w, we would like to clarify the trade-off between convergence speed and the suboptimality value gap induced by regularization.

[Impact of τw\tau_w on Convergence Speed]

From Theorem 4.1, the speed of convergence is determined by the base ρ\rho, which is guaranteed to be 0<ρ1ϵ220<\rho\le1-\frac{\epsilon^2}{2} where ϵ\epsilon is a parameter connects the scales of learning rates η\eta and λ\lambda. From the parameterization used in our proof (Appendix F.3), ϵ=O(λτw)\epsilon = O(\sqrt{\lambda\tau_w}). This says that the speed is determined not only by τw\tau_w, but the product with learning rate λ\lambda. Note that with fixed learning rate λ\lambda, larger τw\tau_w results in smaller ρ\rho and thus, faster convergence.

[Impact of τw\tau_w on Suboptimality Value Gap by Regularization]

Let the optimal solutions for the regularized game be wr\*,πr\*w_r^\*,\pi_r^\*. Let the optimal solutions for unregularized game be w\*,π\*w^\*,\pi^\*.

We derived a bound for the gap between the two values, that are used for metric in our experiments (e.g. traffic). Please see the end of this response for full proof for the bound: Vwr\*πr\*Vw\*π\*τlogA1γ+τwlogK|V_{w_r^\*}^{\pi_r^\*}-V_{w^\*}^{\pi^\*}|\le \frac{\tau\log|A|}{1-\gamma}+\tau_w\log K. Note that, suboptimality value gap by regularization is linear in τ,τw\tau,\tau_w.

[Trade-off Between Convergence Speed and Suboptimality Gap Induced by τw\tau_w]

To sum up, for fixed learning rate λ\lambda, the regularization coefficient τw\tau_w involves a trade-off between convergence speed and suboptimality gap: larger τw\tau_w lead to faster convergence but also increase the suboptimality gap in value.

We hope this analysis addresses your concern regarding the deeper analysis on impacts of τw\tau_w on convergence and suboptimality gap.

Control of Estimation Error (Questions 4)

The key contribution of Theorem 4.3 is that ERAM provably converges in tabular MOMDPs without requiring explicit control of error accumulation. While Theorem 4.1 with exact policy evaluation shows that the suboptimality bound decays as ρt\rho^t, Theorem 4.3 with approximate policy evaluation provides a bound of the form ρ^t+O(δ)\hat{\rho}^t + O(\delta), where δ\delta denotes the value estimation error. In other words, the result explicitly shows that the impact of error accumulation grows linearly with δ\delta. This suggests that, in deep RL, as long as function approximation is sufficiently accurate, error accumulation does not explode—supporting the practical stability of our approach.

Investigating how to correct for error propagation when value estimates are considerably inaccurate in deep RL remains an interesting direction for future work.

Additional Baselines (Questions 5)

Thank you for the helpful suggestions. Regarding [12], we implemented Welfare Q-Learning in [12] using the egalitarian objective (W=minW = \min). For the Four-Room environment, we directly used the official tabular implementation provided by the authors. Since their work does not include a deep RL version, we extended it ourselves by adapting it to a DQN-based implementation for the traffic environments.

EnvironmentsARAMERAMWQL [12]
Base-4-1160-1387-1688
Asym-4-2696-2732-5412
Asym-16-15043-17334-15639
Four Room1.801.560.89

As for [23], the focus of that work is on recovering the Pareto front using linear preferences. This differs from our setting, which aims to find the max-min solution. Even if the full Pareto front were recovered, identifying the specific linear preference corresponding to the max-min point would be nontrivial. Hence, while related, their method does not directly address our objective.

[Appendix. Proof for Suboptimality Value Gap by Regularization]

Let the optimal solutions for regularized game be wr\*(π):=argminww,Vπ+τH~(π)τwH(w)w_r^\*(\pi) := arg\min_w \langle w,V^\pi\rangle+\tau\tilde{H}(\pi)-\tau_w H(w), πr\*:=argmaxπminww,Vπ+τH~(π)τwH(w)\pi_r^\*:=arg\max_\pi \min_w \langle w,V^\pi\rangle+\tau\tilde{H}(\pi)-\tau_w H(w). Let wr\*:=wr\*(πr\*)w_r^\* := w_r^\*(\pi_r^\*).

Similarly, let the optimal solutions for unregularized game be w\*(π):=argminww,Vπw^\*(\pi) := arg\min_w \langle w,V^\pi\rangle, π\*:=argmaxπminww,Vπ\pi^\*:=arg\max_\pi \min_w \langle w,V^\pi\rangle. Let w\*=w(π\*)w^\* = w^*(\pi^\*).

1). lower bound

Vwr\*πr\*+τH~(πr\*)τwH(wr\*)V_{w_r^\*}^{\pi_r^\*} + \tau\tilde{H}(\pi_r^\*)-\tau_w H(w_r^\*)

=maxπminww,Vπ+τH~(π)τwH(w)= \max_\pi\min_w \langle w,V^\pi\rangle+\tau\tilde{H}(\pi)-\tau_w H(w)

minww,Vπ\*+τH~(π\*)τwH(w)\ge \min_w \langle w,V^{\pi^\*}\rangle+\tau\tilde{H}(\pi^\*)-\tau_w H(w)

=wr\*(π),Vπ\*+τH~(π\*)τwH(wr\*(π\*))=\langle w_r^\*(\pi^*),V^{\pi^\*}\rangle+\tau\tilde{H}(\pi^\*)-\tau_w H(w_r^\*(\pi^\*))

=w\*,Vπ\*+wr\*(π\*)w\*,Vπ\*+τH~(π\*)τwH(wr\*(π\*)).=\langle w^\*,V^{\pi^\*}\rangle + \langle w_r^\*(\pi^\*)-w^\*, V^{\pi^\*}\rangle + \tau\tilde{H}(\pi^\*)-\tau_w H(w_r^\*(\pi^\*)).

Since 0H~(π)logA1γ0\le \tilde{H}(\pi)\le \frac{\log|A|}{1-\gamma} and 0H(w)logK0\le H(w)\le\log K for any π\pi and wΔKw\in\Delta^K, we obtain

Vwrπr\*Vw\*π\*wr\*(π\*)w\*,Vπ\*τlogA1γτwlogK.V_{w_r^*}^{\pi_r^\*} - V_{w^\*}^{\pi^\*} \ge \langle w_r^\*(\pi^\*)-w^\*, V^{\pi^\*}\rangle - \frac{\tau\log|A|}{1-\gamma}-\tau_w\log K.

We show that wr\*(π\*)w\*,Vπ\*0\langle w_r^\*(\pi^\*)-w^\*, V^{\pi^\*}\rangle\ge0.

For simplicity, we denote Vkπ\*V^{\pi^\*}_k as VkV_k. Without loss of generality, assume that 0<V1Vk0<V_1\le \cdots \le V_k. Let xk=eVk/τwSx_k = \frac{e^{-V_k/\tau_w}}{S} where S=ieVi/τwS=\sum_i e^{-V_i/\tau_w}. Note that this xx is the solution of argminww,Vπ\*τwH(w)arg\min_w \langle w,V^{\pi^\*}\rangle-\tau_w H(w), thus x=wr\*(π\*)x=w_r^\*(\pi^\*), and satisfies xKx11x_K\le\cdots\le x_1\le1. By definition, w\*=w\*(π\*)=(1,0,,0)w^\* = w^\*(\pi^\*) = (1,0,\ldots,0). Then,

wr\*(π\*)w\*,Vπ\*\langle w_r^\*(\pi^\*)-w^\*, V^{\pi^\*}\rangle

=(x11)V1+x2V2++xKVK= (x_1-1)V_1+x_2V_2+\cdots+x_KV_K

=τw(x11)logSx1τwx2logSx2τwxKlogSxK=-\tau_w(x_1-1)\log Sx_1-\tau_wx_2\log Sx_2-\cdots-\tau_wx_K\log Sx_K

=τw(x11)logx1τwx2logx2τwxKlogxK (kxk=1)  (1)=-\tau_w(x_1-1)\log x_1-\tau_wx_2\log x_2-\cdots-\tau_wx_K\log x_K~(\because \sum_k x_k=1)~~-(1)

=τw(x2logx1x2++x2logx1x2) (kxk=1)=\tau_w (x_2\log\frac{x_1}{x_2}+\cdots+x_2\log\frac{x_1}{x_2})~(\because \sum_k x_k=1)

τw0\ge \tau_w\cdot 0 since x1xk>0x_1\ge x_k>0 and equality holds when xk=1K,kx_k=\frac{1}{K},\forall k.

Therefore, we conclude that the suboptimality gap has lower bound τlogA1γτwlogK- \frac{\tau\log|A|}{1-\gamma}-\tau_w\log K.

2). upper bound

Vw\*π\*V_{w^\*}^{\pi^\*}

=maxπminww,Vπ= \max_\pi\min_w \langle w,V^\pi\rangle

minww,Vπr\*\ge \min_w \langle w,V^{\pi_r^\*}\rangle

=w\*(πr\*),Vπr\*=\langle w^\*(\pi_r^\*),V^{\pi_r^\*}\rangle

=wr\*,Vπr\*wr\*w\*(πr\*),Vπr\*.=\langle w_r^\*,V^{\pi_r^\*}\rangle - \langle w_r^\* - w^\*(\pi_r^\*), V^{\pi_r^\*}\rangle.

Thus, Vwr\*πr\*Vw\*π\*wr\*w\*(πr\*),Vπr\*V_{w_r^\*}^{\pi_r^\*} - V^{\pi^\*}_{w^\*} \le \langle w_r^\* - w^\*(\pi_r^\*), V^{\pi_r^\*}\rangle.

Following the same logic in 1), the equation (1) results in,

wr\*w\*(πr\*),Vπr\*\langle w_r^\* - w^\*(\pi_r^\*), V^{\pi_r^\*}\rangle

=τw(x11)logx1τwx2logx2τwxKlogxK=-\tau_w(x_1-1)\log x_1-\tau_wx_2\log x_2-\cdots-\tau_wx_K\log x_K

=τw(logx1kxklogxk)=\tau_w(\log x_1 -\sum_k x_k\log x_k)

<τw(0+logK) (H(x)logK).< \tau_w (0 + \log K)~(\because H(x)\le\log K).

Therefore, we obtain upper bound τwlogK\tau_w \log K.

Combining 1) and 2), we obtain the suboptimality value gap Vwr\*πr\*Vw\*π\*τlogA1γ+τwlogK|V_{w_r^\*}^{\pi_r^\*}-V_{w^\*}^{\pi^\*}|\le \frac{\tau\log|A|}{1-\gamma}+\tau_w\log K.

评论

Thank you for your thoughtful rebuttal. Your detailed responses strengthened our understanding of the work's contributions. We appreciate your commitment to revisions—restructuring the paper (e.g., moving related work, ARAM subsection) and clarifying Park et al./GGF-PPO comparisons will further enhance accessibility. While open questions remain (e.g., non-softmax convergence), these are well-scoped for future work.

评论

Thank you for the positive feedback; your comments were very helpful. Additionally, we apologize for the delayed reply.

审稿意见
3

This work gives an algorithm for multi-objective reinforcement learning (MORL), specifically entropy-regularized maxmin MORL. The algorithm formulates the MORL problem as a two-player zero sum game in which the learner updates its policy using natural policy gradient updates and the adversary uses mirror descent to select a strategy from the simplex over objectives. In the case of tabular MDPs, last-iterate convergence guarantees are given. The algorithm is empirically evaluated in a tabular MDP as well as a traffic signal control environment.

优缺点分析

Strengths: The paper studies the interesting and important problem of multi-objective reinforcement learning, provides a novel algorithm, and supports the usefulness of the algorithm both theoretically and empirically.

Weaknesses: The paper omits a great deal of relevant literature and so it’s very unclear what the novel contribution is. I’ve included some related work on constrained and fair RL that I think the authors should compare to. Notably, the work of Cousins et al [2] solves a more general version of the minmax problem in tabular settings, and so should be discussed in detail.

The paper also claims that the minmax objective presents a unique challenge, but (as observed in [5]), the minmax objective can be handled as a constrained optimization problem, where the values for constraints are also optimized.

The presentation of the theoretical guarantees is also a bit hard to interpret. How is epsilon chosen in Theorem 4.1? Also how does the regularized Q value compare to the unregularized? It would be good to have more discussion about the consequences of regularization, beyond the justification of regularization for the purposes of simplifying the analysis of mirror descent.

Larger experiments for the tabular setting would also strengthen the empirical evaluation.

Missing citations:

[1] M. Calvo-Fullana, S. Paternain, L. F. Chamon, and A. Ribeiro. State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards.

[2] C. Cousins, K. Asadi, E. Lobo, and M. L. Littman. On welfare-centric fair reinforcement learning.

[3] A. Muller, P. Alatur, V. Cevher, G. Ramponi, and N. He. Truly no-regret learning in constrained mdps.

[4] S. Miryoosefi, K. Brantley, H. Daume III, M. Dudik, and R. E. Schapire. Reinforcement learning with convex constraints.

[5] E. Eaton, M. Hussing, M. Kearns, A. Roth, S. Sengupta, J. Sorrell. Intersectional Fairness in Reinforcement Learning with Large State and Constraint Spaces.

问题

  • What is the novel theoretical contribution over the work of Cousins et al?
  • How does this work compare to the existing MORL literature that uses primal dual algorithms?

局限性

See comments above

最终评判理由

I appreciate the detailed and thoughtful rebuttal from the authors, as well as the commitment to include the relevant related work in an updated version. While the author's explanations did help clarify how the proposed algorithm differs from similar existing work, it seems from the rebuttal that the contribution is more one of practical improvements in some settings, as opposed to a novel theoretical contribution. In light of this, I think the paper would be significantly strengthened by more thorough and diverse experimental comparison to existing model-based and primal dual MORL algorithms. I have increased my score based on the comments in the rebuttal, but think the work would be much improved by being rewritten with a deeper awareness of the related work in mind.

格式问题

No

作者回复

We thank the reviewer for valuable comments. Below is our response to the reviewer's comments.

Novelty and Comparison with the Work of Cousins et al. (Question 1)

Thanks for introducing relevant existing works. We will include relevant related works in the revision. Here, we would like to explain our contribution and difference from the existing works. We agree that the max-min or more general utility MORL problem can be formulated using the occupancy measure d(s,a):

maxdmin1kKs,ad(s,a)r(k)(s,a)\max_{d} \min_{1 \leq k \leq K} \sum_{s,a} d(s,a) r^{(k)}(s,a) ad(s,a)=μ0(s)+γs,aP(ss,a)d(s,a)  s\sum_{a'} d(s',a') = \mu_0(s') + \gamma \sum_{s,a} P(s' | s,a) d(s,a) ~~ \forall s' d(s,a)0,  (s,a),d(s,a) \geq 0, ~~ \forall (s,a),

where r(k)(s,a)r^{(k)}(s,a) is the kk-objective reward function.
This was noted in Eq. (9) of Park et al. 2024 and Eq. (3) of Cousins et al. 2024 [2]. The method of Cousins et al. is model-based: it learns the model PP and r(k)r^{(k)}, then solves the convex optimization problem using a solver. This restricts the algorithm to finite state-action spaces. Algorithm 1 of Cousins et al. consists of an exploration phase to learn the model and an exploitation phase to solve the optimization. In finite settings, the model can eventually be learned via visitation-count methods by the law of large numbers, and the algorithm will converge. However, this is not practical in realistic environments. For example, in the Four-room MORL environment with S=5248|S|=5248, A=4|A|=4 and two objectives, the required samples by Cousins et al. is mknw131,001,956,377,564m_{knw} \approx 131,001,956,377,564 for ϵ=δ=0.1\epsilon=\delta=0.1. This is clearly impractical. So, the paper of Cousins et al. does not have any numerical example.

Our whole idea and contribution is to circumvent the model learning and develop a model-free algorithm applicable to even continuous state-action spaces. As mentioned by the reviewer, the above optimization can be written using a slack variable cc as

maxd(s,a),cc\max_{d(s,a), c} c s,ar(k)(s,a)d(s,a)c, 1kK\sum_{s,a} r^{(k)}(s,a) d(s,a) \geq c, ~ 1 \leq k \leq K ad(s,a)=μ0(s)+γs,aP(ss,a)d(s,a)  s\sum_{a'} d(s',a') = \mu_0(s') + \gamma \sum_{s,a} P(s' | s,a) d(s,a) ~~ \forall s' d(s,a)0,  (s,a).d(s,a) \geq 0, ~~ \forall (s,a).

Again, we do not solve this problem (requiring the model) directly. We consider the dual problem of this problem, given by

minwΔK,vsμ0(s)v(s)\min_{w \in \Delta^K, v} \sum_s \mu_0(s) v(s) v(s)k=1Kwkr(k)(s,a)+γsP(ss,a)v(s), (s,a)v(s) \geq \sum_{k=1}^K w_k r^{(k)}(s,a) + \gamma \sum_{s'}P(s'|s,a)v(s'), ~ \forall (s,a)

where μ0\mu_0 is the initial state distribution and ΔK:=w=[w1,,wK]TRKk=1Kwk=1; wk0, k\Delta^K :=\\{ w=[w_1,\cdots,w_K]^T\in R^K | \sum_{k=1}^K w_k = 1; ~ w_k \geq 0, ~ \forall k\\} is the (K1)(K-1)-simplex. Note that now the Lagrange dual variable ww is on the probability simplex not in the positive first quadrant as in the references the reviewer suggested. The above problem involves inequality difficult to handle. One can further show the above dual problem is equivalent to (Park et al. 2024)

minwΔKsμ0(s)vw\*(s)     Eq.(R1)\min_{w \in \Delta^K}\sum_s \mu_0(s) v_{w}^\*(s) ~~~~~Eq.(R1)

where vw\*v_w^\* is the unique fixed point of the following operator:

Tw\*v(s):=maxa[k=1Kwkr(k)(s,a)+γsP(ss,a)v(s)] T_w^\* v(s) := \max_a \left[ \sum_{k=1}^K w_k r^{(k)}(s,a)+ \gamma \sum_{s'} P(s'|s,a) v(s') \right]

Thus, vw\*(s)v_w^\*(s) is basically the optimal value function of the weighted reward kwkr(k)\sum_k w_kr^{(k)}. The optimal value function can be learned in a model-free manner using various methods, eliminating the need to learn the model PP and r(k)r^{(k)}. A similar derivation is possible for the entropy-regularized case. Our algorithm can be applied to a real-world problem of 16-lane traffic control. Now, we hope that the reviewer recognizes our contribution as compared to the work of Cousins et al. We will mention the model-based approaches like Cousins et al. are available for small problems.

[Ref.] Park et al., "The max-min formulation of multi-objective reinforcement learning: From theory to a model-free algorithm," ICML, Jul. 2024

Novelty Compared to Other Primal-Dual Algorithms (Question 2)

Now, consider the start state s0s_0 problem where μ0\mu_0 is a delta on s0s_0. Eq. (R1) becomes minwΔKvw\*(s0)\min_{w \in \Delta^K} v_w^\*(s_0) which is equivalent to our initial problem maxπminkVkπ(s0)=maxπminwΔKw,Vπ(s0)\max_{\pi} \min_k V_k^\pi(s_0)=\max_{\pi} \min_{w \in \Delta^K} \langle w,V^\pi(s_0)\rangle. Note that this max-min and min-max equivalence is different from conventional max-min and min-max equivalence considered in constrained MDPs (Paternain et al. 2019, Efroni et al. 2020, Muller et al. 2024), where the following problem was considered:

maxπVrπ  s.t.   Vuiπci,i.   Eq.(R2)\max_\pi V_r^\pi ~~s.t.~~~ V_{u^i}^\pi \ge c_i, \forall i.~~~ Eq.(R2)

The strong duality (Paternain et al. 2019) of this problem says

maxπminλR0KL(π,λ)=minλR0KmaxπL(π,λ)   Eq.(R3)\max_\pi \min_{\lambda \in R_{\ge 0}^K} L(\pi,\lambda) = \min_{\lambda \in R_{\ge 0}^K} \max_\pi L(\pi,\lambda) ~~~ Eq. (R3)

where L(π,λ)=Vrπ+iλi(Vuiπci). L(\pi,\lambda) = V_r^\pi + \sum_i \lambda_i (V_{u^i}^\pi -c_i).

Note that due to λR0K\lambda \in R_{\ge 0}^K (i.e., positive quadrant), the left-hand side of Eq. (R3) is simply restating Eq. (R2). If Vuiπci0V_{u^i}^\pi -c_i \ge 0 , then λi=0\lambda_i=0 and maxπminλR0KL(π,λ)=maxπVrπ\max_\pi \min_{\lambda \in R_{\ge 0}^K} L(\pi,\lambda) = \max_\pi V_r^\pi. If Vuiπci<0V_{u^i}^\pi -c_i < 0, then λi=\lambda_i=\infty and the problem becomes infeasible. The max-min and min-max equivalence in our paper does not enjoy such simplification because the dual variable ww lie in the probability simplex. So, the max-min and min-max equivalence and the proof of NE in our paper are a new result.

In fact, due to the fact wΔKw \in \Delta^K, we considered the entropy regularization H(w)H(w) instead of w2||w||^2 of Muller et al. 2024. Thus, we have a closed-form solution for ww udpate, not requiring projected gradient descent used in Effroni et al. 2020, Muller et al. 2024. The role of H(w)H(w) will be explained further below.

Now, consider the work of Eaton et al. [5] and the approach suggested by the reviewer. Eaton et al. basically considered constrained MDP: maxπVrπ  s.t.   Vuiπci,i.\max_\pi V_r^\pi ~~s.t.~~~ V_{u^i}^\pi \ge c_i, \forall i.

Consider the max-min approach, stated in the 2nd line after eq. (2) of Eaton et al.'s paper and suggested by the reviewer:

maxπ,c c\max_{\pi,c} ~c s.t. Vuiπc, i.  Eq.(R4)V_{u^i}^\pi \ge c, ~\forall i.~~ Eq.(R4)

Its Lagrangian is given by L(π,c,λ)=c+λi(Vuiπc).L(\pi,c,\lambda) = c + \lambda_i (V_{u^i}^\pi-c).

Eaton et al.'s method is a primal-dual algorithm with a primal step solving π,c\pi, c given λ\lambda, and a dual step solving λi\lambda_i given π,c\pi, c. Their dual step (Lin-OPT and OPT) yields a scaled one-hot vector CekC e_k for the worst value index kk. While they did not consider deep implementations, combining PPO (as RL solver) with their dual update reduces to GGF-PPO of Siddique et al., which we already include. Their Algorithm 4 (FairFictRL) accumulates λ\lambda over time. We evaluated this version of GGF-PPO during revision, and the result is shown below:

EnvironmentsARAMERAMGGF-PPOFairFictRL with PPO
Base-4-1160-1387-1731-3477
Asym-4-2696-2732-3501-4359
Asym-16-15043-17334-21663-28970

As seen, our algorithms (ARAM ad ERAM) significantly outperform GGF-PPO and GGF-PPO with time-averaged λ\lambda (FairFictRL).

As mentioned in our paper, we adopted H(w)H(w) regularization, which spreads out the weighting factor across multiple objective dimensions. With this, we achieved the last-iterate convergence and significant performance improvement. Furthermore, in our ARAM we went one step further to devise more intelligent regularization. Please see the last part of Section 5 of our paper.

In summary, our algorithm is model-free and falls into the category of primal-dual algorithms. Unlike primal-dual algorithms for conventional constrained MDPs, our algorithm exploits the special properties of max-min multi-objective RL, e.g. the dual variable lies in the probability simplex. With our entropy regularization on the dual variable, we have closed-form updates for the dual variable, went beyond straightforward primal-dual algorithms like GGF-PPO or Eaton et al.'s, and achieved significant performance improvement over existing methods (at least 30 % longest waiting time reduction for asymmetric 16-line traffic junction control. Please see the paper for simulation details.).

Considering these clear contributions, noted by other reviewers, we would like to kindly ask the reviewer to reconsider the evaluation. We will include missing references with proper explanation in our revision.

Other Comments and Questions

[How is epsilon chosen in Theorem 4.1?]

Any ϵ(0,ϵ0)\epsilon\in (0,\epsilon_0) where ϵ0=2(1γ)4+5γ6γ2\epsilon_0 = \frac{2(1-\gamma)}{4+5\gamma-6\gamma^2} can be chosen in Theorem 4.1. Note that 2(1γ)4+5γ6γ248(1γ)121>0\frac{2(1-\gamma)}{4+5\gamma-6\gamma^2} \ge \frac{48(1-\gamma)}{121}>0 for γ<1\gamma<1.Please see Remark F.3 for detailed derivations.

[How does the regularized Q value compare to the unregularized?']

For an optimal solution πr\*,wr\*\pi_r^\*, w_r^\* of the regularized game maxπminwΔKw,Vπ+τH~(π)τwH(w)\max_\pi \min_{w \in \Delta^K} \langle w, V^\pi \rangle + \tau \tilde{H}(\pi)-\tau_w H(w) and an optimal solution π\*,w\*\pi^\*, w^\* of the unregularized game maxπminwΔKw,Vπ\max_\pi \min_{w \in \Delta^K} \langle w, V^\pi \rangle, we derived a bound on the gap in unregularized values, Vwr\*πr\*Vw\*π\*τlogA1γ+τwlogK.|V_{w_r^\*}^{\pi_r^\*}-V_{w^\*}^{\pi^\*}|\le \frac{\tau\log|A|}{1-\gamma}+\tau_w\log K.

This is because the original problem is defined without regularization, and thus the unregularized value remains the primary performance metric of interest. For proof, please see the end of the response to Reviewer 5Y2u.

[Larger tabular experiments]

Please note that the 16-lane traffic control problem has continuous states and is a real-world problem. The paper already includes a tabular experiment with a large state space in Appendix M, the Four-Room environment, which has a state space of size 5,248. Our algorithm outperforms the baselines in terms of max-min performance and achieves a minimum return of 1.8, which is close to the ground-truth optimal return of 2.

审稿意见
5

This paper reformulates entropy-regularized max-min multi-objective reinforcement learning as a two-player zero-sum game and introduces ERAM/ARAM, single-loop algorithms that combine natural policy gradients for the agent with a closed-form mirror-descent adversary. The authors establish global last-iterate convergence and finite-sample guarantees, while empirical studies show faster convergence and lower memory use than prior double-loop approaches across standard control benchmarks.

优缺点分析

Strengths:

  1. ERAM/ARAM eliminate the costly inner loop and large Q-function ensembles required by prior work via an analytic adversary update, which reduces parameters by 95\approx 95% while retaining direct optimisation of the true max-min objective.
  2. The paper provides global last-iterate convergence, explicit O~(ϵ6)\tilde{O}(\epsilon^{-6}) sample complexity and iteration bounds under approximate evaluation. This result strengthens theoretical correctness beyond average-iterate results.
  3. ARAM generalizes ERAM with data-driven KL regularization on the adversary and shows adaptive entropy can sharpen worst-case returns without altering algorithmic complexity.

Weaknesses:

  1. Convergence proofs assume fixed τ,τw,λ\tau, \tau_w, \lambda. However, empirical performance varies markedly with these coefficients and no principled tuning rule or robustness bound is provided.
  2. The algorithm optimizes a surrogate that differs from the exact max-min objective whenever τ\tau or τw>0\tau_w>0, yet the paper gives no bound on the resulting sub-optimality.

问题

  1. How does the bias introduced by non-zero τ\tau and τw\tau_w trade off against convergence speed, and can the authors derive bounds quantifying the gap to the unregularized max-min optimum?
  2. The mirror-descent step computes a soft-max over the full K-dimensional value vector each iteration. When K is large, what is the update’s time/space complexity and can a sparsity-promoting or low-rank variant preserve the same convergence guarantees?
  3. The natural policy-gradient update assumes access to the exact inverse Fisher matrix. In practice this is approximated (e.g., diagonal or K-FAC). What accuracy is required for last-iterate convergence to hold, and could an adaptive scheme bound the error introduced by these approximations?
  4. In ARAM, the KL-regularization term uses a baseline distribution cc that is updated over time to highlight poorly performing objectives. Because cc now varies with the iteration, the effective optimization objective changes. Does this time-varying regularizer invalidate the convergence analysis that assumes a fixed objective, and what constraints on the update rule for cc (e.g., step size, update frequency, contraction conditions) would preserve the theoretical guarantees while retaining the empirical gains?
  5. An open question: the paper would be strengthened by positioning its max–min objective within the broader distributionally-robust reinforcement learning (DR-RL) literature [1][2][3]. Can you add a further discussion on this field?

Reference:

[1] Smirnova, Elena, Elvis Dohmatob, and Jérémie Mary. "Distributionally robust reinforcement learning." arXiv preprint arXiv:1902.08708 (2019).

[2] Derman, Esther, and Shie Mannor. "Distributional robustness and regularization in reinforcement learning." arXiv preprint arXiv:2003.02894 (2020).

[3] Liu, Zhishuai, Weixin Wang, and Pan Xu. "Upper and lower bounds for distributionally robust off-dynamics reinforcement learning." arXiv preprint arXiv:2409.20521 (2024).

局限性

There is no potential negative societal impact in this work. The limitation of methodology is mentioned in Weaknesses.

最终评判理由

I’m satisfied with the author's response and my questions have been fully addressed. After reading the other review and response, I believe this is a solid work and have decided to raise my rating.

格式问题

None

作者回复

Choice of τ,τw\tau,\tau_w and λ\lambda (Weakness 1)

Thank you for pointing out this important issue. Regarding τ\tau, we followed the default entropy coefficient used in PPO as provided in the stable-baselines3 library. For the newly introduced coefficients τw\tau_w and λ\lambda in our algorithm, we propose the following heuristic to guide their selection: Recall the ERAM update rule: wt+1=softmax(1βτwVt+βlogwt)w_{t+1}=softmax(-\frac{1-\beta}{\tau_w}V_t+\beta\log w_t).

The key idea for coefficient selection is to balance the scale of the two terms inside the softmax. If the value term dominates, the weight update collapses toward minimum-objective selection (similar to GGF-PPO). Conversely, if the logw\log w term dominates, the update becomes overly conservative and largely ignores the current value estimates.

To capture this balance, we compare the average magnitude of the two terms at convergence. Let 1βτw1KkVklast\frac{1-\beta}{\tau_w}\frac{1}{K} \sum_k V_k^{last} and βH(wlast)\beta H(w^{last}) denote the respective contributions of the value and regularization terms, where H(wlast)H(w^{last}) is the entropy of the final ww. In our experiments, we found that the regularization term was approximately: 0.18% of the value term in Base-4, 1.5% in Asym-4, and 0.1% in Asym-16.

Similarly, we examined the ratio between the average regularization term, βH(wlast)+(1β)H(clast)\beta H(w^{last}) + (1 - \beta) H(c^{last}), and the value term, 1βτw1KkVklast\frac{1 - \beta}{\tau_w} \frac{1}{K} \sum_k V_k^{last}, in ARAM, and observed the following results: 1.69% in Base-4, 6.8% in Asym-4, 25.4% in Asym-16.

This suggests a practical heuristic: prior to the main training run, conduct a short preliminary run to estimate average V,H(w)V, H(w) and H(c)H(c), then set λ,τw\lambda,\tau_w such that the regularization term lies in the range of 0.1%-2% of the value term for ERAM. Averaging across tasks, we recommend targeting a regularization-to-value ratio of around 10% as a practical starting point for ARAM. In practice, this approach may help inform the initial choice or tuning range of the coefficients.

We will clarify this heuristic in the revised version and consider analyzing its robustness more formally in future work.

Convergence Speed, Suboptimality Bound, and Trade-off Induced by τ,τw\tau,\tau_w (Weakness 2, Question 1)

Thank you very much for drawing our attention to these important points—they are highly valuable for strengthening our theoretical analysis.

[Convergence Speed]

From Theorem 4.1, the speed of convergence is determined by the base ρ\rho, which is guaranteed to be 0<ρ1ϵ220<\rho\le1-\frac{\epsilon^2}{2} where ϵ\epsilon is a parameter connects the scales of learning rates η\eta and λ\lambda. From the parameterization used in our proof (Appendix F.3), ϵ=O(ητ)=O(λτw)\epsilon = O(\eta\tau) = O(\sqrt{\lambda\tau_w}). This says that the speed is determined not only by τ,τw\tau,\tau_w, but the product with learning rates.
Note that with fixed learning rates η,λ\eta,\lambda, larger τ,τw\tau,\tau_w results in smaller ρ\rho and thus, faster convergence.

[Suboptimality Gap of Values by Regularization]

Let the optimal solutions for the regularized game be wr\*,πr\*w_r^\*,\pi_r^\*. Similarly, let the optimal solutions for unregularized game be w\*,π\*w^\*,\pi^\*.

We derive the bound for the gap between "unregularized" values, that are used for metric in our experiments (e.g. traffic). Please see the end of the response of the Reviewer 5Y2u for the full proof of the following result: Vwr\*πr\*Vw\*π\*τlogA1γ+τwlogK.|V_{w_r^\*}^{\pi_r^\*}-V_{w^\*}^{\pi^\*}|\le\frac{\tau\log|A|}{1-\gamma}+\tau_w\log K. Note that, suboptimality gap of value by regularization is linear in τ,τw\tau,\tau_w.

[Trade-off Between Convergence Speed and Suboptimality Gap]

To sum up, for fixed learning rates η\eta and λ\lambda, the regularization coefficients τ\tau and τw\tau_w involve a trade-off: larger τ\tau and τw\tau_w lead to faster convergence but also increase the suboptimality gap in value.

We appreciate the question and will include this analysis in our revision.

Complexity of Softmax and Convergence in Sparsity-Promoting Variants (Questions 2)

[Complexity of Softmax]

The built-in softmax in PyTorch operates with O(K)O(K) time and space complexity. We also compare against GGF-PPO which updates only the minimal objective without using softmax in the weight update. The training wall-clock time of ERAM and ARAM is comparable to that of GGF-PPO.

EnvironmentsERAMARAMGGF-PPO
Base-4111±2.6120±3.9122±4.0
Asym-487.2±2.487.4±2.484.8±2.2
Asym-16356±27365±20394±5.5

[Convergence in Sparsity-Promoting Variants]

Regarding sparsity-promoting or low-rank variants, we believe this is a compelling direction. We implemented two heuristics based on ARAM. The first, topM-c, updates only the top-MM objectives most correlated with the current minimum: we compute the ARAM weight, select the top-MM by correlation, mask the others to 0, and renormalize. The second, topM-w, selects the top-MM elements from the ARAM weight vector and renormalizes them. We used the best-performing ARAM hyperparameters with M=5M=5 in the Asym-16 setting, which has a relatively large number of objectives.

EnvironmentARAMERAMtop5-ctop5-w
Asym-16-15043-17334-15913-16551

As shown in the table, even simple heuristics perform fairly well and exhibit stable convergence. This suggests that more intelligent sparsity-promotion methods could be beneficial when KK becomes large. While we do not provide a rigorous convergence guarantee for the sparsity-promoting variants, we offer the following intuitive perspective.

Intuitively, if the selected subsets vary with little overlap across iterations, the behavior may resemble the jumping dynamics of GGF-PPO. If they consistently overlap, the method may inherit ARAM's conservatism. In practice, the degree of overlap would depend on the structure of the objectives and may vary over time.

Designing a variant that maintains sufficient overlap could strike a balance between sparsity and stability, and we see this as a promising direction for future work.

Fisher Matrix Approximation and Required Accuracy (Questions 3)

In our analysis of NPG with softmax policy parameterization, the inverse Fisher matrix cancels out in the update expression (please see eq. (16) in Cen et al. (2022)). As a result, there is no need to approximate or bound this term within the proof.

In our analysis, the required accuracy primarily concerns the value estimates, rather than the Fisher matrix itself.

For ERAM, as shown in Theorem 4.3, we prove that if the value estimates can be approximated within an error bound δ\delta, then last-iterate convergence is achieved up to an O(δ)O(\delta) residual error. This ensures robustness to approximation errors in practice.

In the case of ARAM, we do not provide a rigorous bound for error, but empirically observe improved stability and performance. We agree that developing an adaptive scheme to bound approximation errors—especially under practical approximations of the Fisher matrix—is a promising direction for future theoretical investigation.

[Ref.] Cen, Shicong, et al. "Fast global convergence of natural policy gradient methods with entropy regularization." Operations Research 70.4 (2022): 2563-2578.

Validity of the Adaptive Scheme (Questions 4)

We acknowledge that ARAM was introduced to improve practical performance, motivated by real-world scenarios where objectives may be correlated. While we do not provide a formal convergence guarantee, we offer the following insight into how ARAM works and how τw\tau_w influences the update, aiming to illustrate the qualitative behavior and the role of τw\tau_w in stabilizing the updates.

Recall the ARAM update: wt+1=softmax(1βτwVt+βlogwt+(1β)logct)w_{t+1} = softmax(-\frac{1-\beta}{\tau_w}V^t + \beta\log w_t + (1-\beta)\log c_t) where ct=softmax(E(s,a)dt[rk(s,a)rit(s,a)])c_t = softmax(E_{(s,a)\sim d^t}[r_k(s,a)r_{i_t'}(s,a)]) is the empirical correlation with the minimal objective iti_t'. Using the fact that Vkt=11γEdt[rk]V_k^t = \frac{1}{1-\gamma} E_{d^t}[r_k], this update can be rewritten as: wt+1=softmax(1βτw(Vt[rk(1τw(1γ)rit)])+βlogwt(k))k=1Kw_{t+1} = softmax(-\frac{1-\beta}{\tau_w}(V^t[r_k(1-\tau_w(1-\gamma)r_{i_t'})]) + \beta\log w_t(k))_{k=1}^K where Vt[r]V^t[r] is a value by πt\pi_t and given scalar reward function rr. This shows that ARAM is equivalent to ERAM with an effective (time-varying) reward function that depends on the correlation with the current minimal objective.

This modification has an intuitive interpretation: when rkr_k and ritr_{i_t'} are positively correlated and ritr_{i_t'} is large enough so that 1τw(1γ)rit<01-\tau_w(1-\gamma)r_{i_t'}<0, the effective reward for objective kk changes the sign, encouraging to increase weight on kk—as it is helpful for the weakest objective. Note that in ERAM, larger rewards always lead to lower weights in order to maximize the minimum, due to the negative coefficient 1βτ-\frac{1-\beta}{\tau}. The parameter τw\tau_w controls the sensitivity of this adjustment: A smaller τw\tau_w imposes a stricter criterion for considering an objective as helpful to the minimal one (thus, behaves like ERAM), while a larger τw\tau_w makes this criterion more lenient.

Although this is not a formal guarantee—due to the time-varying nature of the effective reward and its dependence on the evolving policy—we hope this analysis provides helpful intuition about the adaptive mechanism in ARAM.

Discussion on DR-RL (Questions 5)

We appreciate the suggestion and references on DR-RL. DR-RL typically tackles transition uncertainty. Similarly, in reward-uncertain MDPs, one can define an uncertainty set in reward space and apply max–min for robustness. Our case fits the finite uncertainty set rk\\{r_k\\}, with regularization capturing this internalized reward uncertainty, analogous to how DR-RL treats transition uncertainty. For infinite sets, DR-RL seem applicable for analysis in terms of distribution over reward space. We'll add this in revision for further related works. Thanks.

评论

Thank you for your detailed response and I’m satisfied with it. After reading the other review and response, I believe this is a solid work and have decided to raise my rating.

Apologies for the late reply. I took some extra time to reread the paper and verify a few details in your response.

评论

Dear Reviewer xubR

Thank you very much for your positive feedback and raising the score further. We appreciate your time and effort for review and discussion.

Best regards,

Authors

审稿意见
4

This paper introduces a game-theoretic framework for multi-objective reinforcement learning (MORL) with a max-min criterion. The authors reformulate the problem as a two-player zero-sum game and propose ERAM, an efficient single-loop algorithm with provable last-iterate convergence in tabular settings with softmax parameterization. They also present ARAM, a variant with adaptive regularization for improved performance. The method achieves gains in computational and memory efficiency over prior work, such as Park et al. [2024], and demonstrates better results in traffic signal control.

优缺点分析

Strengths:

  1. The minimax game formulation proposed is novel in the context of MORL.
  2. The paper provides the first last-iterate convergence guarantee for softmax parameterization in tabular MORL.
  3. The proposed algorithm is clean and insightful, and a practical variant is also presented.
  4. The experimental results demonstrate improvements over prior work (Park et al.), with better computational and space efficiency.

Weaknesses:

  1. The minimax theorem result feels standard. It seems that by representing the policy in the occupancy measure space—where everything becomes convex-concave and the policy space is convex and compact—the result could follow directly from von Neumann’s minimax theorem.
  2. The theoretical analysis for the tabular setting largely follows standard NPG analysis. The sample complexity bound (1/ϵ6ϵacc2)(1/\epsilon^6 \epsilon_{\mathrm{acc}}^2) is not particularly strong.
  3. The use of entropy regularization in ERAM for achieving last-iterate convergence is fairly common.
  4. The writing could be improved. The theoretical sections, such as Section 3.1, would benefit from clearer structure, with formal statements of lemmas and proofs; the current presentation is difficult to read. The overall organization could be significantly improved: for example, I suggest moving the related work section earlier in the paper. Since ARAM is the central algorithm of this work, it deserves more emphasis—ideally its own subsection rather than being described briefly in the related work. For Section 6.3, the comparison with Park et al. would be stronger if accompanied by a more detailed description of their algorithm in the appendix.

Minor issues:

  1. It is unnecessary to repeatedly cite basic results from convex optimization or information theory whenever entropy/optimization is used.

问题

  1. Could you explain in more detail why Park et al. only provide average-iterate convergence? I did not see this explicitly mentioned in their paper.

  2. Could you discuss the relationship between this work and existing last-iterate convergence results for zero-sum Markov games? For example, would min-max MORL be a special case of those settings? How does your result compare to theirs? See Faster Last-Iterate Convergence of Policy Optimization in Zero-Sum Markov Games, for instance.

  3. Could you elaborate on the key theoretical contributions of this paper?

局限性

Yes, the authors have addressed the limitations and potential negative societal impacts of their work. Another limitation is that the theoretical analysis is restricted to tabular MOMDPs with softmax parameterization, and the sample complexity guarantees are not perfect.

最终评判理由

The authors have addressed all of my concerns, and the paper provides insightful theory along with interesting experiments, making a valuable contribution to the field.

格式问题

No.

作者回复

Importance of Minimax Theorem 3.1 (Weakness 1)

We agree that, when formulated over the space of occupancy measures dd, the problem becomes a convex-concave optimization and the minimax theorem can be applied more directly. While this formulation is theoretically elegant, we instead choose to express the problem in terms of (w,π)(w, \pi), where optimization is carried out in the policy space. This choice allows us to leverage advanced policy optimization methods, such as PPO, and serves as a foundation for applying minimax-based reasoning in deep RL settings.

In this formulation, the minimax result is non-trivial because the value function is not concave in π\pi. Our Theorem 3.1 shows not only that the max-min and min-max values coincide, but also that they share the same optimal solution. This insight provides a theoretical justification for solving the max-min MORL problem via direct policy optimization, bridging minimax theory with practical deep RL algorithms.

Regarding Sample Complexity (Weakness 2)

We would like to clarify that our goal is not to claim an improvement in sample complexity over standard NPG analysis. Rather, our intention is to demonstrate that it is possible to obtain a solution to the max-min MORL problem under a finite sample complexity regime via our algorithm. This is included to ensure theoretical completeness and to support the validity of our overall framework.

Novelty of Regularization Technique (Weakness 3)

We agree that entropy regularization is a fairly common tool for achieving last-iterate convergence. Our work takes one step further by introducing an adaptive regularization scheme (ARAM) that captures correlations across objectives. This design is motivated by the practical challenges of real-world multi-objective problems and, to the best of our knowledge, is novel in this context.

Regarding Writings (Weakness 4)

Thank you for the helpful suggestions. We will revise the theoretical sections to present lemmas and proofs more formally for clarity. We also agree that moving the related work section earlier and presenting ARAM in its own subsection would improve the overall organization, and we will revise the paper accordingly. Lastly, we will include a more detailed description of Park et al.'s algorithm in the appendix to strengthen the comparison.

Convergence in Park et al. (Questions 1)

The reason Park et al. only provide average-iterate convergence is that their weight update is based on projected gradient descent, not on the original objective but on a Gaussian smoothing of the objective. Theorem 6 in Nesterov (2017) establishes only average-iterate convergence guarantees for this optimization method (referred to as RSμRS_\mu in their paper).

[Ref.] Nesterov, Yurii, and Vladimir Spokoiny. "Random gradient-free minimization of convex functions." Foundations of Computational Mathematics 17.2 (2017): 527-566.

Relation to Zero-Sum Markov Games (Questions 2)

Thank you for the thoughtful question. The key difference lies in the fact that, in zero-sum Markov games (MGs), both players operate over the same strategy space—typically the policy space Π=ππ:SΔ(A)\Pi = \\{\pi|\pi:S\to\Delta(A)\\}. In contrast, our setting involves asymmetric strategy spaces: the learner's strategy space is the policy space Π\Pi, while the adversary operates over the weights, i.e., wΔKw \in \Delta^K.

To investigate whether our setting could be framed as a special case of an MG, consider defining a new action space for the adversary as [K]={1,,K}[K] = \{1, \ldots, K\}, allowing wΔKw \in \Delta^K to be interpreted as a policy over [K][K]. However, in our formulation, the same ww must be applied across all states. That is, we may view the adversary’s strategy as a state-independent policy πwMG:swΔ([K])\pi_w^{MG}: s \mapsto w \in \Delta([K]) for all sSs \in S.

This corresponds to a constrained subset of the full MG policy space Δ([K])S\Delta([K])^S, and restricts the adversary to state-independent strategies. As such, our setting becomes a constrained optimization problem over a restricted policy class, which cannot, in general, be addressed by techniques designed for unconstrained MGs. Furthermore, in MGs there is no guarantee that the optimal policy for the adversary would be state-independent. Hence, our problem is not a special case of existing MG formulations, and last-iterate convergence results for MGs are not directly applicable here.

Key Theoretical Contributions (Questions 3)

As discussed in our response to Weakness 1, one of the central theoretical contributions of our work is the reformulation of the max-min MORL problem into a two-player zero-sum game in terms of the pair (w,π)(w,\pi), rather than over the occupancy measure space. This reformulation enables the use of policy optimization methods, which are more scalable and practical. Despite the fact that the value function is non-concave in π\pi, starting from deriving that the max-min and min-max values are equal, we prove that our game formulation indeed solves the original max-min MORL objective by finding a Nash equilibrium.

To the best of our knowledge, our work is the first to provide a provable last-iterate convergence guarantee for the max-min MORL setting. In our proof, we build upon prior results on the monotonic improvement of value functions under NPG with softmax policy and entropy regularization (Cen et al. 2022). However, unlike standard NPG analysis with regularization, the presence of an adversary in the max-min setup breaks the monotonic improvement of VwtπtV_{w_t}^{\pi_t}. As a result, we derive a lower bound on the change in the value function (which can be negative due to adversary), which plays a critical role in establishing convergence. This treatment of the non-monotonicity introduced by the adversary is a key technical distinction of our proof.

[Ref.] Cen, Shicong, et al. "Fast global convergence of natural policy gradient methods with entropy regularization." Operations Research 70.4 (2022): 2563-2578.

评论

Thank you for your response.

Regarding Weakness 1, I believe one can always use the occupancy measure as the policy representation, prove the existence of a saddle point in that space, and then recover the corresponding policy from the occupancy measure?

Regarding Weakness 2, I suggest the authors review the literature on zero-sum Markov games and clearly discuss the differences in both the problem settings and the techniques.

评论

Thank you very much for replying to our response and additional comments.

Saddle Point in Policy Space (Weakness 1)

We agree that the max-min MORL problem can be formulated using the occupancy measure d(s,a)d(s,a) and a policy can be recovered from the occupancy measure. The occupancy measure max-man formulation is given by [Eq. (9) of Park et al. (2024) and Eq. (3) of Cousins et al. (2024)]

maxdmin1kKs,ad(s,a)r(k)(s,a)\max_{d} \min_{1 \leq k \leq K} \sum_{s,a} d(s,a) r^{(k)}(s,a)

ad(s,a)=μ0(s)+γs,aP(ss,a)d(s,a)  s\sum_{a'} d(s',a') = \mu_0(s') + \gamma \sum_{s,a} P(s' | s,a) d(s,a) ~~ \forall s'

d(s,a)0,  (s,a).d(s,a) \geq 0, ~~ \forall (s,a).

where r(k)r^{(k)} is the k-objective reward function.

The issue of this formulation is we need to learn PP and r(k)r^{(k)}, which restrict its applicability to small finite problems.

Cousins et al. (2024) considered this model-based approach, estimating the model via visitation counts requiring roughly 131,000,000,000,000131,000,000,000,000 samples even for simple Four-room MORL environment with S=5248|S|=5248, A=4|A|=4, 2 objectives. This is impractical.

Our formulation in terms of π\pi rather than dd circumvents the necessity of model knowledge and enables model-free policy optimization which is effective in deep RL settings.

More on Two-Player Zero-Sum Markov Games (Question 2)

We realize that our previous explanation was not sufficiently clear. Your insightful question has helped us gain a deeper understanding of how our work relates to research on two-player zero-sum Markov games (2p0s MGs).

[Problem Settings]

Our target problem, entropy-regularized max-min MORL, can be viewed as a two-player zero-sum Markov game:

  • the min-player (the adversary) has a state-independent policy ww, and

  • the transition is determined solely by the max-player (the learner), since P(ss,a)P(s'|s,a) is independent of ww.

Note that we also aim to find an ϵacc\epsilon_{acc}-QRE, as in several prior works.

[Algorithmic Settings]

The following table compares the algorithmic settings of our method with existing references on 2p0s MGs. Here, symm denotes symmetric, and asym denotes asymmetric.

In 2p0s MGs, symmetric regularization is natural because both players use policies as their strategies with cumulative entropy H~(π)\tilde{H}(\pi) on the order of logA1γ\frac{\log|A|}{1-\gamma}. In contrast, ERAM’s adversary uses a weight ww as its strategy with non-cumulative entropy H(w)H(w) on the order of logK\log K, justifying asymmetric regularization coefficient.

References for 2p0s MGsEntropy Reg.Ent. Reg. Coef.Base MethodPolicy Param.Learning Rates
Daskalakis et al. (2020)nonePGdirectasym
Zeng et al. (2022)symmPGsoftmaxasym
Wei et al. (2021)noneOGDAdirectsymm
Cen et al. (2023)symmOMWUdirectsymm
ERAM (Ours)asymNPG,MDsoftmaxasym

[Techniques for Proof]

We have carefully reviewed the proof of Theorem 1 in Cen et al. (2023) and found it applicable to MORL setting under symmetric learning rates and regularization (η=λ\eta=\lambda, τ=τw\tau=\tau_w) and by using 11γH(w)\frac{1}{1-\gamma}H(w) (or absorbing 11γ\frac{1}{1-\gamma} into τw\tau_w, yielding asymmetric coefficients). Nevertheless, a fully direct application is still challenging because our setting lacks these symmetric conditions.

Note that our proof requires asymmetric learning rates for last-iterate convergence (please see Section 4) and asymmetric regularization (please see [Algorithmic Settings]).

The result of Cen et al. (2023) applies to MORL in the symmetric case because its key step, eq. (14), has a state-dependent RHS that is ultimately bounded by Γ(ρ)||\cdot||_{\Gamma(\rho)}, removing all state dependence. Hence, both the state-dependent policy in MGs and the state-independent ww in MORL satisfy the same recursive bounds in Lemma 2–4 of Cen et al. (2023).

Instead, in our work, since ww is already state-independent and there is only one RL learner, we find that adapting the proof in Cen et al. (2022) is sufficient for our analysis.

Regarding convergence speed, Cen et al. (2023) provides a convergence error bound of O((1(1γ)ητ/4)t)O((1-(1-\gamma)\eta\tau/4)^t), while ours achieves at most O((1ϵ2/2)t)O((1λτw/4)t)O((1-\epsilon^2/2)^t)\le O((1-\lambda\tau_w/4)^t) (The choices of η,λ\eta,\lambda follow our proof in Appendix F). If we take ϵ(1γ)2\epsilon\ge(1-\gamma)^2, we have 1λτw/41(1γ)ητ/41-\lambda\tau_w/4 \le 1-(1-\gamma)\eta\tau/4. Hence, our convergence speed is comparable to the prior work and does not conflict with their results. (Note that ϵ(1γ)2\epsilon\ge(1-\gamma)^2 does not violate ϵ(0,ϵ0=48(1γ)/121)\epsilon\in(0,\epsilon_0=48(1-\gamma)/121) (Remark F.3), for γ\gamma sufficiently close to 1.)

Please let us know if any further discussion would be helpful. Thank you.

评论

Thank you for the response!

Regarding Weakness 1, what I meant is that if your goal is to prove the minimax property and Theorem 3.1 itself, the choice of policy representation doesn't fundamentally matter. You could always prove it using the occupancy measure and still design the algorithm in the original policy space. I bring this up because in this sense the minimax result is somewhat standard. I suggest the authors revise the writing to make this point clearer.

Thank you for the clarification to Question 2. I recommend that the authors add a discussion of this literature into the Related Work section.

I will keep my positive rating for the final assessment.

评论

The references for our response are listed below.

[References]

Cen et al. "Fast global convergence of natural policy gradient methods with entropy regularization." Operations Research (2022)

Cen et al. "Faster last-iterate convergence of policy optimization in zero-sum Markov games." ICLR (2023).

Cousins et al. "On welfare-centric fair reinforcement learning." Reinforcement Learning Conference. (2024).

Daskalakis et al. "Independent policy gradient methods for competitive reinforcement learning." NeurIPS (2020).

Park et al., "The max-min formulation of multi-objective reinforcement learning: From theory to a model-free algorithm," ICML (2024).

Wei et al. "Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games." COLT (2021).

Zeng et al. "Regularized gradient descent ascent for two-player zero-sum Markov games." NeurIPS (2022).

评论

Thank you for the positive assessment and for engaging in the discussion; your feedback has been greatly valuable.

We will revise the writing around Theorem 3.1 in the revised version to clarify our intent and avoid any potential misleading points you mentioned. We will also add a comparison and discussion with prior work on zero-sum Markov games.

评论

Dear Reviewers,

As you may know, the reviewer-author discussion period for NeurIPS 2025 is currently underway (July 31 – August 6, 11:59pm AoE). The authors have put in significant time and effort to the rebuttal. Your timely engagement is crucial—not only to acknowledge their responses, but also to help clarify any outstanding issues, ensure a fair and thorough evaluation process, and improve the overall quality of the review cycle.

I kindly ask you to:

  • Read the author rebuttal and other reviews (if you have not already).
  • Post your response to the author rebuttal as soon as possible.
  • Engage in discussion with the authors and other reviewers where appropriate.

Thank you again for your time and contributions to the review process.

Best regards,

Area Chair, NeurIPS 2025

最终决定

This paper proposes a game-theoretic framework for max-min multi-objective reinforcement learning with efficient single-loop algorithms and provable last-iterate convergence. Reviewers appreciated the theoretical insights, the adaptive regularization scheme, and the strong empirical improvements over prior work. Some concerns remain regarding clarity, missing related work, and limited novelty beyond practical contributions, but the authors’ rebuttal addressed most issues satisfactorily. Overall, I find the paper technically sound and recommend its acceptance as a poster.