/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Navigating the Social Welfare Frontier: Portfolios for Multi-objective Reinforcement Learning

Cheol Woo Kim,Jai Moondra,Shresth Verma,Madeleine Pollack,Lingkai Kong,Milind Tambe,Swati Gupta

提交: 2025-01-22更新: 2025-08-16

摘要

In many real-world applications of Reinforcement Learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of $p$. To address this challenge, we study the concept of an $\alpha$-approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized $p$-means for all $p \in [-\infty, 1]$. We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying $p$ values, empowering decision-makers to navigate this landscape more effectively.

关键词

Reinforcement learningmulti-objective optimizationp-means aggregation

评审与讨论

审稿意见

评分: 42025-03-10

This paper proposes a method to create a portfolio of multi-objective reinforcement learning policies that aim to preserve the social welfare among agents; in other words, a portfolio that ensures that different stakeholders fulfill their individual preferences. The presented method (p-MeanPortfolio) revolves around Generalized p-means, a family of social welfare functions that balance fairness given a parameter $p \in [-\infty, 1]$ . This work proposes an algorithm that generates a portfolio of policies that approximates the generalized p-means objectives up to a level defined by $\alpha \in (0,1)$ . The paper provides theoretical bounds to the portfolio size and the number of Markov Decision Processes (MDPs) optimally solved. Additionally, this paper proposes a heuristic algorithm (BudgetConstrainedPortfolio) that reduces the computational cost of the p-MeanPortfolio algorithm while preserving the quality of the portfolio. The paper presents experiments in synthetical data and real-world problems.

给作者的问题

How does your overestimation of Theorem 4.1 impact your current results?
How diverse are the final p values selected in your experiments? I understand that the final portfolio finds the $\alpha$ -approximate portfolio that ultimately contains pairs policy-value of p. Is my understanding correct?

论据与证据

Theoretical analysis and experiments on synthetic data and real-world scenarios support the paper's claims.

方法与评估标准

The proposed methods fulfill the objective of the paper: selecting a subset of policies from a larger set to constitute a portfolio that preserves generalized p-means welfare along the spectrum of possible values of p, tackling the limitations of picking a specific value that could be suboptimal. The paper introduces two algorithms that manage to do this: p-MeanPortfolio and BudgetConstrainedPortfolio and shows their validity empirically in cases of limited size.

The characteristics of the testing scenario match the theoretical guarantees the paper provides, as it assumes bounded rewards for the Markov Decision Processes (MDPs), and they respect the bound for the portfolio size detailed in the theory part.

理论论述

I checked all the theoretical claims:

Assumption 3.1 to bound the reward
Lemma 3.2 that talks about the monotonicity of the generalized p-mean
Theorem 4.1 about the bounds to portfolio size and Oracle calls:

I think the final bound has a miscalculation (Line 858), leading to an overestimation of the oracle calls upper-bound:

O\left(ln\left(\frac{\kappa ln N}{ln(1/\alpha)}\right)\right) = O\left(\frac{ln\kappa + ln ln N}{ln ln(1/\alpha)}\right)

Proposition 4.2 talks about the closeness of optimal values for two different p (enabling the proposition of warming up the search and increasing computational efficiency.

实验设计与分析

The presented experiments ran over three scenarios: resource allocation after a natural disaster (synthetic), taxi (synthetic), and healthcare intervention (real-world). There was a set of pre-trained policies to choose from in all cases. The test consisted of evaluating the quality of the portfolios based on how close they were on average with respect to the best possible performance (known beforehand).

One experiment compared the p-MeanPortfolio with picking the policies randomly and picking the value of p randomly to create the portfolio regarding the approximation quality. I couldn't see any mention of the selected value of p for each policy in the final portfolio, and it's relevant to better understand the experiments.

The other experiment compared both methods (p-MeanPortfolio and BudgetConstrainedPortfolio) regarding approximation quality., portfolio size, and oracle calls.

I couldn't find an explanation in the paper for the cases in which BudgetConstrainedPortfolio beat the p-MeanPortfolio in terms of quality, considering that the former uses heuristics versus the latter using as many calls to the oracle as possible.

补充材料

The supplementary material is not runnable; it contains a single file with two indicative functions, rendering it hard to check their experiment's reproducibility.

与现有文献的关系

The work in this paper could be comparable to optimizing portfolios using combinatorial methods (which the paper referenced), with the approach of searching over the space of possible values of $p$ in the context of generalized p-mean welfare instead of solving a mathematical problem, as long as the assumptions hold.

遗漏的重要参考文献

I couldn't find any essential reference that was not discussed.

其他优缺点

Strengths

The paper is well-organized and easy to follow.
The theoretical claims are clearly developed and verifiable
The theoretical bounds seem reasonable in providing optimality guarantees (given an original set of policies).

Weaknesses

Supplementary materials don't add to the submission
The results don't discuss the selection of values for $p$ when picking a policy to be part of the portfolio; considering that the most important contribution of this work is the search over the range of possible values for $p$ , I believe the pairs policy-value of p should be discussed.
It seems that this method is applicable for reduced-size (small-sized portfolios) problems.

其他意见或建议

Check the suggestion over the final expression of the bound in Theorem 4.1 for oracle calls and discuss if the overestimations change in any sense your results
I think there's a typo in Lemma A.4 (Line 653): You switched to $v^{\mathcal{(l)}}$ , but you were developing all for $v^{\mathcal{(1)}}$ .
Why do you think the BudgetConstrinedPortfolio performs better for portfolio sizes 1 and 2?

作者回复

2025-04-01

We thank you for your incredibly detailed and insightful comments! Additional results are available in the anonymous link here

1. On the overstimation of oracle complexity in Theorem 4.1, and how this impacts our current results.

We thank you for pointing out such a detailed point! Indeed, both expressions are valid upper bounds. The version stated in the paper,( $O\left(\frac{\ln(\kappa)^2 \ln\ln(N)}{\ln(1/\alpha)\ln\ln(1/\alpha)}\right)$ is intentionally presented in a slightly simplified, cleaner form. The more precise, but somewhat more complicated expression, would be $O\left(\frac{(\ln \kappa)(\ln \kappa + \ln\ln(N))}{\ln(1/\alpha)\ln\ln(1/\alpha)}\right)$ ); we will include this bound in the revised version. Both formulations correctly bound the oracle complexity, with our stated version slightly overestimating it for simplicity.

2. Discussions on selected $p$ values in the experiments.

We thank you for the great suggestion. Yes, your understanding is correct: our approach outputs pairs of $p-$ value and the corresponding optimal policies. We have included the entire $p$ values chosen by our main algorithm $p$ -MeanPortfolio in the anonymous link, Table 2. Furthermore, we have provided illustrative results demonstrating how the outcomes of the optimal policies for different $p$ values in the portfolio lead to varying impacts for the stakeholders (see Figures 4,5). We observe that the $p$ values are quite diverse, which corroborates the algorithm's objective to cover the entire $p$ range only with these selected values.

For the random $p$ baseline, however, it is more challenging to present the entire $p$ values with meaningful explanations. Due to the inherent randomness of this baseline, we generate 10 independent random $p$ portfolios and report the average performance across these runs (see Section 5.1 for details). Since the $p$ values are sampled uniformly at random from an interval, listing them would result in too many random values without providing additional useful insight.

3. Why the heuristic can sometimes outperform the p-meansPortfolio

$p$ -MeanPortfolio selects policies solely to meet the input approximation factor $\alpha$ . However, the algorithm has no incentive or mechanism to surpass this approximation quality.

In contrast, the heuristic fully utilizes the input budget $K$ , and strategically selects boundary points first (a small initial $p$ followed by $p=1$ ). When $K$ is very small (1 or 2), this initial heuristic strategy can provide more effective coverage across $p \in [-\infty, 1]$ , explaining its occasional superior performance.

4. Code Reproducibility in the Supplementary materials.

We have included the code to replicate our results in the anonymous link, and have verified that everything runs seamlessly. Thank you for pointing this out.

5. It seems that this method is applicable for reduced-size (small-sized portfolios) problems.

In our experiments, the algorithms indeed resulted in small-sized portfolios while maintaining high approximation guarantees. However, we respectfully argue that our method is not inherently limited to small-sized portfolios. The theoretical results we provided explicitly characterize how portfolio size depends on the underlying problem parameters, allowing the method to scale accordingly. Our experiments involve a significantly larger number of reward functions (stakeholders) compared to prior MORL research, demonstrating the effectiveness of our method beyond simpler settings.

6. Typo in Lemma A.4.

We thank the reviewer for noting this detail! We will promptly correct the typo.

审稿人评论

2025-04-09

I appreciate the efforts of the authors to address my comments and suggestions in their rebuttal. Adding better supplementary material in the anonymized repo helped check their results. Thanks for clarifying the upper bound to the Oracle calls; including your decision to overestimate and the reasons for doing so would contribute to understanding your method better. Thanks for the discussion about $p$ values and your explanation about the performance of heuristics vs. the line search version of your algorithm.

I feel that the authors answered my concerns sufficiently, and I am willing to update my score.

审稿意见

评分: 32025-03-12

The paper proposes algorithms for computing a portfolio of policies for multi-objective MDPs. The individual reward functions are scalarized by an aggregation function called p-means, which is dependent on a parameter $p \in [-\infty,1]$ . The authors propose an algorithm, which generates an $\alpha$ -approximate cover of all $p$ values, i.e., for an p-value there is an alpha approximation in the portfolio. The algorithm starts with a lower bound and then, exploits the monotonicity of the p-means functions to find the next p-value for a policy that should be included in the portfolio. Finding this upper bound is done by a binary search algorithm which requires logarithmic calls of the oracle function. The authors also provide an upper bound for the portfolio size and the number of MDPs that have to be solved. As an alternative, they propose a second algorithm, which has a bounded number of oracle calls, but cannot guarantee to hold a specific $\alpha$ -value. In the experiments, the algorithms are compared to baselines generating policies for a randomly chosen number of values for $p$ .

给作者的问题

How did you select the $\alpha$ values for the experiments?

Can you explain why the approximate algorithm can outperform the p-meansPortfolio?

What is the set of feasible set of policies $\Pi$ for in the input of your algorithms? You seem to compute the policies on the fly using the Oracle?

论据与证据

I would generally say that the claims are supported well enough.

方法与评估标准

The empirical evaluation uses three environments looking appropriate for this setting.

理论论述

As far as I could follow the proofs in the appendix, I could not find any flaw.

实验设计与分析

The selection of baselines seems somewhat unfair as the baselines have exactly $K$ MDPs to solve whereas the proposed p-mean portfolio might have much more Oracle calls. Also, it is unclear how p-means were tuned to exactly generate a given number of policies. The approximate method sometimes beat the non-approximate ones and there was no comment on how this happened.

补充材料

I checked the supplementary. In particular the proofs and the description of the used environment.

与现有文献的关系

The related work section of the paper was insightful though I lack the familiarity with the area to claim that I could check for completeness here.

遗漏的重要参考文献

I am not aware of any major work that should have been mentioned.

其他优缺点

The general task of the paper is hard to grasp as the use of varying the parameters of p-means is not motivated very well. Also, the assumption that all reward functions have the same range is somewhat unmotivated. Thus, the problem setting seems to be rather specific and have to ask myself if the paper is of broader impact to the machine learning community.

其他意见或建议

The example in section 1.2 did not help me much to understand the paper. Instead, a paragraph to better explain the idea of p-means would be insightful. Maybe a figure on which reward vectors are optimal under varying values of p, would help to grasp what varying the p-value actually means.

As your work does not rely on reinforcement learning, assuming MDPs where the optimal solution can be computed by dynamic programming or linear programming, might allow you exactly determine optimal policies for a particular value of p.

The section on warm-starting the oracle is also not very insightful. The general idea is known from methods for computing the pareto front and the proposition gives a cryptic bound for the difference in the value. But this is not very insightful on the rate this accelerates the convergence of the method used to solve the MDP.

作者回复

2025-04-01

We thank you for the insightful comments! Additional results are available in the anonymous link here

1. The selection of baselines seems somewhat unfair ... much more Oracle calls.

Thank you for your comment. However,if we match the number of oracle calls for the random $p$ baseline, the resulting portfolio becomes impractically large. For instance, in the taxi environment, our portfolio achieves an approximation factor of 0.99 with 10 policies (using 144 oracle calls), while the random baseline under the same oracle calls will produce 144 policies. Thus, although both approaches may yield similar approximation quality under the same oracle budget, our method does so with significantly fewer policies.

2. It is unclear how p-means were tuned ... given number of policies.

We do not manually tune the $p$ values to explicitly control the number of policies in the portfolio. Instead, the portfolio size is implicitly controlled through the desired approximation factor $\alpha$ . Given a specific $\alpha$ as the input, our algorithm systematically searches through the entire range of $p \in [-\infty, 1]$ , automatically selecting a minimal set of representative $p$ -values to build a $\alpha-$ approximate portfolio.

3. The general task of the paper is hard to grasp ... of broader impact to the machine learning community.

We thank you for this important comment. Generalized $p$ -means represent a widely used and theoretically justified class of social welfare functions in MORL. However, an important practical challenge that was previously overlooked is that the right fairness criterion (the choice of $p$ ) can be unclear in advance. Selecting a policy optimized for an arbitrary $p$ can lead to poor outcomes under a different $p$ value (see the approximation qualities of random $p$ baselines in our experiments).

We propose computing a small yet representative set of policies that covers the entire spectrum of fairness criteria represented by $p \in [-\infty, 1]$ . The decision-maker can confidently select from this portfolio, without worrying about suboptimality under other potential choices of $p$ .

Finally, we respectfully argue that the assumption on reward ranges is not overly restrictive. If each reward function $i \in [N]$ has a different range $L_i \leq R_i(\cdot) \leq U_i$ , we can simply define a universal reward range as $L := \min_{i \in [N]} L_i$ and $U := \max_{i \in [N]} U_i$ without loss of generality.

4. The example in section 1.2 did not help me ... varying the p-value actually means.

We thank you for this suggestion. To clarify the meaning behind $p$ -means, we constructed an illustrative example in Figures 1, 2, and 3. In this example, three vectors $A$ , $B$ , and $C$ represent different policies, where each vector's $i$ -th entry is the reward for the $i$ -th stakeholder. Here, $A$ is a balanced policy, $B$ is unbalanced with a high sum, and $C$ lies in between. Figure 3 plots the $p$ -mean values of these vectors, showing that varying $p$ leads to different optimal policy.

5. The section on warm-starting the oracle ... convergence of the method used to solve the MDP.

We thank you for this comment. Indeed, the provided bound does not directly translate into an explicit convergence rate, as this rate depends heavily on the specific algorithm used to solve a given MDP. What we aimed for is a theoretical foundation that quantifies the benefits of warm-starting regardless of the algorithm, using the distance between an initial value function (obtained by warm-starting) and the optimal value function, which is standard metric in the literature.

6. How did you select the $\alpha$ values?

For a given portfolio size $K$ , we chose the smallest $\alpha \in \{0.10, 0.15, \ldots, 0.90, 0.95\} \cup \{0.99\}$ among which the resulting portfolio has size $K$ . We will make sure to clarify this point in the revision.

7. Why the approximate algorithm can outperform the p-meansPortfolio

$p$ -MeanPortfolio selects policies solely to meet the input approximation factor $\alpha$ . However, the algorithm has no incentive or mechanism to surpass this approximation quality.

In contrast, the heuristic fully utilizes the input budget $K$ , and strategically selects boundary points first (a small initial $p$ followed by $p=1$ ). When $K$ is very small (1 or 2), this initial heuristic strategy can provide more effective coverage across $p \in [-\infty, 1]$ .

8. What is the set of feasible set of policies ... using the Oracle?

The set of feasible policies $\Pi$ is assumed to be given as an input. Note that $\Pi$ may include all possible policies within an MDP and need not be a small, restricted set. The Oracle we refer to computes the welfare-maximizing policy within $\Pi$ , given a fixed value of $p$ (i.e., a fixed social welfare function).

审稿人评论

2025-04-02

Thank you, for your detailed answers. for point 2, Understood. I was just wondering how you aligned the portfolio size for table 2 when you compare both of you algorithms.

I am still confused about where $\ Pi$ in your algorithm comes from. In the pseudo-code, it looks like you select a subset of $\Pi$ . But this would mean that you must compute the all policies before calling the algorithm. From the rest of the paper, I understood that you call the oracle on the fly during the algorithm to generate policies for particular scalarizations. Where is my mistake?

作者评论

2025-04-03

The short answer is no. We do not need pre-computed policies. $\Pi$ can be implicitly specified as the set of all possible policies in MDP, though our method can easily adapt to explicitly given, finite sets. We elaborate this below.

Suppose $\Pi$ is defined as the set of all possible policies in an MDP. Clearly, this set is uncountably infinite (including stochastic policies) and cannot be explicitly enumerated or pre-computed in advance. However, when we call the oracle to solve ${\max}_{\pi \in \Pi} v^{(\ell)} (\pi, p)$ for a fixed $p$ during the algorithm, we are treating this as an optimization problem over policy space. This is exactly what RL algorithms are designed to do: they search this infinite policy space implicitly to find an optimal policy for a given scalarization. Hence, at no point do we need to pre-compute any policy in $\Pi$ in advance to compute its subset (i.e, a portfolio). This is the setting we consider in the taxi environment in our experiments.
On the other hand, there are practical scenarios that more closely match your interpretation. In some applications, a decision-maker may not want to consider all possible policies, but rather choose from a finite set of explicitly given, pre-computed ones. In this case, the Oracle still solves ${\max}_{\pi \in \Pi} v^{(\ell)} (\pi, p)$ , for some $p$ , but does so by simply evaluating all policies in the given finite set $\Pi$ . This does require pre-computing the policies, as you suggested. We adopt this approach in the healthcare intervention and natural disaster environments, where we construct a finite, realistic set of policies in advance, and use this set as $\Pi$ . Details on how these policies are generated can be found in Appendix B. To summarize, even if $\Pi$ serves as an “input” to our algorithm and our goal is to find its subset, we do not need to pre-compute every policy in $\Pi$ explicitly in advance. When $\Pi$ represents all possible policies in MDP, the Oracle uses RL to implicitly search the policy space. We hope this clarifies your concern as well as the two distinct settings considered in our experiments.

If our response addresses your concerns, we kindly request that the reviewer update the score accordingly.

审稿意见

评分: 32025-03-14

The paper investigates an interesting problem of learning fair policies in multi-objective reinforcement learning (MORL) by proposing an $\alpha$ -approximate portfolio which is a finite set of policies that are approximately optimal across the family of generalized p-means social welfare functions (e.g., Nash, utilitarian, egalitarian). Instead of optimizing a single welfare function, the authors develop algorithms (p-MeanPortfolio and Budget-ConstrainedPortfolio) to construct portfolios for generalized p-means with theoretical guarantees on approximation quality and computational efficiency. Experiments on synthetic and real-world domains (e.g., healthcare interventions) demonstrate the approach’s ability to balance fairness and efficiency while reducing oracle calls.

给作者的问题

See above weaknesses and my concerns in the Relation To Broader Scientific Literature.

A few more questions:

How does the heuristic algorithm perform in worst-case scenarios? Can you bound its approximation factor $\alpha$ as a function of the budget K?
Which RL algorithms (e.g., policy gradient, value iteration) were used to compute optimal policies for fixed p-means? How were hyperparameters tuned?
Do the approximation guarantees differ between SER and ESR scalarizations? Is it possible to include separate results for each criterion?
How does the portfolio size/quality trade-off compare to Pareto front approximations in terms of interpretability and computational cost?
How does this work differ from methods like Envelope [1], GPI [2], or PCN [3], which also learn multiple policies for diverse objectives?

[1] Yang, Runzhe, Xingyuan Sun, and Karthik Narasimhan. "A generalized algorithm for multi-objective reinforcement learning and policy adaptation." Advances in neural information processing systems 32 (2019).

[2] Alegre, Lucas N., et al. "Sample-efficient multi-objective learning via generalized policy improvement prioritization." arXiv preprint arXiv:2301.07784 (2023).

[3] Reymond, Mathieu, Eugenio Bargiacchi, and Ann Nowé. "Pareto conditioned networks." arXiv preprint arXiv:2204.05036 (2022).

论据与证据

Most of the paper claims are supported by theoretical and experimental results. However, the paper lacks baseline comparisons and more realistic environments to show if the proposed method can scale with environments with large number of objectives.

方法与评估标准

The benchmark environments are appropriate. However, including one or more realistic environment with higher number of objectives would further strengthen the paper's claim.

理论论述

The theoretical results in the main paper seem to be correct. However, I did not verify the detailed proofs in the appendix.

实验设计与分析

Relevant baselines and more realistic empirical validations are missing.

补充材料

I briefly checked appendix B for environment descriptions.

与现有文献的关系

The contributions of this paper to fair RL are somewhat unclear. Previous methods, such as the one cited in the paper (Yu et al., 2024), already optimize the generalized Gini welfare function for preferential user treatment. This approach is quite similar to the one proposed here. The Gini welfare function is also highly general where setting $w_1=1$ and all others to 0 results in an egalitarian outcome, while equal weights lead to a utilitarian welfare function. Additionally, preferential weights can be used to sample a policy for specific user weights. A key advantage of the cited work is that it learns a single policy which eliminates the need for multiple oracle calls. This raises the question of what specific benefits this paper offers over existing methods?

遗漏的重要参考文献

The paper adequately discusses related work.

其他优缺点

Strengths:

The paper investigates an interesting problem of fairness in MORL where the goal is to achieve policy outcomes to the choice of social welfare functions. This problem in its setting is not only interesting but also novel the way it is being solved.
The idea of an $\alpha$ -approximate portfolio is promising, seems original and provides a structured way to learn a set of Pareto optimal policies that trade-offs across p-values. Moreover, the unification of scalarized expected returns (SER) and expected scalarized returns (ESR) under one framework is a notable contribution.
The analysis of portfolio size (logarithmic in the reward condition number κ) and oracle complexity (Theorem 4.1) is thorough, and make sense.
The Budget-ConstrainedPortfolio heuristic achieves near-optimal $\alpha$ with significantly fewer oracle calls than p-MeanPortfolio* (Table 2), making it viable option for real-world applications.
The paper is well-written, and the related work section comprehensively covers MORL and fairness literature.

Weaknesses: Despite it strengths, the paper has several weaknesses. Below I enlist the weaknesses, my concerns, questions and suggestions for improving the paper.

While theoretically sound, p-MeanPortfolio* requires many oracle calls (e.g., 69 calls for α=1.0 in healthcare experiments) which is very computationally expensive and also limit the scalability to large MDPs.
The paper does not specify the underlying RL method used to solve Problem (3) (e.g., policy gradient, Q-learning). It's important to explain how (3) is solved both for practical and reproducibility reasons.
The paper also lacks comparisons to MORL methods that approximate Pareto front such as Envelope[1], GPI[2] or PCN[3]. Although random policy and $p$ baselines are included, comparisons to multi-policy MORL methods is essential to see if they can outperformed by the proposed methods both in terms of fairness and policy diversity.
The experiments do not clarify whether results correspond to SER, ESR, or both. It's important to clarify as results holding for SER may not be hold true for ESR.

其他意见或建议

See above.

作者回复

2025-04-01

We thank you for the insightful comments! Additional results are available in the anonymous link here

1. Including one or more realistic environment with higher number of objectives.

We respectfully note that our experiments already consider a significantly higher number of objectives compared to most previous MORL studies. For example, while the referenced papers explore up to 16 objectives (Yu et al., 2024), 6 in [1], 2 in [2], and 9 in [3], our setups include a natural disaster scenario with 16 objectives and a healthcare intervention scenario with 59 objectives. We will make sure to emphasize this point in our manuscript.

2. The relation to the paper (Yu et al., 2024) and the benefits of our work over existing methods

Yu et al. (2024) optimize a generalized Gini welfare assuming a fixed weight vector. We address a fundamentally different scenario, where decision-makers are uncertain about the appropriate fairness criterion (e.g., the choice of $p$ in $p$ -means or the selection of weights in Gini welfare) in advance. Optimizing a single policy for one fairness parameter can lead to poor outcomes under different criteria (see the approximation qualities of random $p$ baselines in our experiments). Our method computes a small, representative set of policies covering the entire spectrum of fairness criteria. This allows decision-makers to select confidently from this set, without worrying about suboptimality under other potential choices of $p$

3. While theoretically sound ... limit the scalability to large MDPs.

The referenced scenario ( $\alpha = 1.0$ ) means that the computed portfolio achieves perfect optimality across the entire range of $p$ . Our algorithm provides a flexible trade-off: selecting smaller $\alpha$ will reduce the oracle calls, thereby enhancing scalability to large MDPs while still maintaining approximation guarantees. For example, in the same experiment, the portfolio constructed with 7 oracle calls achieves 0.982 optimality.

Furthermore, we'd like to respectfully highlight that to the best of our knowledge, our work is the first to explicitly consider the number of oracle calls required for constructing portfolios in MORL. Hence, we believe explicitly counting and reporting oracle complexity (along with a theoretical bound) represents a strength of our work rather than a weakness.

4. The RL algorithm used, hyperparameters, and the experimental settings (ESR vs SER).

We thank you for this comment. We respectfully note that the experimental details pointed out can be found in Section 5.2. For example, we used Welfare Q-Learning (Fan et al., 2022) for the taxi environment. For the other two environments, the feasible policy set $\Pi$ consists of finite pre-computed policies, which are selected according to the procedure detailed in Appendix B.1. Consequently, for these environments, Problem (3) is solved by simply enumerating the policies in $\Pi$ without needing additional RL algorithms. The finite $\Pi$ setting is discussed in the beginning of Section 4. We have also made the code to replicate our results available in the anonymous link (including hyperparameters) .

5. Comparisons to other MORL methods. How does this work differ from methods like Envelope [1], GPI [2], or PCN [3]

[1] focuses on weighted linear objectives and learns a single policy, whereas we focus on $p$ -means and compute a portfolio (multiple policies). [2] builds portfolios focusing on weighted linear objectives and does not offer guarantees on portfolio size or oracle complexity. [3] trains a single policy to approximate the Pareto front, which does not directly correspond to any social welfare function. To the best of our knowledge, our work is the first to (1) propose a multi-policy approach in $p$ -means and (2) provide theoretical guarantees on portfolio size, approximation quality, and oracle complexity simultaneously.

We conducted preliminary experiments comparing our method to a version of GPI[2]. The results indicate that our approach achieves significantly better approximation quality. For example, the approximation quality of the GPI portfolio can be as low as 35% of that achieved by our portfolio with the same size. Please refer to Table 1 in the anonymous link. We have also provided a theoretical explanation in Lemma 1, which shows why the weighted linear combinations that GPI focuses on lead to suboptimal results from the $p-$ means perspective.

6. Bound on the heuristic algorithm's approximation factor

The best bound we have on the heuristic algorithm's approximation factor is $\kappa$ (See Lemma 2 in the link). However, this is a generic bound that holds for all portfolios.

7. Do the approximation guarantees differ between SER and ESR?

Our theoretical guarantees hold uniformly for both ESR ( $v^{(1)}$ ) and SER ( $v^{(2)}$ ).

审稿意见

评分: 32025-03-16

In this paper, the authors proposed a novel setting in optimizing social welfare while considering different preferences. The authors proposed an algorithm which can trade off among approximation factor, portfolio size, and computation efficiency. The proposed p-MEANPORTFOLIO algorithm chooses several p-value, and train corresponding policies.

给作者的问题

In the Introduction section, authors mentioned that the major challenge is p-value can change. But in previous literature, if the model is perfectly known, and the social welfare maximization problem is totally tractable, is there still a need to use MORL?

In the Introduction, the authors mentioned "Small changes in p can sometimes lead to dramatically different policies, further complicating the selection process". What does selection process refer to? Can we directly model or optimize the policies given different p?

A related but also practical setting to this work is that each user may have its own utility or welfare definition, and they are making decisions independently. In this way, a multi-agent setting shall be considered with different objectives. Can the authors comment on this setting?

论据与证据

In real-world applications, decision-makers should understand how policies change as the value p varies, and make informed decisions about which policy to deploy. The authors reflected this observation in their algorithm design.

方法与评估标准

Yes, the methods are properly evaluated in the simulation section with several environments.

理论论述

Yes, I checked the major proofs of Proposition and Theorems.

实验设计与分析

Yes, I checked the simulation results.

补充材料

Yes,

与现有文献的关系

The proposed method in this paper is an interesting addition to multi-objective Reinforcement Learning and practical settings of social welfare setups.

遗漏的重要参考文献

其他优缺点

Strengths:

-The paper comes up with a novel setting for connecting MORL with social welfare maximization problem.

-Simulation on several interesting social domains validate the effectiveness of the proposed algorithm.

-Theoretical results support the usage and effectiveness of line search and warm start.

Weaknesses

It is unclear why there is a need to find \alpha-approximation for p-value. Why not directly applying some previous MORL algorithms?

Other social welfare functions are not discussed, and it is also unclear how the algorithm performs for other social welfare functions.

To select the policies, the authors assumed the utility can always be evaluated. However, in many practical settings, it is not possible to query the environment and get all the possible return under different p-value.

There lacks comparisons to neither model-based optimization approaches nor other MORL algorithms.

其他意见或建议

Please see above.

作者回复

2025-04-01

We thank you for the insightful comments! Additional results are available in the anonymous link here

1. Why there is a need to find $\alpha$ -approximation for p-value. Why not use previous MORL algorithms?

We thank you for this important comment. Our work addresses the following limitations of previous MORL methods.

Current work on approximating the Pareto frontier scales exponentially with the number of reward functions. Furthermore, to the best of our knowledge, there are no known theoretical guarantees about the portfolio size, approximation factors or oracle complexity.
Approaches that explicitly incorporate a fairness perspective via social welfare functions (instead of Pareto front) assume that the right social welfare function is known and fixed. This overlooks the challenge that the ideal choice of the fairness parameter (e.g., $p$ value) is often unclear in advance. Selecting a policy optimized for an arbitrary $p$ can lead to poor outcomes under a different $p$ value.
Moreover, the connections between the Pareto frontier (as considered in the RL literature) and how this maps to social welfare functions is largely unclear. Please see (Hayes et al., 2022) cited in our paper for this discussion.

Our work addresses each of these limitations as follows:

We provide theoretical guarantees that polynomially bound both portfolio size and the number of oracle calls, and we empirically demonstrate this.
We ensure strong fairness guarantees with respect to $p$ -means, a widely used class of social welfare functions.
Our approach computes a small yet representative set of policies that covers the entire spectrum of fairness criteria represented by $p \in [-\infty, 1]$ . The decision-maker can confidently select from this concise portfolio, without worrying about suboptimality under other potential choices of $p$ .

2. Comparison to other optimization or MORL algorithms.

We conducted preliminary experiments comparing our method to a version of a well-known MORL algorithm, Generalized Policy Improvement (GPI), which builds portfolios by considering weighted linear combinations of objectives for multiple weights. The results indicate that our approach achieves significantly better approximation quality. For example, the approximation quality of the GPI portfolio can be as low as 35% of that achieved by our portfolio with the same size. Please refer to Table 1 in the anonymous link. We have also provided a theoretical explanation in Lemma 1 in the anonymous link, which shows why the weighted linear combinations that GPI focuses on lead to suboptimal results from the $p-$ means perspective.

3. If the model is perfectly known ... is there still a need to use MORL?

Indeed, if the exact choice of the social welfare function (i.e., the exact value of $p$ ) were already known and fixed, computing a single optimal policy would suffice, and there would be no need for constructing a portfolio. As noted above, we address a fundamentally different scenario, where decision-makers are uncertain about the appropriate $p$ in advance.

4. The authors mentioned ``Small changes in p can sometimes ... Can we directly model or optimize the policies given different p?

The selection process refers to the challenge a decision-maker faces when trying to choose an appropriate social welfare function (parameterized by $p$ ) and its corresponding optimal policy in advance.

As suggested, we could incorporate $p$ as an additional input to the policy. While this is an appealing direction, this policy again assumes that the decision-maker knows the desired $p$ at deployment, making it challenging to apply in our setting where the right choice of $p$ is unclear.

5. Extension to a multi-agent setting.

We thank you for this interesting idea. Our work explicitly assumes a centralized decision-maker who selects a policy to balance the outcomes across multiple stakeholders. Thus, addressing multi-agent scenarios falls beyond our current scope. Nevertheless, we agree that extending our ideas to such settings is an exciting future direction.

6. Other social welfare functions are not discussed.

We thank you for this comment. Although our approach addresses $p$ -means only, we emphasize respectfully that this represents a rich and theoretically justified class of social welfare functions (see page 1, right column, lines 23–33 for our discussion).

7. To select the policies, the authors assumed the utility can ... under different p-value.

In our setting, each stakeholder $i$ 's utility is defined as the sum of rewards along a trajectory, $\sum_{h=1}^{H} R_i(s_h, a_h)$ , which is a standard assumption in MORL. For a fixed $p$ , Problem (3) can be solved using existing RL methods to find a welfare-maximizing policy (see also Section 3.2, lines 190–195).

审稿人评论

2025-04-05

Thanks the authors for addressing my review comments.

最终决定Accept (poster)

2025-05-01

This paper addresses the problem of constructing policy portfolios for multi-objective reinforcement learning to approximate generalized p-means social welfare objectives. The reviewers generally agreed that the problem is clearly motivated and relevant, especially in scenarios where decision-makers face uncertainty in stakeholder preferences.

Reviewers appreciated the clear theoretical framework, rigorous proofs, and practical algorithms (p-MeanPortfolio and BudgetConstrainedPortfolio) proposed. The experiments effectively demonstrate the trade-offs in portfolio size, computational cost, and approximation quality in synthetic and realistic scenarios. The clarity of writing and presentation were also positively noted. The primary concerns revolved around somewhat limited baseline comparisons, especially with respect to Pareto front approximation methods. Reviewers suggested additional discussion on how the method would scale to larger or more complex scenarios. Initially, there was also some confusion regarding algorithmic details. But these concerns were clarified in the authors' rebuttal.

Overall, while the paper could benefit from expanded comparisons and clearer discussions of scalability, the contributions are solid, and the results convincing. It represents a sound contribution to the multi-objective reinforcement learning literature. I recommend acceptance, provided the authors incorporate the reviewers' feedback in their final revision.