PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
4
3
ICML 2025

Preference Controllable Reinforcement Learning with Advanced Multi-Objective Optimization

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

Novel Multi-Objective Reinforcement Learning (MORL) method that discovers more Pareto optimal solutions than most previous MORL

摘要

关键词
Multi-Objective OptimizationMulti-Objective Reinforcement Learning

评审与讨论

审稿意见
3

The paper introduces a novel framework, Preference Controllable Reinforcement Learning (PCRL), which trains a single, preference-conditioned policy capable of generating Pareto optimal solutions according to user-specified trade-offs. The approach leverages advanced multi-objective optimization (MOO) techniques and proposes a new update method, PreCo, that incorporates a similarity function to align the policy’s performance with a given preference. The paper supports its claims with comprehensive theoretical analysis—providing convergence guarantees—and demonstrates the method’s effectiveness through experiments in environments such as Fruit-Tree and MO-Reacher, showing improved hypervolume and cosine similarity metrics compared to traditional linear scalarization (LS) and several existing MOO methods.

给作者的问题

  1. Can the authors provide ablation studies on the similarity weight λ\lambda to illustrate its impact on both convergence speed and final control performance?

  2. For continuous action spaces, does the PreCo update method require any modifications? Could the authors elaborate on potential challenges and necessary adjustments for continuous control environments?

  3. Although the paper suggests that the method could be applied to real robotic tasks, are there any plans for experimental validation on actual robotic platforms? What practical issues might arise during such a transition?

  4. The current experiments involve up to six objectives. How does the proposed method scale when the number of objectives increases further, and are there anticipated bottlenecks in terms of computational complexity or performance?

论据与证据

Yes

方法与评估标准

Yes

理论论述

The theoretical analysis, while rigorous, makes some idealized assumptions that may not fully capture the noise and uncertainties present in real-world scenarios, thereby leaving questions about the practical applicability of the method.

实验设计与分析

There is a lack of ablation studies or sensitivity analysis regarding key hyperparameters—such as the similarity weight λ\lambda in the PreCo update—which raises concerns about the robustness and practical tuning of the approach.

补充材料

Yes.

与现有文献的关系

The proposed method represents an important contribution to the field of multi-objective reinforcement learning.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. The PCRL framework addresses the well-known limitation of LS methods by enabling more extensive exploration of the Pareto front and precise control over policy outputs in non-strictly convex settings.
  2. The proposed PreCo update method is innovative; by integrating a similarity gradient to balance conflicting multi-objective gradients, it provides a clear theoretical advancement, as evidenced by detailed convergence proofs (e.g., Theorems 4.1–4.4).
  3. The experimental evaluation is extensive, covering both discrete action environments (Fruit-Tree) and robotic control tasks (MO-Reacher), with state coverage visualizations that effectively illustrate how the method produces diverse, preference-specific policies.
  4. The authors also consider computational efficiency by solving the min-norm problem at the policy level rather than at the parameter level, which is beneficial for scalability in large models.

Weaknesses:

  1. The set of baseline methods used in the experiments is somewhat limited; some of the compared methods are relatively dated, which narrows the scope of the comparative analysis.

  2. Experiments are primarily conducted in discrete action spaces. Although the authors mention potential applications to continuous control and real robots, there is insufficient experimental evidence to support the method’s effectiveness in these more challenging settings.

其他意见或建议

No

作者回复

Dear reviewer,

We sincerely thank you for your valuable review and constructive feedback.

We address your concerns and aswer your questions below:


Q1. Can the authors provide ablation studies on the similarity weight λ\lambda to illustrate its impact on both convergence speed and final control performance?

A1: The table below demonstrates the robustness of our method to different values of λ\lambda:

λ\lambda51050100300550
Hyper Volume (*1e3)13.83±\pm1.7413.98±\pm2.1114.74±\pm1.3615.02±\pm1.715.21±\pm0.7915.61±\pm0.75
Cosine Similarity0.76±\pm0.030.76±\pm0.040.77±\pm0.020.77±\pm0.030.79±\pm0.020.78±\pm0.03

This table illustrates how different upper bounds of λ\lambda affect PreCo's performance in the FruitTree environment. While smaller values of λ\lambda may slightly change the performance, PreCo remains consistently superior to existing MORL methods. Notably, the best-performing baseline, PDMORL, achieves a hypervolume of only 9.3±\pm0.08*1e3, demonstrating PreCo's promising advantage.


Q2. For continuous action spaces, does the PreCo update method require any modifications? Could the authors elaborate on potential challenges and necessary adjustments for continuous control environments?

A2. This has been discussed in Appendix E, with experiments in continuous action space presented in Appendix C.

In continuous action spaces, the "policy-level" gradient corresponds to the gradients of (1) action samples (Eq. 26) or (2) the parameters of the Gaussian action distribution (Eq. 24), rather than the gradients of action probabilities. As illustrated in Figure 9 and Algorithm 2, our approach first determines a common descent direction for all objectives by solving the min-norm problem (6) using their gradients (Eq. 24 or 26). The policy model is then updated to fit the new actions along this direction.

The experiments in Appendix C focus on high-dimensional continuous control tasks with simple, strictly convex objectives. While the advantage on performance is smaller compared to tasks with non-strictly convex objectives and many objectives, our method still outperforms previous MORL approaches.


Q3. Although the paper suggests that the method could be applied to real robotic tasks, are there any plans for experimental validation on actual robotic platforms? What practical issues might arise during such a transition?

A3: Real-world applications, such as robotics, are indeed an exciting direction. In fact, we have already been applying our method to large language models and acheived promising results. One of our future plans is to leverage preference-controllable language models for high-level robotic planning and human-robot interaction.

For low-level robotic control, several challenges include common sim-to-real issues such as sample efficiency, dynamics mismatch, and safety concerns. One critical study is to ensure that the robot does not exhibit unpredictable behavior when conditioned on previously unseen preferences.


Q4. The current experiments involve up to six objectives. How does the proposed method scale when the number of objectives increases further, and are there anticipated bottlenecks in terms of computational complexity or performance?

The MOO algorithms with conflict-avoidance ability, including our proposed PreCo, require solving the min-norm problem (6) to determine a weight vector ww, whose dimension corresponds to the number of objectives. When the number of objectives becomes very large, the computational cost of solving (6) can increase significantly.

To mitigate this, instead of explicitly solving (6) at every training step (Algorithm 2), we can update ww using the gradient of (6) once per training step (Algorithm 1). Our theoretical results justify the convergence of this novel approach, making it a more scalable alternative in practice.


We hope these clarifications address your concerns. Let us know if you have any further questions.

审稿人评论

Thanks for the response that addresses my concerns

作者评论

We sincerely appreciate your thoughtful feedback and are glad that our response addressed your concerns.

If our clarifications have resolved your doubts, we would be grateful if you could reconsider your rating. Your comments helped us to strengthen our empirical results, and your support for our work could further drive advancements in this research direction.

Thank you for your time and support!

审稿意见
4

The paper proposes a novel approach to learning the ϵ-Pareto efficient frontier in multi-objective optimization using standard reinforcement learning algorithms, specifically TD3 and PPO. The authors introduce a method where preferences are sampled uniformly, and a similarity function between the preference and value function is used as feedback to guide the RL algorithm. Theoretical guarantees are provided regarding convergence to ϵ-optimality. Empirical results are presented across multiple environments to demonstrate the efficacy of the proposed approach. The paper also discusses the potential application of the method in real-world scenarios, particularly in multi-agent reinforcement learning settings.

给作者的问题

Justification for Similarity Function (Eq. 8): What justifies the use of this specific similarity function over alternatives? How sensitive are the results to the choice of similarity function?

Non-Uniform Preference Sampling: How would non-uniform sampling of preferences affect the learned policy? Can the method be adapted to handle such cases?

Non-Transient Preferences: Can the method extend to non-transient preferences? Is consistency in preferences a requirement for convergence?

Economic Benchmarks: Have the authors considered benchmarking the method on economic settings, such as maximizing profit versus customer satisfaction? This would provide a more nuanced evaluation of the method's applicability.

论据与证据

The claims made in the paper are generally supported by empirical evidence and theoretical analysis. The authors provide convergence guarantees for their method, which are backed by theoretical proofs. The empirical results are extensive and demonstrate the effectiveness of the proposed approach across various environments. However, some claims could benefit from further clarification or additional evidence:

The claim that the method achieves ϵ-Pareto efficiency is supported by theoretical analysis, but the empirical results (e.g., Figure 4b and 4d) do not clearly show that the solutions lie on the Pareto front. This discrepancy should be addressed.

The justification for the choice of similarity function in Eq. (8) is not thoroughly explained. A more detailed discussion or comparison with alternative similarity functions would strengthen this claim.

方法与评估标准

The proposed methods are appropriate for the problem of learning the ϵ-Pareto efficient frontier in multi-objective optimization. The use of TD3 and PPO as base algorithms is reasonable, given their widespread success in RL. The evaluation criteria, such as hypervolume optimization, are standard in multi-objective optimization literature and align well with the goals of the paper. However, the following points could be improved:

The paper could benefit from a more detailed discussion of why hypervolume was chosen as the primary metric and how it relates to the ϵ-Pareto efficiency goal.

The uniform sampling of preferences is a simplifying assumption. The authors should discuss how non-uniform preference sampling might affect the results and whether the method can be adapted to handle such cases.

理论论述

The theoretical claims regarding convergence to ϵ-optimality are a key contribution of the paper. The proofs appear to be correct, but the theoretical analysis could be deepened. For instance:

The learning rates for the RL algorithms are mentioned, but a more detailed discussion of their impact on convergence would be valuable.

The paper could explore the theoretical implications of non-transient preferences or non-linear preference transformations, as these scenarios are common in real-world applications.

实验设计与分析

The experimental design is sound, with results presented across multiple environments to demonstrate the robustness of the proposed method. However, there are a few areas for improvement:

In Figure 4b and 4d, the solutions do not clearly lie on the Pareto front, which raises questions about the empirical validity of the ϵ-Pareto efficiency claim. The authors should address this discrepancy.

The experiments focus on synthetic environments. Including real-world benchmarks, such as economic settings (e.g., profit vs. customer satisfaction), would strengthen the practical relevance of the paper.

补充材料

The supplementary material was reviewed and provides additional details on the theoretical proofs and experimental setups. However, it could be expanded to include additional empirical results, particularly for non-uniform preference sampling or non-transient preferences.

与现有文献的关系

The paper builds on prior work in multi-objective optimization and reinforcement learning. The key contribution—learning the ϵ-Pareto efficient frontier using RL—is novel and addresses an important gap in the literature. However, the paper could better situate itself within the broader context by discussing how the proposed method compares to other approaches for Pareto front learning, such as evolutionary algorithms or gradient-based methods.

遗漏的重要参考文献

The paper could benefit from citing and discussing the following works:

Recent advances in non-linear preference transformations for multi-objective optimization.

Methods for handling non-uniform preference sampling in RL.

Applications of Pareto optimization in economic settings, which would provide a more nuanced benchmark for the proposed method.

其他优缺点

Strengths:

The paper is well-written and addresses an important problem in multi-agent reinforcement learning.

The empirical results are extensive and demonstrate the potential of the proposed method in applied settings.

The theoretical guarantees provide a solid foundation for the proposed approach.

Weaknesses:

The delta from previous RL algorithms is not clearly articulated. The reliance on TD3 and PPO without significant modification raises questions about the novelty of the method.

The empirical results do not consistently demonstrate ϵ-Pareto efficiency, particularly in Figure 4b and 4d.

The paper could benefit from a more thorough discussion of the broader implications and limitations of the proposed method.

其他意见或建议

Move the discussion of related works to the introduction for better flow.

Clarify whether the objectives are truly "conflicting" or simply involve trade-offs (Line 138).

Move Algorithm 2 to the main paper, as it represents the actual algorithm used, while Algorithm 1 is more illustrative.

作者回复

Dear reviewer,

We sincerely thank you for your valuable review and constructive feedback.

We address your concerns and aswer your questions below:


C1: About Figure 4b and 4d: The empirical results do not consistently ... in Figure 4b and 4d.

R1: Figures 4b and 4d are 2D projections of the 3D results presented in Figure 4a. While some points may appear dominated in the 2D projections, all points in the figures are actually non-dominated when all three objectives are taken into account.

We hope this clarification helps address your concern regarding the empirical results.


C2: The delta from previous RL algorithms is not clearly ... the novelty of the method.

R2: TD3 and PPO are representative single-objective RL algorithms used as lower-level backbones in our approach. More details on how they are integrated can be found in Appendix E.

The key novelties of our work are as follows:

  1. Conceptual: We propose a preference control framework (PCRL) that overcomes the limitations of traditional linear scalarization methods, which can only discover a limited set of Pareto-efficient solutions.

  2. Methodological: We integrate modern MOO algorithms into MORL, enabling better handling of conflicting and stochastic gradients—challenges that previous MORL methods largely overlooked. Additionally, we design a novel PreCo algorithm to leverage these strengths while ensuring preference alignment.

  3. Technical: Beyond theoretical analyses, we propose a memory-efficient approach for policy-level computation (see Figure 9, Algorithm 2, and Appendix E), making our method scalable for large models used in practical applications.


Q1: Justification for Similarity Function (Eq. 8): ... choice of similarity function?

A1: PreCo's theoretical properties require the similarity function to:

  1. Be lipschitz smooth.
  2. have a gradient gsg_s must be positive linear combination of objective gradients g1:mg_{1:m} (mm is the number of objectives).

The similarity defined in Definition 4.1 satisfies these conditions. Using a different similarity function could disrupt the convergence guarantees of our algorithm. In practice, conventional cosine similarity could work in some cases, but it lacks theoretical guarantees.


Q2: Non-Uniform Preference Sampling: ... handle such cases?

A2: Non-uniform preference sampling may cause certain preferences to be learned better than others or lead to overfitting. The impact depends on the model's generalizability, so training with a uniform distribution is preferable without prior knowledge of test preferences.

However, as noted in Appendix A.3, a progressive curriculum for preference distributions could accelerate learning. For instance, the agent could start with a small, diverse set of preferences and gradually expand to neighboring ones, progressively uncovering the full Pareto front. Further exploration of this is an interesting direction for future work.


Q3: Non-Transient Preferences: ... convergence?

A3: If we understand correctly, non-transient preferences refer to preferences that are consistently sampled during training. While this could lead to overfitting to specific preferences, a well-designed curriculum that gradually introduces new preference distributions can mitigate this issue and promote exploration of the full Pareto front (as discussed in Appendix A2).

In fact, the training scheme in Algorithm 2 (Appendix E.1) can be viewed as a form of meta-learning, where the outer loop samples different preferences and the inner loop optimizes for a specific sampled preference. While convergence is always guaranteed within the inner loop, non-stationary preference distributions may affect the overall meta-learning process, potentially leading to overfitting. However, a carefully designed curriculum with progressively shifting preference distributions, even if non-stationary, can accelerate learning and improve the agent’s performance.

Theoretical analysis of such a curriculum would be an interesting direction for future work.


Q4: Economic Benchmarks: ... the method's applicability.

A4: Economic benchmarks are indeed an interesting direction for future work. We will add more discussion on economic applications in the related work section.

A key technical contribution of our method is to solve the min-norm problem (6) at the policy level (details in Figure 9 and Appendix E), removing the need to store multiple parameter gradients for different objectives. This results in minimal additional memory cost, making it highly suitable for training large-scale practical models. In fact, we are already applying our method to multi-objective preference learning for large language models with promising results, suggesting its potential applicability to practical economic settings as well.


Thanks again, and please let us know if there are further questions.

审稿意见
4

This paper proposes PCRL for preference control of multi-objective trade-offs, incorporating recent MOO algorithms into MORL. Convergence analysis is provided to show the approach can learn preference-specific Pareto optimal solutions and handles tochastic gradients as well.

给作者的问题

The Theorem 4.2's assumption is that g' is a convex combination of g_i. Is this practical for practical MORL setting?

What does Multi-Objective Optimization(MOO) algorithm exactly refers to? Are there any other choices other than the MOO mentioned in this work?

"However, the gradients of cosine similarity and the original objectives can conflict, as it does not leverage conflict-avoidance techniques from MOO algorithms." Are there any evidence showing this claim, rather than just looking at the final performance?

Can the authors comment on the computation complexity of implementing Jacobian matrix and similarity gradient, especially compared to plain policy gradient algorithms?

Update after rebuttal

Thanks for the authors' efforts in rebuttal. My evaluation has been updated.

论据与证据

Yes, the paper is showing evidence in both theoretical proofs and simulation results.

方法与评估标准

Yes.

理论论述

Yes, the Theorems in main paper are making sense, though I have not checked the full proof.

实验设计与分析

Yes.

补充材料

Yes, the simulation setup and part of the proof.

与现有文献的关系

This paper proposes a new technique to scale up reinforcement learning algorithms to multi-objective setting.

遗漏的重要参考文献

No.

其他优缺点

The paper proposes a novel approach based on multi-objective optimization. The method is clearly described, and the simulation results especially on higher number of objectives and conflicting objectives validate the model design.

However, it is unclear about the computation complexity. And the method seems to consume more iterations and need to keep large volume of gradient information. It is unclear how the stepsize choice for each preference would affect the performance. Moreover, the methods seems to be hard to transfer to new preference vector, as the algorithm needs to re-implement the whole optimization again. In addition, more explanations about MGDA are required, as it is not explained in this paper.

其他意见或建议

Please see below for questions.

作者回复

Dear reviewer,

We sincerely thank you for your time and valuable review.


Response to your questions

Below are our answers to your questions

Q1. About Theorem 4.2's assumption

A1: Theorem 4.2 considers the case where λ\lambda increases indefinitely. In practice, Theorem 4.1 already guarantees PreCo’s convergence as long as λ\lambda has an upper bound, as noted in Remark 4.1.

In Appendix F.2, We provided some hints for constructing such a similarity function Ψ(p,)\Psi'(p,\cdot) as follows:

  • When vv gets close to preference pp: Its similarity gradient vΨ(p,v)\nabla_v\Psi'(p,v) should approach the convex preference weight pp;
  • When vv is NOT close to preference pp: vΨ(p,v)\nabla_v\Psi'(p,v) should be close to the convex coefficient vΨ(p,v)/vΨ(p,v)1\nabla_v\Psi(p,v)/\|\nabla_v\Psi(p,v)\|_1, where Ψ\Psi is our original similarity defined in Definition 4.1.

Q2. About MOO and choices other than the MOO

A2: MOO refers to the algorithms specificly designed for deep learning with multiple objectives, such as MGDA[1], CAGrad[2], PCGrad[3], SDMGrad[4]. They are known to find a conflict avoidance direction for common improvements across all objectives.

Additionally, evolutionary algorithms[5] have been developed for multi-objective problems, but they tend to be less scalable and less sample-efficient compared to gradient-based MOO approaches.

Q3. Evidence for gradient conflict

A3: For a two-objective case, when the values are [v1,v2]=[10.0,1.5][v_1,v_2]=[10.0,1.5] and the preference is p=[0.9,0.1]p=[0.9,0.1], and let g1,g2g_1,g_2 denote the gradients of objective 1 and 2, respectively.

The gradient of cosine similarity is 0.1483g10.9889g20.1483 g_1 -0.9889 g_2, which is nearly opposite to g2g_2, creating a direct conflict with objective 2’s gradient.

To resolve this, we need a conflict-avoidance mechanism that optimizes similarity without negatively impacting objective 2. Intuitively, the goal is to find a direction that improves both v1,v2v_1,v_2 while prioritizing v1v_1 to get close to the preference pp. This can be precisely acheived by our proposed PreCo algorithm.

Q4. Computation complexity of implementing Jacobian matrix and similarity gradient

A4: Solving the min-norm problem (6) introduces additional computation to determine a common ascent direction, improving sample efficiency at the cost of extra computation.

However, memory consumption remains comparable to plain policy gradient algorithms:

  • When solving the problem (6), objective gradients πvπ\nabla_\pi \mathbf{v}^{\pi} and similarity gradient πΨ(p,vπ)\nabla_\pi \Psi(\mathbf{p},\mathbf{v}^{\pi}) are stored, but their size are at most batch size ×\times # objectives, which typically much smaller than the parameter size. Once (6) is solved, we obtain a batch-size update direction dd^* for π\pi, allowing πvπ,πΨ(p,vπ)\nabla_\pi \mathbf{v}^{\pi},\nabla_\pi \Psi(\mathbf{p},\mathbf{v}^{\pi}) to be released.

  • After solving (6), we compute θTπ d\nabla_\theta^T\pi\ d^* to update the model parameter. The jacobian θTπ\nabla_\theta^T\pi is not explicitly computed and stored. Instead, we take the gradient for dTπ{d^*}^T \pi to obtain it. In a formulation similar to the policy gradient, it corresponds to Es,a[θlogπ(as) d(s,a)πold(as)]E_{s,a}[\nabla_\theta \log\pi(a|s)\ d^*(s,a)\pi_{old}(a|s)], as detailed in Appendix E.

Therefore, memory usage at any given time does not exceed that of linear scalarization methods that computes the gradient θTππpTvπ\nabla_\theta^T\pi \nabla_\pi \mathbf{p}^T\mathbf{v}^{\pi}. In the policy gradient form, it is expressed as Es,a[θlogπ(as) pTqπ(s,a)]E_{s,a}[\nabla_\theta \log\pi(a|s)\ \mathbf{p}^T\mathbf{q}^{\pi}(s,a)].


Addressing concerns of weeknesses

However, it is unclear about the computation complexity... large volume of gradient information.

As mentioned in A4, keeping large volume of gradient information is unnecessary and the memory consumption remains comparable to traditional methods.

It is unclear ... to transfer to new preference vector, as the algorithm needs to re-implement the whole optimization again.

As mentioned in lines 143–144 of Section 3 and Algorithm 2 in Appendix E, preferences for training are uniformly sampled, and the stepsize for each preference sample remains constant. This enables a single agent to learn in a "meta-learning" fashion. This enables the agent to generalize to unseen preferences, as demonstrated in our experiments (Section 5 and Appendix D), where we test the agent on preferences it has not encountered during training.


We hope these clarifications address your concerns. Please let us know if you have any further questions.

Reference

[1] Désidéri, MGDA, 2009

[2] Liu et al. Conflict-averse gradient descent for multi-task learning, 2021

[3] Yu et al. Gradient Surgery for Multi-Task Learning, 2020

[4] Xiao et al. Direction-oriented multi-objective learning: Simple and provable stochastic algorithms, 2023

[5] Tan et al. Evolutionary Algorithms for Solving Multi-Objective Problems, 2007

审稿人评论

The reviewer appreciate the authors' further explanation.

审稿意见
3

This paper addresses the limited controllability and coverage of Pareto-optimal solutions in Multi-Objective Reinforcement Learning (MORL), where existing methods based on linear scalarization might struggle to align with user-defined trade-offs and fail to explore the full Pareto front. To overcome these limitations, the authors propose Preference Controllable Reinforcement Learning (PCRL), a framework that trains a preference-conditioned meta-policy to generate solutions based on user-specified preferences. They further introduce PREference COntrol (PreCo), a novel algorithm tailored for PCRL, providing theoretical guarantees on convergence and preference-controlled Pareto-stationary solutions. By integrating advanced Multi-Objective Optimization (MOO) methods, PCRL enhances controllability in MORL and proves particularly effective in environments with highly conflicting objectives, improving the diversity and utility of Pareto-optimal solutions.

给作者的问题

  1. Based on the theoretical results in Theorems 4.1-4.4, it appears that PreCo can only achieve Pareto stationarity and near-stationary points of the similarity function. Accordingly, the results do not fully match the claims in the introduction. Can the authors comment on this?

  2. How is the conditional policy utilized in PreCo (Algorithm 1)? Is there any knowledge sharing across preferences in PreCo? How the different preferences are handled during training?

  3. Built on 1., if PreCo needs to handle each preference separately, then how PreCo handles the critical sample efficiency issue in MORL?

  4. How many environment steps are used by each algorithm in the experiment?

  5. How is the similarity function chosen in PreCo? While I can understand that it is maximized when pp and vv are fully aligned, it would be good to explain the nice properties of the chosen function in more detail.

论据与证据

Regarding the claims made in the paper, below are the positive and the negative parts:

Strengths

  • This paper addresses the inherent controllability issue in many existing MORL methods that are meant to only find a Pareto-stationary point for some arbitrary / uncontrollable preference. By using PreCo, one can choose to find the corresponding Pareto-stationary point for a specific preference of interest.
  • The proposed approach can be viewed as a generalized version of linear scalarization (i.e., linear scalarization is a special case with Ψ(p,v)=pv\Psi(p,v)=p^\top v). This is an interesting and reasonable extension.
  • Motivated by the MOO literature, this paper extends the min-norm problem in the classic MGDA to the case of similarity function for preference control.

Weaknesses

  • One main concern is that the proposed algorithm does not really solve the target PCRL problem considered by this paper. Specifically, on page 2, right column, it is mentioned that the goal is to learn a preference-conditional policy to achieve a Pareto-optimal value that maximizes the similarity for any preference pp. However, based on Theorems 4.1-4.4, it appears that PreCo can only achieve Pareto stationarity (Theorems 4.1-4.2) and near-stationary points of the similarity function (Theorem 4.3-4.4). Hence, the theoretical results do not fully match the claims in the introduction.

  • Another concern is that I do not see why the conditional policy is needed in PreCo. In Algorithm 1, the (projected) gradient updates (Lines 4-5) are done only for one given preference pp (Line 1), and there seems nothing to do with varying preferences. My guess is that there shall be some meta-level method that determines how to switch between different preferences during training, like many other single-policy-network MORL methods. Otherwise, one would need to have a full run of PreCo for each individual preference, and this can be extremely sample-inefficient (this issue is also relevant to the issues of the experiments described below). However, this part appears not described at all in the paper.

方法与评估标准

  • Regarding the methods, one major concern is that technically this paper is more like an MOO paper instead of an MORL work. Based on the formulation and the PreCo algorithm, it appears that the vector-valued objective function v\mathbb{v} can be replaced by any objective function and does not necessarily need to be the total expected reward. Indeed, with the unbiasedness assumption (Assumption 4.2) and regularity conditions (Assumptions 4.1 and 4.3), the analysis in this paper completely gets rid of the inherent properties of MORL and thereby can simply follow the standard analysis of the MOO literature. With that said, in my view, this makes the analysis in this paper not that interesting. Moreover, it can even degrade the performance as the inherent properties of MORL are completely ignored.

  • The choice of the similarity function (Definition 4.1) needs to be further justified. I can understand that Ψ(p,v)\Psi(p,v) is maximized when pp and vv are parallel. However, it is not clear how the similarity function would contribute to the overall policy update when pp and vv are far from being parallel, especially if the number of objectives is large.

理论论述

I have checked the proof in Appendix I. I can see that the proofs can go through based on the existing MOO literature, there are some issues to be fixed. Specifically, all the expectations in (51)-(54) need to be conditional. Otherwise, πp,t\pi_{p,t} is a random variable and then (51)-(54) would not hold. Similarly, the authors need to check the correctness of Eqs. (56)-(61) and (68)-(70).

实验设计与分析

  • Evaluation domains: The domains used in this paper appear quite standard in the MORL literature. Both discrete and continuous control tasks in the MO-Gymnasium are considered in the evaluation.
  • Metrics: Hypervolume (HV) is a fairly standard MORL metric. As this paper focuses on the similarity, the authors also take the cosine similarity (CS) into account.
  • Performance: PreCo appears to achieve the best HV and CS in both Fruit Tree and the MO MuJoCo tasks.

That being said, I do have several concerns:

  • About the sample efficiency: As mentioned in the “Claims And Evidence,” it remains unclear to me how PreCo handles different preferences during training (either each preference is treated separately or there is indeed some knowledge sharing across preferences in the implementation). Therefore, it is not clear why PreCo can have much better performance than the MORL benchmark methods (like PDMORL and CAPQL).

  • About the number of environment steps used by each algorithm: Based on the above, one possibility is that different algorithms actually use different numbers of environment steps. If this is the case, then the comparison is actually not fair. With that said, I checked the experimental details in the appendix, but I did not find anything specifically mentioned about the environment steps. Please correct me if I missed anything.

  • About the missing baselines: There are several missing recent baselines that are known to be strong in finding the Pareto front in the context of MORL, such as Envelope Q-learning (and Envelope DQN) [1], Conditioned Network [2], Q-Pensieve [3], and PCN [4].

[1] Yang et al., “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” NeurIPS 2019.

[2] Abels et al., “Dynamic weights in multi-objective deep reinforcement learning,” ICML 2019.

[3] Hung et al., “Q-Pensieve: Boosting sample efficiency of multi-objective RL through memory sharing of Q-snapshots,” ICLR 2023.

[4] Reymond et al., “Pareto Conditioned Networks,” AAMAS 2022.

补充材料

I have checked the experimental details and the proofs in the appendix.

与现有文献的关系

The paper contributes to MORL by addressing the challenge of preference-controllable policy learning, which is not well studies in the existing MORL literature. Prior works, such as LS-based MORL methods, primarily optimize a scalarized objective but fail to capture a diverse set of Pareto-optimal solutions, limiting their ability to align with user preferences. PreCo incorporates preference-awareness into the learning process by conditioning policy updates on user-specified trade-offs. The lack of controllability in existing methods limits their practical applicability, making the research direction of PCRL necessary.

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  • The main strength is that the paper points out the inherent controllability issue in many existing MORL methods that are meant to only find a Pareto-stationary point for some arbitrary / uncontrollable preference. By using the proposed PreCo, one can choose to find the corresponding Pareto-stationary point for a specific preference of interest. Moreover, PreCo can be viewed as a generalized version of linear scalarization (i.e., linear scalarization is a special case with Ψ(p,v)=pv\Psi(p,v)=p^\top v), and this is a nice extension of the LS-based methods.

Weaknesses

  • As mentioned above, the proposed algorithm does not seem to fully solve the PCRL problem. Notably, the goal is to learn a preference-conditional policy to achieve a Pareto-optimal value that maximizes the similarity for any preference pp. However, based on Theorems 4.1-4.4, it appears that PreCo can only achieve Pareto stationarity (Theorems 4.1-4.2) and near-stationary points of the similarity function (Theorem 4.3-4.4). Hence, the theoretical results do not fully match the claims in the introduction.
  • The clarity of the proposed method can be improved. Specifically, the use of the conditional policy in PreCo, how the different preferences are handled, and the chosen similarity function need to be further explained and justified.

其他意见或建议

  • Eq. (53)-(54): The notation C2C_2 is overloaded.
  • Eq. (56): The right bracket at the end shall be moved to the left.
作者回复

Dear reviewer,

We sincerely thank you for your time and detailed review. We hope the responses below address your concerns.


Answers To Your Questions


A1: In practical deep RL/MORL, first-order gradient-based algorithms are the most widely-used and stationarity is the strongest guarantee [1] they can achieve in practice. Hence, prominent MOO algorithms [2,3,4] establish only Pareto stationarity. Compared to existing works, ours is one of the few having provable convergence under noisy gradients.


A2: We only need to train a single policy to handle different preferences as input conditions. During training, preferences are sampled uniformly without using prior knowledge. These are explained in lines 143–145 of Section 3 and Algorithm 2 in Appendix E. This standard approach in conditional MORL [5,6] ensures shared parameters and knowledge across all preferences.


A3: As noted in A2, preferences are uniformly sampled during training in a meta-learning fashion, applied consistently across all methods.

Our high-level objective design makes it a generally applicable framework to all RL backbones (on-policy & off-policy). Its contribution is orthogonal to lower-level improvement of sample efficiency. As discussed in Appendix A.1, when adapted to off-policy methods, it can integrate techniques like Hindsight Experience Replay(HER) to enhance sample efficiency.


A4: In the MuJoCo environments (Ant, Hopper, Reacher), all methods use 3e6 environment steps.

For the FruitTree environment, all methods except PDMORL use 3.6e5 steps, while PDMORL follows its original implementation with 1e6 steps.


A5: PreCo's theoretical properties require the similarity function to be:

  1. lipschitz smooth
  2. Its gradient gsg_s must be positive linear combination of objective gradients g1:mg_{1:m}

The similarity in Definition 4.1 meets these criteria, and its magnitude does not affect PreCo's properties. Intuitively, it encourages certain objectives to improve more, so the achieved trade-offs get closer to the pre-defined preference. See Appendix F.1 and Figure 10 for further insight.


Concerns About Methods

To address your concerns, we summarize a few key points:

Generalizability: We focus on a MORL's framework design that is generally applicable to all on & off-policy RL. In contrast, techniques like HER that exploit inherent RL properties are limited to off-policy methods.

Novelty & Contributions:

  1. Conceptual: We formulate PCRL for any-preference alignment, overcoming the limitations of prior MORL methods using Linear Scalarization (LS) that has no alignment guarantee.

  2. Methodological: We integrate modern MOO algorithms into MORL to handle conflicting and stochastic gradients—key aspects previously overlooked. Specifically for PCRL, we design PreCo to inherit these strengths while promoting preference alignment.

  3. Technical: PreCo’s similarity design requires independent proofs (App. I.1) and significant adaptations (App. I.2-3), as no prior method incorporates such similarity. Moreover, as noted in lines 1670-1672, our analysis is more rigorous than previous literature.

Practicality: Our assumptions align with standard RL/MORL settings: most RL methods (e.g. policy gradients, DQN loss) are unbiased (Ass.2), and their values typically change with a certain level of smoothness (Ass.1).


About Theoretical Correctness

As noted in Assumption 2, the expectation here is over the gradient noise ξ\xi and Assumptions 1-3 ensures unbiased, bounded gradients for all πp,t\pi_{p,t} samples. Thus, the bounds in (51)-(54), (56)-(61), and (68)-(70) apply to all πp,t\pi_{p,t} samples and remain valid under expectation over πp,t\pi_{p,t}.

We appreciate the detailed feedback, but treating (51)–(54) as conditional expectations does not impact the proof's correctness.


About Experiments

A2-A4 have answered questions about handling different preferences and environmental steps. Here, we address the concerns about the baselines.

Envelope Q-learning (EQL), Conditioned Network (CN), and Q-Pensieve (QP) all optimize a linear scalarization (LS) objective and inherit its limitations. The best-performing QP achieves a 6859.94 hypervolume for 6D FruitTree, identical to the best cases of our implemented general LS baseline. This is expected, as QP is simply a more sample-efficient version of LS.

Since we already include an LS baseline and CAPQL, which achieves a higher upper bound than LS methods, comparing against EQL, CN, or QP is unnecessary.

PCN is a heuristic method relying on model generalizability. Its comparison to other methods is limited. It only accepts a unique input condition, making it less relevant to our study.


Please let us know if you have further questions.

[1] Kushner 1978 Stochastic

[2] Sener 2018 MGDA

[3] Liu 2021 CAGrad

[4] Xiao 2023 SDMGrad

[5] Liu 2023 CAPQL

[6] Yang 2019 Envelope Q

审稿人评论

Thank the authors for the response. Some of my questions have been nicely addressed in A2-A5.

A1: Thanks for the clarification. I understand that stationarity is the main convergence property shown for gradient-based methods in general optimization (and that’s why MOO algorithms establish only Pareto stationarity). However, that does not necessarily mean that stationarity is the only thing that one can look for in RL. See [1-2] and references therein.

The response appears to also echo the original review comment "..., one major concern is that technically this paper is more like an MOO paper instead of an MORL work. Based on the formulation and the PreCo algorithm, it appears that the vector-valued objective function vπ\mathbf{v}^{\pi} can be replaced by any objective function and does not necessarily need to be the total expected reward. Indeed, with the unbiasedness assumption (Assumption 4.2) and regularity conditions (Assumptions 4.1 and 4.3), the analysis in this paper completely gets rid of the inherent properties of MORL and thereby can simply follow the standard analysis of the MOO literature.” Please let me know if I missed anything.

[1] Bai et al., “Joint Optimization of Concave Scalarized Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm,” JAIR 2022.

[2] Zhou et al., “Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning,” NeurIPS 2022.

Sample efficiency: Thanks for the clarification on how PreCo handles different preferences. However, my concern on the sample efficiency of PreCo still remains (also stated in the original comments in Experimental Designs Or Analyses). Given that PreCo does not introduce any specific design for improving sample efficiency and also uses the conventional uniform preference sampling, it is not totally clear why PreCo can achieve a higher HV than the benchmark MORL methods (like PD-MORL, CAPQL, and GPI-LS) on both Fruit-Tree and MuJoCo tasks, given that PD-MORL, CAPQL, and GPI-LS are known to be quite strong in terms of sample efficiency.

Experiments: Thanks for providing the additional results. However, Q-Pensieve (QP) was originally designed for continuous control tasks, but FruitTree is by default a discrete control task. Then, how is QP adapted to FruitTree (e.g., by discretizing the actions)? Accordingly, comparing PreCo with QP directly on continuous control tasks (e.g., MO-Hopper, MO-Ant) would be more fair.

----- Edit ----- Thank the authors for the follow-up response. Most of my concerns have been alleviated to some extent. I have updated my score accordingly. I would encourage the authors to include these additional discussions to improve the clarity of the paper.

作者评论

Thanks for your time and the further questions. We try to address your remaining concerns below:


Regarding A1

We appreciate your constructive comments and the references ([1][2]).

Inspired by your feedback, we now understand your emphasis on more RL-specific properties. Proposition 3 in [3] shows that in MORL, the Pareto front is convex (but not necessarily strictly convex), ensuring that Pareto stationary points are always Pareto optimal. We will formalize this in the revisions to strengthen the RL-specific theoretical guarantees.

Furthermore, [1] and [2] are not first-order methods but natural policy gradient (quasi-second-order) approaches that:

  • Are more computationally intensive
  • Require stricter assumptions, including unbiased fisher information matrix estimation, concavity of scalarization function, lipschitz smoothness, bounded gradients

Although, as mentioned, the stationary solutions are optimal for MORL, whether this optimality is achieved does not diminish our contribution, since most existing MORL methods, such as EQL, QP, and GPD-LS, lack global optimality guarantees when applied to continuous spaces using DDPG or SAC.

In 'Concerns About Methods,' we tried to clarify the motivation and contribution by formulating MORL as a general MOO problem and leveraging recent MOO advancements.


Regarding concerns about sample efficiency

Validity of empirical results

Sample efficiency is not the sole factor influencing performance. Our main argument is that the LS objective has inherent limitations, regardless of the sample efficiency of LS methods. As shown in Figure 1(b) or Figure 4.9 of [7], LS can only discover a limited set of optimal policies when the Pareto front is not strictly convex. This explains why the best LS method, QP, achieves a similar hypervolume to our LS baseline (without modifications for better sample efficiency), as both are constrained by the LS objective.

Our results are reasonable. In the FruitTree environment, with non-strictly convex objectives and low dimensionality, sample efficiency is not the performance bottleneck, leading to a noticeable performance gap between PreCo and LS methods. In contrast, in Mujoco tasks, where objectives are strictly convex and sample efficiency is more crucial, the performance gap is smaller.

Sample efficiency of proposed approach

MOO algorithms benefit from finding conflict-avoidant directions, which also improves sample efficiency. The toy example in Figure 11 and Appendix G illustrates how MOO algorithms like PreCo, MGDA, and SDMgrad converge for all cases, while GD (Figure 11(e), which optimizes the weighted sum of objectives, the same as LS) converges more slowly in some cases with more conflicting objective gradients.

This advantage is general and applies to both on- and off-policy RL backbones with our proposed PCRL. However, sample efficiency techniques like HER[5] for EQL, PD-MORL, QP, and Q-snapshots[6] are exclusive to off-policy Q-value-based methods.


Regarding QP Experiments

For discrete control tasks, Q-Pensieve (QP) is implemented by adding Q-snapshots to the Q-envelope method from the MORL benchmark [4]. The greedy action for preference pp is given by: argmaxasupp,QpTQ(s,;p),\arg\max_a \sup_{p',Q'}p^T Q'(s,\cdot;p'), where QQ' is the Q-value sampled from the Q-snapshots and pp' is the sampled preference used in the Q-envelope operation. Despite these enhancements, QP remains an LS method that aims to maximize the weighted sum of values according to pp.

As discussed,the LS objective inherently constrains performance, regardless of an algorithm’s optimization efficiency. Our results align with the best LS results from the benchmark [4].

Environments with non-strictly convex Pareto fronts better highlight the unique advantages of our method, which is to overcome the limitations of existing LS methods. FruitTree, for example, illustrates the fundamental limitations of LS methods, as analyzed in Section 5.1 and shown in Figure 4.


Once again, we thank you for your time and engagement. The discussion has been constructive. Please let us know if you have any remaining concerns.


References

[1] Bai et al., “Joint Optimization of Concave Scalarized Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm,” JAIR 2022

[2] Zhou et al., “Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning,” NeurIPS 2022

[3] Lu et al, "Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality" ICLR 2023

[4] Felten et al. "A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning" NeurIPS 2023

[5] Andrychowicz et al, "Hindsight Experience Replay" 2017

[6] Hung et al. "Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots" ICLR 2023

[7] Boyd et al. "Convex Optimization" 2004

最终决定

This paper introduces a novel framework called Preference Controllable Reinforcement Learning (PCRL), designed to train a single, preference-conditioned policy capable of generating Pareto-optimal solutions according to user-specified trade-offs. In particular, it uses standard reinforcement learning algorithms (TD3 and PPO) to identify an ϵ\epsilon-Pareto efficient frontier in multi-objective optimization (MOO). The paper also introduces PREference COntrol (PreCo), an algorithm that provides theoretical guarantees on convergence to preference-controlled Pareto-stationary solutions. The theoretical results also support the authors' claim that their method can identify such solutions while being robust to stochastic gradients. Empirical results were reported across multiple environments.

All reviewers agree that this paper offers a valuable contribution to the MOO and MORL communities. They noted that the key claims made in this work are supported by both empirical evidence and theoretical analyses, and highlighted that some of the theoretical guarantees introduced by the authors—particularly those relating to convergence to ϵ\epsilon-optimality—are a key contribution of this paper. All reviewers agreed that the empirical results are extensive, properly designed, and convincing.

A few concerns were raised. One reviewer questioned whether the proposed method truly solves the PCRL problem as defined, and whether the intended contributions were aimed at the MOO community or at multi-objective RL. The authors provided a detailed response, and the reviewer confirmed that most of their concerns had been adequately addressed. Another reviewer raised questions about the computational complexity of the method, along with several clarification questions. The authors addressed those points in their rebuttal, and the reviewer confirmed that the rebuttal had properly addressed their questions. A third reviewer had two main concerns: first, that some of the empirical results (particularly Figures 4b and 4d) appeared inconsistent with the authors’ claim of achieving ϵ\epsilon-Pareto efficiency. The authors explained that the figures were 2D projections of 3D results and clarified that while some points may appear dominated, they are in fact non-dominated when all objectives are considered. Secondly, the same reviewer felt that the paper did not clearly articulate the method’s technical contributions and improvements over existing approaches. The authors prepared a detailed response, and the reviewer confirmed that their main concerns had been addressed. Lastly, a fourth reviewer argued that additional ablation studies and sensitivity analyses were needed to support some of the authors' main claims. In response, the authors ran a new set of experiments and included the results, which confirmed and supported their initial findings and claims. The reviewer confirmed that their concerns had been resolved.

Overall, reviewers agree that this paper offers valuable insights that will benefit the ICML community. They noted that the concerns raised during the discussion phase were either addressed or were not critical and could be resolved in a revised version. The reviewers encouraged the authors to incorporate the feedback from the reviews and discussion and argued that doing so would strengthen the overall quality and impact of the work.