PaperHub
6.1
/10
Spotlight4 位审稿人
最低3最高4标准差0.4
3
3
3
4
ICML 2025

Efficient First-Order Optimization on the Pareto Set for Multi-Objective Learning under Preference Guidance

OpenReviewPDF
提交: 2025-01-19更新: 2025-07-24
TL;DR

We cast the preference-guided multi-objective learning problem as optimization on the Pareto set, and propose a first-order penalty approach to solve it.

摘要

关键词
multi-objective optimizationoptimization on the Pareto setsemivectorial bilevel optimization

评审与讨论

审稿意见
3

This paper considers the problem of preference-guided multi-objective optimization. The authors first formulate it as a semivectorial bilevel optimization problem, which optimizes the upper-level preference objective, subject to the constraint that the model parameters are weakly Pareto optimal or Pareto stationary for the lower-level multi-objective optimization problem. The authors further propose an algorithm to solve this semivectorial bilevel optimization problem, which first converts the lower-level constraint to a single-objective constraint through a merit function, then solves the transformed problem by a penalty-based reformulation. Theoretically, the authors analyze the relations of solutions for different formulations and the convergence of the proposed method. Empirical results on various synthetic and real-world problems demonstrate the effectiveness of the proposed method.

给作者的问题

Please see the mentioned points of weakness in the Claims And Evidence and Methods And Evaluation Criteria part, as well as the possible violation of paper format template.

论据与证据

There are two key claims in this work:

  • The preference-guided multi-objective optimization problem can be solved through a semivectorial bilevel optimization problem
  • The proposed method FOOPS can effectively solved the obtained semivectorial bilevel optimization problem

While the second claim is supported by sound theoretical analysis and empirical results, the first claim seems not sufficiently supported from my perspective. I am a bit confused on how the original formulation for preference-guided multi-objective optimization:

minxF(x),s.t.,G(x)0,H(x)=0\min_x F(x), s.t., G(x) \le 0, H(x)=0

can be re-formulated as

minxf0(x),s.t.,xargminF(x)\min_x f_0(x), s.t., x \in \arg\min F(x)

I suppose F(x)F(x) remains the same, while how are G(x)G(x) and H(x)H(x) related to f0(x)f_0(x)? Some more explanations are necessary here. Without such explanations, it is even unclear how existing preference-guided multi-objective optimization methods are implemented and if they are solving the same problem.

方法与评估标准

While the proposed method is easy to understand and the evaluation is sufficient, I am a bit curious on whether the proposed method can be combined with other bi-level optimization methods, as is also mentioned in Appendix A.1. Specifically, after converting the lower-level constraint to a single-objective constraint, I suppose the converted problem can be also solved by some bi-level optimization methods other than the penalty-based formulation? Some empirical comparisons can be useful here.

理论论述

Theoretical analysis in this work is sound and interesting. I have checked the proofs in Appendix B-E and found they are clearly organized and easy to understand without significant errors.

实验设计与分析

Generally the experiments are sufficient with sound analysis.

补充材料

This paper does not have supplementary material.

与现有文献的关系

This paper proposes a novel formulation of preference-guided multi-objective optimization as well as a novel method (based on smoothed merit functions and existing gradient-based bi-level optimization methods) to solve the optimization problem under this formulation.

遗漏的重要参考文献

The references are generally complete, and I do not have any works that are strongly recommended to be included.

其他优缺点

All mentioned in previous parts

其他意见或建议

The authors seem to have modified margins in some places, e.g., line 110-116 and line 432-439. This should be strictly forbidden and some reasonable explanation is necessary here.

作者回复

Thanks for acknowledging that we propose a novel formulation and an easy-to-understand novel method for preference-guided multi-objective optimization, our proof is sound and interesting, and the experiments are sufficient with sound analysis.

Below we address your concerns point by point. The link to additional results is https://anonymous.4open.science/r/FOOPS-F746/ICML3045_rebuttal.pdf

Claims And Evidence: "The preference-guided multi-objective optimization problem can be solved through a semivectorial bilevel optimization problem" is unclear... How are G(x)G(x) and H(x)H(x) related to f0(x)f_0(x)? Whether they are solving the same problem?

Yes, F(x)F(x) remains the same. The relation of G,HG,H to f0f_0 is that f0f_0 can be chosen as f0(x)=H(x)2+[G(x)]+2f_0(x) = ||H(x)||^2 + ||[G(x)]_+||^2. f0f_0 is minimized when H(x)=0H(x)=0 and G(x)0G(x)\leq 0. In our experiments, we show the example for equality-constrained problems without GG and with f0(x)=H(x)2f_0(x) = ||H(x)||^2. See Section 6, and Eq. (16) for the speech experiment.

The two formulations are not equivalent mathematically. But both can be applied to preference-guided multi-objective learning with different emphasis on either satisfying preference or achieving weak Pareto optimality. As discussed in Section 3.3 and Section 6 in our paper, the constrained formulation in e.g., PMTL and FERERO with preference modeled by constraints GG and HH puts more emphasis on satisfying the preference, while the FOOPS formulation with preference modeled by f0f_0 puts more emphasis on achieving weak Pareto optimality.

Methods And Evaluation Criteria: While the proposed method is easy to understand and the evaluation is sufficient, I am a bit curious on whether the proposed method can be combined with other bi-level optimization methods, as is also mentioned in Appendix A.1. Specifically, after converting the lower-level constraint to a single-objective constraint, I suppose the converted problem can be also solved by some bi-level optimization methods other than the penalty-based formulation? Some empirical comparisons can be useful here.

  • Other bilevel methods. Thanks for the suggestion. Indeed, the penalty method used in this paper is not the only way to solve a bilevel problem. However, the existing methods listed in Appendix A.1, Table 5 require different assumptions which cannot be satisfied by our converted problem. Therefore, the methods in Table 5 cannot be directly applied to our converted problem. However, we provide a discussion on how some other methods could be applied under certain additional assumptions. For example, the recent concurrent AGILS (Bai et al., 2024) method can be applied when the merit function vl,τv_{l,\tau} has HEB with η1\eta \geq 1.

  • Empirical comparisons. Since other existing bilevel methods require different assumptions, they cannot be directly applied to our converted problem. Further investigation is needed to check whether the methods still work under weaker assumptions. The concurrent AGILS (Bai et al., 2024) method may be applied to specific problems satisfying the assumptions therein, which we leave for future work. We provide a comparison to other OPS methods such as PNG in Figure A and Table B in the link.

Other Comments Or Suggestions: The authors seem to have modified margins in some places, e.g., lines 110-116 and lines 432-439. This should be strictly forbidden and some reasonable explanation is necessary here.

Thanks for spotting this. We would like to clarify that we did not intentionally or explicitly alter the margins. Instead, we used the {\small } environment in LaTeX to reduce the size of Equations (5) and (16) so they would better fit the space. This inadvertently caused the line spacing for lines 110-116 and 432-439, immediately before the equations, to appear smaller, despite our attempt to limit the {\small } environment to the equations themselves. In response to the reviewer’s feedback, we will modify the paper to remove this unintended spacing issue.


We hope our rebuttal resolves the reviewer's concerns and the reviewer can reconsider the rating of our paper. Thanks!

审稿人评论

I would like to first thank the authors for their detailed reply. Most of my previous concerns are addressed and the authors are encouraged to add the additional clarification in the revised version, as well as fixing the formatting issue. I have also increased my score.

作者评论

Thank you very much for acknowledging our response, engaging in the discussion, and updating your score. Yes, we will incorporate the promised revisions and fix the formatting issue.

Sincerely, authors.

审稿意见
3

This paper studies multi-objective optimization with user-specified preferences. The authors formulate the problem as a bilevel optimization problem, where the upper-level is a preference function, and the lower-level problem is the minimization of a smoothed version of merit function. Merit function usually serves as an objective whose solutions are a set of weak Pareto optimal points. The authors provide a comprehensive analysis for the proposed problem. In particular, they first provide some properties of the smoothed merit functions under the assumption that f_m is quasar-convex function. Then, they consider a penalized reformulation of the original bilevel problem. And then they characterize the equivalence between the two problems in terms of global or local solutions, based on the assumption on the global subanalyticity, Holderian error bound and KL inequality. Experiments validate the effectiveness of the proposed method.

给作者的问题

see the strengths and weaknesses.

论据与证据

Yes.

方法与评估标准

A running time comparison and confidence intervals could be provided.

理论论述

Check some proofs and did not find main problems.

实验设计与分析

A running time comparison and confidence intervals could be provided. More details

补充材料

Yes, mainly the theoretical proofs. But I did not go through all details.

与现有文献的关系

The idea of using bilevel optimization for preference-based multi-objective optimization is new to the literature.

遗漏的重要参考文献

Yes. it includes the most important works on preference-based MOO. However, more related works on differentiable MOO in ML could be provided. For example, PCGrad, CAGrad, NashMTL, MoCo, SDMGrad, where CAGrad and SDMGrad also introduce preference-based regularization or constraints. They should be discussed.

其他优缺点

Strengths:

  1. The studied problem is very important in multi-objective optimization, because we often tend to find a preferred point on the Pareto front. The bilevel optimization perspective seems to be new.

  2. The authors analyze the smoothed merit function and the equivalence between the bilevel problem and the panelized problem under some assumptions. Experiments seem to support that the proposed method can get higher accuracy.

Weakness:

  1. The paper is not well written. The analysis part makes multiple assumptions. Some theorem requires the convexity-like assumptions, some needs HEB assumption, and later a KL inequality is needed to guarantee the penalty function is less than \epsilon. I suggest explicitly point out all assumptions in before the theorems.

  2. The motivation of smoothing the merit function is not very clear to me. I understand that the original merit function could be non-differentiable. It may not be a problem in real-world problems. Can the authors validate this in the experiments to see if smoothing is necessary?

  3. Another issue for this smoothing is that it introduces two more hyperparameters \tau and l. Can the authors explain how they select these hyperparameters in practice? Ablation studies should be provided.

  4. The assumptions could be quite strong. The point strong quasar-convexity assumption is hard to be satisfied in practical problems. Although the authors mention some examples such as linear models with leaky ReLU, for general setups, it may be hard to validate this assumption. In addition, the subanalyticity and the KL inequality assumptions further make the analysis less applicable to the practical cases. It would be great if the authors could provide some justification that the problems in the experiments could satisfy these assumptions (it is ok if just partially)?

  5. In lemma 3.4, what does it mean by X_v^* \cap X_C is globally subanalytic. The assumption 2 is made on the function rather than the set. Perhaps a definition on subanalyticity should be provided.

  6. In theorem 3.5, what is the definition of (ϵ,δ)(\epsilon,\delta)-global solution for (CP)?

  7. The algorithm is complex, containing multiple hyperparameters like τ,l,γt,Kt,αt,βt\tau, l, \gamma_t, K_t, \alpha_t,\beta_t, as well as two projections in steps 5 and 8. Compared to previous methods like EPO, FERERO, this is less appealing.

  8. Experiments results are not convincing. In table 4, it is very close to FERERO-FT. Then, multiple seeds with confidence interval should be provided. A running time comparison could be provided to further justify the efficiency compared to FERERO.

其他意见或建议

see the strengths and weaknesses.

作者回复

Thanks for acknowledging that the problem is important and the bilevel perspective is new. We would like to emphasize that the bilevel problem in this paper with non-convex vector-valued lower-level objective is much more challenging and nontrivial, as pointed out by Reviewer XkmK.

Below we address your concerns. The link to additional results is https://anonymous.4open.science/r/FOOPS-F746/ICML3045_rebuttal.pdf

W1.Writing&assumptions not clear... Explicitly point out all assumptions before the theorems.

It might be a misunderstanding that the HEB and KL conditions are assumed. Instead of directly assuming them, we prove them based on the properties of FF. In our theorems, we list all the required assumptions in the beginning and then discuss other conditions that can be proved. We will clarify this.

W2.Motivation of smoothing the merit function ...Validate this in the experiments?

Smoothing is commonly used for nonsmooth optimization problems, whose motivation is well-studied in prior works, e.g. [R1] and references therein. [R1] shows that smoothing is necessary for finite-time convergence of nonconvex nonsmooth problems using any algorithm. Therefore, we use LSE to smooth the max-min nonsmooth uˉ\bar{u} to ensure finite-time convergence, which is widely used in prior works, e.g. [R2]. In our experiments, we find that without smoothing, it could return NaN (not a number) errors and the algorithm could diverge.

[R1] Deterministic Nonsmooth Nonconvex Optimization. M.I. Jordan, et. al. COLT 2023.

[R2] An alternative softmax operator for reinforcement learning. Kavosh Asadi, et. al. ICML 2017.

W3.Smoothing introduces hyperparams. \tau and l. How to select them? Provide ablation studies.

We choose τ,l\tau,l to be small and to ensure differentiability. We use grid search to tune them. See more details in Appendix F. We provide an ablation study in Table C in the link.

W4.The assumptions could be strong...The subanalyticity&KL assumptions make the analysis less applicable...Justification of problems in the experiments satisfying these assumptions (ok if partially)?

Note that compared to existing works for OPS (Table 2) and bilevel optimization (Table 5), which usually require convexity or PL assumptions, our assumptions are actually much weaker. The subanalylticity and KL conditions are weaker than PL, see more discussions in Appendix C with examples justifying the assumptions.

W5.In lemma 3.4, meaning of global subanalyticity?...A definition should be provided.

We have mentioned in the main paper, lines 204-216 that, the definition of (global) subanalyticity, including both subanalytic functions and sets is provided in Appendix C, Definitions C.1 and C.2.

W6.In theorem 3.5, what is the definition of (ϵ,δ)(\epsilon,\delta)-global solution for (CP)?

The (ϵ,δ)(\epsilon,\delta)-global solution for (CP) is that, f0(x)minxXδf0(x)ϵ,xXδ={xXvl,τ(x)+τlnMδ}.f_0(x) - \min_{x\in {\cal X}_{\delta}} f_0(x) \leq \epsilon, x\in {\cal X}_{\delta} = \{x \in {\cal X} \mid v_{l,\tau}(x) + \tau\ln M \leq \delta \}. This is a widely used definition for constrained optimization.

W7.The algorithm is complex, containing multiple hyperparameters like τ,l,γt,Kt,αt,βt\tau, l, \gamma_t, K_t, \alpha_t,\beta_t, as well as two projections in steps 5 and 8. Compared to previous methods like EPO, FERERO, this is less appealing.

This might be a misunderstanding. The algorithm is general such that it can be applied to both the cases when X\cal X is compact and when X=Rq{\cal X} = R^q. When X=Rq{\cal X} = R^q, the two projections are not needed, which is the case in our experiments. KtK_t can be chosen to be small like 1 or 2. As a comparison, FERERO also requires choosing hyperparameters such as αt,γt,K,cg,ch\alpha_t,\gamma_t,K,c_g,c_h. So the number of hyperparameters is similar. They do not make FOOPS more complex or inefficient.

We do provide a run time in Appendix F, Table 10 in the paper, and Table A in the link to show FOOPS is comparable or more efficient than FERERO.

W8&Methods&Evaluation&Experiment: Experiments are not convincing. In Table 4, it is very close to FERERO-FT ...confidence interval... A running time comparison could be provided to further justify the efficiency compared to FERERO.

We respectively disagree. We show in Table 3 that FOOPS achieves much better hypervolume compared to FERERO and other baselines. In Table 4, it achieves comparable average performance as FOOPS. So we believe our result demonstrates FOOPS is effective, as acknowledged by all other reviewers. This experiment takes much longer time. We will run additional experiments add the confidence in the revision.

Running time comparison is given in Appendix F, Table 10 in the paper, and Table A in the linked PDF. Results show FOOPS can be faster than FERERO.

References: More works on differentiable MOO could be discussed.

Thanks, we will include a discussion in the revision.


We hope we have addressed your concerns and you can reconsider the rating of our paper. Thanks!

审稿人评论

Apologize for the late response. I thank the authors for the detailed response. My concerns have been resolved and I increase my score accordingly. I highly suggest the authors add the related works and the discussion I mentioned in the final revision.

Best, Reviewer

作者评论

Thank you very much for acknowledging our response, engaging in the discussion, and updating your score. Yes, we will incorporate the promised revisions and other related works.

Sincerely, authors.

审稿意见
3

In this work, the authors frame preference-guided multi-objective learning as an optimization problem on the Pareto set and propose a first-order penalty method to address it, where the penalty function is a polynomial of a smoothed merit function. They begin by establishing key properties of the merit function, including its connection to weak Pareto optimality and the Hölderian error bound. Next, they examine the relationship between solutions of the penalty reformulation and those of the original problem. Finally, they present algorithms for solving the penalty problem and analyze their convergence guarantees.

给作者的问题

See Theoretical Claims, Experimental Designs Or Analyses, and Other Strengths And Weaknesses.

论据与证据

I find it unclear how the authors incorporate Preference Guidance in this paper. If Preference Guidance were replaced with another objective function, the overall formulation would remain largely unchanged, raising questions about its specific role and impact.

方法与评估标准

I believe there is still a missing component in the problem reformulation. The authors only establish the connection between CP and PP, but it would be helpful if they could also provide insights into the relationship between (1) and CP.

理论论述

In Proposition 3.2, where ϵ=τlnM\epsilon = \tau \ln M, ϵ\epsilon can become quite large as MM increases. Would it be better to choose τ\tau as a function of MM rather than a constant, ensuring that ϵ\epsilon remains sufficiently small?

实验设计与分析

The baselines used in the experimental section are incomplete. I suggest that the authors compare their algorithms with those listed in Table 2 for a more comprehensive evaluation.

补充材料

I have reviewed Appendices D and E.

与现有文献的关系

The experimental results appear promising. After incorporating comparisons with additional baselines, it would be beneficial to scale up the approach and explore its potential applications in larger models.

遗漏的重要参考文献

I did not find any essential references that are missing.

其他优缺点

Strengths:

  1. This paper improves the theoretical results of optimization on the Pareto set by removing the strong convexity assumption in (Roy, 2023) and provides stronger convergence guarantees compared to (Ye, 2022).
  2. The applications of this type of optimization problem are broad. The paper presents an algorithm to solve such problems, which has the potential to be scaled up to some large-scale applications.

Weaknesses:

  1. How should τ\tau and ll be chosen both theoretically and in practice for the merit function?
  2. Theorem 3.6, 3.7, and 3.9 appear to be direct adaptations from (Shen, 2023) with only minor modifications.
  3. The convergence analysis is relatively trivial, as it primarily leverages results from well-known optimization algorithms.

References:

[1] Roy, A., So, G., and Ma, Y.-A. Optimization on Pareto sets: On a theory of multi-objective optimization. arXiv preprint arXiv:2308.02145, 2023.

[2] Ye, M. and Liu, Q. Pareto navigation gradient descent: a first-order algorithm for optimization in Pareto set. In Uncertainty in Artificial Intelligence, pp. 2246–2255, 2022.

[3] Shen, H., Xiao, Q., and Chen, T. On penalty-based bilevel gradient descent method. arXiv preprint arXiv:2302.05185, 2023.

其他意见或建议

No

作者回复

Thanks for acknowledging that we consider BLO with non-convex vector-valued lower-level objective with stronger guarantees, which is different from prior works, and the experiment results are promising.

We would like to emphasize that the non-convex vector-valued lower-level objective is much more challenging and nontrivial, as also pointed out by Reviewer XkmK.

Below we address all your concerns. The link to additional results is https://anonymous.4open.science/r/FOOPS-F746/ICML3045_rebuttal.pdf

Claims: How to incorporate Preference Guidance...

This might be a misunderstanding. In our paper, the preference guidance was not replaced with another objective from FF, but was enforced via a new upper-level preference objective f0f_0.

Our experiments include the preference vector guided problems with f0(x)=H(x)2f_0(x) = ||H(x)||^2, and H(x)=0H(x)=0 describing the preference vector; see Section 6. See also the response to Reviewer SqKr-Claims. We will clarify this.

Broader Literature: ...scale up to applications in larger models.

We have experiments for larger models for speech recognition. See Section 6, Table 4. The model size is around 64.5M, with more details provided in Appendix F.

Methods&Evaluation: Relationship between (1) and CP.

In our paper, below Eq. (CP), we have discussed that as l,τ0l,\tau \downarrow 0, (1) and (CP) are equivalent. For τ0,l>0\tau \downarrow 0, l>0, (1) and (CP) are equivalent under conditions in Proposition 3.2-2-b). Moreover, by Proposition 3.2, the approximate solutions to (CP) with vl,τ(x)+τlnMϵv_{l,\tau}(x) + \tau \ln M \leq \epsilon are ϵ\epsilon'-weak Pareto optimal. Therefore, an ϵ\epsilon-global/local solution xx to (CP) satisfies that it is global/local ϵ\epsilon'-weakly Pareto optimal, and that f0(x)f0(x)f_0(x)\leq f_0(x^*), with xx^* being the global/local solution to (1). We will add this discussion in the modified paper.

Theoretical Claims: In Proposition 3.2, ϵ=τlnM\epsilon=\tau\ln M,... choose τ\tau as a function of MM...?

Yes. In our experiments, MM is fixed for a problem, and τ\tau is chosen according to MM, so we can ensure ϵ\epsilon to be sufficiently small.

Experimental Designs/Analyses: ...Compare with methods listed in Table 2.

It is hard to compare with the methods since they do not provide open-source code. Furthermore, these methods require different assumptions that cannot be satisfied in our case. For example, for the lower-level objective, PMM and TAWT require strong convexity or invertible Hessian, which does not hold in our problems. So these methods cannot be applied.

As per the reviewer's request, we implement PB-PDO and PNG. Figure A and Table B in the linked PDF summarize the results, which show they sometimes cannot converge to the Pareto front, and they achieve worse hypervolumes compared to FOOPS.

W1.How should τ\tau and ll be chosen theoretically and in practice?

As discussed in Section 3.1, theoretically, smaller τ\tau approximates uˉ\bar{u} better, but also increases the smoothness constant. We need to choose lf,1l \geq \ell_{f,1} to ensure vl,τ\nabla v_{l,\tau} exists and can be computed. In practice, we choose relatively smaller τ\tau and ll but not smaller such that vl,τv_{l,\tau} is not differentiable. Detailed choices are provided in Appendix F.

W2.Thm 3.6, 3.7, and 3.9 appear to be direct adaptations from (Shen, 2023)...

We respectively disagree. The theorems compared to (Shen, 2023) are not minor. In (Shen, 2023), the focus is on a scalar lower-level objective satisfying the QG condition (a special case of HEB with η=2\eta=2). Directly applying the results from (Shen, 2023) is impossible as we can only make assumptions on objective FF, but not on vl,τv_{l,\tau}, and vl,τv_{l,\tau} does not satisfy this condition even if FF satisfies. In comparison, our theories require much weaker assumptions, as discussed in Appendix A.1, lines 749-756. In addition, (Shen, 2023) assumes the QG holds on the entire domain. However, the exponential function in our case is not globally subanalytic on the whole domain (Remark C.9). Instead, we can prove that vl,τv_{l,\tau} is globally subanalytic on any bounded subanalytic set given that the objective FF is subanalytic. Thus, the HEB holds on the bounded subanalytic set, which is the base of the proof for Thm 3.6, 3.7, and 3.9.

W3.Convergence analysis is relatively trivial...

We respectively disagree. Our main contribution is to convert the challenging semivectorial bilevel optimization problem to a simpler penalty problem with guarantees on their relations, and easy-to-evaluate gradients for the penalty problem. Building upon this, the convergence of the penalty problem can be analyzed using existing tools.

By converting challenging problems to simpler ones, we have a simple convergence analysis. This actually shows the strength rather than weakness of our method.


We hope the concerns are addressed and the reviewer can reconsider the rating of our paper. Thanks!

审稿人评论

Thanks to the authors for the rebuttal. I have a follow-up question regarding "Relationship between (1) and CP". I am wondering why you do not have a definition of ϵ\epsilon-stationary solution for CP. Why is the gradient-based formula (10) needed here? It seems like (10) is only needed for the definition of an ϵ\epsilon-stationary solution. The relation between CP and (10) is also not clear to me.

Since the authors have addressed most of my concerns, I’ve updated my score accordingly.

作者评论

Thank you very much for engaging in the discussion and the follow-up questions. Indeed, a proper definition of a stationary condition of CP is a delicate issue in our studied problem and one of our contributions. We will answer your questions point by point as follows.

1. Why not use ϵ\epsilon-stationary solution for CP. A commonly used stationarity condition is the (ϵ,ϵ)(\epsilon,\epsilon')-KKT condition for the constrained problem CP, which is defined as vl,τ(x)ϵ,f0(x)+vl,τ(x)wϵv_{l,\tau}(x)\leq \epsilon, ||\nabla f_0(x) + \nabla v_{l,\tau}(x) w || \leq \epsilon', with ww being a bounded scalar. However, it has been proved in prior works [Liu et al. 2022, Section 4.1; Xiao et al. 2023, Example 1] that for the bilevel problem, the KKT condition of CP is not a necessary optimality condition when the KL exponent of vl,τv_{l,\tau} is in a certain range, e.g. exponent αv=2\alpha_v=2, which reduces to the PL condition. Therefore, it is not suitable to directly use the KKT condition of CP as a stationarity measure.

2. The need of formula (10). The above discussion shows KKT condition for CP cannot be used. Nevertheless, when the lower-level objective satisfies certain conditions such as the KL condition, then the KKT condition of the reformulated problem (10) is a necessary optimality condition. This is proved in our paper in Appendix D.4, Lemma D.7, which shows that the calmness constraint qualification in Definition D.6 holds under the KL condition, justifying that the KKT condition of the reformulated problem (10) is a necessary optimality condition to (10). This type of reformulation exists and has been justified in prior works [Liu et al. 2022, Eq(3); Xiao et al. 2023, Eq(3)], but only for PL lower-level objective. And (10) is equivalent to CP under the KL condition as detailed in the next point.

3. Relation between CP and (10). (10) is an equivalent reformulation of CP when the lower-level objective vl,τ(x)v_{l,\tau}(x) or p(x)p(x) satisfies the KL condition. This is because under the KL condition, p(x)=0\nabla p(x)=0 is equivalent to p(x)=0p(x)=0, and thus equivalent to vl,τ(x)+τlnM=0v_{l,\tau}(x) + \tau\ln M=0. Similar discussions exist in prior works [Liu et al. 2022, Section 4.1; Xiao et al. 2023, Section 2.1], but only under the PL condition (i.e. exponent αp=2\alpha_p=2).

Due to limited space, a brief discussion is provided below Theorem 3.9, and a detailed discussion is provided in Appendix D.4 in the paper. We will further clarify these questions in the revision. We hope our answers address your questions and that you will reconsider the rating of our paper. We are willing to address any follow-up questions the reviewer may have.


Thank you very much for acknowledging our response, engaging in the discussion, and updating your score. We will incorporate the promised revisions to improve our paper.

Sincerely, authors

References

Liu, B., Ye, M., Wright, S., Stone, P., and Liu, Q. "BOME! Bilevel optimization made easy: A simple first-order approach." NeurIPS, 2022.

Xiao, Q., Lu, S., and Chen, T. "An alternating optimization method for bilevel problems under the Polyak Łojasiewicz condition." NeurIPS, 2023.

审稿意见
4

This paper proposes a new method for solving the semivector bilevel optimization problem. The authors first reformulate the multi-objective subproblem as a single objective constraint and then use a penalty-based method to solve the reformulated optimization problem. The results demonstrate the effectiveness of the proposed method in finding preference-guided optimal solutions to the multi-objective problem.

给作者的问题

No other questions.

论据与证据

I think all the claims have been well supported by clear and convincing evidence.

方法与评估标准

I think the proposed methods are well-evaluated.

理论论述

I have checked the proof.

实验设计与分析

I have checked the experimental designs.

补充材料

I have reviewed Appendix A.

与现有文献的关系

The key contribution of this paper is proposing a new method for solving the challenging semivector bilevel optimization problem.

遗漏的重要参考文献

I think the author should discuss some related work. (See Comments 2)

其他优缺点

Strengths:

  1. I think this paper is clearly written and easy to understand.
  2. The solved semivector bilevel optimization is very challenging. This paper proposes an efficient method to solve it with a strong theoretical guarantee.
  3. The experimental result is intuitive and demonstrates the authors' claims.

其他意见或建议

  1. Lacking analysis on computational cost. The authors should provide the order of computational cost and memory cost per iteration, and provide the table of real running time for each method.
  2. In Appendix A, the authors discuss several works on bi-level optimization (BLO). However, I suggest they also provide a discussion on multi-objective bi-level optimization (MOBLO) problems, as explored in [1-5]. The key difference between semivector bi-level optimization and multi-objective bi-level optimization is that the former involves solving multiple objectives at the lower level, making it significantly more challenging.
  3. Typo: The author should pay attention to the consistency of the punctuation at the end of the formulas; some formulas end with a period, while others do not.
  4. The experiment on the multi-patch image classification problem (using Multi-Fashion+MNIST) is relatively simple and may not fully demonstrate the capabilities of the proposed method. Could the authors conduct experiments on more challenging and widely used datasets, such as Office-31 or Office-Home?
  5. It seems the number of objectives in the LL subproblem is small in the experiments. Could the authors conduct experiments on more challenging settings, M>20M>20 ?

[1] Gu et al. Min-max multi-objective bilevel optimization with applications in robust machine learning. ICLR, 2023.

[2] Ye et al. AFirst-Order Multi-Gradient Algorithm for Multi-Objective Bi-Level Optimization, ECAI 2024.

[3] Ye et al. Multi-Objective Meta-Learning, AIJ 2024.

[4] Yang et al. Gradient-based algorithms for multi-objective bi-level optimization, Science China Mathematics, 2024

[5] Fernando et al. Mitigating gradient bias in multi-objective learning: A provably convergent approach. ICLR, 2023.

作者回复

Thanks for supporting our work, acknowledging that we propose a new efficient method to solve a very challenging semivectorial bilevel optimization problem, with a strong theoretical guarantee, and with intuitive experimental result demonstrating the authors' claims.

We address your other comments below. The link to the additional results is https://anonymous.4open.science/r/FOOPS-F746/ICML3045_rebuttal.pdf

1.Lacking analysis on computational cost. Provide the order of computational cost and memory cost per iteration, and the real running time for each method.

Thanks for the suggestion. In Appendix F, Table 10, we have provided the real running time for each method. We add the memory cost per-iteration in Table A in the link.

2.In Appendix A, the authors discuss several works on bi-level optimization (BLO). I suggest they also provide a discussion on multi-objective bi-level optimization (MOBLO) problems, as explored in [1-5]...

Thanks for the suggestion. We will incorporate these works in our paper.

3.Typo: consistency of the punctuation at the end of the formulas.

Thanks for the suggestion. We will check to ensure the consistency of the punctuation.

4.The experiment on the multi-patch image classification problem (using Multi-Fashion+MNIST) is relatively simple and may not fully demonstrate the capabilities of the proposed method. Could the authors conduct experiments on more challenging and widely used datasets, such as Office-31 or Office-Home?

We have included the experiment for larger models in our paper for speech recognition; see e.g., Table 4 in Section 6. The model size is around 64.5M, much larger than the image classification problem. More details are provided in Appendix F. The benchmarks Office-31 and Office-Home suggested by the reviewer are designed for multi-task learning, but not for preference-guided multi-objective learning studied in this paper, thus not very suitable for our problem. To do such experiments, we need to first define a preference objective f0f_0, then conduct the experiments using our method. We will include the results in the revision.

5.It seems the number of objectives in the LL subproblem is small in the experiments. Could the authors conduct experiments on more challenging settings, M>20M>20?

Theoretically, our method can be used for a larger MM without significantly increasing complexity. We test this on a toy problem with M=25M=25 and report the running time till convergence in Table A in the linked PDF, the last row, which shows the run time can be much shorter compared to FERERO. The problem is defined as fm(x)=(xm)2,m[M]f_m(x) = (x - m)^2, m\in [M], H(x)=f1(x)f2(x)H(x) = f_1(x) - f_2(x), and f0(x)=H(x)2f_0(x) = ||H(x)||^2. xx is initialized to be 100100. We will find some other real-world problems with more objectives and include the experiments in the revised paper.


We hope the concerns are addressed and the reviewer can continue to support our paper. Thanks!

审稿人评论

Thanks for the authors' responses. My concerns have been solved, and I will keep my positive score.

Here are some other suggestions.

In some BLO papers, such as "A Generic Descent Aggregation Framework for Gradient-Based Bi-Level Optimization," they call this kind of BLO a simple BLO where there is only one variable. I wonder whether we can call this problem (Problem 1) semivector bilevel optimization, because naturally, we will have UL variables and LL variables, but the solved problem only has one variable.

The authors should write the convergence metric in a separate section and include the discussion of theoretical analysis challenges in the paper.

作者评论

Thank you for the follow-up reply and the consistent support!

Thank you for the suggestions.

  1. Yes, it would be more accurate to call (Problem 1) a semivector simple bilevel optimization problem. We will revise the paper accordingly.

  2. We have highlighted some theoretical analysis challenges in the introduction and remarks. Following your suggestion, we will write the convergence metric in a separate section and include a more detailed discussion of theoretical analysis challenges in the main paper.

Since you are an expert in the field, it would be great if you could champion our paper! Much appreciated!

Sincerely, Authors

最终决定

This paper explores the problem of multi-objective learning under preference guidance. It formulates the problem as a semivectorial bilevel optimization task and proposes a first-order penalty approach to solve it. The paper further establishes the theoretical foundations of the proposed method and demonstrates its practical effectiveness on several machine learning tasks.

The reviewers agree that this paper addresses a challenging and important problem in multi-objective optimization, that the proposed method is novel, and that the theoretical foundation is solid. Given its important contributions to the field, I recommend accepting this paper to ICML.