/10

Rejected3 位审稿人

最低2最高4标准差0.9

ICML 2025

🏆 COPA: Comparing the Incomparable to Explore the Pareto Front

Adrián Javaloy,Antonio Vergari,Isabel Valera

提交: 2025-01-23更新: 2025-06-18

TL;DR

We make objectives comparable via their CDFs, approximated by their relative rankings, to meaningfully navigate the space of trade-off solutions.

摘要

关键词

multi-objectivemodel selectionrankings

评审与讨论

审稿意见

评分: 42025-02-24

The paper proposes COPA (Comparing the Incomparable to Explore the Pareto Front), a novel approach for comparing and aggregating multiple objectives in machine learning model selection. The authors address the challenge of meaningfully comparing objectives with different scales and semantics (e.g., accuracy vs CO2 emissions) by transforming them through their CDFs, approximated by relative rankings. This allows for principled navigation of the Pareto front while respecting user preferences. The method is demonstrated on several important applications including LLM selection, domain generalization, and AutoML benchmarking.

给作者的问题

How does the method scale with very large numbers of models/objectives?
Are there cases where the CDF transformation could be misleading?

论据与证据

yes

方法与评估标准

yes

理论论述

yes

实验设计与分析

yes

补充材料

yes

与现有文献的关系

Related to some extent.

遗漏的重要参考文献

None

其他优缺点

Strengths:

The approach is theoretically well-grounded with clear analysis of properties
The problem being solved (comparing incomparable objectives) is highly relevant to modern ML
Extensive empirical validation across multiple important domains
Clear practical impact for model selection and benchmarking
The implementation is relatively simple yet effective

Weaknesses:

The computational overhead of computing rankings for large model populations could be discussed more
Some discussion of failure cases or limitations would be valuable
Additional ablation studies on the choice of p parameter could help inform practical usage
The connection to existing work in multi-criteria decision making could be expanded

其他意见或建议

None

作者回复

2025-04-01

We thank the reviewer for the constructive and positive feedback. We are especially grateful for the kind words towards our work, acknowledging its importance and potential impact on modern ML. It is also encouraging to see the reviewer confirming the validity of our approach and derivations, as well as of the empirical validation of COPA.

The computational overhead of computing rankings for large model populations could be discussed more.

How does the method scale with very large numbers of models/objectives?

We agree, and we will add a paragraph discussing the overhead of COPA in detail for the camera-ready. Glossing over minor details, an implementation of COPA consists of sorting K arrays, each of them containing the performance of N models (see the implementation on the notebooks in the supplementary material). As a result, the overall time complexity of COPA is of $O(K N \log N)$ , which is comparable to many light-weight preprocessing steps commonly used in the ML pipeline.

Some discussion of failure cases or limitations would be valuable.

Are there cases where the CDF transformation could be misleading?

We provide COPA, and thus the CDF, as a complement (not a replacement) of the original objectives as there is no need to discard them (i.e., the marginals). In fact, in figures 1-3, 5 and table 2, we plot the Pareto-front exploration in the original objective space to enable decision makers to perform intra-objective comparisons. Otherwise, we would lose vital information regarding the marginal information such as sudden phase-changes, as rightfully pointed out by reviewer w3Vj.

Regarding the limitations of the CDF, we can think of one main case where the transformation can be misleading: If the objectives turn out to be discrete rather than continuous (as assumed in line 55), then the resulting variable will no longer resemble a standard uniform one (as claimed in lines 180-184). We will include a paragraph stressing the need to meet our assumptions in the camera-ready version.

Additional ablation studies on the choice of p parameter could help inform practical usage.

While we already try to provide some guidance and intuition on the choice of $p$ to the practitioners, especially in the last paragraph before section 4 and the experiment in figure 3, we acknowledge that additional experimental results could help choosing $p$ . Using the simple notebooks provided in the supplementary material, we will add extra results of the existing experiments using different values of $p$ .

The connection to existing work in multi-criteria decision making could be expanded.

We will expand the existing discussion of related works. Besides MOO ML (i.e. estimation of the Pareto front) and multi-objective Bayesian Optimization, we will discuss existing works in multi-criteria decision making. We invite the reviewer to share any specific work they could have in mind and that we might have missed in the first version.

We appreciate the reviewer for their time and questions. We hope to have sorted out all existing questions and, if that were the case, we kindly ask to revisit the review if it feels appropriate. If there were further questions, we are happy to address them in the next phase on the rebuttal.

审稿意见

评分: 22025-02-27

This paper proposed "COPA: Comparing the Incomparable to Explore the Pareto Front". The authors claim that it is often unclear how one should compare, aggregate and, ultimately, trade-off these objectives, as they might be measured in different units or scales. The author proposed to make incomparable objectives comparable via their CDFs, approximated by their relative rankings.

给作者的问题

For the LLM models as evaluated, do you use public data or your train those models and evaluate it by yourself.
My main concern is that, both the proposed MOO methods and evaluated models are proposed by other papers. For example, Tchebycheff or the p-norm aggregation function has been proposed for years. Up to now, it seems that this paper only merge these two directions. If that is true, the contribution of this paper seems limited to me. Is there new technical contribution in this paper?
(Cont. with 2) For example, line 292 right. It seems that the author just gather the results from some existing models. Therefore, what is the contribution of this work?
Line 320 right, CelebA is not a LLM benchmark. What is the purpose of using CelebA as an introduction here?

论据与证据

The author proposed three interesting case study to show the effectiveness of their method.

方法与评估标准

Authors use the leading board model, Open LLM Leaderboard (Fourrier et al., 2024), which is pretty new and appropriate.

理论论述

Theoretical claims are largely based on previous literature, e.g, (Miettinen, 1999, Thm. 3.4.1).

实验设计与分析

Experiments are conducted on three case studies.

补充材料

I have roughly gone through the supplementary material and the results seem correct.

与现有文献的关系

This paper is highly related to LLM evaluation and content moderation.

遗漏的重要参考文献

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. This paper offer a MOO cretia of MOO.
Panacea: Pareto Alignment via Preference Adaptation for LLMs. This paper is a paper combining post-training of LLM and multiobjective optimization.

其他优缺点

Strength:

This paper combine MOO and modern LLMs.

其他意见或建议

The notation of y1 ∼ U(0.02, 0.2) is not proper. Consider change to y_1 \in [0.02, 0.2].

伦理审查问题

NA.

作者回复

2025-04-01

We thank the reviewer for their work, and we are happy to hear that the cases for which we apply COPA are interesting, and that our evaluation is new and appropriate. We hope the following helps the reviewer better understand our work.

This paper combine MOO and modern LLMs.

This paper is highly related to LLM evaluation and content moderation.

First, we want to stress that the scope of our work is multi-criteria evaluation in modern ML in general, being LLM selection one of our use cases in Section 5, which include LLM selection, domain generalization, fair ML, and AutoML benchmarking.

It seems that the author just gather the results from some existing models. Therefore, what is the contribution of this work?

Indeed, we gather publicly available evaluation data for all experiments except the synthetic and FairGrad ones (as stated in the appendix), which does not diminish the value of our work (at the end, these are our “datasets”) and only corroborates how broad and common of a task multi-criteria evaluation is, and thus the potential impact of COPA.

Our main contribution is a simple, yet general, approach to evaluate, compare and select ML models in terms of several (often) non-comparable objectives, by casting the problem to a probabilistic MOO problem (see lines 75-100 for further details). The simplicity of COPA is not a weakness, but a strength, and can potentially have significant impact in many areas of ML, as acknowledged by reviewer UMh1.

[…] Tchebycheff or the p-norm aggregation function has been proposed for years. [...] Is there new technical contribution in this paper?

While we hope to have cleared out the contributions of our work, let us remark that the proposed weighted $p$ -norm in Eq. 12 is a novel and technical contribution. In lines 234-236 we discussed the differences between this norm and the usual weighted $p$ -norm, which does not serve to intuitively map user preferences. To make this point more clear, we have reproduced figure 1 of the main paper using the usual weighted p-norm (see the new figure here). Furthermore, matching the Tchebycheff problem with $p=\infty$ is just one property of our norm that, however, does not hold for the regular weighted p-norm (as it can be seen here).

CelebA is not a LLM benchmark. What is the purpose of using CelebA as an introduction here?

CelebA is the dataset of one of the 5 total experiments we show in the main manuscript, covering many areas in ML (LLMs, fair ML, MTL, domain generalization, and AutoML). In our experiments CelebA is used to show how COPA enables practitioners to choose ML models that achieve a sensible fairness-accuracy trade-off.

DecodingTrust and Panacea references.

We appreciate the shared references, and we will add them to the camera-ready revision as related work.

Regarding DecodingTrust, we found it really interesting as it serves as yet-another-use-case for which the adoption of COPA can have a real impact, as the authors provide a set of objectives to evaluate LLMs and attempt to make them comparable (Appendix I.1), ultimately taking their average as score.

Given the similar format of the DecodingTrust and Open LLM Leaderboards, we have applied COPA to the formed leaderboard too, sorting the considered LLMs with different user-given preferences (we provide tables for three values of $p$ ). We will include the results as an additional experiment in the appendix. Among the most interesting take-aways, with COPA we can see that GPT-4 ranks among the least robust models in terms of DecodingTrust objectives (as it is the least fair LLM), while it is the 6th best model using the provided Overall score, as shown in their online leaderboard.

We hypothesize that the reviewer flagged our work for an ethics review by mistake. Otherwise, we would appreciate some explanation in this regard. We hope to have addressed all the concerns from the reviewer. If so, we would appreciate it if the reviewer could revisit their review to reflect these changes. We are happy to clarify any further questions in the next phase of the rebuttal period.

审稿意见

评分: 22025-03-14

The goal of the paper is to address the challenge of multi-objective machine learning evaluation where objectives are often incomparable due to differing semantics, units, and scales (e.g., comparing model performance and CO2 emissions). It proposes a novel method, COPA (Cumulative-based Optimization of the Pareto front), to make such objectives comparable.

The COPA algorithm consists of the following main steps:

Problem Setup
- Define the multi-objective optimization problem as: $\min_{h \in H} \mathbf{y}(h) = [y_1(h), y_2(h), \dots, y_K(h)]$ where $\mathbf{y}(h) $ is a vector of K objectives for model $ h$
CDF Normalization
- Normalize objectives using their CDFs: $u_k = F_k(y_k) \sim \text{U}(0, 1)$ where $F_k(y_k)$ is the cumulative distribution function of the $k$ -th objective. When CDFs are unknown, approximate them using relative rankings: $\hat{u}_i = \frac{\text{rank}(y_i)}{N}.$
Preference Integration
- Define a criterion function $C$ to aggregate normalized objectives. For example, use a weighted $p$ -norm: $C(\mathbf{u}) = \left( \sum_{k=1}^K |\omega_k \cdot u_k|^p \right)^{1/p},$ where $\omega_k$ are objective importance weights $($ \sum \omega_k = 1 $)$ , and $p$ determines the aggregation method (e.g., $p = \infty$ for robust optimization).
Optimization
- Solve the optimization problem: $\min_{i = 1, \dots, N} C(\hat{\mathbf{u}}_i),$ where $\hat{\mathbf{u}}_i$ is the vector of normalized objectives for model $i$ .
Model Selection
- Select the model(s) with the smallest value of $C(\hat{\mathbf{u}})$ , reflecting the desired trade-off.

给作者的问题

See Weaknesses.

论据与证据

Claims are clear.

方法与评估标准

I don't think it makes sense. Users could have complex preference which is hard to scoring by weight. It is not proper to assume each there could be a weight between different objectives. For example, users might like to maximize A+2B when C<0.5 whereas maximize A+B+C when C>=0.5. This phenomenon is significant when the number of objective are larger than 2. Thus, we cannot assume the criterion function C must be differentiable and easy-to-be-optimized in 3.3 Incorporating Preferences into the Optimization.

理论论述

No theoretical claims in the paper

实验设计与分析

See Methods And Evaluation Criteria

补充材料

code readed

与现有文献的关系

LLM, multi-objective optimization.

遗漏的重要参考文献

Missing related Pareto front estimation related work

Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging
MAP: low-compute model merging with amortized pareto fronts via quadratic approximation etc.

其他优缺点

Strengths: COPA uses CDFs to normalize objectives, ensuring all objectives, regardless of their semantics or scale, are comparable. It is objective-agnostic and preserves Pareto-optimality.

Weaknesses:

I am skeptical about the project's incentives, particularly based on Figure 1. The statement, "This is reflected in the retrieved LLMs where, for α = 1/2, COPA finds a top-18% model for both objectives, while all other approaches select either a high-performing but CO₂-intensive model or a low-performing but ‘CO₂-free’ model," suggests that the authors assume a compromise between performance and carbon footprint is inherently preferable. However, this assessment should depend on user preferences. For instance, if a user has no carbon credits available for emissions, a completely "CO₂-free" model should be the better choice. These work should more focus on getting the correct Pareto front rather than help user to pick the optimal choice. As long as we have a perfect Pareto front, practitioners can immediately find their own optimal choice according to their preference.

Based on the summary and Weakness 1, the main contribution of the paper appears to be in the second step: CDF estimation (solution to "How can we make objectives comparable?"). However, I do not think this trick is a sufficient contribution for a full paper.

Taking a step back. The weakness of CDF estimation also have some problem

The approximation depends on the sample size N. With very small N, the accuracy of the estimated CDFs may degrade, potentially impacting the results.
Rankings discard precise information about the relative distances between objective values, which might be important in some cases. Sometimes, certain thresholds could be critical for some metrics. E.g. to pass a course, one need to get 60; some dynamics would have phase change when a certain value is larger (or smaller) than the threshold. Ranking-based CDF estimation would cause troubles in this situation.

其他意见或建议

作者回复

2025-04-01

We thank the reviewer for their valuable feedback. There seems to be a misunderstanding with our work, which we believe to address below.

As long as we have a perfect Pareto front, practitioners can immediately find their own optimal choice

We politely disagree. As stated in the intro: Pareto fronts in high dimensional spaces are extremely difficult to visualize and navigate. In fact, for our use-case in Fig. 1 we had to summarize 6 performance scores to their average to depict the 487 Pareto-optimal models (out of 2148). Also, as we discuss in depth (e.g. in lines 50-54, 63-69, or 126-154), traversing the front over incomparable objectives is challenging: e.g., in Fig. 1, “Delta” maps half the preferences, $\alpha \in (0.5, 1)$ , to a tiny region of the front (note the log-scale), where LLM performance is very low. The other baselines behave inversely similar, making it hard for practitioners to map their preferences to $\alpha$ (or $\omega$ ). In contrast, COPA maps the hypothetical (as we make clear in lines 96-100) practitioner’s preference of equal performance-cost compromise to $\alpha=1/2$ . Of course, practitioners may require better performing models, which they can easily find by decreasing $alpha$ as shown in Fig 1-COPA, where there are another 486 Pareto-optimal models with different performance-cost trade-offs.

It is not proper to assume each there could be a weight between different objectives

We stress that weight-based preferences are still the de-facto approach in many MOO works [1-3]. While it is true that weights may not always be easy to interpret (see section 3.1.3 of [1]), we also remark that COPA overcomes this by making all objectives comparable first, for which we provide an interpretation of the weights in lines 237-251. Paraphrasing [1]: “Only normalizing the objective functions can one control the method to produce solutions of a desirable nature [...] Otherwise the role of the weighting coefficients may be greatly misleading” (as it is the case with the baselines in Fig. 1).

We acknowledge that there are other ways of expressing preferences—e.g., interactive methods, for which [1] devotes 80 pages—and we would love to explore them in future works.

[1] Nonlinear multiobjective optimization (1999)

[2] Smooth Tchebycheff scalarization for multi-objective optimization (2024)

[3] Revisiting scalarization in multi-task learning: A theoretical perspective (2023)

Users might like to maximize A+2B when C<0.5 whereas maximize A+B+C when C>=0.5 ...

We agree that users can often have complex preferences, and believe that COPA can readily handle many of them, including your example, where one can use COPA with the piece-wise criterion function described by the reviewer. Fig. 5 aims to illustrate how one can combine COPA with user constraints over original objectives.

Inspired by the reviewer, we have slightly adapted Fig. 1 to accommodate for such a case, see here.

While we agree with the reviewer that users can have complex preferences that COPA may not be able to handle, we believe that this is interesting but challenging future work that does not diminish the contributions of COPA.

Thus, we cannot assume the criterion function C must be differentiable and easy-to-be-optimized

There might be a misunderstanding: We do NOT require differentiability or easy-optimization for the criterion function, as we only need to evaluate for each model in Eq. 11 all the objectives. COPA is a multi-objective evaluation method, as stated in the intro, and we assume a given population of (already trained) models (lines 104-107).

With very small N, the accuracy of the estimated CDFs may degrade...

Our theoretical results on its variance in Prop. 3.1, the ablation study in App. A.1.1, and our experiments show that COPA is well behaved for already $N=15$ (Case 3). We will clarify that like, for any statistical estimator, COPA may suffer with extremely low $N$ values.

Rankings discard precise information about the relative distances between objective values

We provide COPA as a complement (not a replacement) of the original objectives as there is no need to discard them (i.e., the marginals). In fact, in figures 1-3, 5 and table 2, we plot the Pareto-front exploration in the original objective space to enable decision makers to perform intra-objective comparisons. We will stress this aspect in the camera-ready.

I do not think this trick is a sufficient contribution for a full paper.

We believe that, as highlighted by reviewer UMh1 in their review, the simplicity of our approach is a strength, which can be applied to many problems such as LLM selection, domain generalization, and AutoML benchmarking (see Section 5).

We hope to have clarified any concerns from the reviewer and, in this case, that they could reconsider their score. We are happy to solve more questions in the next round of the rebuttal.

审稿人评论

2025-04-04

Hi,

Thanks for the rebuttal. Your answer still not convince me. For Q1,

"Pareto fronts in high dimensional spaces are extremely difficult to visualize and navigate" is your point. I definitely understand it is hard to visualize but I don't think it is hard to navigate. Since you agree that COPA cannot handle complex preference. Let's taking a simple weighting preference $w$ as an example.

If your Pareto frontier is a set of points (which is most of the case in high dimensional Pareto frontier estimation)

You can just traverse through the Pareto set and calculate the $u \cdot x^{i}$ . Sort the result and take the maximum. You can also vectorize it to accelerate.

Pareto Frontier is known in analytical form or defined by constraints (continuous frontier)

Maximize $u \cdot x$ s.t. $g(x) = 0$ or $g(x) \leq 0$ or if the Pareto frontier itself has a parametric representation $x(t)$ , the constraint is $x = x(t)$ . Solve the optimization problem. (If the problem is non-convex, the result will be sub-optimum).

Pareto Frontier is known as a generative network

Gradient Ascent in Latent Space (Most Common & Often Fastest)

Maximize $f(z) = u \cdot NN(z)$ s.t. z ~ N(0, I) (or something similar), the problem should be non-convex, the result will be sub-optimum. Other methods could improve: Bayesian Optimization, EA (e.g., CMA-ES)

Sampling-Based Methods

Latin Hypercube Sampling (LHS), Quasi-Random Sampling (e.g., Sobol, Halton sequences).

Conclusion

Again, as I mentioned: these work should more focus on getting the correct Pareto front rather than help user to pick the optimal choice. If a user has no carbon credits available for emissions, a completely "CO₂-free" model should be the better choice. Else, maybe the user should only care about performance.

作者评论

2025-04-07

Dear reviewer,

We appreciate the engagement, but we firmly believe that the reviewer is misunderstanding our scope and contribution. We politely invite them to re-read the paper and our initial rebuttal carefully.

First, we remark again that the setting we consider (as clearly stated in lines 104-107) is starting with a given set of models. This model selection scenario, where the Pareto front is a set of points is ubiquitous in ML and AI as every practitioner knows. In fact, all COPA use cases in Section 5 are taken from the ML literature and publicly available repositories, demonstrating its applicability for model selection in a wide range of ML sub-fields ranging from LLMs and fair ML (Section 5.2) to MTL and domain generalization (Section 5.3), as well as for AutoML benchmarking (Section 5.4).Therefore, we disagree with the idea suggested by the reviewer that we should be “getting the correct Pareto front rather than help user to pick the optimal choice”, as it is an interesting but different problem out of the scope of our paper.

We politely invite the reviewer to evaluate our work on the basis of what it is, what it tries to solve, and what it accomplishes, and not on what it has never tried to be nor solve.

You can just traverse through the Pareto set and calculate the $u \cdot x^{i}$ . Sort the result and take the maximum. You can also vectorize it to accelerate.

This answer let us think that the reviewer has unfortunately only partially and superficially read the paper and our rebuttal. While one could enumerate all datapoints by hand, the major issue is comparing the many objectives in a rigorous and systematic way. To see why, consider that in high-dimensional spaces, when objectives are not comparable, adopting the naive traversal and comparison that the reviewer is suggesting will exactly yield a solution that does not map to the preferences. In fact, just by summing incomparable metrics such as CO2 consumption and performance yields the 'Naive’ approaches depicted in Figure 1, top left (or Figure 2), where half of the preference values are mapped to a small region of the Pareto front (see our previous rebuttal).

Since you agree that COPA cannot handle complex preference

We invite the reviewer to re-read our rebuttal once again. At no point did we agree that “COPA cannot handle complex preference”. In order to find a middle ground, and in an act of courtesy from our side, we agreed that “users can have complex preferences that COPA may not be able to handle” since COPA, just like any other method, cannot perfectly solve every single conceivable query. But at the same time, we precisely showed that COPA can solve the constrained example that the reviewer proposed, showing its flexibility.

If a user has no carbon credits available for emissions, a completely "CO₂-free" model should be the better choice. Else, maybe the user should only care about performance.

We totally agree, and highlight that COPA allows to express preferences where one or more dimensions should receive 0 weight. We do not understand however why the reviewer’s argument should imply that “allowing users to express their preference” can be a useless thing. It is very likely that a practitioner might want to have the CO2 consumption or (all the 6) performance objective(s) to be non-zero, and COPA would allow them to retrieve the optimal solution in a rigorous and systematic way: an algorithm that can be effortlessly applied to other scenarios (AutoML, fairness, etc) without the need of manually comparing dimensions and avoiding the pitfalls of comparing incomparable objectives.

最终决定Reject

2025-05-01

This paper sets out to develop a principled method to address evaluation of models based on multi-objective criteria that might have incomparable units and semantics (e.g. comparing and trading off model accuracy and CO2 emissions). The method that is proposed in the paper, Cumulative-based Optimization of the Pareto from (COPA), is based on the idea of defining overall objective obtained by aggregating the CDFs of the individual objectives. This translation from multiple objectives to an aggregated one has the desirable properties of satisfying crucial theoretical guarantees such as being objective-agnostic and being order-preserving, so that it maintains Pareto-optimality of the models.

The paper then validates the method in a synthetic evaluation benchmark, and showcases its applicability on a number of different use cases, including selecting models by trading off performance and CO2 emission, and trading off fairness and accuracy.

While reviewers have praised the motivation and theoretical grounding of the approach, as well as the practical relevance to applications like ranking LLMs, reviewers have also raised concerns because of the lack of proper reference to and contextualization within the relevant literature. In particular, reviewers have pointed out missing credit to very related previous papers such as Wang et al. "DecodingTrust" (2024), which also deals with aggregating multiple objectives. Related to this work, the current paper also fails to mention and compare to other papers following up on DecodingTrust and proposing similar risk-aware metrics, such as for example Nitsure et al., "Risk Aware Benchmarking of LLMs" which follows a methodology based on stochastic dominance motivating a "portfolio" approach similar to COPA's aggregation method, which should be compared to and benchmarked against in terms of quality of the resulting aggregated metrics. In addition, the paper fails to properly discuss the relation with papers of use related multi-objective metrics as post-training or model-merging objectives, such as Zhong et al. "Panacea" (2024) and Chen & Kwok "Pareto Merging" (2024). Despite its clear merits, without a thorough discussion and when appropriate direct comparison of at least the most directly relevant previous contributions in the literature to help in gauging the practical advantages of the proposed method compared to existing baselines, it is premature to recommend acceptance for this work.