COPA: Comparing the incomparable in multi-objective model evaluation
We investigate how objectives can be automatically normalized and aggregated to systematically navigate their Pareto front.
摘要
评审与讨论
This paper proposes COPA to assist DM in multi-objective model selection and other related tasks. The innovation in COPA is a novel objective space normalization method based on the CDF that resolves semantic incomparability between different objectives by transforming them into a unified interpretation framework. The core implementation uses relative ranking as a finite-sample approximation of the CDF transformation.
优缺点分析
This paper is well-written and easy to follow. On one hand, this work is highly accessible to audiences unfamiliar with multi-objective optimization, providing clear explanations of relevant concepts. On the other hand, as a target audience, I find that some statements are still insightful. I believe the proposed method COPA is practically significant, with each component thoughtfully designed. Experiments are conducted across diverse tasks, and practical suggestions are provided.
My concerns are as follows:
- This paper presents a comprehensive method, supported by many experiments and a good presentation. However, from a research perspective, its contribution is limited. Except for the normalization method, all other components of COPA are based on established techniques.
- I think that the proposed normalization method removes excessive semantic information from the objective values. I think that considering proper Pareto optimality [1,2] is more appropriate. In other words, certain solutions are unsuitable and should be excluded. After removing these solutions, applying norm may be enough.
- What is the computational complexity of the normalization method?
Other comments are as follows:
- When a new is included in , the relative rankings may change. The paper does not discuss how to efficiently update the CDF approximation without recalculating all rankings. I think this might be interesting.
- Using "\eqref" instead of "\ref" may be better. For example, "Eq. 1" "Eq. (1)".
- Some typos: hyperparemeters hyperparameters; leadearboard leaderboard.
[1] Miettinen, K. (1999). Nonlinear multiobjective optimization (Vol. 12). Springer Science & Business Media.
[2] Zheng, R., & Wang, Z. (2024). Boundary decomposition for nadir objective vector estimation. Advances in Neural Information Processing Systems, 37, 14349-14385.
问题
NA
局限性
NA
最终评判理由
Thank you for your response. Based on my current understanding, the method proposed in this paper is designed to facilitate search over the Pareto set. The Pareto set is typically large in such tasks. Although the paper does not introduce groundbreaking advancements or significant technical innovations (as noted by Reviewer N88d), it is characterized by high clarity and a logical progression. Moreover, the method is practical, and the experimental evaluation is comprehensive. Therefore, I have decided to raise my score to offer a little support for the acceptance of this work.
Finally, I sincerely recommend that the author(s) consider the suggestions made by Reviewer N88d in the future.
格式问题
NA
We appreciate the feedback and comments by the reviewer, and are happy to hear that our work is well-written and easy-to-follow for people unfamiliar with multi-objective optimization, which is ultimately one of our goals. Let us then attempt to address those concerns raised by the reviewer:
This paper presents a comprehensive method, supported by many experiments and a good presentation. However, from a research perspective, its contribution is limited.
We can see how our work can seem limited from a technical perspective (yet we have technical contributions), but we disagree that our research contributions are limited. Not all works need to propose new methods nor complex derivations. Instead, our work highlights an existing, prevalent, and underexplored problem in ML model evaluation by framing as a multi-objective problem, and it provides a simple and intuitive approach to address this problem. COPA has been carefully designed and its simplicity makes it likely to have a real and significant impact, as the reviewer acknowledged in their review.
Moreover, we show how to adapt our assumed setting of model selection to more unconventional and surprising scenarios. For example, we show how objective incomparability can bias the conclusions drawn in research works in areas such as multitask learning and domain generalization (section 5.3). While this is clear once it is explained, it is not direct nor obvious, as otherwise existing works would have attempted to address it (however, recent works have started to partially tackle the issue with rank averages, as we properly acknowledge in lines 269-273).
I think that the proposed normalization method removes excessive semantic information from the objective values. I think that considering proper Pareto optimality [1,2] is more appropriate.
We fail to see the concern of the reviewer, as rankings are order-preserving and therefore preserve Pareto optimal points (D4, line 153). Moreover, note that we do not ask the user to drop the semantic information of each objective, as it is valuable to make intra-objective comparisons (or constraints, as in Figure 5). What we ask the user is to use relative rankings to perform inter-objective comparisons, as we can make more-informed comparisons across objectives that way. This is also why almost all figures but Figure 4 in the manuscript are plotted using the original objectives, and not their transformed values.
In other words, certain solutions are unsuitable and should be excluded. After removing these solutions, applying norm may be enough.
We similarly fail to see this point. Without applying any transformation, the objective vector of each model should not depend on the population and, since Eq. 11 guarantees to find Pareto-optimal solutions (lines 242-248), the solution found should be the same regardless of whether we restrict our space of solutions to Pareto-optimal models or not. We have extensively shown in our paper that this approach (which we term “naive”) does not work well in practice (e.g. in Figure 1).
What is the computational complexity of the normalization method?
The proposed normalization has the same complexity as sorting an array and dividing by its length, that is, O(n log n).
When a new is included in , the relative rankings may change. The paper does not discuss how to efficiently update the CDF approximation without recalculating all rankings. I think this might be interesting.
We appreciate the feedback and agree that it can be of interest if the number of models is excessively large. In said scenario, one can update the ranking by finding the position of the new point in a sorted array in O(log n), and then adding a constant 1 to all entries after this position (worst case complexity of O(n), yet it can be practically constant with vectorized operations).
We hope to have addressed all concerns and remain at the disposal of the reviewer had they had more questions to come. We also appreciate the suggestions and spotted typos which we will implement in the next revision of the manuscript, and we hope the reviewer can reconsider the re-evaluation of our work.
For the parts you did not understand, I provide a simple example for illustration. Consider a task (minimization problem) with only three Pareto-optimal objective vectors: , , and . We have:
-
vs. : ; vs. : ; thus, of : 1.
-
vs. : : ; vs. : ; thus, of : 1.
-
vs. : ; vs. : ; thus, of : 1999.
The meaning of can be found in Definition 2.9.1 of [1] in my initial comments, or Definition 10 of [2]. In simple terms, represents the amount of increment in the value of one objective function in exchange for a one-unit decrement in another objective function.
is likely to be undesirable to the decision-maker because it significantly sacrifices the first objective to improve the second. However, ranking eliminates such information.
We thank the reviewer for the clarification. We initially missed the proper of proper Pareto optimal from the review, so our answers were articulated thinking of the usual notion of Pareto optimality. We really appreciate the provided example and understand the reviewer’s concerns.
With respect to the use of proper Pareto optimality, we understand the appeal of it, but we do not believe it solves the underlying problem: objective incomparability. To see why, let us take the reviewer’s example and give each dimension a physical unit, say, grams and Celsius degrees. Following the example, a European user would conclude that is likely undesirable as . However, an American user would likely first convert/measure the same quantities in their own units, say, (avoirdupois) pounds and Farenheit degrees. Therefore, we would have that is transformed to . Doing the same computations with these different measurement units, we get:
- , , and .
- vs. : 816.466; vs. : 0.816; thus, of : 816.466.
- vs. : 0.0012; vs. : 0.408; thus, of : 0.408.
- vs. : 1.224; vs. : 2.448; thus, of : 2.4483.
So using the value of , the American user would disagree with the European and say that is clearly the most undesirable point.
We do believe that the concerns raised by the reviewer are sensible. However, note that this was not an issue in any of the experiments conducted on the manuscript with real-world data. If this happened, it would become apparent once we plot the Pareto front in the original space as we do in our figures. Moreover, part of our initial response still applies here: we do not discard the marginal information. After selecting a point with COPA, the user can check (and it is advisable to do so) the marginal distribution of the original objectives, diagnose cases like the ones described above, and see whether the selected point is a desirable solution. Otherwise, they can keep exploring the Pareto front with COPA while refining their search. Crucially, what COPA allows is to perform such a search in a meaningful manner, as we can now perform sensible comparisons across objectives and give reliable semantics to the parameters of the criterion function. We will however make sure to make this point clear in the camera-ready.
Thank you once again for the reply! We hope that our answer helps the reviewer share our perspective.
Thank you for your response. Based on my current understanding, the method proposed in this paper is designed to facilitate search over the Pareto set. The Pareto set is typically large in such tasks. Although the paper does not introduce groundbreaking advancements or significant technical innovations (as noted by Reviewer N88d), it is characterized by high clarity and a logical progression. Moreover, the method is practical, and the experimental evaluation is comprehensive. Therefore, I have decided to raise my score to offer a little support for the acceptance of this work.
Finally, I sincerely recommend that the author(s) consider the suggestions made by Reviewer N88d in the future.
This paper introduces COPA (cumulative-based optimization of the Pareto front), a novel approach designed to address the challenges of comparing, aggregating, and trading off multiple, often incomparable, objectives in machine learning (ML) model evaluation. COPA makes incomparable objectives comparable by transforming them using their cumulative distribution functions (CDFs), which are approximated by their relative rankings. This transformation ensures that all objectives are equally (uniformly) distributed in the unit interval, making them semantically comparable and objective-agnostic while preserving Pareto-optimality. It utilizes a criterion function, specifically a parameterized norm (∥u∥p,ω), to aggregate the normalized objectives and match user-specific preferences.
优缺点分析
- The claims made in the abstract and introduction, particularly the need for objective normalization and a meaningful criterion function for user preferences, are appropriately reflected and demonstrated through all experiments and discussions in the manuscript.
- The paper provides full proofs for its main theoretical claims regarding the properties of the rank estimator.
- While the paper does not assume stochasticity via datasets, it clearly describes the properties of the ranking estimator in Proposition 3.1 to address variability .
- The paper meticulously outlines the two crucial limitations (L1: semantic incomparability and L2: overwhelming trade-offs) it aims to address, providing relatable analogies like comparing "meters and grams" to explain the incomparability problem.
- The methodology section is logically organized, breaking down the design of the normalization and criterion functions, and the practical approximation of CDFs using rankings . Weakness-
- While the authors state that assumptions and limitations are clear throughout the paper due to space constraints, a dedicated "Limitations" section is absent. The primary limitation identified by the authors is the assumption of a population of models (H) with objectives that are continuous random variables; COPA's direct applicability is limited if these conditions are not met.
- The paper does not provide sufficient information on the computer resources (type of compute workers, memory, execution time) required for reproducing experiments, stating only that they are "extremely lightweight". This falls short of common reproducibility guidelines.
问题
- We recommend creating a dedicated "Limitations" section, even if concise, to consolidate these discussions.
- please elaborate on: How does COPA perform if the population of models (N) is very small and the CDF approximation via ranking is less reliable? What is considered a "sufficiently large" N for COPA to be effective in practice? What are the specific implications if objectives are discrete rather than continuous? How does the approximation of the CDF via ranking handle this, particularly regarding the "uniform distribution" property and the potential for many ties?
局限性
Yes.
最终评判理由
I am happy with the clarification by authors to my questions, I would like to bump the quality and clarity score to 3.5 each.
格式问题
None
We thank the reviewer for their feedback and positive comments. We are especially happy to hear that the paper was clear, well organized and meticulous, and that we successfully demonstrated all our claims throughout the discussion and experiments conducted in the manuscript. Let us then address the concerns shared by the reviewer.
We recommend creating a dedicated "Limitations" section, even if concise, to consolidate these discussions.
We do agree with the reviewer, and we will add a limitations section in the revised version of the manuscript, gathering and summarizing the parts critical to make COPA work in practice (e.g. the number of samples, or the need for continuous random variables).
The paper does not provide sufficient information on the computer resources.
We apologize that this is the case, and will provide more details in the next revision. To be more specific, as COPA does not need to train any model, but to evaluate them, everything it needs is to store a matrix of N models and K metrics, sort each column, and compute a norm, which can be done in O(n log n).
In practice, all experiments were done in a single core of a mid-range laptop. This also includes the synthetic experiments and the training of the fairness use-case, which we had to explicitly reproduce from the original paper.
How does COPA perform if the population of models (N) is very small and the CDF approximation via ranking is less reliable?
In App. A.1. we ablated COPA as we vary N, and found it to be quite reliable. Naturally, the estimation will worsen as we reduce N (12 is the lowest we tested), but we encourage the reviewer to reproduce Figure 9 several times by themselves with the provided supplementary material (synthetic-case.ipynb) to gain a hands-on feeling.
What is considered a "sufficiently large" N for COPA to be effective in practice?
Sufficiently large is rather subjective, but we note in the inset figure after Prop. 3.1 that for a small sample size of , the maximum variance of the estimator has been reduced from 0.25 to approximately 0.025, which is quite significant. In the experiments, the minimum we have experimented with is in section 5.4, which works qualitatively well as the plots lead to similar conclusions as the original figures with scaled performance.
What are the specific implications if objectives are discrete rather than continuous? How does the approximation of the CDF via ranking handle this, particularly regarding the "uniform distribution" property and the potential for many ties?
This is unfortunately a difficult question to answer, as the probability integral transform only holds for continuous random variables and the current version of COPA heavily relies on this result. Speculatively, COPA should be adapted by finding a transformation for discrete random variables which turns them into the same common language, as we argue in lines 178-182 for the continuous case. A good starting point could be the work by Gery Geenens in discrete copulas [1], but we defer this direction for future work.
[1] Geenens, G. (2020). Copula modeling for discrete random vectors. Dependence Modeling, 8(1), 417-440.
We hope to have successfully addressed all concerns by the reviewer, and to re-evaluate and champion our work during the rebuttal period if that were the case. If there were further questions, we will be happy to answer them.
Thanks for the detailed answers. I am happy to bump my quality and clarity scores. Good Luck!
This paper starts by assuming a given set of models h in H have been trained and mapped to K objectives, and the number of such is sufficiently large that it would be prohibitive for a decision maker, DM, to traverse the set of often incomparable objectives. These are things like accuracy, energy cost, fairness, and bias metrics. Their goal is to develop a transformation that makes objectives comparable that are initially incomparable. The name COPA is defined on page 4, "Cumulative-based optimization of the Pareto Front. They propose a strategy where the objectives are mapped to a corresponding uk that is equally and uniformly distributed, where is the "marginal cumulative distribution function of the kth objective". Focusing on a single objective "" and dropping the "k" they then discuss the distribution of "yi" where "i" now references the kth objective value corresponding to the ith model in H. Then they can create an ording based on R(i) which is the normalized counts of how many time is greater than other 's in this list. I believe they assume that is i.i.d. and hence can show that their new definition of is an unbiased estimator of the CDF if . This map then makes "uideal" effectively zero, and they can then focus on minimizing in place of , equation 10 on page 5. The norm is the p norm scaled by an that sums to 1 and effectively weights the individual objectives. The can be seen as a way to implicitly weight the importance of the individual objectives as described on page 6. They provide a number of numerical results, synthetic and real, showing how COPA provides a way to explore and understand the Pareto front as defined by the COPA strategy.
优缺点分析
The paper provides a way to order objectives that is independent of scaling. This is because their metric is defined not in terms of the actual values of the objectives themselves, which may have wildly different scales, but how the k-objectives compare to each other, ordered by defined in equation 9, page 5. The ability to map “” to 0 under this transformation and creates a mechanism for ordering multi-objective Pareto points given a choice of . E.g. terms like reduced simply to .
They have not accomplished their objective, to make the lives of the DM easier. In general, they are defining a new desirability function defined by . However, it is well known that multi-objective optimization is not reducible and any attempt to do so will create an implicit biased order that effectively hides points from the DM. They mention this fact as a weakness in related work sections, however, their approach is fundamentally limited by the same issues.
(1) Different weights will of course give different orderings. The objectives are not comparable, but this is implicitly requiring the DM to define a ranking, possibly missing a good solution that is filtered out/ranked out of visibility under the new metrics. This is hinted at in the discussion on lines 227-241, where the DM must decide "the first objective is in the top-75% for the second objective." It is not clear to me that the DM will know how to do this and has a -multi-objective" problem where they must sample different weights to see how this affects the ordering created. It is still a multi-objective problem for the user wrt and .
(2) They are assuming that the ordering of the objectives induces comparability. They do this, I think, by assuming that the objective values are, in a sense, nearly uniformly spread throughout the min/max values for a given objective. Essentially, they are using the “median” statistic in a sense (as they mention). For example, the objectives could be in tight clusters for the first objective, where the clusters have dramatically different values for a given cluster. This means small changes, relatively speaking, to might translate into big gains in the second objective, . But because the ranking is based simply on can have an arbitrarily higher score than even if |y1i - y1j| is very small.
(3) They are, in a sense, swapping the problem for incomparable objectives in a "value sense", creating a new problem where internal ranking along one objective may not be comparable to the ranking of another objective as their distributions may be different. I know they are assuming iid, uniform, etc., but I do not think this is the rule but the exception. Further, if DM might assume that this prism for viewing the Pareto front is uniquely better than others when in reality it is one of many ways to investigate multi-objective fronts.
So ultimately the paper provides false hope and a specific biased view on problems that will sometimes be applicable and sometimes not with no warning to the DM for when the assumptions of the approach are violated.
问题
I would define COPA earlier in terms of the acronym I believe it is first defined in section (3) which in the paper comes after Figure (2) discussing the synthetic front plot from COPA.
On lines 99 to 103 they talk about the challenges of the DM (needing to select a single model) but describe the need to specify a total ordering. In general, the DM's internal "total ordering" function is not knowable. In general, the user might use many metrics, including the ones in this paper, to "surface" and reduce the number of "choices", then pick one based on internal intuition and for the sake of time. The premise of the paper is to define this function for them. If this were possible, then MO would be reducible to a single objective, which it is clearly not. This is like trying to make a sheet too small for a bed cover all 4 corners. No matter which corners you have covered, you will inadvertently expose at least one. I believe this is a theorem, and this should be understood and conveyed to the user. This paper is a way to order objectives according to a certain criterion that works in "these cases".
They should provide examples where their system fails to highlight its weaknesses, along with its strengths.
The method should be extended to handle clustering of values where the ranking may not be meaningful within the cluster of values.
One line 194 they say . This is likely renumbering the indices of that is assumed not described which was confusing. E.g. may be strictly less than only one point in the set while might be the largest. In which case imply is not the largest, which is a contradiction.
Some guidance should be given how the DM might decide when there are more than a few objectives.
局限性
Yes.
最终评判理由
I believe the algorithm and approach is interesting. However, I still feel that that algorithm resurfaces the MO problem by either forcing the user to heavily play with the weights and normalization strategies (to account for bias in the ordering metric) or risk obtaining a skewed few of the results. I will not change my scores based on their feedback, but I appreciate it. That is they it will be easier to make a choice because they we have a biased perspective of the results of which an choice might be easier to make, but not necessarily for the correct reason.
格式问题
None that I can see.
We appreciate the extensive feedback by the reviewer, and we are looking forward to clarifying any concern during the rebuttal period that can encourage the reviewer to re-evaluate our work.
Goal of our work:
First and foremost, we believe there is a misunderstanding regarding the premise of our work based on the following snippets from the review:
The premise of the paper is to define this function for them. If this were possible, then MO would be reducible to a single objective, which it is clearly not.
However, it is well known that multi-objective optimization is not reducible and any attempt to do so will create an implicit biased order that effectively hides points from the DM.
In general, the DM's internal "total ordering" function is not knowable.
It is still a multi-objective problem for the user wrt and .
This paper is a way to order objectives according to a certain criterion that works in "these cases".
which led to the following final statement by the reviewer which we do respectfully but strongly disagree with, and which also sadden us to read regarding our work:
So ultimately the paper provides false hope and a specific biased view on problems that will sometimes be applicable and sometimes not with no warning to the DM for when the assumptions of the approach are violated.
The goal of our approach is not to reduce the MO problem to a scalar problem, but rather to assist the user into making informed and meaningful decisions about the model they select in the MO framework. This is to say that we attempt to normalize all objectives under a common framework (Eq. 5) and a tunable criterion function with interpretable parameters (Eq. 11), so that the user intentions can be clearly reflected and mapped into the Pareto front. Of course, the parameters and are an imperfect approximation of what the DM would want, all models are approximations. But we think COPA is a step forward into making an approximation (even if as simple as COPA is) actionable in a principled way.
We did try to make this clear along the paper, e.g., in lines 72-74 where we say: “COPA is a novel approach to allow practitioners to meaningfully navigate the Pareto front, and thus compare and select models that reflect their preferences”, but we will further clarify the scope of our work in the camera-ready.
Questions and comments:
Let us now attempt to clarify other parts that may not be clear from our work based on the provided review:
but this is implicitly requiring the DM to define a ranking, possibly missing a good solution that is filtered out/ranked out of visibility under the new metrics.
This is prevented thanks to desideratum D4 “order-preserving” (line 153). Ranks preserve the ordering of each objective, and therefore Pareto-optimal points wrt the original objectives. Stressing the point above, we do not require the user to fix and (and thus the ordering), they are free to explore as many values as they want.
They are assuming that the ordering of the objectives induces comparability.
We do not assume such a thing. It is well-known that a 1D continuous random variable can always be transformed to a standard uniform by applying its cumulative function (lines 171-172) [1, Theorem 2.1.10], which we approximate by taking rankings in Eq. 9. This fact is also known as the universality of the uniform, and we provide an interpretation of why objectives are indeed comparable after this transformation in lines 178-182, as now all values refer to their performance relative to the same population.
[1] George Casella and Roger L Berger. Statistical inference. Cengage Learning, 2021.
They mention this fact [other methods attempt to reduce MO to a scalar problem] as a weakness in related work sections.
We fail to see where we mention what the reviewer refers to. What we say in the related work is that prior approaches fail to address objective incomparability (line 265), and that others instead define ad-hoc approaches to address it in their specific problem instance (line 268). We are happy to further clarify this in the next revision.
They do this, I think, by assuming that the objective values are, in a sense, nearly uniformly spread throughout the min/max values for a given objective.
We do not make this assumption anywhere. Again, transforming a continuous random variable using its own cumulative distribution function will turn the variable into a standard uniform.
Essentially, they are using the “median” statistic in a sense (as they mention).
No, what we say in lines 181-182 is that, after transforming the objectives using Eq. 8, a value of semantically corresponds to the median value for all objectives. In other words, all values of refer to the same quantity for all objectives, and are therefore comparable across them.
the objectives could be in tight clusters for the first objective, where the clusters have dramatically different values for a given cluster. [...] The method should be extended to handle clustering of values where the ranking may not be meaningful within the cluster of values.
This could happen, and we appreciate the feedback and suggestions, which we will consider in future work. However, please note that we do not ask the user to discard nor disregard the original objective values: indeed, they are useful to perform intra-objective comparisons. If the original objective distribution were multimodal, the user should be able to compare the original metrics and see a large discrepancy between values, which can then be further analyzed. This is also why almost all figures but Figure 4 in the manuscript are plotted using the original objectives, and not their transformed values.
They should provide examples where their system fails to highlight its weaknesses, along with its strengths.
We agree with the reviewer, and will gather and extend our assumptions, insights and limitations into a common “Limitations” section in the next revision.
This is likely renumbering the indices of that is assumed not described which was confusing.
We are sorry for the confusion, and indeed the expression is a typo, which we wrote in an attempt to clarify what are order statistics. This will be easily fixed for the camera-ready and we thank the reviewer for noticing it.
Here, are order statistics and the reviewer can think of them simply as rankings. In standard notation, one can write , but since our notation refers to the inverse mapping, there is no clean way of writing such expression. We will fix this line by explicitly stating what refers to: the new position of the i-th sample after reordering the array of objectives.
Some guidance should be given how the DM might decide when there are more than a few objectives.
We agree and will add a paragraph to better guide the DM. For example, in section 5.2 we followed a strategy that first groups objectives by what quality they measure (e.g., performance vs. cost) and then we assigned a weight to carbon footprint, and the same weight of to the rest of metrics. This hierarchical strategy can easily be applied to other scenarios.
We hope to have clarified all the concerns of the reviewer, especially those concerning the scope and overall goal of our work. If this were the case, we would truly appreciate a reevaluation of our work with this newly-adquired perspective.
Dear Reviewer fEWz, given the authors' response, please raise any remaining questions and/or concerns in a timely fashion so the authors have a chance to reply. Thanks!
This paper introduces COPA, a new framework for multi-objective model evaluation. COPA makes incomparable objectives comparable by normalizing them via their empirical CDF, using rank-based scaling to map all objectives to [0, 1]. The normalized objectives are then aggregated using a weighted p-norm, letting users adjust p to balance robustness versus sparsity and ω to set objective importance. Case studies show COPA enables more principled, intuitive trade-off navigation than existing methods.
优缺点分析
pro: The proposed solution is elegant, simple to implement, and grounded in well-established statistical principles (the probability integral transform). The use of desiderata (D1-D4) to frame the problem and solution is a strong organizational choice.
con:
- if you add more models, the whole rank would change and need to recompute everything.
- the results are "relative", not an absolute score like Elo scores which hugely depends on the small circle of models.
- when the metrics are discretized, it is hard to rank them or quantify them. Many models would be on-par. E.g. sense of humor: Yes/No.
问题
The rank-based normalization is relative to the provided population of models H. How sensitive are the final model selections to the composition of H? For example, if a new set of SOTA models is added to the Open LLM Leaderboard, the ranks of all existing models would shift. Does this volatility undermine the goal of finding a "globally" good trade-off, and are there ways to mitigate this, perhaps by using historical data or defining a more stable reference population?
For a large number of objectives K (you know, we have a lot of benchmarks today and each of them claim they are unique), setting the weight vector ω is non-trivial. Could you elaborate on practical strategies for users in such scenarios?
局限性
yes
最终评判理由
The authors cannot justify the advantage of COPA over absolute scores and rankings given the scenarios that I put forward. It is also not always easy to find a set of to weight the scores and therefore lower the meaning of this work. Thus, I decide to maintain my score.
格式问题
NA
We thank the reviewer for their comments. We are particularly pleased to hear from the reviewer that COPA is an elegant, grounded and simple solution, and that the paper is well organized. We now address the raised concerns:
if you add more models, the whole rank would change and need to recompute everything.
This is true, but note that computing the rank is a cheap computation of O(n log n). If this were a concern, one could still update the rankings by finding the position of the new model in a sorted list in O(log n), and then adding a 1 to all the values after that entry in O(n) at worst. Note that in practice one can speed up this linear factor further via vectorization.
the results are "relative", not an absolute score like Elo scores which hugely depends on the small circle of models.
The results are indeed relative to the population in play, all of them. That is to say, their rank does not depend on a subset of “similar” models, which is what we understand when the reviewer refers to “the small circle of models”.
when the metrics are discretized, it is hard to rank them or quantify them. Many models would be on-par. E.g. sense of humor: Yes/No.
True, but we assume continuous metrics in our problem statement (lines 87-88), so your concern is already addressed. In the case of discrete variables, and hence in the presence of ties, Eq. 9 gives them the same (minimum) rank. For example, assuming we have ties for a continuous array of metrics [0.2, 0.5, 0.5, 4] would get ranks [0, 1, 1, 3]. This is not an issue for our methodology.
How sensitive are the final model selections to the composition of H?
We provide in Prop. 3.1 key statistical properties of the rank estimator (and therefore of the final model selection) with respect to the model population . Moreover, provided i.i.d. samples the given estimator converges uniformly to the empirical distribution (see the Dvoretzky–Kiefer–Wolfowitz inequality). Finally, we explicitly evaluate the sensitivity of COPA with respect to the model population in App. A.1 of the manuscript, where we find it robust even as we re-sample (and decrease) the number of samples in .
Does this volatility undermine the goal of finding a "globally" good trade-off, and are there ways to mitigate this, perhaps by using historical data or defining a more stable reference population?
This is an interesting question that deserves a follow up! Extending COPA to cases with historical data or to online settings are exciting (but non trivial) venues of future work. We need to establish COPA for global, non-online setting first (as we did in this paper).
For a large number of objectives K [...] setting the weight vector ω is non-trivial. Could you elaborate on practical strategies for users in such scenarios?
We provide in lines 227-235 an interpretation of the weights under our framework with the aim of assisting the decision maker choosing . In a scenario as the one described by the reviewer, one strategy to simplify the choice of can be to group objectives by their overall goal (e.g. some might measure different types of robustness, while others could measure performance) and provide weights to the groups directly. This is what we actually do in, e.g., section 5.2, when we assign a weight to carbon footprint, and of to the rest of metrics.
We hope to have addressed the concerns shared by the reviewer, and would appreciate it if the reviewer could update their score accordingly as well as share any remaining comments they may have.
Thank you for your response.
I’d like to further clarify the concern I raised, "the results are "relative", not an absolute score like Elo scores which hugely depends on the small circle of models.".
To illustrate this, consider the recent release of LLaMA-3 and its inclusion in the rankings. As the community begins developing numerous variants of LLaMA-3, the definition of the "comparison pool", or what I referred to as the "small circle of models", becomes increasingly ambiguous. Since the score is derived from rankings within this evolving set, the relative positions of existing models can shift significantly, even if the models themselves have not changed a lot.
This volatility is precisely why researchers often prefer absolute scores when they want stable, direct comparisons, and turn to rankings when they are interested in relative performance within a fixed cohort. In this context, I'm not fully convinced that moving to COPA provides a meaningful improvement, especially if it introduces the same kind of dependency on an evolving model pool.
Another point that I would like to clarify is that it is not very easy for practitioners to figure out a linear weight as preference weighting. For example, if one model is small, I would like to know how fast it can be as long as the performance is above a certain level. Otherwise, if it is a large model, I probably won't care the speed at all because it is large and I have to buy API for it if I want to use it.
In conclusion, based on the paper and rebuttal, I choose to maintain the original score.
Thank you for the clarification.
Be would like however to remind the reviewer that the scope of this work is not to derive a score for an evolving set of models, but to perform model selection within a fixed set of trained models in a meaningful manner. To this end, rankings are a great tool, as the reviewer just acknowledged. Our use cases cover many standard cases in ML evaluation and selection, showing that we can satisfactorily systematize what often are ad hoc analyses.
We agree with the reviewer that multi-objective model evaluation and selection is a challenging and time demanding task. This is precisely why we believe that COPA can significantly ease such a task, even if as any other approach still requires involvement of a human to ultimately make a decision. With COPA, the values of now enjoy a coherent meaning that can guide the decision maker.
As per the examples provided by the reviewer, say there are two objectives, in order, performance and speed:
if one model is small, I would like to know how fast it can be as long as the performance is above a certain level.
This corresponds to the case after constraining the valid models with a threshold (in terms of performance and model size) as we do in Figure 5 for the fairness use-case.
Otherwise, if it is a large model, I probably won't care the speed at all because it is large and I have to buy API for it if I want to use it.
This corresponds to the case , and in Figure 1, 4 and 5. We, however, fail to see how inference speed becomes less significant if it runs in someone else’s computer.
We hope this helps the reviewer better understand our work.
Dear reviewer mDa1, did the authors' additional comment help resolve any outstanding questions? I am also curious about your concern regarding the inference speed (which if done via API may just translate to higher monetary cost).
Dear Area Chair,
I wanted to highlight that human preferences are often complex and interdependent, and therefore cannot always be accurately represented by a small set of weights or a single score.
For example, I might prefer a model that is small and fast, suitable for deployment on a mobile device, even if it performs poorly on reasoning tasks. In that case, I may only need it for simple tasks like information retrieval. Alternatively, I might favor a high-performing model even if it is expensive in API, my company might reimbursement the cost as long as it can get jobs done easily. A model that lies somewhere in between might not meet my needs at all.
Here are a couple more examples:
- If a model excels at math or reasoning, I may accept weaker multilingual capabilities.
- If a model is strong in multilingual understanding, I might prefer it for document summarization, where math skills are less relevant.
In short, a single aggregated score is unlikely to capture such nuanced preferences. From a product development perspective, a model that is moderately good at everything but excels at nothing may ultimately lack a compelling use case. A Pareto front, by contrast, would offer a more informative and flexible view. Whether preferences are interdependent or lexicographically ordered, presenting a Pareto front allows practitioners to make more meaningful trade-offs based on their specific needs.
Best regards,
Dear reviewer,
If we may, we would like to clarify a couple of points from your last response:
a single aggregated score is unlikely to capture such nuanced preferences. [...] A Pareto front would offer a more informative and flexible view [...] and allow practitioners to make more meaningful trade-offs.
We completely agree with the reviewer, which is why we would like to remark that this is precisely what we do in our submission. We do not advocate using a single aggregated score, but to indeed meaningfully navigate the space of trade-offs provided by the Pareto front (as the title of our work suggests). What COPA provides is a more consistent meaning to the Pareto front, by making it independent of the way each objective was measured, and instead talk about performance relative to a fixed population (lines 178-182). Then, our criterion function eases the task of the decision maker of selecting a point based on these semantics (which is necessary, e.g., when the Pareto front cannot be easily visualized).
For example, I might prefer...
As for the examples provided by the reviewer, please do not underestimate the "equally weighted" solution when done right which, as we show in our use cases (e.g. Figure 1), can lead to good performing but efficient solutions. The same way the reviewer leaned to extreme solutions implying that middle-ground solutions may not suit one's need at all, one could create a similar argument adding to the population extremely quick but not well-performing models (e.g. a constant or linear predictor), and argue that a middle-ground solution is more preferable.
Regarding the specific examples, COPA offers a way to talk about "what is acceptable by the user" in a natural way. For example, they could stipulate that maths and reasoning benchmarks are the most important ones (with a weight of 0.8), and then that they are ok if the model is not as good in multilingual benchmarks (say, a weight of 0.2, meaning that there are 20% of models that still perform worse). Again, if the user is not happy with their choice, COPA encourages them to meaningfully tune these parameters until they find a model that works for them but, in all cases, their choices of parameters will be independent of the units each objective were measured with.
We sincerely hope that this response helps the reviewer better understand the scope and goal that we aim to achieve with our work. Otherwise, we are happy to provide further answers until the end of the rebuttal period.
This paper considers comparing models with respect to multiple objectives on different scales. The paper proposes a normalization technique to scale each metric by using the approximate CDF of the metric. This makes the objectives comparable so that they can be aggregated according to user preferences via a weighted scalarization. The paper showcases the method on a variety of model evaluation and comparison use cases.
优缺点分析
Strengths
- The method is simple and intuitive.
- The interpretability of the parameters in the scalarization seems valuable for users.
Weaknesses
- Much of the approach is akin to "How to Make Multi-Objective Evolutionary Algorithms Invariant to Monotonically Increasing Transformation of Objective Functions" Yamada et al 2024, which also applies empirical CDFs individually to each objective and takes a weighted scalarization, in the context of multi-objective optimization. In light of this, the contribution of this work is largely applying this technique to model comparison, insight of population selection in evolutionary algorithms
- The baselines are limited. Are there other normalization techniques to consider?
问题
- How should a user consider choosing p?
- L294: "finding a proper ω could prove challenging". Can you elaborate on that?
局限性
yes
最终评判理由
Thanks for the response. I've read the other reviews as well, and I opt to keep my score as is. The contribution of applying a the CDF transform to model comparison, rather than individual comparison in a population in the context of evolutionary optimization is very small. I don't think this warrants a publication. Furthermore, I agree with the concerns raised by other reviewers regarding difficulty for decision makers to specify preferences in scenarios with many objectives.
格式问题
none
We thank the reviewer for their feedback and comments. Let us next attempt to resolve all the concerns raised and, if we correctly address them, we would appreciate it if the reviewer could reconsider their evaluation of our work.
Much of the approach is akin to "How to Make Multi-Objective Evolutionary Algorithms Invariant to Monotonically Increasing Transformation of Objective Functions" Yamada et al 2024.
We appreciate the reference by the reviewer, which we were unaware of and will add to the next revision of the manuscript. However, while we see the similarities, we respectfully disagree in that both approaches are as similar as to disregard our methodological contributions, i.e., the paper cannot be rejected in these terms.
First, we compute the empirical CDF of the marginal (i.e. wrt. all samples), while the work by Yamada et al. computes that of the non-dominated solutions. This leads to two different estimators, one being computed on a subset of samples of the other. Whether it is preferable to define Eq. (1) wrt. all or only Pareto-optimal solutions is an interesting question beyond the scope of this work. Second, our criterion function in Eq. (11) subsumes those considered by Yamada et al., as they only consider ( in their notation). Note that in their work corresponds to the usual weighted p-norm and not to Eq. (11), which leads to significantly different results, see Figure 10 in our appendix.
The contribution of this work is largely applying this technique to model comparison, insight of population selection in evolutionary algorithms.
Disregarding the technical differences mentioned above, we fail to see where the insights mentioned above can be found in the reference provided and would appreciate pointers. Moreover, we would like to remind the reviewer that not all contributions of a work need to be technical. Our work highlights an existing, prevalent, and underexplored problem in multi-objective ML model evaluation and provides a simple and intuitive approach to address it, as acknowledged by the reviewer.
This is, on its own, an important contribution but, more remarkably, we show how to adapt our assumed setting to more unconventional scenarios where the adoption of COPA can have a real and significant impact, as acknowledged by reviewer Wm5j. For example, we show how objective incomparability can bias the conclusions drawn in research works in areas such as multitask learning and domain generalization (section 5.3).
The baselines are limited. Are there other normalization techniques to consider?
As mentioned above, this is an underexplored problem. There are other simple but general normalization functions that we are aware of (e.g. the inter-quantile normalization) and will be happy to add them and run experiments if the reviewer can provide a concrete recommendation, as well as any specific normalization technique the reviewer could point to.
More often, we find that authors come up with ad-hoc normalization techniques to try and avoid issues arising from the incomparability of metrics, as we point out in the paper, see lines 267-271 or App. A.6 where we explicitly give all the normalization functions employed in DecodingTrust.
How should a user consider choosing p?
We try to provide advice and intuition on how to choose p in lines 236-241, 248-252 as well as in section 5.1. In short, larger values of p focus on maximizing individual metric performance (and thus robustness in the min-max sense), while lower ones focus on aggregated performance (e.g. a value of one corresponds to a weighted average performance). We recommend initially choosing p based on the user expected robustness, and explore other solutions by nudging this initial value until satisfied, since evaluating COPA after normalizing once is linear in the number of models.
L294: "finding a proper ω could prove challenging". Can you elaborate on that?
Of course, let us clarify this statement. With , we gain a finer control with on the solution found, as this formulation can find any Pareto-optimal solution (lines 246-247). A good analogy of this is that we can think of the solutions found with as those found by projecting a ray from the origin towards . Therefore, if we have many objectives ( is large) and we are interested in a really particular solution, tuning to find said solution can be time-consuming as we would need to carefully tune each of the components of .
We hope to have clarified all the concerns of the reviewer, especially those concerning the contributions our work. If this were the case, we would truly appreciate it if the reviewer could update their score accordingly. Thanks again for the feedback, and we remain at the reviewer’s disposal in case there were further questions.
Dear Reviewer N88d, given the authors' response, please raise any remaining questions and/or concerns in a timely fashion so the authors have a chance to reply. Thanks!
Thanks for the response. I've read the other reviews as well, and I opt to keep my score as is.
Disregarding the technical differences mentioned above, we fail to see where the insights mentioned above can be found in the reference provided and would appreciate pointers.
Apologies that was a typo: "insight" should be "instead". My point was the contribution of applying a the CDF transform to model comparison, rather than individual comparison in a population in the context of evolutionary optimization is very small. I don't think this warrants a publication.
Furthermore, I agree with the concerns raised by other reviewers regarding difficulty for decision makers to specify preferences in scenarios with many objectives.
Whether it is preferable to define Eq. (1) wrt. all or only Pareto-optimal solutions is an interesting question beyond the scope of this work.
This does not seem beyond the scope of this work as this work is explicitly argues that the empirical CDF is a good normalization function. The function itself depends on the data being used to approximate the CDF. The difference between using all data and only data that are Pareto optimal can be significant, so comparing these two options thoroughly seems important for the paper in my opinion.
We thank the reviewer for their response.
At this point, we would like to remind the following statement from the NeurIPS Reviewer Guidelines (which we cannot directly link in our response):
originality does not necessarily require introducing an entirely new method. Rather, a work that provides novel insights by evaluating existing methods, or demonstrates improved efficiency, fairness, etc. is also equally valuable.
Our work highlights an important oversight that systematically occurs in ML model evaluation when there are multiple objectives, and provides a simple yet principled approach that overcomes objective incomparability. This is extensively demonstrated in all our use cases, showing that such problems appear for final deployers (case 1), researchers (case 2) and analysts (case 3). Our work is thus of practical significance to the field, as acknowledged by reviewer Wm5J and, for example, we demonstrate that conclusions drawn in MTL and domain generalization (case 2) are biased by objective incomparability, which will ultimately have an impact on future research based on said works.
Furthermore, I agree with the concerns raised by other reviewers regarding difficulty for decision makers to specify preferences in scenarios with many objectives.
These are valid concerns, but they are fundamentally inherent to the multi-objective setting itself. As we have reiterated during the rebuttal, what COPA does is to provide a consistent meaning to the parameters of the criterion function ( and ) to ease (not solve) the task of the decision maker of specifying their personal preferences. With COPA, DMs can interpret the values of as quantiles and later come back to the original objective values to verify their selection. Also, as described in lines 227-249, the values of represent ratio trade-offs of comparable quantities (quantiles, lines 178-182) and specifies the level of robustness desired (going from aggregated averaged performance, , to individual min-max performance, ).
The difference between using all data and only data that are Pareto optimal can be significant, so comparing these two options thoroughly seems important for the paper in my opinion.
We appreciate the reviewer’s input, and will add on their request a new section where all figures on the main paper are reproduced using the empirical CDF on only Pareto-optimal points, as well as a discussion on the differences between both choices. COPA can still be applied with this alternative empirical CDF and, indeed, we get similarly meaningful Pareto-fronts as in Figure 1 when using and only Pareto-optimal points. However, using only Pareto-optimal points brings two major issues: 1) the number of samples is substantially reduced; and 2) as all points are Pareto-optimal, for each objective their transformed values will be evenly-spaced at increments. As a result of the latter, the Pareto front often resembles a hyperplane with a slope of -1 (y = -x in 2D) and may not yield the desired results. For example, with the extreme points will always be optimal for any value of , which is not desirable in general. Alongside the new figures, we will adapt the submitted notebooks to let readers easily check these differences themselves.
We hope to have addressed the reviewer’s concerns, and we remain at their disposal otherwise.
This paper introduces COPA (Cumulative-based optimization of the Pareto front), a method for multi-objective model evaluation that addresses the challenge of comparing models across objectives measured in incompatible units and scales. The authors argue that by normalizing objectives using their empirical cumulative distribution functions (CDFs) approximated through relative rankings, practitioners can make meaningful comparisons across disparate metrics. The method transforms objectives to a uniform distribution on [0,1] and uses a weighted p-norm criterion function to aggregate normalized objectives while preserving Pareto-optimality. The authors illustrate their approach across five use cases spanning fair ML, domain generalization, AutoML, and foundation models.
Strengths:
- Motivation and relevance (this is an important practical problem)
- Clarity of discussion and presentation
- Method is simple and computationally efficient (provided objectives have been collected)
- Case studies on ML model evaluation are illustrative and helpful
Weaknesses:
- Methodological contribution is very limited
- Limited discussion of alternative normalization approaches and related work
- Selecting parameters for the proposed method remains challenging for decision makers
- Empirical evaluation lacks additional baselines
Reviewers generally agreed that the problem studied is important and well-motivated, and that presentation of the work is clear. The primary concerns raised by reviewers were the limited methodological novelty (relative to Yamada et al. 2024) as well as the insufficient guidance for practitioners on selecting the parameters and in high-dimensional settings. During the rebuttal, the authors engaged thoroughly with these points. While they were able to clarify some questions and technical aspects of the work, the authors were unable to resolve reviewers’ concerns about fundamental limitations regarding methodological novelty and preference specification challenges.
There was some confusion with reviewer mDa1 appearing to have a misconception about the scope of providing absolute vs. relative rankings. The ranking-based normalization can easily be re-normalized as new models are added to the set, so I don’t think this is a fair critique – Now, the question of providing absolute scores that do not require this is valid and interesting, but does seem outside the scope of this work. I agree with the concern of reviewer fEWz that simply applying the ranking and optimizing based on that could cause issues in settings where some objectives are very similar and differences between them are more or less negligible. Then using the empirical CDF could end up putting high weight on differences that are immaterial. I don’t think this is a fundamental issue for the method (could e.g. be addressed by filtering the objectives to those where differences are meaningful), but it is something that should be discussed in the paper.
Overall, I am somewhat on the fence about this paper. I fully second the concerns about it being challenging for decision makers to specify complex preferences in many-objective settings via the and parameters, a clear weakness of the approach. However, I am also sympathetic to the authors attempting to provide a (however flawed) solution to this challenge, rather than throwing their hands up in the air and regarding this as an unsolvable task, given that this challenge is universal and not specific to their approach.
That said, I would have expected a more thorough comparison with alternative approaches to deal with this challenge of specifying preferences - such as some of the other works referenced by reviewers and additional work in the space of preference learning (which has not come up during the reviews), in which complex preferences are inferred, e.g. from simpler pair-wise comparison feedback that is generally easier to provide for the decision maker. The paper does a nice job explaining the challenges of navigating and specifying preferences in multi-objective settings in general, but a more thorough discussion of existing approaches appears needed.
While section 5 provides a helpful illustration of how the proposed approach can be utilized in practice, it doesn’t really answer the question of how to select the parameters and . But I found it helpful in that it provides some intuition for how this would work in practice and how a decision maker could iterate and try out different choices to home in on some tradeoffs that are to their liking, rather than being overwhelmed by looking at the raw metrics for all elements of the set of models / items options under consideration. I have had some issues with clarity here though (e.g. it’s not clear what the “density task” is in Sec 5.3).
Considering only the above, I would have considered this a borderline submission. But I also agree with the additional concerns raised (primarily by reviewer N88d) about limited methodological novelty and the lack of discussion of previous work on using empirical CDFs to normalize different objectives for better comparability. In light of that, I don’t think the contributions of the paper are sufficiently strong for publication at NeurIPS.
However, I do commend the authors on their attempt to tackle an inherently challenging problem and encourage them to incorporate the constructive feedback from the reviewers to improve their work, in particular by extending the discussion of how the proposed approach relates to existing works surfaced during the discussion, as well as providing improved guidance for how decision makers would leverage their approach in practice.