PaperHub
6.3
/10
Poster3 位审稿人
最低3最高4标准差0.5
3
3
4
ICML 2025

Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
learning to rankarea under the curvemulti-objective optimization

评审与讨论

审稿意见
3

The paper studies the bipartite ranking problem in a multi-label setting, comparing two approaches: one based on loss aggregation and the other on label aggregation. It shows that loss aggregation can result in label dictatorship.

给作者的问题

In line 251-260, the authors state that `Practically, this suggests that the choice of the mixing weights aka_k determines the extent to which the optimal scorer favors one label over the others. More precisely, the optimal scorer favors labels that are skewed ( π(k) \pi^{(k)} is away from 0.5 ), over labels that are balanced ( π(k)0.5 \pi^{(k)} \approx 0.5 ). As we shall see in \S6.1, this can lead to undesirable behaviour for the loss aggregation objective, unless one commits to a very specific choice of mixing weights aka_k. This is unexpected: one might naturally assume that a uniform weighting (ak=1a_k = 1) would induce scorers that treat all labels equitably."

I am not sure I agree with the authors' assessment here. The term π(k)(1π(k))\pi^{(k)}(1-\pi^{(k)}) is the variance of the label distribution for category kk. This means that the Bayes-optimal classifier aggregates individual Bayes-optimal classifiers by weighting them inversely to their variances. Since the framework is Bayesian, I would expect the prior probabilities to carry meaningful information. If π(k)0.5\pi^{(k)} \approx 0.5, then the label is almost entirely noise and contains little to no useful signal. In such a case, is it not reasonable for the Bayes-optimal scorer to put less weight on Y(k)Y^{(k)}?

Given this, I would like the authors to clarify why the implication of Proposition 6.1 is considered problematic. In the scenario where class probabilities are either 00 or 11 and α(1)>α(2)\alpha^{(1)} > \alpha^{(2)}, the emergence of a dictatorial label seems natural. Do you have a practical example where such a scenario occurs, but a dictatorial label is undesirable? Perhaps you could construct an example within your running framework of information retrieval, demonstrating a case that mathematically aligns with Proposition 6.1 (with deterministic labels), where ideally, we would not want a dictatorial label to emerge, yet it does.

I am having difficulty understanding Proposition 5.3, Proposition 5.4, and the surrounding discussion. What does it mean to aggregate labels by summing them, i.e., kY(k)\sum_{k} Y^{(k)}? Are the individual labels Y(k)Y^{(k)} binary? If so, their sum may no longer belong to {0,1}\{0,1\}. On the other hand, if the labels are multiclass, how does summation apply in this case? For example, what does adding the labels “cat” and “dog” yield?

论据与证据

Yes

方法与评估标准

Yes

理论论述

I checked proofs of Propositions 6.1, 5.2, and 5.4.

实验设计与分析

Yes. I did not find any issues in the experimental designs.

补充材料

Proofs of Propositions 6.1, 5.2, and 5.4.

与现有文献的关系

The paper advances a line of work on learning with multiple labels.

遗漏的重要参考文献

I am not aware of any.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Thank you for your positive feedback and insightful questions. We clarify below:

1. Regarding Interpretation of Prop 6.1 / Loss Aggregation Weighting & Practical Example:

Re: Prop 6.1 & loss weighting. Why problematic if π(k)0.5\pi^{(k)}\approx 0.5? Isn't less weight on noisy labels reasonable? Clarify problematic nature & provide practical example where dictatorship undesirable.

We clarify that π(k)\pi^{(k)} represents the overall balance or prevalence of label kk across the dataset, not the conditional uncertainty or noise P(Y(k)=1x)P(Y^{(k)}=1|x) for a specific instance xx. A label can be perfectly balanced (π(k)=0.5\pi^{(k)}=0.5) while being conditionally deterministic (zero noise). For example, with instances x1,x2x_1, x_2 where η(1)(x1)=1,η(1)(x2)=0\eta^{(1)}(x_1)=1, \eta^{(1)}(x_2)=0, we have π(1)=0.5\pi^{(1)}=0.5 (balanced marginal) but zero conditional label noise.

Therefore, the loss aggregation weighting 1/[π(k)(1π(k))]1/[\pi^{(k)}(1-\pi^{(k)})] favors marginal imbalance (skewness), not necessarily conditional signal quality. Our concern is that this skew-based weighting may conflict with practical goals.

Illustrative Example: IR Example: Assume Y(1)='relevance' is balanced (π(1)0.5\pi^{(1)} \approx 0.5) and Y(2)='is recent' is highly skewed (π(2)0\pi^{(2)} \approx 0). Even if both relevance and recency signals are perfectly clean (conditionally deterministic, η(k)(x){0,1}\eta^{(k)}(x) \in \{0,1\}), loss aggregation with uniform ak=1a_k=1 yields α(2)α(1)\alpha^{(2)} \gg \alpha^{(1)}, making 'is recent' dominate the ranking (Prop 6.1). If the user cares primarily about relevance, but the system heavily prioritizes recency regardless of relevance due to skew, this "dictatorship" is undesirable. The system optimizes for marginal imbalance, ignoring the balanced relevance signal.

This issue persists even with non-deterministic labels (Prop 5.2), where the optimal scorer remains sensitive to the marginal priors π(k)\pi^{(k)}. Our experiments (Section 7) on data with varying skewness provide empirical support.

Action: We will revise Section 5.1 and Section 6.1 to clearly distinguish marginal imbalance (π(k)\pi^{(k)}) from conditional label distributions (η(k)(x)\eta^{(k)}(x)). We will incorporate an illustrative example and refine the IR explanation to clarify why favoring labels based on marginal skew π(k)\pi^{(k)} can be practically undesirable.

2. Regarding Understanding Label Aggregation by Summation (Prop 5.3/5.4):

Re: Prop 5.3/5.4. Difficulty understanding label summation kY(k)\sum_k Y^{(k)}. Are Y(k)Y^{(k)} binary? The sum is not {0,1}. How does this compare to multi-class label addition (e.g., "cat" + "dog")?

Thank you for requesting clarification.

  • Setup: As introduced in Section 3.1, our setting involves KK distinct binary labels Y(1),,Y(K)Y^{(1)}, \ldots, Y^{(K)} for input xx, where Y(k){0,1}Y^{(k)} \in \{0, 1\}. These represent different facets or signals (e.g., clicks vs. ratings mentioned in Sec 1, 3.3). We aim to learn a single scorer f:XRf : \mathcal{X} \rightarrow \mathbb{R}.
  • Summation: In Section 4.2 (specifically for Prop 5.3 and 5.4), we use the aggregation ψ(Y(1),,Y(K))=k=1KY(k)\psi(Y^{(1)}, \ldots, Y^{(K)}) = \sum_{k=1}^K Y^{(k)}. Since each Y(k)Y^{(k)} is binary, the sum Yˉ\bar{Y} is an integer in {0,1,,K}\{0, 1, \ldots, K\}, representing the count of positive labels.
  • Resulting Label: You are correct, this sum Yˉ\bar{Y} is generally not binary (it is ordinal).
  • Handling the Sum: This is handled by treating the task as multipartite ranking with respect to this ordinal label Yˉ\bar{Y}. We optimize the multipartite AUC defined in Definition 2.3 / Eq. 5 using Yˉ\bar{Y}.
  • Role of Costs (Crucial): Prop 5.4's key insight relies on using specific costs cyy=1(y>y)yyc_{\overline{y}\overline{y}^{\prime}} = \mathbf{1}(\overline{y} > \overline{y}^{\prime}) \cdot |\overline{y} - \overline{y}^{\prime}| within the multipartite AUC objective (Eq. 5). Under this specific cost structure, the Bayes-optimal scorer simplifies remarkably to f(x)k=1Kη(k)(x)f^*(x) \propto \sum_{k=1}^K \eta^{(k)}(x) (Eq. 7). As shown in the proof, this occurs because E[YˉX=x]=kE[Y(k)X=x]=kη(k)(x)E[\bar{Y} | X=x] = \sum_k E[Y^{(k)}|X=x] = \sum_k \eta^{(k)}(x), and these specific costs make E[YˉX=x]E[\bar{Y}|X=x] the optimal scorer (per Uematsu & Lee, 2015). This simplification does not generally hold for other costs (like cyy=1c_{\overline{y}\overline{y}^{\prime}} = 1 used for Prop 5.3).
  • Distinction from Multi-Class: This is fundamentally different from having mutually exclusive multi-class labels (like "cat", "dog"). We are summing binary indicators related to the same instance, not arithmetically combining distinct semantic categories.

Action: We will add text clarifying that Y(k)Y^{(k)} are binary, Yˉ=kY(k)\bar{Y}=\sum_k Y^{(k)} serves as an intermediate ordinal target for multipartite ranking, and emphasize the critical role of the specific cost function cyy=yyc_{\overline{y}\overline{y}^{\prime}} = |\overline{y} - \overline{y}^{\prime}| in deriving the clean result of Prop 5.4.

Please let us know if any further clarification is needed.

审稿人评论

I thank the authors for their detailed responses to my questions. Both of the proposed action items sound reasonable to me, and I have no concerns regarding the correctness of the work.

That said, I will maintain my original assessment of the paper as a weak acceptance.

审稿意见
3

The paper formulates a new problem, where each instance is associated with multiple labels and the goal is to find a ranking for these labels. To deal with this problem, the paper derives the Bayes-optimal solvers for two commonly used loss functions loss aggregation and label aggregation and proves pareto-optimality of these two methods. The effectiveness of the proposed methods are validated by empirical studies.

给作者的问题

How can the theoretical results guide the design of the method?

论据与证据

Most claims in the paper are supported by the theoretical proof or empirical results.

方法与评估标准

The paper focused on the AUC metric, which is commonly used in ranking problems. From theoretical and empirical perspectives, the paper show positive results for the proposed method.

理论论述

I do not focus on this topic. It is difficult for me to check the correctness of proofs.

实验设计与分析

The paper mainly performs experiments on a toy dataset and two realistic datasets. Although it focused on the theoretical properties of two loss functions, it is suggested to perform experiments on much more datasets from diverse domians.

补充材料

There is no supplementary material.

与现有文献的关系

The paper focus on a general problem, ranking, which is relevant to multiple research communities.

遗漏的重要参考文献

The relevant works are fully discussed in the paper.

其他优缺点

Strengths:

The paper provides rigorous theoretical results for their claims.

Weakness:

The paper does not provide a detailed discussion of the application scenarios of the method, nor does it explain how the theoretical results guide the design of the method.

Although this is a rather theoretical work, the results presented in the paper are still insufficient. The paper should conduct experiments on more datasets.

其他意见或建议

The font in the figure is suggested to be adjusted to match the text in the main body.

作者回复

Thank you for your positive feedback and insightful questions. We address the specific points you raised below.

1. Regarding Theory Guiding Design:

"...nor does it explain how the theoretical results guide the design of the method." "How can the theoretical results guide the design of the method?"

Thank you for this important question on the practical implications of our theory. As opposed to informing methods from theory, we follow a different structure in our paper, where we start with an established goal of Pareto-optimality, describe two methods for achieving that goal (i.e., label aggregation and linear scalarization), and using the Bayes optimality framework analyze whether they achieve it. Indeed the theoretical results do not serve a purpose of constructing methods, but rather, they allow us to analyze the proposed methods.

That being said, our analysis does lead to practical conclusions. First, we introduce a novel criterion eluding Pareto optimality, the “dictatorial issue”, which a practitioner may want to keep in mind. Second, as demonstrated with our experiments on both synthetic and real world experiments, one should be careful when choosing linear scalarization, as it suffers from the dictatorial issue, contrary to label aggregation techniques.

2. Regarding Application Scenarios:

"The paper does not provide a detailed discussion of the application scenarios of the method..."

We appreciate this point and agree that a more elaborate discussion of application scenarios would strengthen the paper. The applications of multi-objective learning to rank are widespread. We briefly mentioned potential areas like information retrieval and medical diagnosis in the Introduction and used a motivating example from information retrieval (relevance vs. engagement trade-off) in Section 3.3. To further elaborate, in the revised manuscript, we will add a dedicated subsection to discuss potential real-world applications in greater detail. This will include more concrete examples, such as:

  • Multi-faceted Information Retrieval: Ranking documents based on relevance to different query interpretations or aspects (e.g., topicality, freshness, geographical relevance). [R1]
  • Recommendation Systems: Recommending items (e.g., products, movies, news articles) by balancing multiple objectives like predicted user click/purchase probability, relevance to long-term interests, promotion of diversity, or fairness considerations across item groups. [R2]
  • Computational Advertising: Ranking ads based on predicted click-through rate and predicted conversion rate. [R3]

For each scenario, we will briefly discuss why synthesizing multiple labels is necessary and how the choice between loss aggregation and label aggregation (informed by our analysis of Pareto optimality and the "dictatorship" issue) might impact the final ranking outcome. We believe this expanded discussion will better illustrate the practical relevance and potential impact of our work.

[R1] Perkio, Jukka, et al. "Multi-faceted information retrieval system for large scale email archives." The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05). IEEE, 2005.

[R2] Zheng, Yong, and David Xuejun Wang. "A survey of recommender systems with multi-objective optimization." Neurocomputing 474 (2022): 141-153.

[R3] Wang, Xuewei, et al. "Towards the Better Ranking Consistency: A Multi-task Learning Framework for Early Stage Ads Ranking." AdKDD (2023).

审稿意见
4

This paper investigates the problem of bipartite ranking with multiple binary labels, comparing two approaches: loss aggregation and label aggregation. The authors provide a theoretical analysis of the Bayes-optimal solutions for both methods and empirically validate their findings. Extensive experiments have been conducted.

给作者的问题

NA

论据与证据

NA

方法与评估标准

NA

理论论述

Key theoretical results (Theorems 1–3 in the main text and Appendices A–B) are provided. Theorem 1‌ (Bayes-optimality of loss aggregation) assumes deterministic labels (Equation 3) .However, most labels in real-world are noisy. Theorem 3‌ (Label aggregation avoids dictatorship) is compelling but relies on the Pareto-optimality definition.

实验设计与分析

The impact of label correlations (e.g., overlapping vs. orthogonal labels) on aggregation methods is not studied.

补充材料

Proofs of Theorems and Additional Experiments are provided.Proofs of Theorems‌ are logically sound

与现有文献的关系

The field of Multi-objective optimization might be beneficial

遗漏的重要参考文献

NA

其他优缺点

Strengths: 1.The motivation is clear 2.Extensive theoretical analysis has been conducted.

Weaknesses:

  1. Could you test label aggregation on a dataset with ‌anti-correlated labels‌ (e.g., one label’s positives are another’s negatives) to assess robustness 2.How sensitive are the results to the choice of ‌mixing weights‌ (e.g., uniform vs. task-specific) in label aggregation?

其他意见或建议

NA

作者回复

Thank you for your positive feedback and insightful questions. We address the specific points you raised below.

1. Anti-correlated Labels:

"test label aggregation … with ‌anti-correlated labels‌"

Thank you for this insightful question. We show below that the optimal scorer does depend on the correlation between the labels; for the special case of anti-correlated labels, we get no useful signals, resulting in a constant scorer.

Specifically, Proposition 5.4 shows the optimal scorer for label aggregation (cost cyy=yyc_{\overline{y}\overline{y}^{\prime}} = |\overline{y} - \overline{y}^{\prime}|) ranks by f(x)kη(k)(x)f^*(x) \propto \sum_k \eta^{(k)}(x). For K=2K=2, this is equivalent to ranking by δ(x)=p(Y1=1,Y2=1x)p(Y1=0,Y2=0x)\delta(x) = p(Y_1=1, Y_2=1 | x) - p(Y_1=0, Y_2=0 | x). The derivation below shows f(x)δ(x)f^*(x) \propto \delta(x).

(Derivation: From Prop 5.4, the scorer is η1(x)+η2(x)\eta_1(x)+\eta_2(x). We have η1(x)=p(Y1=1,Y2=0x)+p(Y1=1,Y2=1x)\eta_1(x) = p(Y_1=1, Y_2=0 | x) + p(Y_1=1, Y_2=1 | x) and η2(x)=p(Y1=0,Y2=1x)+p(Y1=1,Y2=1x)\eta_2(x) = p(Y_1=0, Y_2=1 | x) + p(Y_1=1, Y_2=1 | x). Summing them gives η1(x)+η2(x)=p(Y1=1,Y2=0x)+p(Y1=0,Y2=1x)+2p(Y1=1,Y2=1x)\eta_1(x) + \eta_2(x) = p(Y_1=1, Y_2=0 | x) + p(Y_1=0, Y_2=1 | x) + 2 p(Y_1=1, Y_2=1 | x). Since p(Y1=1,Y2=0x)+p(Y1=0,Y2=1x)+p(Y1=1,Y2=1x)+p(Y1=0,Y2=0x)=1p(Y_1=1, Y_2=0 | x) + p(Y_1=0, Y_2=1 | x) + p(Y_1=1, Y_2=1 | x) + p(Y_1=0, Y_2=0 | x) = 1, the sum simplifies to 1p(Y1=0,Y2=0x)+p(Y1=1,Y2=1x)=1+δ(x)1 - p(Y_1=0, Y_2=0 | x) + p(Y_1=1, Y_2=1 | x) = 1 + \delta(x).)

Thus the optimal scorer depends on label correlations via δ(x)\delta(x). Let's consider the implications:

  • Overlapping (Y1=Y2Y_1=Y_2): δ(x)η1(x)\delta(x) \propto \eta_1(x), so the ranking uses η1(x)\eta_1(x), which is sensible.
  • Anti-correlated (Y1=1Y2Y_1 = 1-Y_2): δ(x)=0\delta(x) = 0. The scorer f(x)f^*(x) is constant, yielding no ranking. This is unsurprising as the aggregated label Y1+Y2=1Y_1+Y_2=1 is constant, making the AUC objective ill-defined.
  • Mild anti-correlation: Here, both p(Y1=1,Y2=1x)p(Y_1=1, Y_2=1 | x) and p(Y1=0,Y2=0x)p(Y_1=0, Y_2=0 | x) would be small, and the ranking depends on their difference via δ(x)\delta(x). For example, when δ(x)>0\delta(x) > 0, it is more likely that both labels are 1 than 0, resulting in xx being ranked higher.

2. Sensitivity of Label Aggregation to Weights:

"How‌ sensitive are the results to … ‌mixing weights‌ in label aggregation?"

This is a great question! Below, we generalize label aggregation with non-uniform weights α1,α2>0\alpha_1, \alpha_2 > 0, i.e., Yˉ=α1Y1+α2Y2\bar{Y} = \alpha_1 \cdot Y_1 + \alpha_2 \cdot Y_2, again for the K=2K=2 case for simplicity.

From Proposition 5.4, the Bayes-optimal scorer ηˉ(x)=E[YˉX=x]\bar{\eta}(x) = E[\bar{Y}|X=x] becomes: ηˉ(x)=0p(Y1=0,Y2=0x)+α1p(Y1=1,Y2=0x)+α2p(Y1=0,Y2=1x)+(α1+α2)p(Y1=1,Y2=1x)\bar{\eta}(x) = 0 \cdot p(Y_1=0, Y_2=0 | x) + \alpha_1 \cdot p(Y_1=1, Y_2=0 | x) + \alpha_2 \cdot p(Y_1=0, Y_2=1 | x) + (\alpha_1+\alpha_2) \cdot p(Y_1=1, Y_2=1 | x) =α1[p(Y1=1,Y2=0x)+p(Y1=1,Y2=1x)]+α2[p(Y1=0,Y2=1x)+p(Y1=1,Y2=1x)]=α1η1(x)+α2η2(x) = \alpha_1 \cdot [ p(Y_1=1, Y_2=0 | x) + p(Y_1=1, Y_2=1 | x) ] + \alpha_2 \cdot [ p(Y_1=0, Y_2=1 | x) + p(Y_1=1, Y_2=1 | x) ] = \alpha_1 \eta_1(x) + \alpha_2 \eta_2(x)

This shows that the optimal scorer is a direct linear combination of the class-probability functions ηk(x)\eta_k(x), weighted by the explicitly chosen aggregation weights αk\alpha_k.

This contrasts sharply with the optimal scorer for loss aggregation (Proposition 5.2), which is kakπ(k)(1π(k))η(k)(x)\sum_k \frac{a_k}{\pi^{(k)}(1-\pi^{(k)})} \eta^{(k)}(x). Here, the effective weight on η(k)(x)\eta^{(k)}(x) depends not only on the chosen weight aka_k but also implicitly on the label prior π(k)\pi^{(k)}.

Therefore, regarding sensitivity:

  • The label aggregation optimal scorer is sensitive to the choice of weights αk\alpha_k in a direct and predictable way: the final scorer is exactly the αk\alpha_k-weighted sum of the ηk(x)\eta_k(x). It is notably insensitive to the class priors π(k)\pi^{(k)}.
  • The loss aggregation optimal scorer is sensitive to both the chosen weights aka_k and the class priors π(k)\pi^{(k)} through the π(k)(1π(k))\pi^{(k)}(1-\pi^{(k)}) term.

3. Deterministic Label Assumption:

"Theorem 1‌ assumes deterministic labels .However, most labels in real-world are noisy."

Thank you for pointing this out. We searched our manuscript but could not find "Theorem 1". We kindly ask for clarification on which result is being referred to.

Assuming the concern relates to our use of deterministic labels in Section 6 (where η(k)(x){0,1}\eta^{(k)}(x) \in \{0, 1\}): this was purely for illustrative purposes, to provide a clear and simple setting to demonstrate the "label dictatorship" phenomenon associated with loss aggregation (Proposition 6.1, Figure 1).

Our main theoretical results on Bayes-optimal scorers in Prop 5.2 (Loss Agg.) and Prop 5.4 (Label Agg.) do not require deterministic labels: they hold for general η(k)(x)\eta^{(k)}(x).

Furthermore, our Section 7 experiments use both synthetic dataset (generated from a non-deterministic distribution) and real-world datasets (Banking, MSLR), confirming relevance in non-deterministic settings.

最终决定

The authors study the problem of bipartite rankings with multiple labels, focusing on two well-known approaches to this problem -- loss aggregation and label aggregation. Specifically, they formalize this problem, show that the Bayes-optimal solutions to AUC maximization are Pareto optimal, and that loss aggregation can lead to label dictatorships (an undesirable outcome). They empirically validate these findings on synthetic (data generated from a transformation from a uniform distribution of instances in R^2) and real-world datasets (UCI Banking [mortgage and personal loan] and MSLR Web30k [click and relevance]).

Strengths of this work identified by reviewers include:

  • While learning bipartite rankings with multiple labels has been defined and applied in practice, this is the first work that characterizes the properties of existing formulations and establishes theoretical justification for choosing one formulation over another.
  • The stated claims are rigorously (analytically) supported and reflected in empirical results.

Limitations of this work identified by the reviewers include:

  • Reviewer PfnK initiated some discussion regarding relationships between the labels. While this is relatively straightforward for people working in this space, the discussion in the rebuttal is potentially elucidating to potential users of this method.
  • The primary concern is limited experiments. The synthetic experiments behave as expected and the two 'real-world' datasets only have two output variables. I believe this is sufficient, but secondary experiments (e.g., sensitivity analyses) would strengthen the work. Minimally, I would recommend discussing specifics regarding the range of utility of this setting (as addressed in rebuttal for reviewer aFZ3).

In any case, this is a general formulation for an important problem with useful results. The paper is well-motivated, well-constructed, well-supported, and well-written. The experiments could be improved to better demonstrate the importance of the problem, but this is a very clean paper providing potentially important results.