/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Divide and Conquer: Learning Label Distribution with Subtasks

Haitao Wu,Weiwei Li,Xiuyi Jia

OpenReview PDF

提交: 2025-01-22更新: 2025-07-24

摘要

关键词

label distribution learningsubtasklabel polysemy

评审与讨论

审稿意见

评分: 32025-02-17

The paper proposes a new plug-in method for label distribution learning, based on auxiliary tasks derived from the original dataset label distribution. The subtasks are defined by a label mask optimized before the optimization procedure, encouraging diversity between the subtasks and alignment with the original label distribution.

UPDATE AFTER REBUTTAL: I keep my recommendance of acceptance.

给作者的问题

论据与证据

Claim 1: The optimization procedure (Eq. 1) can find subtasks which are informative and diverse.

This claim is verified using the metrics defined in Definition 4.2 and Definition 4.3, and the empirical evidence is supported on the different datasets, comparing against random baseline. One question: why do the authors define the diversity as in Definition 4.3, instead of the cosine similarity used in Equation 1?

Claim 2: The NSUM is the only normalization function which permits to reconstruct the original label distribution.

This claim is proved in Theorem 4.4, with ablations comparing it to a a min-max normalization.

Claim 3: S-LDL leads to better performance

This claim is supported by empirical evidence in Section 6, with a sufficient number of baselines and metrics.

方法与评估标准

The datasets and evaluation metrics are borrowed from the label distribution litterature.

理论论述

Some steps in Theorem 4.4 could be made more clear. For example, the proof relies on the assumption that q is a normalization constant, but does not discuss what happens if this is not the case. Moreover, the steps following Equation 8 are not clear: what does it mean that \sum [p(d)]_{j} is "given"? Furthermore, there is implicitly a linearity assumption made to deduce that p(v) = v, hence more details on the derivation would be helpful.

实验设计与分析

Yes, the experimental section is quite comprehensive and covers a wide range of baselines

补充材料

I read the appendix

与现有文献的关系

This paper proposes a plug in which can be used on top of certain LDL methods.

遗漏的重要参考文献

I am not familiar enough with the litterature to discuss this.

其他优缺点

The writing of the paper could be improved, for example the notation used is not always properly introduced (cf. Algorithm 1 where $j$ appears without proper definition, making it hard to understand the algorithm).

Furthermore, there is no clear intuition or theoretical explanation for why subtasks should help with the LDL primary task, and why diversity and information should be maximized when designing good subtasks.

其他意见或建议

作者回复

2025-03-30

Many thanks for your precious comments! We have provided point-by-point responses to your questions below.

Comment 1: Why do the authors define the diversity as in Definition 4.3, instead of the cosine similarity used in Equation 1?

Response: In Definition 4.3, we use $\bar{\boldsymbol{d}}$ to assign weights to each subtask pair, which can make the two subtask label spaces with different high-frequency labels considered more differentiated. This helps evaluate diverse subtask label distributions. In Equation (1), we use cosine similarity instead for optimization purposes, since element-wise XOR operator is a discrete, non-continuous operation. We will clarify this distinction in the manuscript to improve readability.

Comment 2: Some steps in Theorem 4.4 could be made more clear. For example, the proof relies on the assumption that $q$ is a normalization constant, but does not discuss what happens if this is not the case. Moreover, the steps following Equation 8 are not clear: what does it mean that $\sum [p(d)]_{j}$ is "given"? Furthermore, there is implicitly a linearity assumption made to deduce that $p(v) = v$ , hence more details on the derivation would be helpful.

Response: Below, we provide a detailed explanation of the points raised:

If $[q(\cdot)]_j$ is not a constant for all $j$ , each element has its own scaling factor, which is atypical and it is difficult to output a probability simplex.
What we mean is that the term $\sum_{j=1}^{L} [p(\boldsymbol{d})]_{j}$ must be known a priori when solving for $d_k$ in Equation (8).
The choice $[p(\boldsymbol{v})]_j = v_j$ for all $j$ is the simplest nontrivial solution consistent with the problem constraints. We will explicitly call out the linearity assumption and justifies why it’s necessary (to avoid cross-dependencies and preserve normalization).

Comment 3: The writing of the paper could be improved, for example the notation used is not always properly introduced (cf. Algorithm 1 where $j$ appears without proper definition, making it hard to understand the algorithm).

Response: In this context, $j$ denotes the index of the label space. While we initially used a more compact pseudocode style for brevity, we acknowledge that this could lead to ambiguity. We will revise the algorithm to provide a more detailed and unambiguous formulation in the updated version of the paper.

Specifically, Line 4 in Algorithm 1 can be expanded into:

$\mathcal{Y}^{(t)} \leftarrow \lbrace \varnothing \rbrace$ ;

for $j=1$ to $L$ do

    if $M_{tj}=1$ then

          $\mathcal{Y}^{(t)} \leftarrow \mathcal{Y}^{(t)} \cap \lbrace y_j \rbrace$ ;

    end if

end for

Line 9 in Algorithm 1 can be expanded into:

$\boldsymbol{D}^{(t)} \leftarrow []$ ;

for $j=1$ to $L$ do

    if $y_j \in \mathcal{Y}^{(t)}$ then

         $\boldsymbol{D}^{(t)} \leftarrow [ \boldsymbol{D}^{(t)} | \text{clip}(\boldsymbol{d}_{\bullet j}, \varepsilon, 1) ]$ ;

    end if

end for

Comment 4: Furthermore, there is no clear intuition or theoretical explanation for why subtasks should help with the LDL primary task, and why diversity and information should be maximized when designing good subtasks.

Response: First, the reason why subtasks help with the primary task can be explained from the perspective of shared representation learning, which forces the model to learn more general and discriminative feature representations. These features may capture patterns that are beneficial to the primary task. Moreover, empirically speaking, simple techniques that are similar to our method (e.g., label powerset, dummy variable, label smoothing, etc.) have been demonstrated to achieve performance improvements. Second, the reason why diversity and information should be maximized is to ensure that each subtask contributes unique knowledge, avoiding trivial or overlapping learning signals. Also, diverse and informative subtasks can act as implicit regularization, preventing the model from relying too heavily on spurious label correlations in the primary task.

审稿意见

评分: 42025-02-28

The authors introduce S-LDL, a novel label distribution learning (LDL) method that generates/utilizes label distribution subtasks. This method can seamlessly integrate with existing LDL methods without any prior/expert knowledge, and is suitable for some derived tasks. The paper also conducts analyses and experiments to demonstrate that S-LDL is effective and efficient.

update after rebuttal

I have carefully read the rebuttal. The rebuttal provided a detailed explanation of the role of downstream tasks and clarified the three assumed conditions in Theorem 4.4, which addressed some of my concerns. Therefore, I maintain my vote for "accept".

给作者的问题

Regarding Section 4.2, what benefits does the reconstructability of subtask label distribution bring?
Regarding Theorem 4.4, why there are three conditions? More explanations would be helpful.

论据与证据

Yes. The analyses and experiments in this paper show that the additional beneficial data claimed by the authors can indeed be mined from subtasks and improve the performance of the LDL method.

方法与评估标准

Yes. The proposed methods and/or evaluation criteria make sense for the problem at hand.

理论论述

Yes. The correctness of the proofs for the theoretical claims is checked, and they appear to be correct without any significant issues.

实验设计与分析

Yes. The experimental designs and analyses are sound and valid. I have doubts about the analysis part. See the question below for details.

补充材料

Yes. I reviewed the supplementary material, which includes the TensorFlow code for the methods. The code is basically consistent with the content discussed in the paper.

与现有文献的关系

The idea of using subtasks in the field of LDL is novel, and the proposed S-LDL is compatible with existing LDL methods. This paper provides new insights into how subtasks can be leveraged to refine label distribution predictions, contributing to the ongoing evolution of LDL methodologies.

遗漏的重要参考文献

No. This paper adequately discusses related work.

其他优缺点

Strengths:

This paper proposes an LDL method that generates/utilizes label distribution subtasks, and this idea seems novel.
There are some good properties of the proposed method: independence from expert/prior knowledge, compatibility with existing methods, and applicability to derived tasks, etc.
There are creative analyses of the proposed method, illustrating the effectiveness of the subtask label spaces/distributions.
There are sufficient experiments to demonstrate the performance of the proposed method, and the results seem promising.

Weaknesses:

There are limited performance improvements in some cases, particularly with the Yeast_ series datasets.
There are some non-intuitive parts of the analysis. See the questions below for details.

其他意见或建议

The notation $[\cdot]$ is used to denote the set of natural numbers, but $[\cdot]_i$ is used to represent the element at the $i$ -th position of a vector, which could be confusing for readers sometimes.

作者回复

2025-03-30

Many thanks for your precious comments! Responses to your concerns are as follows.

Comment 1: There are limited performance improvements in some cases, particularly with the Yeast_ series datasets.

Response: For LDL, even small changes in metrics can indicate significant performance improvements. This is because the values involved in the calculation are constrained by probability simplex, which limits the range of results and leads to small divergence. For example, on the $\mathtt{Yeast\\_diau}$ dataset, the K-L divergence between the ground truth and a uniform label distribution matrix (values are all $\frac{1}{L}$ , where $L$ is the number of labels) is 0.0158. This can be considered as the worst performance. DF-LDL, a strong baseline, achieves 0.0131. To ensure the reliability of the experimental results, we repeat the experiments and provide the standard deviation of the results, and perform significance tests (pairwise t-test at 0.05 significance level) to confirm that the observed changes are statistically significant.

Comment 2: There are some non-intuitive parts of the analysis. The notation $[\cdot]$ is used to denote the set of natural numbers, but $[\cdot]_i$ is used to represent the element at the $i$ -th position of a vector, which could be confusing for readers sometimes.

Response: We appreciate this observation. To avoid ambiguity, we will revise the notation by using parentheses for indexing (i.e., $(\cdot)_i$ for the $i$ -th element of a vector) and reserve square brackets exclusively for set definitions.

Comment 3: Regarding Section 4.2, what benefits does the reconstructability of subtask label distribution bring?

Response: The purpose of Section 4.2 is to demonstrate that, under certain conditions, the subtask label distributions can indeed reconstruct the primary task label distribution. In Section 5, we concatenate the subtask label distribution with the representation to serve as input to a label distribution estimator, simulating this reconstruction process. In essence, reconstructability guarantees that subtasks are not just arbitrary auxiliary objectives but are structurally aligned with the primary LDL problem. Therefore, it is crucial for the overall validity of the $\mathcal{S}$ -LDL framework to prove that the subtask label distributions can yield the original distribution.

Comment 4: Regarding Theorem 4.4, why there are three conditions? More explanations would be helpful.

Response: Below, we provide explanation of the conditions:

$\mathcal{G}$ is connected: This ensures that the proportions between all description degrees of the primary label distribution are known. If $\mathcal{G}$ is not connected, the proportions of description degrees corresponding to labels on different connected components would remain unknown.
$\mathcal{G}$ covers all labels in the label space: The union of the subtask label spaces must encompass the entire primary label space. If any labels are not included in any subtask label space, the description degrees for those labels would be unknown. Hence, this is a necessary condition.
The corresponding description degrees of all cut vertices of $\mathcal{G}$ are not zero: When there are cut vertices in $\mathcal{G}$ , the non-zero description degrees corresponding to all the cut vertices are necessary to keep the proportions known.

审稿人评论

2025-04-02

I have carefully read the rebuttal, and the response solved the most of my concerns. In the third response, I still have a small question: Is there a relationship between the subtask and the main task? Are they in an independent and identically distributed (i.i.d.) relationship? Aside from this question, the other responses have addressed my concerns.

作者评论

2025-04-03

Thank you for your feedback! The subtasks and the primary task are not independent but share structural correlations. However, they do not strictly follow an i.i.d. relationship, as the subtask construction process introduces dependencies.

审稿意见

评分: 32025-03-14

This paper studies the Label distribution Learning (LDL) problem. It first claims the disadvantages of existing works, pointing out their contradiction between auxiliary tasks and the generalizability. To mitigate this issue, the authors propose a new method S-LDL, which generates subtask label distributions to help LDL. The proposed S-LDL consists of two simple key components: 1) generating subtasks without prior knowledge; 2) solving subtasks using off-the-shelf LDL algorithms. The experiments demonstrate the effectiveness of S-LDL on multiple datasets.

给作者的问题

Please see above.

论据与证据

Strengths:

The proposed idea is derived from some multi-label learning (MLL) problems, but is also suitable for the LDL problem. This is a minimalist and interesting motivation for LDL.

Weaknesses:

It is better to provide some evidence that existing methods cannot deal with the contradiction. Otherwise the motivation will be a bit unconvincing.
The claim of "furnish additional supervised data" seems not rigorous and might cause misunderstanding, , since you have not involved more external data. Maybe it is better to use "supervision" or "supervised information".

方法与评估标准

Strengths:

The proposed method is minimalist and scalable. The authors have properly provided the algorithm procedure.
The evaluation metrics are introduced in Section 6. And the detailed formulations are provided in the appendix.

理论论述

Weaknesses:

Some related works in MLL with partitioning of the label space provided theoretical justification. However, there is no theoretical support for the proposed method.

实验设计与分析

Strengths:

The experiments demonstrate the effectiveness of the proposed method on multiple datasets.
The authors have provided proper ablation study and parameter sensitivity analysis.

Weaknesses:

The performance gains on some datasets are marginal (less than 0.01). Is there any explanation?

补充材料

Strengths:

The authors have provided the source code for reproduction.

与现有文献的关系

Strengths:

This paper is partially related to multi-label learning. The authors have provided proper explanations and discussions.

遗漏的重要参考文献

Weaknesses:

It seems that the references are a bit out-of-date. Most references are from or earlier than 2023. Is there more recent related work?

其他优缺点

I have no extra concerns. Please see the above comments.

其他意见或建议

Please see above.

作者回复

2025-03-30

Many thanks for your precious comments! We have provided point-by-point responses to your questions below.

Comment 1: It is better to provide some evidence that existing methods cannot deal with the contradiction.

Response: In Section 1 & 2, we have provided such evidence. On the one hand, we highlight how existing methods relying on prior/expert knowledge (e.g., [Chen et al., 2020; Wu et al., 2019; Yang et al., 2017a]) suffer from poor generalization, as they cannot adapt to new domains. On the other hand, we demonstrate that conventional LDL techniques use solely the primary task distribution (e.g., [Jia et al., 2019; 2023; Ren et al., 2019; Wen et al., 2023]) as supervised data. These comparisons directly show why existing solutions cannot resolve the contradiction, and the essential reason behind this is highly related to the design logic of these methods.

Comment 2: The claim of "furnish additional supervised data" seems not rigorous and might cause misunderstanding, since you have not involved more external data. Maybe it is better to use "supervision" or "supervised information".

Response: The claim is indeed not meant to imply external data but rather newly constructed data for supervision. We will revise the wording to clarify this in the manuscript.

Comment 3: Some related works in MLL with partitioning of the label space provided theoretical justification. However, there is no theoretical support for the proposed method.

Response: We sincerely appreciate the reviewer's insightful comment regarding theoretical justification. While prior MLL works with label space partitioning do offer theoretical analyses, our method provides distinctive theoretical support through the following aspects. In Section 4.1, we have discussed the validity of subtask construction, ensuring the subtasks remain meaningful for the primary LDL objective. In Section 4.2, we prove that the primary label distribution can be theoretically reconstructed from subtask distributions, demonstrating information preservation. Additionally, in Section 4.3, we have provided a detailed analysis of the time complexity of subtask construction, establishing its scalability and efficiency. While our theoretical framework differs from related works in MLL with partitioning of the label space, these three pillars collectively justify the rationality of our design.

Comment 4: The performance gains on some datasets are marginal (less than 0.01). Is there any explanation?

Comment 5: It seems that the references are a bit out-of-date. Most references are from or earlier than 2023. Is there more recent related work?

Response: We appreciate the reviewer's suggestion. While there are indeed some recent works in the field of LDL [1, 2], they currently lack official open-source implementations, making their effectiveness and reproducibility unverified. For reliability and fairness in comparison, we chose to focus on well-established methods with publicly available implementations.

[1] Label Distribution Learning Based on Horizontal and Vertical Mining of Label Correlations. TBD, 2024.

[2] Exploiting Multi-Label Correlation in Label Distribution Learning. IJCAI, 2024.

审稿意见

评分: 22025-03-18

This paper investigates the problem setting of Label Distribution Learning (LDL). In particular, the authors propose the concept of leveraging subtasks to facilitate learning.

update after rebuttal

I reviewed the rebuttal and further responses. I am still not fully convinced. I did not have more time to check the motivations of the LDL setting, but it seems that other reviewers approve of this setting. I still have reservations, but the AC may wish to ignore my comments because I did not read the paper in full.

给作者的问题

See weakness and strength.

论据与证据

See weakness and strength.

方法与评估标准

See weakness and strength.

理论论述

See weakness and strength.

实验设计与分析

See weakness and strength.

补充材料

See weakness and strength.

与现有文献的关系

See weakness and strength.

遗漏的重要参考文献

See weakness and strength.

其他优缺点

The concept of Label Distribution Learning (LDL) feels quite unfamiliar to me. How does it differ from a typical classification network? What I mean is, even the most common classification networks can use softmax predictions as the label distribution output in this context.

Moreover, what form does the training set take? The authors only describe the samples as $\mathcal{X}$ without mentioning the labels. Does each sample come with a soft label? Even if that’s the case, it still strikes me as odd. As we know, the ERM essentially guarantees that, given a sufficient sample size, a classification network trained with cross-entropy loss will eventually produce softmax predictions that converge to the true posterior distribution $P(Y|X)$ . In simpler terms, for the same sample point $X$ , we would effectively be sampling multiple labels $Y$ , which is indeed equivalent to training with soft labels.

Please first address the points mentioned above, after which I will proceed with my review.

其他意见或建议

See weakness and strength.

作者回复

2025-03-30

Many thanks for your comments! Responses to your questions are as follows.

Comment 1: How does it (LDL model) differ from a typical classification network?

Response: Although typical classification networks can formally produce outputs similar to label distribution learning (LDL) through Softmax, they differ significantly in essence, specifically manifested in the following aspects:

Learning Objectives: Typical classification networks with Softmax outputs aim to learn a single-peaked probability distribution, emphasizing the certainty that a sample belongs to a specific category (i.e., the "one true label"). Their limitation lies in the inability to express the varying degrees of association between a sample and multiple labels. In contrast, LDL seeks to learn a multi-modal (possibly single-peaked or multi-peaked) label distribution, describing the degree of association between a sample and all labels. It is primarily suited for scenarios where semantic overlap or granularity differences exist among labels. The core challenge of LDL never lies in ensuring outputs satisfy a probability simplex, but rather in precisely fitting the description degrees.
Semantic Interpretation: The Softmax output ensures that probabilities sum to 1, but only the highest probability label is meaningful (winner-takes-all). The label distribution output, however, determines probability distribution based on the relative descriptive strength among labels, allowing all labels to retain interpretability simultaneously.
Underlying Philosophy: The classification model is decision-oriented, prioritizing clear classification boundaries. LDL model is description-oriented, pursuing fine-grained quantification of label associations.

Comment 2: What form does the training set take? Does each sample come with a soft label?

Response: In short, each sample $\boldsymbol{x}$ come with a label distribution $\boldsymbol{d}$ , which is not a soft label. Soft labels are often introduced to improve the generalization of classification tasks, and they are still only related to a specific dominant class. Label distributions model ambiguity, where diversity is a property of the data itself. Therefore, label distributions may inherently follow a continuous or multi-peaked distribution (e.g., subjective human annotations). We have some rigorous statements about LDL in the problem definition in Section 3.1. These do not conflict with your observations based on ERM.

To clarify, our work follows a pure LDL paradigm, which is fundamentally distinct from traditional classification tasks. We believe this clarification adequately addresses the reviewer's concerns. We sincerely hope these issues will not negatively impact the overall rating of our contribution.

审稿人评论

2025-04-02

Thanks for the rebuttal. My point is: when the sample size is sufficiently large and the model capacity is appropriate, the probability output by a model trained with the commonly used cross-entropy loss will converge to the true posterior probability. This convergence is independent of whether the true posterior distribution is assumed to be unimodal or multimodal.

From my perspective, Label Distribution Learning (LDL) appears to be merely a rephrasing of the same concept. Simply put, if LDL assumes that the annotation of a sample is a label distribution—e.g., for a given sample $x$ , its label is represented as $y = [0.3, 0.6, 0.1]$ —then in the traditional setting, when the training dataset is sufficiently large, even with single-label supervision, we would expect multiple sampled labels $y$ for the same $x$ . These sampled labels will naturally form a distribution similar to $[0.3, 0.6, 0.1]$ .

作者评论

2025-04-02

Thank you for your thoughtful discussion. While the theoretical point holds universally (any ML paradigm can converge to the true distribution given infinite data and ideal conditions), empirically these conditions are rarely met in practice.

Here we take an example: Your mention of single-label supervision corresponds exactly to problem-transformation (PT) methods that convert label distributions into single-label approximations (e.g., through repeated sampling). Geng (2016) demonstrates these methods perform well only on artificial datasets but fail in practical applications. This occurs because: PT methods fundamentally destroy the structure of the original label distribution. If the training data is single-labeled, the model will tend to ignore the underlying multimodality.

This example not only highlights that label distributions are not merely an alternative representation of traditional single-label sampling statistics, but also demonstrates the persistent gap between theory and empirical practice.

We share your enthusiasm for these fundamental questions. Therefore, in this paper, we deconstruct richer supervision information using label distribution subtasks to address empirically observed performance challenges.

最终决定Accept (poster)

2025-05-01

The paper proposes $\mathcal{S}$ -LDL, a novel framework for Label Distribution Learning (LDL) that reconciles the trade-off between generalizability and auxiliary task reliance by generating subtask label distributions without prior knowledge. S-LDL introduces two key components: (1) an algorithm to autonomously construct subtasks, and (2) a plug-and-play framework compatible with existing LDL methods and derivative tasks. Theoretical analysis validates subtask validity and reconstructability, while experiments demonstrate consistent improvements across multiple datasets and tasks.

The paper makes a timely and practical contribution to LDL by introducing a generalizable subtask-learning framework that does not rely on expert knowledge. There are some strengths mentioned by the reviewers: (1) Novelty: Reviewer BGkH noted S-LDL is the first to address LDL via subtasks, with a "novel" and "plug-and-play" design. (2) Empirical Validation: All reviewers acknowledged the comprehensive experiments (10+ datasets, 4 tasks). (3)Theoretical Rigor: Reviewer yDqw initially found Theorem 4.4’s proof unclear, but the authors clarified the linearity assumption and normalization uniqueness, satisfying BGkH’s request.

While minor issues remain, the rebuttal and revisions have adequately addressed concerns. Reviewer sNTH initially questioned the distinction between LDL and standard classification and the authors provided detailed responses from different aspects. Reviewer sNTH expressed that he still has reservations about the LDL setting, but doesn't have time to thoroughly check its the details and doesn't object if the paper is accepted. Thus, I recommend acceptance.