PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
4
3
3
ICML 2025

Approximately Correct Label Distribution Learning

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24

摘要

关键词
label distribution learningapproximate correctnesslabel polysemy

评审与讨论

审稿意见
4

To address such as exisome deep-rooted problems of LDLsting LDL metrics lose their discriminability, and existing LDL objectives are also at risk of overfitting, this paper proposes DeltaLDL, a percentage of predictions that are approximately correct within the context of LDL. Based on DeltaLDL, a novel evaluation metric (the µ metric) and a novel learning objective are proposed. Finally, the authors encapsulate it to propose a new LDL algorithm, named δ-LDL. The theoretical analysis and empirical results validate the effectiveness of the proposed δ-LDL.

update after rebuttal

After reading the authors' repsonses, I decide to keep my original score.

给作者的问题

  1. Some necessary justifications about the existing deep-rooted problems of LDL are needed. In Introduction: "For years, there are some deep-rooted problems in the field of LDL:" How to draw this conclusion? Combining with the related work (Section 6), More detailed elaboration is needed.
  2. In Robustness testing, shown in Figure 4, why only other five methods were used to compare? In Table 2, the authors comapre the proposed δ-LDL with the existing other eleven methods.
  3. In Table 2, the detailed experimental results on different datasets, there is no need to ranking the other eleven competitors, which reduces the readability of the paper.
  4. Although the current experiments are good enough, it is more convinced if the authors can explain the reason why these seven datasets are used in the current experiments.

论据与证据

YES

方法与评估标准

YES

理论论述

YES

实验设计与分析

YES

补充材料

YES. Code and Datasets.

与现有文献的关系

This paper proposes DeltaLDL, a percentage of predictions that are approximately correct within the context of LDL, as a solution to some existing deep-rooted problems of LDL. DeltaLDL can serve as a novel evaluation metric and a novel learning objective.

遗漏的重要参考文献

NO

其他优缺点

Strengths:

  1. The authors conduct a theoretical analysis of the KLD to demonstrate its unsuitability as an evaluation metric/learning objective for LDL.
  2. The authors propose DeltaLDL, which can serve as a solution to some existing deep-rooted problems of LDL.
  3. DeltaLDL can serve as a novel evaluation metric and a novel learning objective.
  4. Finally, the authors encapsulate DeltaLDL to propose a new LDL algorithm, named δ-LDL. The empirical results validate the effectiveness of the proposed δ-LDL.

Weaknesses:

  1. Some necessary justifications about the existing deep-rooted problems of LDL are needed. In Introduction: "For years, there are some deep-rooted problems in the field of LDL:" How to draw this conclusion? Combining with the related work (Section 6), More detailed elaboration is needed.
  2. In Robustness testing, shown in Figure 4, why only other five methods were used to compare? In Table 2, the authors comapre the proposed δ-LDL with the existing other eleven methods.
  3. In Table 2, the detailed experimental results on different datasets, there is no need to ranking the other eleven competitors, which reduces the readability of the paper.
  4. Although the current experiments are good enough, it is more convinced if the authors can explain the reason why these seven datasets are used in the current experiments.

其他意见或建议

  1. Some necessary justifications about the existing deep-rooted problems of LDL are needed. In Introduction: "For years, there are some deep-rooted problems in the field of LDL:" How to draw this conclusion? Combining with the related work (Section 6), More detailed elaboration is needed.
  2. In Robustness testing, shown in Figure 4, why only other five methods were used to compare? In Table 2, the authors comapre the proposed δ-LDL with the existing other eleven methods.
  3. In Table 2, the detailed experimental results on different datasets, there is no need to ranking the other eleven competitors, which reduces the readability of the paper.
  4. Although the current experiments are good enough, it is more convinced if the authors can explain the reason why these seven datasets are used in the current experiments.
  5. In the last paragraph of Introduction, the authors highlight the contributions of the paper. It looks more like contributions and organizational structure.
作者回复

Many thanks for your precious comments! We have provided point-by-point responses to your questions below.

Comment 1: Some necessary justifications about the existing deep-rooted problems of LDL are needed. In Introduction: "For years, there are some deep-rooted problems in the field of LDL:" How to draw this conclusion? Combining with the related work (Section 6), More detailed elaboration is needed.

Response: The "deep-rooted problems" refer to the following two issues:

  • Poor discriminability of traditional metrics: Early work (Geng, 2016) demonstrated that Clark and Canberra metrics suffer from oversensitivity to small values. Subsequent studies (Xu & Zhou, 2017) revealed KLD's unreliability with sparse predictions, prompting calls for alternative measures.

  • Overfitting from minimizing average measurement: For example, AA-BP (Geng, 2016), a simple 3-layer network, minimizes MSE but underperforms due to overfitting. Current methods implicitly address this through ad-hoc regularization (Ren et al., 2019a; b; Jia et al., 2023a; b), indicating the field of LDL lacks principled solutions.

We will expand on these points in Section 6 to clarify the motivation for our work.

Comment 2: In Robustness testing, shown in Figure 4, why only other five methods were used to compare? In Table 2, the authors comapre the proposed δ-LDL with the existing other eleven methods.

Response: We selected representative methods from different performance tiers to maintain clarity in the visualization. Including all 11 methods would make the plot overcrowded and hard to interpret. Tables 2 & 3 provide the complete comparison for readers interested in detailed results.

Comment 3: In Table 2, the detailed experimental results on different datasets, there is no need to ranking the other eleven competitors, which reduces the readability of the paper.

Response: The ranking was intentionally included to demonstrate that while μ\mu is derived from KLD, it produces meaningfully different evaluation outcomes than KLD alone. As discussed in Section 5.2, this highlights how μ\mu addresses known limitations of traditional metrics. We will add a clearer explanatory note in the table caption to improve readability while maintaining this important comparative information.

Comment 4: Although the current experiments are good enough, it is more convinced if the authors can explain the reason why these seven datasets are used in the current experiments.

Response: The seven datasets were carefully selected to represent diverse real-world LDL applications across multiple domains: Aesthetic perception (M2B\mathtt{M}^{\mathtt{2}}\mathtt{B} & fbp5500\mathtt{fbp5500}), facial emotion recognition (RAFML\mathtt{RAF\\_ML} & SBU3DFE\mathtt{SBU\\_3DFE}), multi-class image classification (NaturalScene\mathtt{Natural\\_Scene}), and artistic emotion perception (Painting\mathtt{Painting} & Music\mathtt{Music}). Due to the complexity of human subjective perception, these tasks particularly exhibit the characteristic of label polysemy - a crucial challenge LDL aims to address.

Comment 5: In the last paragraph of Introduction, the authors highlight the contributions of the paper. It looks more like contributions and organizational structure.

Response: Yes. We will revise the subsection title to better reflect its dual purpose.

审稿人评论

I have read the authors' responses to all comments, and thus I will keep my score.

审稿意见
4

This paper theoretically reveals the deficiency of the KL divergence in learning and evaluating LDL mappings. To address the mentioned shortcomings, this paper proposes a new LDL paradigm, DeltaLDL, which focuses on how many label distributions are approximately correctly predicted. Based on DeltaLDL, this paper proposes a novel evaluation metric (which is theoretically proved that it possesses superior discriminative power) and a novel learning objective (which achieves highly competitive performance). Finally, this paper conducts extensive experiments to demonstrate the effectiveness of the proposal.

update after rebuttal

I read the rebuttal and the other reviews, my major concern is how a specific distance or similarity value represents the closeness between the predicted and ground-truth label distributions and some writting issues. The rebuttal well answer the distance question. The writting issues can be solve in the camera-ready version. Therefore, I raise my score.

给作者的问题

Why should δ0\delta_0 be defined as in Equation (12), and what advantages does this approach have compared to directly allowing users to pre-set a threshold δ0\delta_0?

论据与证据

Yes, the claims are supported by clear and convincing evidence.

方法与评估标准

Yes, both the methods and evaluation criteria make sense for the problem at hand.

理论论述

Yes, I have checked the proof of Proposition 2.2.

实验设计与分析

Yes, I have checked the experimental designs and analyses. The evaluation metrics and datasets are sufficient, and the comparison methods are state-of-the-art. Additionally, the effectiveness of the proposal is demonstrated by the ablation experiments.

补充材料

Yes, I have reviewed the appendix, including the proof of theorems and the additional experimental analyses.

与现有文献的关系

In the literature related to support vector regression, the algorithms aim to learn a strip with the minimum width such that as many samples as possible are within the strip. This paper adapts this idea to the label distribution learning paradigm and proposes the concept of "approximately correct prediction," aiming to approximately correctly predict more samples.

遗漏的重要参考文献

The paper does not neglect the essential works.

其他优缺点

The evaluation metrics proposed in this paper possess strong practicality and offer a new perspective for subsequent research in label distribution learning. Traditional evaluation methods typically measure prediction quality by calculating the distance or similarity between the predicted label distribution and the ground-truth label distribution.

However,:

  1. Non-professional users may not intuitively understand how a specific distance or similarity value represents the closeness between the predicted and ground-truth label distributions.
  2. The evaluation approach proposed in this paper aims to calculate the number of samples with a distance less than a certain threshold, which is easy to understand for non-professional users.
  3. The paper provides theoretical support for the proposed evaluation method.
  4. the paper also has some imperfections, such as the arbitrary naming of methods or metrics (e.g., μ\mu and DeltaLDL), which does not reflect the characteristics of the methods and metrics.

其他意见或建议

In Equation (13) and Equation (19), it should be clarified whether the symbol cc is also included within the logarithmic operation. Besides, it is recommended to assign a name to the evaluation metric to more accurately reflect its characteristics, rather than using the Greek letter (μ\mu) directly.

作者回复

Many thanks for your precious comments! We have provided point-by-point responses to your questions below.

Comment 1: Non-professional users may not intuitively understand how a specific distance or similarity value represents the closeness between the predicted and ground-truth label distributions.

Response: Traditional metrics reflect the closeness from different perspectives. For example, Euclidean distance measures geometric separation, while K-L divergence quantifies distributional differences. Our metric, i.e., Eq. (11), built upon these foundations, remains intuitive. It directly corresponds to an area ratio. Let us explain this better with Fig. 2 (b):

  • Numerator of Eq. (11): The integral corresponds to the area under the curve, bounded by the gray line and axes.

  • Denominator of Eq. (11): δ0\delta_0 represents the area of the gray rectangle, reflecting the ideal model’s performance.

Comment 2: The paper also has some imperfections, such as the arbitrary naming of methods or metrics (e.g., μ\mu and DeltaLDL), which does not reflect the characteristics of the methods and metrics. ... It is recommended to assign a name to the evaluation metric to more accurately reflect its characteristics, rather than using the Greek letter (μ\mu) directly.

Response: We appreciate this thoughtful suggestion. While we originally adopted Greek-letter notation μ\mu for consistency with established metric conventions like Spearman's ρ\rho and Kendall's τ\tau, we agree that more descriptive names would better reflect their characteristics. We will provide the aliases as follows:

  • μ\mu: improvement ratio (clearly indicating performance gain).

  • DeltaLDL: AC-LDL (Approximately Correct LDL).

These new names better capture the methods' essential features while maintaining readability. Thank you for helping improve our paper's clarity.

Comment 3: In Equation (13) and Equation (19), it should be clarified whether the symbol cc is also included within the logarithmic operation.

Response: In both Equations (13) and (19), the parameter cc should indeed be included within the logarithmic operation. We will revise these equations to explicitly show this by adding parentheses.

Comment 4: Why should δ0\delta_0 be defined as in Equation (12), and what advantages does this approach have compared to directly allowing users to pre-set a threshold?

Response: Our theoretical analysis illustrates that, δ0\delta_0 reflects the worst-case divergence. Therefore, values larger than δ0\delta_0 (for distance metrics) would imply tolerating worse-than-random errors, which doesn't make much sense. While smaller δ\delta could be explored, they require strong assumptions about data training difficulty (e.g., label noise), i.e., we can't decide how small δ\delta should be. Without such prior knowledge, δ0\delta_0 provides a neutral starting point.

Allowing non-professional users to set thresholds does not provide any advantages. We provide results of D(KL,δ;f)\mathfrak{D}(\text{KL}, \delta; f) w.r.t. δ\delta on SBU3DFE\mathtt{SBU\\_3DFE} & NaturalScene\mathtt{Natural\\_Scene}, where ff always outputs a uniform label distribution matrix.

SBU3DFE\mathtt{SBU\\_3DFE}:

δ\delta0.020.040.060.0851 (δ0\delta_0 of SBU3DFE\mathtt{SBU\\_3DFE})0.10.120.141.172 (δ0\delta_0 of NaturalScene\mathtt{Natural\\_Scene})
D\mathfrak{D}.156.314.450.594.665.732.7871.

NaturalScene\mathtt{Natural\\_Scene}:

δ\delta0.0851 (δ0\delta_0 of SBU3DFE\mathtt{SBU\\_3DFE})0.60.81.01.172 (δ0\delta_0 of NaturalScene\mathtt{Natural\\_Scene})1.41.61.8
D\mathfrak{D}.0.238.240.481.650.662.777.797

Key findings: 1) D(KL,δ;f0)\mathfrak{D}(\text{KL},\delta;f_0) shows high sensitivity to δ\delta changes (non-linear response); 2) optimal δ\delta ranges vary significantly across datasets, which is not a mutually referable setting. This sensitivity highlights the dangers of ad-hoc parameter choices, justifying our pursuit of a parameter-free solution.

审稿人评论

Thank you to the authors for their response, which addressed several of my questions and included an additional experiment to explain the rationale behind setting the parameter δ0\delta_0. For the first question, I suggest adding a visualization to more intuitively illustrate how a specific distance or similarity value represents the closeness between the predicted and ground-truth label distributions. For example, multiple figures could be plotted, each showing the true label distribution and the predicted label distribution, along with the values of distances/similarities between them under different measurement methods. This would enhance the intuitive understanding of the closeness.

作者评论

Thank you for your suggestion. We will incorporate visualizations that combine the true and predicted distributions along with relevant metrics to enhance intuitive understanding.

审稿意见
3

This paper focuses on label distribution learning (LDL) and addresses the limitations of existing evaluation metrics and learning objectives. Existing LDL evaluation metrics based on distance/similarity metrics, like Kullback-Leibler divergence (KLD), have poor discriminability due to the constraints of label distributions. Also, existing LDL learning objectives often overfit by emphasizing a small subset of samples, leading to sub-optimal performance. Hence, it proposes DeltaLDL, which can be used as both a novel evaluation metric and a learning objective, aiming to improve the performance and discriminability in LDL.

给作者的问题

1.In Remark 2.3, please give some explanations that why the closed-form solution of Eq. (8) does not exist.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes.

实验设计与分析

Yes.

补充材料

Yes. This submission provides the code as supplementary material. I specifically examined the consistency between the logic of the code and the metrics and algorithms proposed in the article. The Appendix section contains proofs of the paper's theories and some additional experimental results, both of which I have checked.

与现有文献的关系

The key contributions of the paper are closely related to the fields of label distribution learning. It proposes a new metric to measure the distance between the ground-truth label distribution and estimated label distribution, and transform the metric into a loss function so that we could consider the metric when optimizing the model.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. Novel Evaluation Metric: This paper proposes a parameter-free metric, μ, which quantifies the percentage of samples predicted as approximately correct by integrating the Kullback-Leibler divergence (KLD) over a threshold range. This metric avoids the sensitivity issues of traditional KLD and provides a more discriminative measure of performance improvements.
  2. Novel Learning Objective: This paper formulates a learning objective using a smoothed indicator function and adaptive Simpson's rule for numerical integration. This approach encourages most samples to be approximately correct while mitigating overfitting to extreme predictions.
  3. Experiments across multiple datasets (e.g., M2B, fbp5500, SBU_3DFE) show that δ-LDL outperforms baseline LDL methods in terms of both accuracy and robustness, which validates the effectiveness of the proposed method.

Weaknesses:

  1. My main concern is whether the new metric proposed in the article can truly distinguish between superior and inferior models. In Eq. (11), there is a coefficient of 1/delta_0 before the integral. What does this coefficient mean? Why does the area under the curve (AUC) need to be multiplied by such a coefficient? Moreover, the smaller this coefficient is, the larger \mu becomes. This makes one wonder whether it is the subsequent integral, this coefficient, or the interaction between the two that plays a role in the measurement. I think this point needs further explanation.
  2. Setting \delta_0 as the expected KL divergence between the label distribution and the vector v is somewhat heuristic. It is recommended to give more explanations or conduct more experiments for exploration.
  3. In Section 4, it mentions “the objective should be to sacrifice a small number of samples that are difficult to learn and ensure that most samples can be predicted as approximately correct.” Then, in the initial stage of training, when the model's parameters are random, how can we determine which samples are difficult to learn? Perhaps it would be best to have a warm-up process, or there could be a process of gradually changing the hyperparameter \delta.

其他意见或建议

None.

作者回复

Many thanks for your precious comments! We have provided point-by-point responses to your questions below.

Comment 1: My main concern is whether the new metric proposed in the article can truly distinguish between superior and inferior models. In Eq. (11), there is a coefficient of 1/delta_0 before the integral. What does this coefficient mean? Why does the area under the curve (AUC) need to be multiplied by such a coefficient? Moreover, the smaller this coefficient is, the larger \mu becomes. This makes one wonder whether it is the subsequent integral, this coefficient, or the interaction between the two that plays a role in the measurement. I think this point needs further explanation.

Response: The metric μ\mu is not a raw AUC but a ratio of areas. Let us explain this better with Fig. 2 (b):

  • Numerator: The integral in Eq. (11) corresponds to the area under the curve, bounded by the gray line and axes.

  • Denominator: δ0\delta_0 represents the area of the gray rectangle, reflecting the ideal model’s performance.

Thus, 1δ0\frac{1}{\delta_0} normalizes the μ\mu metric to [0, 1], ensuring fair comparison (no arbitrary scaling) between metrics. We will clarify this critical point in the manuscript.

Comment 2: Setting \delta_0 as the expected KL divergence between the label distribution and the vector v is somewhat heuristic. It is recommended to give more explanations or conduct more experiments for exploration.

Response: Our theoretical analysis illustrates that, δ0\delta_0 reflects the worst-case divergence. Therefore, values larger than δ0\delta_0 (for distance metrics) would imply tolerating worse-than-random errors, which doesn't make much sense. While smaller δ\delta could be explored, they require strong assumptions about data training difficulty (e.g., label noise), i.e., we can't decide how small δ\delta should be. Without such prior knowledge, δ0\delta_0 provides a neutral starting point. We provide results of D(KL,δ;f)\mathfrak{D}(\text{KL}, \delta; f) w.r.t. δ\delta on SBU3DFE\mathtt{SBU\\_3DFE} & NaturalScene\mathtt{Natural\\_Scene}, where ff always outputs a uniform label distribution matrix.

SBU3DFE\mathtt{SBU\\_3DFE}:

δ\delta0.020.040.060.0851 (δ0\delta_0 of SBU3DFE\mathtt{SBU\\_3DFE})0.10.120.141.172 (δ0\delta_0 of NaturalScene\mathtt{Natural\\_Scene})
D\mathfrak{D}.156.314.450.594.665.732.7871.

NaturalScene\mathtt{Natural\\_Scene}:

δ\delta0.0851 (δ0\delta_0 of SBU3DFE\mathtt{SBU\\_3DFE})0.60.81.01.172 (δ0\delta_0 of NaturalScene\mathtt{Natural\\_Scene})1.41.61.8
D\mathfrak{D}.0.238.240.481.650.662.777.797

Key findings: 1) D(KL,δ;f0)\mathfrak{D}(\text{KL},\delta;f_0) shows high sensitivity to δ\delta changes (non-linear response); 2) optimal δ\delta ranges vary significantly across datasets, which is not a mutually referable setting. This sensitivity highlights the dangers of ad-hoc parameter choices, justifying our pursuit of a parameter-free solution.

Comment 3: How can we determine which samples are difficult to learn? Perhaps it would be best to have a warm-up process, or there could be a process of gradually changing the hyperparameter \delta.

Response: The identification of difficult-to-learn samples is gradually facilitated by the "ReLU + margin" mechanism during training, since initializing δ\delta to δ0\delta_0 is sufficiently inclusive, allowing the model to first learn from easier patterns. While we prioritize simplicity and avoid introducing additional hyperparameters, we acknowledge that a warm-up strategy or adaptive scheduling for δ\delta could be explored in future work.

Comment 4: In Remark 2.3, please give some explanations that why the closed-form solution of Eq. (8) does not exist.

Response: The partial derivative of Equation (8) is given by:

MLEαk=m(Φ(j=1cαj)Φ(αk))i=1mlndxidk,\frac{\partial \ell_{\text{MLE}}}{\partial \alpha_k} = -m \left( \Phi \left( \sum_{j=1}^{c} \alpha_j \right) - \Phi (\alpha_k) \right) - \sum_{i=1}^{m} \ln d_{\boldsymbol{x}_i}^{d_k}\text{,}

where Φ(x)=dlnΓ(x)dΓ(x)\Phi(x) = \frac{\mathrm{d} \ln \Gamma(x)}{ \mathrm{d} \Gamma(x)} is the digamma function. The equation exhibits highly nonlinear behavior and implicitly forms a globally coupled nonlinear system, making a closed-form solution for Equation (8) intractable.

审稿意见
3

This paper addresses the issues in Label Distribution Learning (LDL), notably the limitations of Kullback–Leibler Divergence (KLD) as both an evaluation metric and learning objective. The authors propose DeltaLDL, a novel framework that introduces the concept of "approximately correct" label distributions, aiming to measure and optimize the percentage of samples predicted within a reasonable distance from the ground-truth distributions. This is operationalized as both a new evaluation metric (µ) and a differentiable learning objective. The authors provide theoretical analysis to support their approach, including a critique of KLD's properties via Dirichlet distribution modeling, and propose an algorithm δ-LDL that integrates these ideas. The method is evaluated on a variety of standard LDL datasets, showing improved performance across multiple metrics, and experiments include ablations and robustness tests to validate the effectiveness of their contributions.

update after rebuttal

Thank the authors for the rebuttal, which has addressed my concerns and I agree to raise my score.

给作者的问题

  • Training Time and Complexity: Could you provide quantitative comparisons of training time between δ-LDL and standard KLD-based LDL methods, especially for larger datasets?

  • Robustness to Distributional Assumptions: How sensitive is the method to deviations from the Dirichlet assumption for the ground-truth label distributions?

  • Parameter Sensitivity: While the method is termed δ-parameter-free, there remain parameters (e.g., tolerance in ASR). How do these affect performance, and are there guidelines for setting them?

  • Generalization: Have you explored the applicability of the µ metric beyond the LDL framework, such as in multi-label learning or other prediction problems with distributional outputs?

论据与证据

The claim that "DeltaLDL, based on approximately correct predictions, offers a more discriminative and robust framework." is problematic:

The definition of "approximately correct" relies on an empirically-derived threshold (δ) whose semantic meaning across domains is underexplored, making the generalizability of these claims less certain without additional justification or sensitivity analysis.

方法与评估标准

The proposed methods and evaluation criteria make sense within the LDL field, focusing on nuanced distributional accuracy rather than strict one-hot labeling. The replacement of KLD with DeltaLDL as both a metric and loss function is well motivated. The use of adaptive Simpson's rule for optimizing the objective is technically creative and well-integrated.

However, while the datasets used (e.g., M2B, SBU 3DFE, fbp5500) are standard in LDL, the lack of evaluation on more challenging or modern datasets (e.g., higher-dimensional, more skewed distributions, language models, or image classification) is a limitation. Given the paper's ambitious claims about general improvements to LDL, it would strengthen the work to see tests beyond traditional benchmarks, such as larger-scale datasets in domains like NLP or vision where distributional representations are crucial.

Runtime and complexity is not evaluated: While the method is linear in key dimensions, the use of adaptive Simpson’s rule introduces overhead. A quantitative runtime analysis would help assess practical viability.

理论论述

The theoretical critique of KLD via Dirichlet expectations is sound and well-developed, including the calculation of expected KLD values under stochastic models. The properties of DeltaLDL (e.g., monotonicity, normalization) are clearly stated and appear mathematically correct. However, the threshold δ that underlies DeltaLDL is only heuristically defined, and its theoretical grounding (e.g., optimality properties, sensitivity) is not fully explored. While the integral-based µ metric attempts to mitigate the need for precise δ selection, further theoretical discussion on this aspect would be helpful. Moreover, the assumptions (e.g., Dirichlet distribution for ground truth) could be limiting in real-world scenarios.

实验设计与分析

The experimental design is comprehensive within LDL:

  • Comparison against strong baselines.
  • Multiple datasets and repeated runs.
  • Robustness evaluations against noise.

Nevertheless, there are some missing experimental angles:

  1. Impact of δ and sensitivity analysis: Since δ plays a crucial role, a more detailed empirical analysis of how δ affects outcomes across datasets would be valuable.
  2. Lack of real-world downstream tasks: It would be valuable to see if improvements in label distribution predictions translate into improvements in downstream tasks like multi-label classification, especially in ambiguous domains like NLP or vision where label semantics are inherently overlapping.

补充材料

The core paper is detailed, and appendices (e.g., for theoretical proofs and additional experiments) seem well-structured. However, review of the supplementary material in detail has not been completed at this stage.

与现有文献的关系

The paper is well-situated within the LDL literature, offering clear advancement over previous KLD-based methods. However, connections to broader multi-label and distributional learning paradigms are missing. Particularly: Relating the proposed method to multi-label classification [1, 2], distribution-based labels (soft targets) [3] could highlight broader applicability.

[1] Zhu, Feng, et al. "Learning spatial regularization with image-level supervisions for multi-label image classification." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. [2] Bi, Wei, and James Kwok. "Efficient multi-label classification with many labels." International conference on machine learning. PMLR, 2013. [3] Zhang, Chang-Bin, et al. "Delving deep into label smoothing." IEEE Transactions on Image Processing 30 (2021): 5984-5996.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  • Introducing an “approximately correct” metric and corresponding loss function is a novel contribution.

  • The theoretical analysis is thorough and offers insight into why KLD may be problematic in LDL contexts.

  • The experiments cover several datasets and compare many baselines, demonstrating consistent trends across different conditions.

Weaknesses:

  • The additional computational overhead due to ASR and the smoothing function is not well quantified.
  • The assumptions (e.g., Dirichlet-distributed ground truths) might limit the method’s applicability in real-world settings where distributions deviate from these assumptions.

其他意见或建议

Broader Applicability: Discussion on how the proposed ideas might extend to other related tasks (e.g., multi-label learning) would be beneficial.

作者回复

Many thanks for your comprehensive comments! Responses to your concerns are as follows.

Response about δ\delta: Our theoretical analysis illustrates that, δ0\delta_0 reflects the worst-case divergence. Therefore, values larger than δ0\delta_0 (for distance metrics) would imply tolerating worse-than-random errors, which doesn't make much sense. While smaller δ\delta could be explored, they require strong assumptions about data training difficulty (e.g., label noise), i.e., we can't decide how small δ\delta should be. Without such prior knowledge, δ0\delta_0 provides a neutral starting point. We provide results of D(KL,δ;f)\mathfrak{D}(\text{KL}, \delta; f) w.r.t. δ\delta on SBU3DFE\mathtt{SBU\\_3DFE} & NaturalScene\mathtt{Natural\\_Scene}, where ff always outputs a uniform label distribution matrix.

SBU3DFE\mathtt{SBU\\_3DFE}:

δ\delta0.020.040.060.0851 (δ0\delta_0 of SBU3DFE\mathtt{SBU\\_3DFE})0.10.120.141.172 (δ0\delta_0 of NaturalScene\mathtt{Natural\\_Scene})
D\mathfrak{D}.156.314.450.594.665.732.7871.

NaturalScene\mathtt{Natural\\_Scene}:

δ\delta0.0851 (δ0\delta_0 of SBU3DFE\mathtt{SBU\\_3DFE})0.60.81.01.172 (δ0\delta_0 of NaturalScene\mathtt{Natural\\_Scene})1.41.61.8
D\mathfrak{D}.0.238.240.481.650.662.777.797

Key findings: 1) D(KL,δ;f0)\mathfrak{D}(\text{KL},\delta;f_0) shows high sensitivity to δ\delta changes (non-linear response); 2) optimal δ\delta ranges vary significantly across datasets, which is not a mutually referable setting. This sensitivity highlights the dangers of ad-hoc parameter choices, justifying our pursuit of a parameter-free solution.

Response about ε\varepsilon: The tolerance parameter ε\varepsilon & maximum recursion depth ξ\xi both ultimately control iteration limits. However, ε\varepsilon is usually set as a fixed small value (1e-7) following numerical analysis conventions. We instead focuse on more adjustable ξ\xi, analyzed in Fig. 2 (a).

Response about downstream tasks: We appreciate this constructive suggestion, but directly applying LDL to downstream tasks risks objective mismatch (W. & G., '19; '21b). However, we can conduct additional vision experiments in the context of pure LDL.

Setup:

  • Dataset: JAFFE\mathtt{JAFFE} (256 ×\times 256 facial images).
  • Arch.: ResNet-50 backbone.
  • Training: 10-fold CV, Adam optimizer (lr=1e-4, bs=32, 100 epochs).
  • Baseline: AA-BP (G., '16) with the same configuration.

Results (bold = better):

Cheby.\mathtt{Cheby.}Clark\mathtt{Clark}Can.\mathtt{Can.}KLD\mathtt{KLD}Cosine\mathtt{Cosine}Int.\mathtt{Int.}Spear.\mathtt{Spear.}μ\mu
AA-BP.0469.1651.3368.0139.9858.9421.794985.10%
δ\delta-LDL.0454.1654.3369.0122.9880.9431.794985.83%

δ\delta-LDL achieves better (\uparrow0.73% μ\mu) or comparable results.

Response about the complexity: We provide runtime comparisons (δ\delta-LDL vs. baselines on SBU3DFE\mathtt{SBU\\_3DFE}). Results:

δ\delta-LDLDF-LDLLDLFSCLLRRDPALDLLCLCLRSA-BFGSLDLSFAA-kkNNPT-Bayes
1.x.16x1.11x1.02x.07x.06x.06x.21x.01x3.95x-.001x

Runtimes are normalized to δ\delta-LDL’s execution time (1.x) for direct comparison. δ\delta-LDL shows similar training times to SCL/LDLF methods. Note that such comparisons may lack fairness since PT-Bayes, AA-kkNN, LDLSF, SA-BFGS & LCLR do not employ deep learning architectures in their implementations.

Response about the assumption: While we use Dir. distributions for theoretical analysis due to their mathematical tractability and natural fit for probability simplex, theoretical assumptions are relaxed in implementation. Precisely because of the consideration of deviations in real-world scenarios, our method does not strictly require this assumption in practice; our method is empirically robust to distribution-free label structures, as demonstrated in experiments.

Response about the applicability: We appreciate the suggestion to explore broader applicability. However, the continuous nature of label distributions differs fundamentally from the discrete logic of multi-label annotations, making DeltaLDL and its derivatives (μ\mu & δ\delta-LDL) more suited to pure LDL tasks. Let‘s discuss some examples:

  • Some MLL evaluation metrics, such as subset accuracy, inherently involve discretization in their computation. These metrics must be calculated on mini-batch test sets and are unsuitable as optimization objectives, making them irrelevant to the core problems we aim to address.
  • However, metrics like Hamming/Jaccard could potentially be applicable. When approximating δ0\delta_0 for these cases, one would need to substitute uniform vectors with a random binary matrix and modify the margin mechanism for discrete outputs.

Multi-label metrics often prioritize accuracy over distributional fidelity, conflicting with LDL’s paradigm. This presents an interesting and meaningful direction for future work.

最终决定

The paper studies the problem of metric design for label-distribution learning problems. The motivation is to rectify certain issues with the use of the KL divergence for this problem, notably its limited discriminative ability in the case of ground-truth distributions being drawn from a Dirichlet distribution. The key idea is to propose a normalized version of the KL divergence (Equation 9), analogous to the empirical CDF at a fixed threshold. It is proposed to further consider a suitable average over a range of thresholds (Equation 11) as the final evaluation metric. A suitable convex relaxation (Equation 14, 17) then provides the metric for training.

Reviewers were generally supportive of the paper, finding it to provide an interesting new perspective and contribution to the LDL problem, with both theoretical and empirical grounding. The main concerns were:

(1) the precise choice of threshold -- i.e., δ0\delta_0, the average KL divergence to a uniform vector, used to normalize the proposed metric. The author response clarified how this is used to represent the ratio of the proposed area metric against that of a perfect predictor, thus representing a normalized area with 11 denoting a perfect predictor.

(2) the empirical results not including more larger-scale settings, wherein the computational costs of the method could be an issue. In the response, the authors added an additional vision dataset, and runtime comparisons. The authors also argued that further experiments in downstream tasks might introduce distribution mismatch.

We believe the concerns on the latter point are valid, and may somewhat limit the scope of the work in its present state. Nonetheless, in the context of the existing LDL literature, the work has provided an interesting new perspective with both conceptual and empirical contributions. We thus believe it likely to be of interest and value to the community, and recommend its acceptance. The authors are encouraged to incorporate points from the reviewer discussion in preparing their final manuscript.