Entropy-Calibrated Label Distribution Learning
摘要
评审与讨论
This paper points out that in the current field of label distribution learning, the existing algorithms underperform on low-entropy samples, and the existing evaluation metrics of label distribution learning cannot adequately capture the prediction performance of samples with different levels of entropy. Aiming at these two problems, this paper theoretically analyzes the underlying reasons for the underperformance on low-entropy samples, designs a regularization term according to the theoretical results, and proposes a novel evaluation method for label distribution learning.
优缺点分析
Strengths: 1) This paper presents the problem of underperformance on low-entropy samples and the bias of evaluation metrics in existing LDL studies, which are widespread and have serious impacts on the downstream decision-making tasks of LDL because low-entropy samples usually harbor expensive misjudgement costs. 2) This paper analyzes the causes of the underperformance problem of existing LDL algorithms on low-entropy samples from the theoretical and empirical viewpoints. The theoretical and visualization results reliably reveal the relationship between the hypothesis space and the low-entropy underperformance of the widely-adopted softmax model, which is instructive for the design of subsequent LDL algorithms. 3) This paper proposes a regularization term based on the results of the theoretical analysis, gives several mathematical properties of the regularization term, and verifies the validity of the regularization term through extensive experiments. 4) This paper proposes a new evaluation method that can effectively capture the prediction performance of samples with different levels of entropy.
Weaknesses: 1) The experimental configurations of this paper are not adequately shown, such as the specific configuration of the optimizer used and the iteration stopping conditions. 2) The writing of the paper could be further improved, such as the mismatch of left and right parentheses in line 196, the lack of period at the end of the sentence in line 197, and in the captions of the subfigures in Fig. 5, Fig. (b) and Fig. (c) show the entropy mean and the entropy variance of the dataset but Fig. (a) shows only the mean value. The paragraph subheadings before line 267 end with a period, while the paragraph subheadings after line 267 end without a period, which is not uniform in format. The dataset name in the image caption in Figure 1 does not match the dataset name in the caption.
问题
- In order to make the anchors as dispersed as possible, why not just minimize the inner product of the anchors instead of minimizing the cosine similarity of the two anchors? After all, cosine similarity imposes a greater computational burden compared to inner product?
- Are there some empirical suggestions for hyperparameter selection?
局限性
yes
最终评判理由
The authors have addressed my concerns and thus I maintain my previous rating.
格式问题
NA
Dear Reviewer Z4dW,
Thank you for your positive assessment on our work. We sincerely appreciate the time and effort that you have dedicated to evaluating our work. We have carefully considered all the comments and have revised the paper accordingly. Below, we provide a point-by-point response to your suggestions.
Weakness 1: The experimental configurations of this paper are not adequately shown, such as the specific configuration of the optimizer used and the iteration stopping conditions.
We have added more necessary information about the experimental configurations. The optimizer used in our paper is the L-BFGS algorithm implemented by pytorch. The code is shown as follows:
optimizer = torch.optim.LBFGS(..., lr=1e-3, max_iter=1000, tolerance_grad=1e-5, tolerance_change=1.4901161193847656e-8, history_size=5, line_search_fn='strong_wolfe', max_eval=None)
The above code shows that the optimization process will be stopped when the change of loss function is smaller than 1.4901161193847656e-8 or the gradient is smaller than 1e-5.
Weakness 2: The writing of the paper could be further improved.
We have improved the writing of our paper. For example, we have added a left parentheses before the "Art Painting"; we have added a period at the end of the sentence in line 197; we have added the entropy variance in the caption of Fig. 5(a); we have calibrated the inconsistency of period of the paragraph subheadings.
Question 1: In order to make the anchors as dispersed as possible, why not just minimize the inner product of the anchors instead of minimizing the cosine similarity of the two anchors?
The inner product of anchor vectors is also determined by their norm, which undermines its reliability in preventing over-uniformity. Mathematically, the inner product between anchor vectors equals to their cosine similarity multiplied by their norms. Consequently, the minimization of the inner product could result merely from reduced vector norms rather than improved angular dispersion (i.e., decreased cohesion among anchor vectors).
Question 2: Are there some empirical suggestions for hyperparameter selection?
As analyzed in lines 249-254 of our paper, we recommend as a default setting for most datasets, as it approaches optimal performance across diverse scenarios. For further performance optimization, we suggest the following empirical guidelines:
- When the training set contains predominantly high-entropy label distributions, consider using .
- For datasets with predominantly low-entropy label distributions, larger values are preferable.
This paper addresses the problem of entropy bias in Label Distribution Learning (LDL), where models tend to perform poorly on low-entropy samples due to overly cohesive anchor vectors. To mitigate this, the authors propose an Inter-anchor Angular Regularization (IAR) term that penalizes anchor vectors with small angular separation, thereby enhancing the model’s ability to represent low-entropy samples. Additionally, they introduce an Entropy-Calibrated Aggregation (ECA) strategy for fairer model evaluation by separately assessing performance on low- and high-entropy subsets. Experimental results demonstrate the effectiveness of the proposed methods.
优缺点分析
Strengths
-
The paper tackles the entropy bias in LDL, which is interesting.
-
The proposed method is built upon both empirical and theoretical analyses, and is well-motivated.
-
This work also proposes a new evaluation mechanism termed ECA for fairer model evaluation, which sounds reasonable.
Weaknesses
-
In line 42, “the samples with high-entropy” should be “the samples with low-entropy”.
-
For better clarity, it would be beneficial to mention the term “ECA” in Section 4.
-
In the main text, it would be beneficial to reference the Appendices explicitly. For example, after presenting Theorem 3.1, a note should be added indicating that its proof is provided in Appendix A.
问题
-
What is the form of the function ? Does it use only anchor vectors as classifier weights?
-
Are the anchor vectors treated as learnable parameters? It would be helpful to clarify the background of LDL.
局限性
The limitations are included in Section 6.
最终评判理由
All of my concerns have been addressed, and I am inclined to maintain a positive score for this work.
格式问题
No.
Dear Reviewer zUMv,
Thank you for your positive assessment on our work. We sincerely appreciate the time and effort that you have dedicated to evaluating our work. We have carefully considered all the comments and have revised the paper accordingly. Below, we provide a point-by-point response to your suggestions.
Weakness 1: In line 42, "the samples with high-entropy" should be "the samples with low-entropy".
As suggested, we have corrected this typo error in the revised version.
Weakness 2: For better clarity, it would be beneficial to mention the term "ECA" in Section 4.
As suggested, we have added an introductory paragraph about ECA at the beginning of Section 4 to elucidate its objectives and main concepts. The details are as follows. In this section, we illustrate the proposed ECA (Entropy-Calibrated Aggregation) strategy to address the entropy bias in conventional model performance evaluation methods. Following the divide-and-conquer principle, the main idea underlying ECA is to evaluate the model performance separately on the low-entropy and high-entropy subsets of test set.
Weakness 3: In the main text, it would be beneficial to reference the Appendices explicitly. For example, after presenting Theorem 3.1, a note should be added indicating that its proof is provided in Appendix A.
As suggested, we have added the necessary references of Appendices in the main text. For example, the proof of Theorem 3.1 is provided in Appendix A; the introduction of datasets is provided in Appendix B.1; more experimental analysis are provided in Appendix B.2.
Question 1: What is the form of the function ? Does it use only anchor vectors as classifier weights?
The function in the paper is a multivariate function that maps the feature vector to the label distribution. Specifically, it uses a softmax normalization over the dot products between the feature vector and the anchor vectors . The form of is where is the dot product between the anchor vector and the feature vector , and ensures the output is a valid probability distribution.
The function does not use only anchor vectors as classifier weights. The function is theoretically compatible with various representation learning techniques, serving as a classifier when combined with architectures like convolutional neural networks for image classification or graph convolutional networks for node classification. Specifically, in typical label distribution learning (LDL) models, the output can be formulated as , where represents the feature vector of a sample. In deep learning scenarios, is usually derived by passing the raw feature vector through a feature extraction network, whereas in non-deep learning settings, is often directly set to . However, in our experiments, we deliberately abstain from incorporating representation learning techniques and instead employ anchor vectors exclusively as classifier weights. This choice is motivated by two key considerations: (1) Most available label distribution datasets are tabular in nature, rendering complex neural network architectures unnecessary. (2) To ensure a fair comparison with state-of-the-art baseline methods, which predominantly do not adopt representation learning techniques, we also avoid a representation learning technique. Thus, our practical implementation of in experiments relies solely on anchor vectors as classifier weights.
Question 2: Are the anchor vectors treated as learnable parameters? It would be helpful to clarify the background of LDL.
Yes, the anchor vectors are indeed treated as learnable parameters that play a fundamental role in the model's architecture. These vectors are optimized during minimizing the Kullback-Leibler (KL) divergence between the predicted label distributions and the ground-truth distributions, as formalized in Equation (3) of the paper. The learning process is further refined through the proposed Inter-Anchor Angular Regularization (IAR) term, which explicitly penalizes excessive cohesion between anchor vectors by minimizing their pairwise cosine similarities.
LDL (Label Distribution Learning) is an effective approach to accurately estimate the entire conditional distribution of labels according to a set of feature variables. This task receives increasing attention both in the field of statistics and machine learning, as the information about the entire distribution is crucial in scenarios that are sensitive to risk, extremes, or uncertainty, such as drug efficacy prediction or emotion recognition. Various kinds of techniques, such as model calibration or mixture density neural network can be utilized to estimate the entire conditional distribution using the training samples that are labeled only with the mean or mode of the underlying true conditional distribution, which are beneficial in the tasks where the true label distributions are unavailable. However, there remain a large number of real-world scenarios where the true distributions are readily available. To address this kind of scenarios, LDL is proposed to learn a multivariate regressor that maps a set of feature variables to the entire conditional distribution of labels according to a training set where each instance is labeled with a label distribution. Compared to the cases without true label distributions, LDL is capable of predicting the entire conditional distribution of labels more accurately, as it is directly supervised by the true label distributions.
Thank you for your rebuttal. All of my concerns have been addressed, and I am inclined to maintain a positive score for this work.
This paper identifies and addresses the "entropy bias" in Label Distribution Learning (LDL), where existing algorithms perform poorly on low-entropy samples, which are crucial for decision-making. The authors propose two key innovations:
-
Inter-anchor Angular Regularization (IAR): A regularization term that penalizes overly cohesive anchor vectors to better capture low-entropy distributions.
-
Entropy-Calibrated Aggregation (ECA): An evaluation strategy that balances performance assessment across low- and high-entropy test samples.
The paper offers both theoretical analysis (including an entropy lower bound theorem) and extensive experiments on eight datasets, demonstrating consistent performance gains, especially on low-entropy samples.
优缺点分析
Strengths:
- Novel identification of entropy bias in LDL, a previously overlooked but practically important issue.
- Strong theoretical analysis supporting the IAR formulation, with well-structured theorems and proofs.
- Comprehensive experiments, including multiple baselines, evaluation metrics, ablations, and significance testing.
Weaknesses:
- The ECA evaluation section is mathematically overloaded and may benefit from simplification or partial relocation to the appendix.
- There is limited discussion on applicability to non-anchor-based architectures such as Transformer-based LDL models.
问题
- Can IAR be extended to architectures without explicit anchor vectors, such as attention-based or Transformer models?
- How sensitive is the performance to the choice of entropy threshold in ECA? Would a learned threshold work better?
局限性
yes
最终评判理由
All my concerns have been resolved, and I am now inclined to recommend acceptance of the paper.
格式问题
The paper formatting appears to meet conference standards with no major concerns.
Dear Reviewer 21xv,
Thank you for your positive assessment on our work. We sincerely appreciate the time and effort that you have dedicated to evaluating our work. We have carefully considered all the comments and have revised the paper accordingly. Below, we provide a point-by-point response to your suggestions.
Question 1: Can IAR be extended to architectures without explicit anchor vectors, such as attention-based or Transformer models?
Our current theoretical framework and methodology are specifically designed for models with explicit anchor vectors, as the core mechanism relies on regularizing the angular separation between these anchor vectors. Therefore, our method is not applicable to models that do not involve anchor vectors. As elaborated in Section 6 of the main text, our proposed inter-anchor angular regularization (IAR) term is not directly compatible with tree-based LDL algorithms since their learning process does not incorporate anchor vectors. On the other hand, our approach actually can be integrated with transformer or attention-based architectures, as the IAR term solely constrains the weight matrix of the neural network's output layer.
Question 2: How sensitive is the performance to the choice of entropy threshold in ECA? Would a learned threshold work better?
Actually, the entropy threshold does not participate in the model learning process; rather, it serves as a parameter in the model evaluation phase. The choice of this threshold parameter can indeed significantly impact the presented performance. For instance, if the threshold is set too small, the majority of test samples would be treated as high-entropy samples, thereby biasing the evaluation results toward high-entropy samples. To address this problem, our paper proposes treating the threshold as a distribution and computing the mean performance across all possible threshold values. In practice, a more reasonable approach is to determine the threshold based on specific task requirements—that is, defining what entropy level qualifies a sample as high-entropy or low-entropy according to the application's needs. As for whether a learnable threshold could be better, we argue that adaptive threshold would not be particularly beneficial, since the threshold is purely a parameter in model evaluation and does not participate in model training.
Thank you for your response. My concerns have been addressed, and I will raise my score accordingly.
The authors demonstrate that many Label Distribution Learning (LDL) methods tend to make significantly larger errors on low-entropy instances compared to high-entropy ones, a disparity that standard aggregate metrics often obscure. To address this, they present Theorem 3.1, which links the cohesion—measured by small pairwise angles—of anchor vectors in softmax models to a lower bound on the entropy of the predicted label distributions. This theoretical insight helps explain why low-entropy samples are particularly challenging to model accurately. To mitigate this issue, the authors introduce an Inter-Anchor Angular Regularizer (IAR) that penalizes overly small angles between anchor vectors, encouraging more dispersed and expressive representations. Additionally, they propose a new evaluation metric, Entropy-Calibrated Aggregation (ECA), which separately averages performance on low- and high-entropy test subsets to provide a more nuanced assessment. Experimental results across eight benchmark datasets using four recent LDL baselines plus ridge regression, along with ablation studies (“ER”, “EW”) and a hyperparameter analysis, show that their approach yields substantial improvements on low-entropy samples without degrading performance on high-entropy ones. Overall, their method achieves the best ECA scores on six out of eight datasets.
优缺点分析
Strengths:
- Connecting anchor geometry to entropy bias is simple yet convincing; the angle-entropy bound is intuitive and mathematically clean.
- Clear assumptions and and full proofs
Weaknesses:
- All baselines are non-deep or shallow deep models. Demonstrating IAR on a modern backbone (e.g., ViT-LDL) would strengthen the claim of universality.
问题
See W1
局限性
yes
格式问题
N/A
Dear Reviewer r9iZ,
Thank you for your positive assessment on our work. We sincerely appreciate the time and effort that you have dedicated to evaluating our work. We have carefully considered all the comments and have revised the paper accordingly. Below, we provide a point-by-point response to your suggestions.
Question: All baselines are non-deep or shallow deep models. Demonstrating IAR on a modern backbone (e.g., ViT-LDL) would strengthen the claim of universality.
We selected two datasets where instance features are encoded by raw image information to demonstrate the effectiveness of the proposed IAR term in deep learning models. We apply the proposed IAR term to the ViT-LDL algorithm while maintaining strict consistency with the experimental procedure utilized in the main text. The next two tables show the performance of the ViT-LDL model with IAR term and the ViT-LDL model without IAR term. It can be seen from the next two tables that our proposed IAR improves the performance of ViT-LDL in most cases.
| Jaffe | ViT-LDL with IAR | ViT-LDL without IAR |
|---|---|---|
| KL (LEA) | 0.0240.0159 | 0.03290.0167 |
| KL (HEA) | 0.01530.003 | 0.01540.003 |
| KL (ECA) | 0.020.0105 | 0.02440.0117 |
| Cosine (LEA) | 0.97840.0148 | 0.96990.0171 |
| Cosine (HEA) | 0.98510.0029 | 0.9850.0032 |
| Cosine (ECA) | 0.98170.0095 | 0.97740.0113 |
| Cheb (LEA) | 0.07190.0213 | 0.090.0251 |
| Cheb (HEA) | 0.04740.0039 | 0.04810.0047 |
| Cheb (ECA) | 0.05970.015 | 0.06930.0184 |
| Intersec (LEA) | 0.9210.0219 | 0.90190.0233 |
| Intersec (HEA) | 0.9350.0048 | 0.93540.0056 |
| Intersec (ECA) | 0.92810.0154 | 0.91870.0175 |
| Emotion6 | ViT-LDL with IAR | ViT-LDL without IAR |
|---|---|---|
| KL (LEA) | 0.10940.032 | 0.10910.0267 |
| KL (HEA) | 0.14610.0533 | 0.13270.0307 |
| KL (ECA) | 0.13720.0379 | 0.1290.0234 |
| Cosine (LEA) | 0.97120.0038 | 0.96520.0068 |
| Cosine (HEA) | 0.97240.0056 | 0.97210.0048 |
| Cosine (ECA) | 0.96770.0065 | 0.9660.0048 |
| Cheb (LEA) | 0.11180.014 | 0.12390.0163 |
| Cheb (HEA) | 0.1190.0197 | 0.12020.0174 |
| Cheb (ECA) | 0.12350.0156 | 0.12840.0118 |
| Intersec (LEA) | 0.87710.011 | 0.86520.0137 |
| Intersec (HEA) | 0.86750.0166 | 0.86650.0151 |
| Intersec (ECA) | 0.86350.0132 | 0.85960.0098 |
I appreciate the authors for providing the experimental results on ViT backbones, which appear very promising. As I am not an expert in the LDL domain, I will refrain from changing my rating. Nonetheless, based on my current understanding and the presented empirical results, I find this paper to be compelling.
This paper finds that excessive cohesion between anchor vectors contributes significantly to the observed entropy bias phenomenon in LDL algorithms. It accordingly proposes an inter-anchor angular regularization term that mitigates cohesion among anchor vectors by penalizing over-small angles. To alleviate the numerical imbalance of high-entropy samples in the test set, it proposes an entropy-calibrated aggregation strategy that obtains the overall model performance by evaluating performance on the low-entropy and high-entropy subsets of the overall test set separately. Experiments demonstrate the effectiveness of the proposed method. The main contributions of this paper are:
-
It analyzes the generation mechanism of entropy bias from both empirical and theoretical perspectives, and consequently proposes an assumption that the underperformance of LDL models on low-entropy samples is significantly driven by the cohesion of anchor vectors.
-
It proposes IAR (i.e., an Inter-anchor Angular Regularization term) to penalize the anchor vectors with over-small angles.
-
It proposes ECA (i.e., an Entropy-Calibrated Aggregation strategy) to calculate the overall model performance.
优缺点分析
Strengths
- The problem studied in this paper is interesting and valuable.
- The paper is well-organized, which is easy to follow.
- The theoretical work improves the value of the paper.
Weakness
-
The paper primarily uses KL divergence and cosine similarity as evaluation metrics. Although these two metrics can measure the differences between the model outputs and the true label distribution from different perspectives, they may not fully reflect the model's performance in real-world applications. For instance, in tasks that require accurate prediction confidence or uncertainty estimation, relying solely on KL divergence and cosine similarity may be insufficient to thoroughly assess the model’s reliability and effectiveness.
-
Although the paper demonstrates the stability of the results through multiple randomized experiments and reports the mean and standard deviation, the analysis of the statistical significance of performance differences across different datasets and algorithms is not sufficiently thorough. For example, on some high-entropy datasets, the performance gap between IAR and other algorithms is not significant, while on low-entropy datasets, IAR shows a noticeable performance gap compared to some SOTA algorithms. These issues are not adequately explored or explained. The authors should further analyze the underlying reasons for these phenomena.
-
The impact of the hyperparameter on model performance varies across different datasets. The authors should further analyze the underlying reasons for this variation and provide practical recommendations for its application.
问题
-
How about the performances of different algorithms on other LDL metrics (e.g., Chebyshev distance, Intersection similarity, etc.)?
-
On some high-entropy datasets, the performance gap between IAR and other algorithms is not significant, while on low-entropy datasets, IAR shows a noticeable performance gap compared to some SOTA algorithms. The authors should further analyze the underlying reasons for these phenomena.
-
The impact of the hyperparameter on model performance varies across different datasets. The authors should further analyze the underlying reasons for the hyperparameter on model performance varies across different datasets, and provide practical recommendations for its application.
局限性
The authors could further discuss the potential societal impact of the proposed method.
最终评判理由
After reading the reviews' comments and the authors' responses, I would like to give a positive rating for this paper.
格式问题
N/A
Dear Reviewer 3R2w,
Thank you for your constructive comments on our paper. We sincerely appreciate the time and effort that you have dedicated to evaluating our work. We have carefully considered all the comments and have revised the paper accordingly. Below, we provide a point-by-point response to your suggestions.
Question 1: How about the performances of different algorithms on other LDL metrics (e.g., Chebyshev distance, Intersection similarity, etc.)?
Due to space limitations, we have only shown the performance of the algorithm using KL divergence and cosine similarity measures. Nevertheless, we have shown the performance of the algorithm using Chebyshev distance and intersection similarity in the table below. Due to the character limit for the rebuttal, we only present results under the ECA aggregation strategy for these two metrics, omitting detailed comparisons between low- and high-entropy subsets.
| Algorithm | ECA (Chebyshev Distance) | ECA (Intersection Similarity) | Dataset |
|---|---|---|---|
| IAR | (1) 0.0820.005 | (1) 0.8940.005 | Jaffe |
| LDM | (4) 0.0990.012 | (4) 0.8810.012 | Jaffe |
| DPA | (6) 0.1150.012 | (6) 0.8520.013 | Jaffe |
| FCC | (3) 0.0920.007 | (3) 0.8880.006 | Jaffe |
| LRR | (2) 0.0920.007 | (2) 0.8890.006 | Jaffe |
| Ridge | (5) 0.1150.014 | (5) 0.8530.015 | Jaffe |
| IAR | (1) 0.1080.003 | (1) 0.8710.003 | BU-3DFE |
| LDM | (6) 0.1190.007 | (6) 0.8630.007 | BU-3DFE |
| DPA | (2) 0.1090.003 | (2) 0.8700.003 | BU-3DFE |
| FCC | (5) 0.1160.003 | (5) 0.8650.003 | BU-3DFE |
| LRR | (4) 0.1150.005 | (4) 0.8660.004 | BU-3DFE |
| Ridge | (3) 0.1090.003 | (3) 0.8700.003 | BU-3DFE |
| IAR | (1) 0.3130.006 | (1) 0.5770.006 | Natural Scene |
| LDM | (6) 0.3620.012 | (6) 0.5380.010 | Natural Scene |
| DPA | (3) 0.3250.010 | (3) 0.5710.009 | Natural Scene |
| FCC | (5) 0.3580.009 | (5) 0.5420.009 | Natural Scene |
| LRR | (4) 0.3360.010 | (4) 0.5630.010 | Natural Scene |
| Ridge | (2) 0.3220.008 | (2) 0.5740.008 | Natural Scene |
| IAR | (1) 0.3570.016 | (1) 0.5550.014 | Emotion6 |
| LDM | (6) 0.3710.014 | (6) 0.5440.012 | Emotion6 |
| DPA | (3) 0.3600.016 | (2) 0.5520.013 | Emotion6 |
| FCC | (5) 0.3650.013 | (4) 0.5480.012 | Emotion6 |
| LRR | (4) 0.3650.013 | (5) 0.5480.012 | Emotion6 |
| Ridge | (2) 0.3600.017 | (3) 0.5520.015 | Emotion6 |
| IAR | (1) 0.2950.049 | (1) 0.5810.036 | Art Painting |
| LDM | (4) 0.3200.040 | (4) 0.5570.027 | Art Painting |
| DPA | (5) 0.3440.047 | (5) 0.5290.041 | Art Painting |
| FCC | (3) 0.3160.043 | (3) 0.5620.032 | Art Painting |
| LRR | (2) 0.3160.043 | (2) 0.5630.032 | Art Painting |
| Ridge | (6) 0.3450.046 | (6) 0.5280.039 | Art Painting |
| IAR | (4) 0.0780.003 | (4) 0.8020.008 | Music Mood |
| LDM | (3) 0.0770.004 | (3) 0.8030.008 | Music Mood |
| DPA | (5) 0.0850.003 | (5) 0.7890.010 | Music Mood |
| FCC | (1) 0.0760.003 | (1) 0.8030.007 | Music Mood |
| LRR | (2) 0.0760.003 | (2) 0.8030.007 | Music Mood |
| Ridge | (6) 0.0860.004 | (6) 0.7890.009 | Music Mood |
| IAR | (1) 0.4090.011 | (1) 0.5840.010 | M2B |
| LDM | (3) 0.4210.020 | (3) 0.5710.021 | M2B |
| DPA | (5) 0.4300.020 | (5) 0.5630.019 | M2B |
| FCC | (2) 0.4190.016 | (2) 0.5740.016 | M2B |
| LRR | (4) 0.4230.018 | (4) 0.5710.018 | M2B |
| Ridge | (6) 0.4310.020 | (6) 0.5620.020 | M2B |
| IAR | (1) 0.2100.024 | (1) 0.7490.022 | Movie |
| LDM | (4) 0.2130.023 | (4) 0.7470.021 | Movie |
| DPA | (6) 0.2130.025 | (6) 0.7460.023 | Movie |
| FCC | (3) 0.2120.024 | (3) 0.7480.022 | Movie |
| LRR | (2) 0.2120.024 | (2) 0.7480.022 | Movie |
| Ridge | (5) 0.2130.024 | (5) 0.7460.023 | Movie |
According to the results in the above table, it can be seen that under the Chebyshev distance and intersection similarity metrics, our algorithm achieves state-of-the-art performance on the Jaffe, BU-3DFE, Natural Scene, Emotion6, Abstract Painting, M2B, and Movie datasets. Notably, while it does not attain the best performance on the Music Mood dataset, it remains highly competitive, with performance very close to the top.
Question 2: On some high-entropy datasets, the performance gap between IAR and other algorithms is not significant, while on low-entropy datasets, IAR shows a noticeable performance gap compared to some SOTA algorithms. The authors should further analyze the underlying reasons for these phenomena.
The performance gap is more pronounced on low-entropy datasets than on high-entropy datasets because prediction errors exhibit greater variance in low-entropy datasets compared to high-entropy ones. We demonstrate this claim through data simulation with the following procedure:
- First, we uniformly generate 10,000 label distributions with varying entropy levels.
- For each label distribution , we randomly generate 10,000 predicted label distributions and compute their KL divergence, cosine similarity, Chebyshev distance, and intersection similarity with respect to .
- We then partition into nine groups based on entropy intervals: [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1], (1, 1.2], (1.2, 1.4], (1.4, 1.6], (1.6, ).
- Finally, we compute the variance of all prediction performance metrics within each group.
The results (shown in the table below) reveal that label distributions with higher entropy exhibit smaller variance in prediction errors. This explains why the performance gap is more noticeable in low-entropy datasets than in high-entropy ones.
| Entropy Range | KL Variance | Cosine Variance | Chebyshev Variance | Intersection Variance |
|---|---|---|---|---|
| [0, 0.2] | 0.8555 | 0.0453 | 0.0129 | 0.0158 |
| (0.2, 0.4] | 0.6871 | 0.0436 | 0.0128 | 0.0154 |
| (0.4, 0.6] | 0.6011 | 0.0411 | 0.0127 | 0.0152 |
| (0.6, 0.8] | 0.4107 | 0.0376 | 0.0102 | 0.0149 |
| (0.8, 1] | 0.2854 | 0.0321 | 0.0101 | 0.0134 |
| (1, 1.2] | 0.232 | 0.0285 | 0.0098 | 0.0128 |
| (1.2, 1.4] | 0.1486 | 0.0215 | 0.0069 | 0.0126 |
| (1.4, 1.6] | 0.0941 | 0.0115 | 0.0054 | 0.0099 |
| (1.6, ) | 0.0459 | 0.0044 | 0.0042 | 0.0081 |
Question 3: The impact of the hyperparameter on model performance varies across different datasets. The authors should further analyze the underlying reasons for the hyperparameter on model performance varies across different datasets, and provide practical recommendations for its application.
We analyzed the underlying reasons for the hyperparameter on model performance varies across different datasets, and provided practical recommendations for its application as follows.
Underlying reasons for the hyperparameter on model performance varies across different datasets. The hyperparameter balances the relative importance between the training error (KL divergence between the model-output label distribution and the true label distribution) and the IAR regularization. Since the converged training error (KL divergence value) is influenced by the number of labels, data distribution, and the noise level of the dataset, the optimal weight () for the IAR term should be dataset-dependent. Furthermore, variations in data distribution across datasets lead to different degrees of over-uniformity when the training error is converged, consequently requiring different levels of IAR regularization during training. This further necessitates dataset-specific values.
Recommended settings. As analyzed in lines 249-254 of our paper, we recommend as a default setting for most datasets, as it approaches optimal performance across diverse scenarios. For further performance optimization, we suggest the following empirical guidelines:
- When the training set contains predominantly high-entropy label distributions, consider using .
- For datasets with predominantly low-entropy label distributions, larger values are preferable.
Thanks for the responses. I appreciate the further results on more evaluation metrics and the analysis of the impact of the hyperparameter . However, for Q2, although the authors give a simulation-based explanation of how prediction error variance differs across entropy levels, I believe the paper would benefit from the analysis of actual data or a theoretical perspective to further study the phenomenon.
We sincerely appreciate the reviewer's insightful comments regarding the need for further validation on real-world data and theoretical analysis. In response, we have conducted additional experiments and analyses as follows:
Real-world Data Validation: We have verified this phenomenon on real-world datasets by replacing the artificially generated label distributions in our simulation experiments with actual dataset label distributions. To facilitate comparison across datasets with varying dimensions of label distributions, we present normalized entropy values rather than raw entropy values in the following table (as different datasets have different dimensionality in their label distributions, making raw entropy values incomparable). The experimental results on real-world data consistently demonstrate that high-entropy datasets (Jaffe, BU-3DFE, Movie, Music Mood) exhibit significantly lower variance in prediction performance compared to low-entropy datasets (Natural Scene, Emotion6, Art Painting, M2B).
| Cheb Variance | KL Variance | Cosine Variance | Intersec Variance | Dataset | Entropy Range | Entropy Distribution | # Labels |
|---|---|---|---|---|---|---|---|
| 0.00503105 | 0.0226465 | 0.0110323 | 0.00882119 | Jaffe | [0.862, 0.998] | 0.9590.026 | 6 |
| 0.00634856 | 0.0238445 | 0.013278 | 0.00938306 | BU-3DFE | [0.841, 1.000] | 0.9530.037 | 6 |
| 0.00812684 | 0.130914 | 0.0201881 | 0.0142667 | Movie | [0.422, 0.999] | 0.8780.061 | 5 |
| 0.0015705 | 0.0346131 | 0.0092632 | 0.00743669 | Music Mood | [0.816, 0.997] | 0.9440.034 | 9 |
| 0.0732037 | 1.69887 | 0.0385863 | 0.0349031 | Natural Scene | [0.006, 0.941] | 0.4660.272 | 9 |
| 0.0224852 | 12.7154 | 0.0319004 | 0.0193312 | Emotion6 | [0.126, 0.976] | 0.6400.156 | 7 |
| 0.0126449 | 10.8347 | 0.0255178 | 0.0161054 | Art Painting | [0.124, 0.965] | 0.7160.126 | 8 |
| 0.0263162 | 0.431754 | 0.041684 | 0.0187049 | M2B | [0.066, 0.692] | 0.4070.118 | 5 |
We also note an interesting observation within the low-entropy dataset group: while Emotion6 and Art Painting show similar entropy levels to other low-entropy datasets, their KL divergence variance is substantially higher. We believe that this occurs because the KL metric is particularly sensitive to zero elements, and these two datasets contain abundant zero elements in their label distributions.
Theoretical Validation: Given a true label distribution , where can be assumed without loss of generality, the squared Euclidean distance between any predicted label distribution and the true label distribution is , which can be utilized to quantify the performance the label distribution predictor. The minimum value of is zero when holds for . The maximum value of is when , . Then the range of the performance is . Now, let and denote two true label distributions, where the Gini coefficient of is smaller than the Gini coefficient of , i.e., , and . It should be noted that we use Gini coefficient here to quantify the uncertainty of a label distribution, which is similar to entropy yet easier to calculate than entropy. According to , we have:
Therefore, the performance range on a true label distribution with lower uncertainty is broader than the performance range on a true label distribution with higher uncertainty if (this requirement can be satisfied in most practical cases (about 90% cases in the real-world datasets in our paper)).
We believe these additional experiments and theoretical analyses substantially strengthen our original findings and provide more comprehensive evidence for the observed phenomenon. Please don’t hesitate to let us know if additional details or revisions would be helpful if there are any remaining points that require further clarification.
Thanks for the responses. I have no further questions.
This paper identifies and analyzes entropy bias in Label Distribution Learning, where models struggle with low-entropy samples. The authors propose an inter-anchor angular regularization term and an entropy-calibrated aggregation strategy, supported by both theory and extensive experiments. Reviewers found the theoretical analysis clear and the empirical validation convincing, with rebuttal additions (new metrics, ViT results, real-data variance analysis) further strengthening the work. While some issues remain around hyperparameter sensitivity and broader applicability, the paper makes a solid contribution.