6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.5

置信度

创新性3.0

质量3.3

清晰度3.3

重要性2.8

NeurIPS 2025

Neural Collapse in Cumulative Link Models for Ordinal Regression: An Analysis with Unconstrained Feature Model

Chuang Ma,Tomoyuki Obuchi,Toshiyuki Tanaka

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

摘要

关键词

neural collapseunconstrained feature modelordinal regressioncumulative link modeldeep learning theory

评审与讨论

审稿意见

评分: 5置信度: 42025-06-16

The paper investigated whether the Neural Collapse (NC) phenomenon arises in deep Ordinal regression (OR) tasks by combining the cumulative link model for OR and the unconstrained feature model (UFM). They extended the NC theory to CLM-based OR and found some properties the authors demonstrated analytically and empirically across benchmark ordinal datasets.

优缺点分析

(+) The authors investigate whether a phenomenon analogous to Neural Collapse (NC) arises in ordinal regression (OR) tasks. They introduce the Ordinal Neural Collapse (ONC) concept and support its existence through analytical derivations and empirical analysis. (+) The work contributes to the theoretical understanding of ordinal regression by extending the theory of Neural Collapse to the ordinal domain, offering new insights into the behavior of deep models trained on ordinal tasks.

(–) As acknowledged in the limitations, the theoretical analysis assumes fixed thresholds $\mathbf{b}$ . The ONC phenomenon under learnable thresholds is not theoretically addressed. Although the empirical analysis includes fixed and learnable thresholds, further investigation is needed to characterize the behavior of ONC_{3} in the learnable case. In particular, experimental results suggest that the ONC_{3} metric converges to a non-zero value when thresholds are learned, indicating that Eq. (21) may not hold in that setting. (–) The purpose and interpretation of Figure 3, which depicts the evolution of latent representations and their PCA projections, is unclear. It is not sufficiently connected to the analysis of ONC properties shown in Figure 2 (which shows the evolution of ONC-related metrics). More explicit descriptions of Figure 3 are needed to clarify the contribution of this result better. (–) While hyperparameter settings and additional experiments are reported in the Appendix, the main paper would benefit from presenting at least one additional experiment on a different dataset to reinforce the empirical evidence for ONC and improve the generalizability of the findings. (–) As discussed by the authors, the ONC properties were confirmed on standard benchmark ordinal datasets. However, to better assess the robustness of these findings, it would be important to test the ONC phenomenon under more challenging datasets and conditions such as ordinal imbalance or distributional shift, which are more representative of real-world scenarios. (–) The evaluation of convergence is currently limited to a few metrics. It is unclear why standard ordinal metrics such as Quadratic Weighted Kappa (QWK), Minimum Sensitivity, or Accuracy within 1-off were not included in the analysis. Their inclusion could offer complementary insights. (–) The authors do not explore alternative ordinal loss functions, such as those based on QWK, which could influence convergence behavior and ONC dynamics. (–) Finally, the choice of link function is limited. The authors do not consider alternatives like the complementary log-log link, which might better model skewed ordinal class distributions.

问题

• Can you clarify the intended purpose of Figure 3 (evolution of latent and PCA-projected features)? How does it support or relate to the ONC properties? • Have you considered evaluating ONC properties on additional (more challenging) ordinal datasets? • Why did you not consider other standard ordinal evaluation metrics like Quadratic Weighted Kappa (QWK), Minimum Sensitivity, and Accuracy within 1-off? Including them could provide complementary evidence for ONC convergence. • Have you considered evaluating your approach with alternative ordinal loss functions, such as those based on QWK? These could affect ONC convergence properties. • Have you explored other link functions, such as the complementary log-log link, which may be more suitable for skewed ordinal class distributions?

局限性

The authors acknowledge relevant limitations in a specific discussion subsection. The additional limitations outlined in the Weaknesses and Questions sections should also be carefully considered and addressed.

最终评判理由

The authors answered my comments. As a result I upgraded my score to accept. I also encourage the authors to include these explanations and additional experiments in the final version of the paper.

格式问题

No major formatting issues were identified.

作者回复

2025-07-31

We appreciate your careful reading of our manuscript and your insightful comments.

[W1] Theoretical limitation regarding learnable thresholds

[A1] We recognize this as a limitation of our results. In the case of learnable thresholds, the outcome likely depends on how the thresholds are optimized. We conjecture that if the learning of thresholds proceeds sufficiently slowly compared to the learning of $\boldsymbol{w}$ and $\boldsymbol{h}$ , ONC3 may still hold. However, this setup deviates from standard DNN training practices, and we have not conducted experiments under such conditions. As this consideration clarifies, the emergence of ONC3 would depend on the details of the learning dynamics, and hence it would be nontrivial to resolve this issue. Accordingly, we would like to leave this point as a direction for future work.

[W2, Q1] Clarification on the purpose and interpretation of Figure 3

[A2] We apologize that the explanation of Figure 3 was insufficient and confusing. The purpose of Figure 3 is to intuitively visualize the occurrence of ONC1, ONC2, and ONC3. Figure 3 is divided into upper and lower parts, corresponding to the fixed and learnable thresholds, respectively. Each threshold setting is further divided into upper and lower panels, with the upper ones representing the latent space and the lower ones expressing the feature space.

Feature Space: This refers to the space where $\boldsymbol{h}$ resides (via PCA dimensionality reduction). Different colors represent different classes, and pentagrams represent the class means. Black arrows represent the classifier weights $\boldsymbol{w}$ in the formula. At epoch 0, features are scattered in the feature space; as training progresses, features collapse to their class means, reflecting ONC1. The class means then collapse to a one-dimensional subspace aligned with $\boldsymbol{w}$ , reflecting ONC2.

Latent Space: This refers to the one-dimensional space where $\boldsymbol{z}$ and $\boldsymbol{b}$ in the formula reside. Different colored points represent $z$ of different classes. Thresholds $\boldsymbol{b}$ are represented by red dashed lines. At epoch 5000, the $z$ class means collapse in class order and are positioned between the appropriate thresholds, which directly reflects the occurrence of ONC3. For the fixed thresholds case, the thresholds are (approximately) located at the midpoints of adjacent $z$ class means. For the learnable case, since the boundary thresholds are set to negative and positive infinity, the $z$ class means at both ends do not show an apparent collapse. This explains why the ONC_3 metric curves converge to different values for the two threshold strategies.

We will improve the main text to clarify the above points in the revision.

[W3] Insufficient Dataset Diversity in Main Text

[A3] Thank you for this excellent suggestion. In response to [W4, Q2], we have added experiments on UTKFace, a classic facial age estimation dataset in ordinal regression tasks, with three additional backbones (details in [A4]). If the paper is accepted, we plan to include partial results from the UTKFace experiments in the main text, which would be more convincing than simply moving results of one dataset from the current appendix to the main text.

[W4, Q2] Evaluation of robustness of the theoretical findings under more challenging and realistic conditions

[A4] Thank you for your constructive suggestion. Reviewers zTjz and 7zzE also emphasized the importance of validating the ONC phenomenon on more challenging tasks. Therefore, we have examined three different backbone networks (ResNet-50, ResNet-101, and DenseNet-201) on the UTKFace age-estimation dataset binned into 20 ordinal classes (5-year intervals, 0–4 yrs up to 95–116 yrs): the maximum class-imbalance ratio is approximately 75:1. Facial age estimation using the UTKFace dataset is one of the most renowned tasks in the ordinal regression field, and the ONC phenomenon in this task will provide more robust validation for our theoretical findings.

The experimental results are summarized in Tables 1 and 2 in our response to Reviewer 7zzE (due to character limitations in this response, it was not possible to show it here). All four ONC metrics show substantial decreases across architectures, indicating successful collapse during training for both training and validation datasets in all cases.

If the paper is accepted, all these additional experimental results will be added to the paper.

[W5, Q3] Use of non-standard ordinal evaluation metrics [Q4]

[A5] We thank you for your suggestion. Reviewer 7zzE also provided a similar suggestion. According to these comments, we have additionally computed five additional metrics, including QWK, minimum sensitivity, and accuracy within 1-off.

Using the five tasks presented in the original manuscript, we have computed these metrics and found that the mean values of QWK, minimum sensitivity, and accuracy within 1-off almost converge to 1 with variance close to 0 at the final epoch. The same observation is obtained for the supplementary experiments mentioned in [A4]. Table 1 in our response to Reviewer 7zzE reports this. Notably, the learnable threshold method consistently achieves 0 minimum sensitivity on UTKFace, while the fixed threshold method overcomes this issue, implying that the fixed threshold strategy actually mitigate the minority class problem.

In the revision, we will add plots of all these new metrics to help readers better judge the training and validation situations.

[W6, Q4] Exploration of alternative ordinal loss functions beyond standard CLM framework

[A6] Indeed, changing the loss function would naturally affect the dynamics of ONC. However, our analysis is not concerning the dynamics, but rather focuses on the stationary properties of solutions obtained by minimizing the loss function. Moreover, our framework is fundamentally grounded in a statistical perspective based on maximum likelihood estimation. While it is interesting to explore the implications of using QWK, particularly in terms of its impact on learning dynamics, we consider such investigations to be beyond the scope of the present study and therefore leave them out of consideration. We would appreciate your understanding.

[W7, Q5] Limited exploration of alternative link functions suitable for skewed class distributions

[A7] As the complementary log-log (cloglog) link is associated with a log‑concave probability density function, it satisfies our theoretical assumptions and thus falls within the scope of our theory. We will stress this point in the revision.

To validate its empirical effect, we have conducted additional experiments using the cloglog link function, with the same architecture and datasets as in the original manuscript. The result is summarized in Table 1. A noteworthy point of the cloglog function is that it is not symmetric, in contrast to the logistic and normal CDF functions, and hence the equation (21) does not hold. Instead, the optimal $z_q^*$ at the weak regularization limit should be computed by numerically solving the equation (18), and thus the appropriate metric for ONC3 (equation (25)) should be modified. The result using the modified ONC3 metric (written as onc3_cloglog) is reported in Table 1.

From the result, we can see that all the three characteristics of ONC actually occur in the cloglog function as well. We will add this result in the appendix in the revision.

Table 1: ONC Metrics Evolution for LE Dataset (Initialization → Post-training)

Threshold	Data Split	onc1 ↓	onc2_1 ↓	onc2_2 ↓	onc3_cloglog ↓
fixed	train	8.20×10⁻¹ → 1.15×10⁻²	3.85×10⁻¹ → 2.18×10⁻¹⁴	9.43×10⁻¹ → 2.80×10⁻⁶	1.34×10⁰ → 2.27×10⁻²
fixed	test	8.48×10⁻¹ → 1.36×10⁻²	3.95×10⁻¹ → 7.92×10⁻¹⁵	9.37×10⁻¹ → 2.86×10⁻⁶	1.34×10⁰ → 2.35×10⁻²
learnable	train	8.30×10⁻¹ → 1.57×10⁻²	3.10×10⁻¹ → 2.77×10⁻¹⁴	8.66×10⁻¹ → 2.41×10⁻⁴	1.99×10⁰ → 5.57×10⁻¹
learnable	test	8.44×10⁻¹ → 2.42×10⁻²	2.74×10⁻¹ → 1.22×10⁻¹⁴	8.73×10⁻¹ → 2.41×10⁻⁴	2.00×10⁰ → 5.53×10⁻¹

评论- Read Authos Rebuttal

2025-08-05

The authors answered my comments. I encourage the authors to include these explanations and additional experiments in the final version of the paper.

2025-08-06

We would like to thank Reviewer wf6P for his/her feedback. As previously stated in our response, we will include these explanations and the additional experiments in the final version if the paper is accepted.

审稿意见

评分: 4置信度: 32025-07-01

This work extends Neural Collapse (NC) to ordinal regression (OR) tasks, which differ subtly from both classification and regression. Building on the Unconstrained Feature Model (UFM) and the Cumulative Link Model (CLM), the authors provide a theoretical characterization of a collapse phenomenon under $\ell_2$ regularization applied to both classifier weights and “free” hidden features. The theoretical results can be decomposed into two sets: the first set ONC1 and ONC2 mirror classical NC behavior, showing within-class feature collapse and alignment of class means with the classifier direction, the other set is more task-specific, where the authors demonstrate that with vanishing regularization strength, the collapsed latent values respect the class order, and under zero regularization, they match threshold midpoints. Empirical results validate these findings and suggest that the collapse structure persists even when thresholds are learned rather than fixed.

优缺点分析

Strengths:

The paper is overall well-written, with clear technical exposition and smooth narrative flow. The related work section on NC is comprehensive and positions the paper well within the broader literature.
As far as I can tell, the extension of the NC framework to OR tasks appears to be novel and well-motivated.
The theoretical analysis and empirical validation are tightly aligned. In particular, the behavior under vanishing regularization is quite interesting and elegant.

Weaknesses:

The paper falls somewhat short on practical insights.. The authors discuss that the fixed threshold values "yield faster and more stable convergence compared with learnable thresholds" and "These make ONC a potentially effective concept for improving practical performance in generic OR tasks." but not pursue this direction more clearly. For example, Figure 2 shows similar MAE for fixed and learnable thresholds, but notably better $\ell_{NLL}$ . Could this translate to improved generalization or calibration in more challenging settings?
In Figure 2 and its associated discussion, the authors mention that both training and validation MAE approach zero. However, the curves appear to be nearly identical throughout training, which raises the question of whether the chosen tasks are too easy. Would it be possible to test on more comprehensive tasks to broaden the applicability of the results?

问题

Please see the weakness section.

局限性

Yes

最终评判理由

After reading the reviews and responses from both the authors and other reviewers, I noticed that a common concern was the simplicity of the experiments presented in the main paper. The authors have addressed this point in their response, including additional results, which is great.

I will maintain my initial recommendation of borderline accept, as I find the task to be novel and the paper overall interesting, but the practical implications are not yet entirely clear.

格式问题

作者回复

2025-07-31

Thank you for your valuable and constructive feedback.

[W1] Limited Practical Insights

[A1] The purpose of this paper is to investigate the existence of NC-like phenomenon in OR tasks and to provide its characterization. Hence, providing practical utility is indeed not the primary purpose of this paper. We hope for your understanding in this regard. However, our results also provide practical insights, such as suggesting appropriate ranges for regularization parameters, indicating the potential for designing new loss functions based on UFM analysis, and highlighting the usefulness of fixed thresholds. While we have already addressed these points in the Discussion section, we will revise the text to make these contributions more explicit.

[W2] Concerns About Task Simplicity

[A2] Similar comments are also raised by Reviewers 7zzE and wf6P, and we agree with this point. To address this weakness, we have conducted additional experiments using the UTKFace age-estimation dataset with 20 ordinal classes (5-year intervals, 0–4 yrs up to 95–116 yrs), which exhibits a strong class imbalance. For treating this complicated task, we have used more powerful backbone architectures (ResNet-50, ResNet-101, and DenseNet-201). All architectures utilize ImageNet pretrained weights and are fine-tuned with the Adam optimizer (lr=0.001, weight_decay=0.001). We employ the logit link function with both fixed thresholds (uniformly spaced from -40 to 40) and learnable thresholds. Tables 1 and 2 present the summarized results. Across all architectures, the four ONC metrics demonstrate significant reductions, confirming successful ONC. This consistent emergence of collapse indicates that the ONC phenomenon represents a fundamental characteristic of over-parameterized neural networks. As a noteworthy outcome, Table 1 reveals that learnable thresholds cause the model to neglect minority classes. Conversely, fixed thresholds mitigate this problem (as evidenced by the "min sensitivity" metric). This is expected and already argued in the Discussion section of the original manuscript, and the expectation has now been experimentally confirmed, providing one reasonable response to your comment regarding the weakness of our paper ([W1]). We will report this finding in the revised manuscript.

Table 1: OR Metrics After Training

Backbone	Threshold	Data Split	loss ↓	acc ↑	mae_order ↓	mae_age ↓	qwk ↑	min_sensitivity ↑	within_1_acc ↑
ResNet101	fixed	train	0.3451	0.9908	0.0261	0.1304	0.9940	0.9333	0.9969
ResNet101	fixed	val	0.8358	0.8683	0.2035	1.0174	0.9834	0.7500	0.9599
ResNet101	learnable	train	1.0741	0.9523	0.0250	0.1249	0.9988	0.0000	0.9980
ResNet101	learnable	val	1.3540	0.5996	0.2066	1.0332	0.9869	0.0000	0.9573
ResNet50	fixed	train	0.3312	0.9909	0.0264	0.1318	0.9931	0.9545	0.9972
ResNet50	fixed	val	0.8450	0.8722	0.2061	1.0306	0.9819	0.7500	0.9526
ResNet50	learnable	train	0.5326	0.9879	0.0177	0.0887	0.9987	0.0000	0.9983
ResNet50	learnable	val	0.9335	0.7837	0.1924	0.9620	0.9883	0.0000	0.9573
DenseNet201	fixed	train	0.4172	0.9454	0.0177	0.0885	0.9964	0.9500	0.9982
DenseNet201	fixed	val	0.7893	0.8776	0.1835	0.9174	0.9877	0.7866	0.9598
DenseNet201	learnable	train	0.7825	0.9884	0.0141	0.0706	0.9993	0.0000	0.9984
DenseNet201	learnable	val	1.0224	0.8042	0.1908	0.9538	0.9866	0.0000	0.9538

Table 2: ONC Metrics Evolution (Initialization → Post-training)

Backbone	Threshold	Data Split	onc1 ↓	onc2_1 ↓	onc2_2 ↓	onc3 ↓
ResNet101	fixed	train	8.91×10⁻¹ → 6.27×10⁻²	4.82×10⁻¹ → 1.21×10⁻⁸	9.65×10⁻¹ → 2.62×10⁻⁴	4.71×10⁰ → 9.24×10⁻²
ResNet101	fixed	val	8.98×10⁻¹ → 1.54×10⁻¹	5.11×10⁻¹ → 1.23×10⁻⁸	9.80×10⁻¹ → 2.62×10⁻⁴	4.72×10⁰ → 2.40×10⁻¹
ResNet101	learnable	train	8.92×10⁻¹ → 5.80×10⁻²	4.84×10⁻¹ → 7.33×10⁻⁵	9.65×10⁻¹ → 5.01×10⁻³	6.28×10⁰ → 2.53×10⁻¹
ResNet101	learnable	val	8.99×10⁻¹ → 1.49×10⁻¹	5.12×10⁻¹ → 7.21×10⁻⁵	9.81×10⁻¹ → 5.03×10⁻³	6.29×10⁰ → 3.16×10⁻¹
ResNet50	fixed	train	8.90×10⁻¹ → 7.65×10⁻²	4.82×10⁻¹ → 3.12×10⁻⁷	9.64×10⁻¹ → 5.52×10⁻⁴	4.71×10⁰ → 8.27×10⁻²
ResNet50	fixed	val	8.97×10⁻¹ → 1.56×10⁻¹	5.10×10⁻¹ → 2.84×10⁻⁷	9.80×10⁻¹ → 5.51×10⁻⁴	4.72×10⁰ → 2.23×10⁻¹
ResNet50	learnable	train	8.92×10⁻¹ → 6.72×10⁻²	4.83×10⁻¹ → 5.91×10⁻⁴	9.65×10⁻¹ → 4.56×10⁻²	6.28×10⁰ → 3.27×10⁻¹
ResNet50	learnable	val	8.98×10⁻¹ → 1.57×10⁻¹	5.12×10⁻¹ → 5.65×10⁻⁴	9.81×10⁻¹ → 4.55×10⁻²	6.29×10⁰ → 4.27×10⁻¹
DenseNet201	fixed	train	8.93×10⁻¹ → 5.64×10⁻²	4.84×10⁻¹ → 1.08×10⁻⁵	9.66×10⁻¹ → 2.66×10⁻³	4.72×10⁰ → 8.04×10⁻²
DenseNet201	fixed	val	9.00×10⁻¹ → 1.27×10⁻¹	5.13×10⁻¹ → 9.81×10⁻⁶	9.82×10⁻¹ → 2.67×10⁻³	4.73×10⁰ → 1.50×10⁻¹
DenseNet201	learnable	train	8.94×10⁻¹ → 4.94×10⁻²	4.85×10⁻¹ → 7.14×10⁻⁴	9.66×10⁻¹ → 1.39×10⁻¹	6.30×10⁰ → 4.48×10⁻¹
DenseNet201	learnable	val	9.01×10⁻¹ → 1.45×10⁻¹	5.14×10⁻¹ → 6.94×10⁻⁴	9.82×10⁻¹ → 1.39×10⁻¹	6.30×10⁰ → 5.36×10⁻¹

2025-08-05

I thank the authors for their response, particularly for conducting the additional experiments. I would encourage the authors to include these results in the revised manuscript.

2025-08-05

We would like to thank Reviewer zTjz for his/her feedback. As we have already written in our response, we will certainly include the additional experimental results in the revised manuscript if the paper is accepted.

审稿意见

评分: 4置信度: 32025-07-01

This paper explores the phenomenon of Neural Collapse in the context of ordinal regression, while previous studies primarily focused on classification and regression tasks. The authors theoretically demonstrate that regularized empirical risk minimization (ERM) in unconstrained feature models leads to Neural Collapse even when the learning objective is ordinal regression. Specifically, they show that the class-wise mean features collapse onto a one-dimensional real line, and the ordering of these projected means corresponds to the ordinal structure of the classes. The authors validate their findings through experiments on several benchmark datasets, providing evidence that overparameterized models trained for ordinal regression display the prescribed Neural Collapse properties.

优缺点分析

This paper makes a valuable theoretical contribution by extending neural collapse analysis to ordinal regression, though its experimental validation has some limitations that reduce its potential impact. While I lean toward accepting the paper, some concerns regarding the experimental results should be addressed in the rebuttal.

The paper is well organized and clearly written. The main theoretical results are well-presented with rigorous mathematical notations. The objective of the empirical evaluations is clearly highlighted as bridging the gap between theoretical findings and practical applications. The clarity of the paper is high except for the demonstrations of the experimental results.

The theoretical results are supported by rigorous mathematical proofs. While I did not check the proofs in detail, I cannot find any major issues. The quality of the theoretical contents appears to be at a publishable standard.

The authors incorporate existing techniques for analyzing neural collapse (i.e., unconstrained feature models) to analyze neural collapse in ordinal regression. While the basic idea of the theoretical analyses combines existing techniques, this combination itself represents a novel contribution. Therefore, the present paper exhibits originality, though not at the highest level.

The theoretical analyses reveal three properties regarding neural collapse in ordinal regression. OCN1 and OCN2 are analogous to existing results in classification tasks. OCN3 and Theorem 4.3 are non-trivial findings specific to ordinal regression. Particularly significant is the suggestion regarding choices of regularization parameters that avoid trivial meaningless solutions.

I have concerns about the experimental validation being conducted on a single non-common architecture designed by the authors. The generalizability of these results to widely-used architectures such as CNN, ResNet, and Transformer remains unclear. Since these common architectures are basis of many current successes in deep learning, the inability to demonstrate that the theoretical findings apply to these architectures diminishes the paper's potential impact in explaining real-world deep learning successes.

The quality of the experimental design raises some questions. The authors use a non-standard accuracy metric (mean absolute error of the value of true and predicted labels) despite their own cited reference (Gutierrez et al., 2016) recommending classification accuracy and mean absolute error of the order of predicted and true labels as accurate metrics for ordinal regression tasks. This raises a concern that the empirical confirmations of OCNs were done for inaccurate models, reducing the practical relevance of the theoretical results.

With respect to clarity of experimental results, the presentation in Figure 2 could be improved. The authors inconsistently use different scales for the vertical axis across the evaluated metrics—linear scale for MAE, ONC2-1, and ONC2-2, while using log scale for LNLL, ONC1, and ONC3. Using a consistent log scale for all metrics would facilitate better interpretation of the results.

问题

How do the theoretical findings translate to practical benefits for ordinal regression tasks? Could you please demonstrate improved performance on benchmark datasets when applying insights from your neural collapse analysis?
Could the authors please justify how their experimental design addresses the concerns raised in the review?

局限性

yes

最终评判理由

I lean toward accepting this paper, as it makes a valuable theoretical contribution by extending neural collapse analysis to ordinal regression. However, I would like to lower the score from the initial evaluations because there remains a concern that the empirical confirmations of OCNs were conducted on inaccurate models, reducing the practical relevance of the theoretical results.

格式问题

I have no concerns about the formatting of the paper.

作者回复

2025-07-31

Thank you for the thoughtful suggestions and detailed critique, which helped us improve the paper.

[W1] Generalization to standard neural network architectures

[A1] We sincerely thank you for this valuable feedback. To address this concern, we have conducted additional experiments. We have examined three different backbone networks (ResNet-50, ResNet-101, and DenseNet-201) and the UTKFace age-estimation dataset binned into 20 ordinal classes (5-year intervals, 0–4 yrs up to 95–116 yrs). All models are pretrained using ImageNet and the Adam optimizer (lr=0.001, weight_decay=0.001) is used for fine-tuning. The logit link function is used with fixed thresholds (uniformly spaced from -40 to 40) and learnable thresholds. The result is summarized in Table 2. All four ONC metrics show substantial decreases across architectures, indicating successful collapse during training for both training and validation datasets. The consistent emergence of collapse patterns across different architectures suggests that the ONC phenomenon is a general characteristic of over-parameterized neural networks.

If the paper is accepted, all the additional experimental results will be added to the paper's appendix. Besides the above three additional backbone networks, we also plan to supplement experimental results of transformer-based backbones. That experiment is currently underway.

[W2] Clarification on evaluation metrics

[A2] We thank you for pointing out the incompleteness of our evaluation metrics. We have additionally computed five metrics: classification accuracy, MAE of the order of predicted and true labels, QWK, minimum sensitivity, and accuracy within 1-off (mentioned by Reviewer wf6P). We will add plots of all these new evaluation metrics to help readers better judge the training and validation situations if the paper is accepted.

[W3] Improving clarity of Figure 2

[A3] Thank you for this suggestion. In the revision, we will consistently use a log scale in all the plots in Fig. 2.

[Q1] Practical implications of our theoretical findings

[A4] This is an interesting question, and it may be possible to address it through the appropriate design of thresholds or the development of new loss functions based on the UFM framework. While we have already mentioned the issue of degraded accuracy for minority classes in the Discussion section of the original manuscript, additional experiments presented in Table 1 show that with learnable thresholds, the model indeed tends to ignore the minority classes. In contrast, this issue is alleviated under fixed thresholds (see the “min sensitivity” metric). We plan to discuss this point in greater depth in the revised version.

[Q2] Justification of experimental design

[A5] In the previous experimental design, it was indeed unclear how relevant our findings were to more realistic scenarios. In response to your suggestion, we conducted additional experiments, which we believe have significantly strengthened our claims. We appreciate your valuable feedback.

Table 1: OR Metrics After Training

Backbone	Threshold	Data Split	loss ↓	acc ↑	mae_order ↓	mae_age ↓	qwk ↑	min_sensitivity ↑	within_1_acc ↑
ResNet101	fixed	train	0.3451	0.9908	0.0261	0.1304	0.9940	0.9333	0.9969
ResNet101	fixed	val	0.8358	0.8683	0.2035	1.0174	0.9834	0.7500	0.9599
ResNet101	learnable	train	1.0741	0.9523	0.0250	0.1249	0.9988	0.0000	0.9980
ResNet101	learnable	val	1.3540	0.5996	0.2066	1.0332	0.9869	0.0000	0.9573
ResNet50	fixed	train	0.3312	0.9909	0.0264	0.1318	0.9931	0.9545	0.9972
ResNet50	fixed	val	0.8450	0.8722	0.2061	1.0306	0.9819	0.7500	0.9526
ResNet50	learnable	train	0.5326	0.9879	0.0177	0.0887	0.9987	0.0000	0.9983
ResNet50	learnable	val	0.9335	0.7837	0.1924	0.9620	0.9883	0.0000	0.9573
DenseNet201	fixed	train	0.4172	0.9454	0.0177	0.0885	0.9964	0.9500	0.9982
DenseNet201	fixed	val	0.7893	0.8776	0.1835	0.9174	0.9877	0.7866	0.9598
DenseNet201	learnable	train	0.7825	0.9884	0.0141	0.0706	0.9993	0.0000	0.9984
DenseNet201	learnable	val	1.0224	0.8042	0.1908	0.9538	0.9866	0.0000	0.9538

Table 2: ONC Metrics Evolution (Initialization → Post-training)

Backbone	Threshold	Data Split	onc1 ↓	onc2_1 ↓	onc2_2 ↓	onc3 ↓
ResNet101	fixed	train	8.91×10⁻¹ → 6.27×10⁻²	4.82×10⁻¹ → 1.21×10⁻⁸	9.65×10⁻¹ → 2.62×10⁻⁴	4.71×10⁰ → 9.24×10⁻²
ResNet101	fixed	val	8.98×10⁻¹ → 1.54×10⁻¹	5.11×10⁻¹ → 1.23×10⁻⁸	9.80×10⁻¹ → 2.62×10⁻⁴	4.72×10⁰ → 2.40×10⁻¹
ResNet101	learnable	train	8.92×10⁻¹ → 5.80×10⁻²	4.84×10⁻¹ → 7.33×10⁻⁵	9.65×10⁻¹ → 5.01×10⁻³	6.28×10⁰ → 2.53×10⁻¹
ResNet101	learnable	val	8.99×10⁻¹ → 1.49×10⁻¹	5.12×10⁻¹ → 7.21×10⁻⁵	9.81×10⁻¹ → 5.03×10⁻³	6.29×10⁰ → 3.16×10⁻¹
ResNet50	fixed	train	8.90×10⁻¹ → 7.65×10⁻²	4.82×10⁻¹ → 3.12×10⁻⁷	9.64×10⁻¹ → 5.52×10⁻⁴	4.71×10⁰ → 8.27×10⁻²
ResNet50	fixed	val	8.97×10⁻¹ → 1.56×10⁻¹	5.10×10⁻¹ → 2.84×10⁻⁷	9.80×10⁻¹ → 5.51×10⁻⁴	4.72×10⁰ → 2.23×10⁻¹
ResNet50	learnable	train	8.92×10⁻¹ → 6.72×10⁻²	4.83×10⁻¹ → 5.91×10⁻⁴	9.65×10⁻¹ → 4.56×10⁻²	6.28×10⁰ → 3.27×10⁻¹
ResNet50	learnable	val	8.98×10⁻¹ → 1.57×10⁻¹	5.12×10⁻¹ → 5.65×10⁻⁴	9.81×10⁻¹ → 4.55×10⁻²	6.29×10⁰ → 4.27×10⁻¹
DenseNet201	fixed	train	8.93×10⁻¹ → 5.64×10⁻²	4.84×10⁻¹ → 1.08×10⁻⁵	9.66×10⁻¹ → 2.66×10⁻³	4.72×10⁰ → 8.04×10⁻²
DenseNet201	fixed	val	9.00×10⁻¹ → 1.27×10⁻¹	5.13×10⁻¹ → 9.81×10⁻⁶	9.82×10⁻¹ → 2.67×10⁻³	4.73×10⁰ → 1.50×10⁻¹
DenseNet201	learnable	train	8.94×10⁻¹ → 4.94×10⁻²	4.85×10⁻¹ → 7.14×10⁻⁴	9.66×10⁻¹ → 1.39×10⁻¹	6.30×10⁰ → 4.48×10⁻¹
DenseNet201	learnable	val	9.01×10⁻¹ → 1.45×10⁻¹	5.14×10⁻¹ → 6.94×10⁻⁴	9.82×10⁻¹ → 1.39×10⁻¹	6.30×10⁰ → 5.36×10⁻¹

2025-08-05

Thank you to the authors for their rebuttal and for providing additional experimental results. Could the authors clarify whether Table 1 truly demonstrates that the models achieve state-of-the-art performance? As previously mentioned, empirical validation of OCNs should be conducted using highly accurate models to ensure practical relevance. However, it is unclear whether the reported values, such as acc and mae_order, indicate high accuracy.

2025-08-06

We would like to thank Reviewer 7zzE for raising this point. However, this paper does not aim to propose a State-of-the-art (SoTA) method, but rather to theoretically investigate whether neural collapse-like phenomena also occur in ordinal regression. As a result, we have verified—both theoretically and experimentally—that collapse phenomena characterized by what we define as ONC do indeed arise within the standard framework based on cumulative link models.

Table 1 is not intended to demonstrate SoTA accuracy, but to show that ONC emerges robustly without the need for extensive tuning or architecture-specific tricks in more practical settings. We will make this point explicit and revise the manuscript to avoid giving the impression that we are claiming SoTA performance.

If there are any further concerns, please feel free to comment.

审稿意见

评分: 4置信度: 42025-07-10

This paper studies the phenomenon of neural collapse in ordinal regression models. The authors propose a set of conditions for ordinal neural collapse, and through the unconstrained features model show that the global minima satisfy these conditions. The authors also empirically validate their findings on ordinal regression datasets.

优缺点分析

Strengths: This paper has both analytical and empirical characterizations of neural collapse in a new setting. The mathematical results seem correct (I have not thoroughly checked the derivation of ONC3), and the empirical validation in small label settings seems valid.

Weaknesses:

The mathematical results do not seem to provide prescriptions for training methods, and only analyze the landscape of the unconstrained features model. This is the case for all UFM analyses however, not a particular problem with this paper.
The empirical validation is performed on relatively simple datasets with a small number of classes. Ordinal regression provides an advantage over regular classification in situations with extremely large label spaces. How does NC emerge in that case?

问题

What is the role of weight decay in the analysis of the UFM model? Is weight decay necessary?
Comparing the emergence of NC in OR vs standard classification, is there a reason to prefer one setup over another? Can the UFM analysis provide an answer for this?
The analysis seems to exclude the case where the thresholds are learned. Can the UFM analysis prescribe a proper choice of thresholds?

局限性

yes

最终评判理由

This paper is a solid contribution to the literature on Neural collapse - indicating that NC is a more fundamental phenomenon in training neural networks across a range of tasks and loss functions. While I have read the rebuttal and appreciate the addition of experiments on UTKFace, my main concerns - regarding limited practical insights, and not having demonstrations with a large number of classes - still remain. I will keep my score which indicates that on balance I believe the paper should be accepted.

格式问题

None

作者回复

2025-07-31

We are grateful for your constructive criticism, which provides an opportunity to clarify our contribution.

[W1] Limited prescriptions for training methods

[A1] Indeed, our theoretical analysis does not directly address training methods. However, UFM-based analyses can inspire practical training methods through their insightful analytical results. In fact, several prior studies based on UFM have proposed some practical prescriptions, such as the use of a fixed ETF classifier to counter class imbalance or the introduction of new loss functions to accelerate convergence; associated experiments have demonstrated that such prescriptions are actually effective. Our paper also leads to some operational prescriptions: one such example is the use of fixed thresholds to preserve ONC3 and stabilize training, which, moreover, is likely to alleviate the common issue of degraded classification performance for minority classes in the presence of class imbalance, as argued in the Discussion section of the manuscript. When submitting this manuscript, this last benefit in the imbalance case was just an expectation, but inspired by the comments of reviewers including you, we have conducted additional experiments using a more challenging task (UTKFace age-estimation dataset which has a strong class imbalance) and found that the fixed thresholds indeed improve the minority class accuracy. This is, in our view, a practical outcome naturally deduced from our theoretical arguments.

Admittedly, UFM introduces a strong simplifying assumption at the beginning. But thanks to that simplification, the derived results are often transparent. They thus can reveal what the model is doing and where to intervene, which potentially yields concrete design principles for training DNNs. This is precisely one of the key advantages of the UFM framework and is something difficult to achieve with other approaches.

[W2] Limited label scale of experimental validation

[A2] We admit that our experiments only treated the cases with a small number of classes. To address this point, as also explained in [A1], we have conducted additional experiments using the UTKFace age-estimation dataset binned into 20 classes; three additional model architectures have also been treated in the experiments to assess the robustness of the results. For more details of the experimental setup, please see our response [A1] for Reviewer 7zzE. The result clearly shows ONC in such a case with a large number of classes, supporting the practicality of our finding.

[Q1] The necessity of weight decay in UFM

[A3] Definitely necessary. In ONC2, the feature vector components orthogonal to the classifier weight collapse to the zero vector thanks to the regularization on the feature $\boldsymbol{h}$ . The same is true for ONC1: without the regularization, it is not possible to show the strong convexity of the total loss function, and hence the uniqueness of the solution leading to ONC1 is not shown. As for the regularization on the weight vector $\boldsymbol{w}$ , it is necessary since otherwise $\boldsymbol{w}$ diverges, as observed in the right panel of Figure 1. Therefore, to ensure that the problem is mathematically well-defined, those regularizations are necessary.

[Q2] OR vs standard classification comparison from NC perspective

[A4] We did not consider this perspective so far. Thank you for your insightful comment. Indeed, since OR tasks can be approached as classification problems, a comparison of the two approaches could be feasible. We have examined this point from a theoretical viewpoint, but we could not draw any conclusion regarding which approach is better in this short time. A detailed investigation and comparative experiments would be necessary, and thus we would like to leave this as a direction for future work.

[Q3] UFM analysis insights for choice of thresholds

[A5] We believe that this is feasible. For example, in the class-balanced case, it is natural to assign equal probabilities to each class. Once the inverse link function $g$ and the number of classes $Q$ are fixed, the appropriate placement of thresholds $\boldsymbol{b}$ is uniquely determined. However, in such cases, we have $b_0 = -\infty$ and $b_Q = +\infty$ , which implies that in the context of ONC, the optimal latent values $z_1^{\ast}$ and $z_Q^{\ast}$ diverge in the weak regularization limit. This divergence could make deep-learning-based experiments less aligned with the theoretical predictions, and therefore, we fix those edge threshold values at large enough values in the experiments, to make the ignored tail probabilities small enough while keeping the alignment with the theory. We believe that this is one of the best practices in the class-balanced setting in that the result is theoretically well-controlled, but still, the classification performance is optimal.

In the case of class imbalance, one can also assign probability values to each class based on some criterion (e.g., setting (the probability) $\times$ (the imbalance ratio) becomes constant), and then determine the corresponding appropriate values for $\boldsymbol{b}$ . The corresponding ideal latent values $\boldsymbol{z}^*$ can be computed using the UFM framework. While we already discussed this point in the current manuscript’s Discussion section, we will make an effort to clarify it further in the revised version.

最终决定Accept (poster)

2025-09-17

This paper extends the theory of Neural Collapse (NC), a phenomenon observed in standard classification, to the domain of Ordinal Regression (OR). The authors introduce and formally characterize "Ordinal Neural Collapse" (ONC) by combining the Cumulative Link Model (CLM) with the Unconstrained Feature Model (UFM). Their key contribution is a theoretical proof demonstrating that optimal solutions exhibit three properties: within-class feature collapse (ONC1), alignment of class means onto a 1D classifier subspace (ONC2), and an ordered alignment of latent variables that relates simply to the model's thresholds (ONC3). Reviewers were in agreement that the work is novel, technically sound, well-motivated, and clearly written, representing a valuable extension of NC theory to a new and important task.

The primary concerns raised by reviewers centered on the empirical validation, which was initially perceived as being conducted on relatively simple datasets with a non-standard architecture, thus limiting the generalizability of the findings and practical insights. However, the authors provided a thorough and convincing rebuttal. They conducted substantial new experiments on the more challenging and imbalanced UTKFace dataset using standard architectures like ResNet and DenseNet. These new results suggested that ONC robustly emerges in more complex and realistic settings and also uncovered a valuable practical insight: the use of fixed thresholds, as suggested by their theory, can improve performance for minority classes. The authors also addressed other reviewer concerns regarding evaluation metrics and technical details, which led one reviewer to raise their score to "Accept".

While some reviewers maintain borderline scores, citing remaining questions about the full practical implications, I believe the paper's contributions are significant enough to warrant acceptance. The primary contribution is a novel and rigorous theoretical extension of NC, and the authors have successfully demonstrated this phenomenon both analytically and empirically. The additional experiments conducted during the rebuttal have substantially strengthened the paper's claims and addressed the most critical initial weaknesses. Therefore, I recommend this paper be accepted. I strongly encourage the authors to integrate the new experimental results and the reviewers' feedback into the final manuscript to maximize its impact.