The Prevalence of Neural Collapse in Neural Multivariate Regression
We provide theoretical and experimental analysis for Neural Regression Collapse which is prevalent among multivariate regression tasks.
摘要
评审与讨论
The paper introduces a novel extension of neural collapse to neural regression collapse, demonstrating that a similar phenomenon exists in multivariate regression. It treats the last layer feature vectors as free variables when minimizing the loss function and derives results similar to those of traditional neural collapse.
优点
The paper is well-written and has a good structure.
The authors provide a detailed explanation of NRC following the framework of NC. They validate NRC through experiments on multiple datasets and discuss the role of regularizers.
The paper also includes comprehensive proofs, demonstrating NRC's performance under the UFM setting.
缺点
-
It seems that Neural Multivariate Regression simply turn the final layer's classification into a regression loss? And the two are just different in form, but ultimately can both be written as MSE loss, making little difference in analyzing the NC phenomenon.
-
Since NC is a model-agnostic framework, after changing the task, how are the conclusions of NRC different from those of traditional NC? I think this should be further clearly explained.
问题
- If the final layer regression directly uses the features learned by the preceding neural network through weights , and already has good properties, can it be understood that the regression model has no actual significance?
- Does the regression here only refer to linear regression? Can nonlinear forms, such as GP regression, also be considered?
- In line 222-225, why the norms in UFM regression cases are finite?
- In Figure 5, it seems that as the value of the regularizers decreases, the effectiveness of NRC decreases while the performance of MSE improves. What is the reason for this? Additionally, could the experimental results from Section 4.4 (removing regularization), be added to the figure for comparison?
- In the experiments validating NRC, the authors used six datasets. However, when testing the effect of the regularizer, did you only select one of these datasets? How does it perform on the remaining datasets?
局限性
N/A
We would like to thank you for the detailed review and helpful comments as well as for recognizing that our work “derives results similar to those of traditional neural collapse”.
We acknowledge that the analysis for NRC bears some similarities with traditional NC for classification tasks with MSE loss. In our proofs, we leveraged some existing results from UFM theory for MSE loss. However, the regression problem is substantially different from the MSE classification problem, resulting in a very different definition of neural collapse. In classification with balanced data, neural collapse refers to the convergence of the last layer features and weight matrix to an ETF geometric structure. In contrast, in multivariate regression, the feature embeddings converge to an n-dimensional subspace spanned by the principal components of the features, which converge to the subspace spanned by the row vectors of the weight matrix . Moreover, in regression tasks, the weight matrix satisfies distinct properties that depend on the covariance matrix of the target variables and their eigenvalues. Therefore, the qualitative nature of the final results is substantially different from those for classification. In addition, the NRC properties provide valuable insights for model training in regression. As demonstrated below, we also present training with a fixed based on the neural collapse solution, which improves training efficiency without compromising performance. We anticipate that further exploration will reveal additional implications, which we leave for future work.
Questions:
-
Indeed, the weights and features in the final regression layer have desirable properties, as highlighted by Theorem 4.1. Consequently, instead of learning the final regression layer, we can design the weight matrix as a random matrix such that as suggested by Theorem 4.1. By training the model with the weights kept frozen, we can significantly reduce the number of parameters and the computational complexity of training, while still achieving comparable performance in terms of both training/testing MSE and NRC metrics, as illustrated in Figure D.
-
This paper focuses on deep regression with a final linear layer, where the network is highly overparameterized and capable of learning a wide range of representations. In lines 217-221, we discussed the resemblance of the UFM with standard (multivariate) linear regression. On a high level, one can regard the features , , as the inputs to linear regression. Then, the objective becomes the minimization of the squared loss between the predicted outputs and the targets . In standard regression, the ’s are fixed inputs, whereas in the UFM, we are optimizing over the weights , the biases and most importantly over the “inputs” . As for GP regression, this is indeed an interesting research direction, but one that is certainly very rich and beyond the scope of the current paper.
-
The norms of and have different training dynamics under classification and regression tasks. In classification tasks with CE loss and no regularization, once features can be perfectly separated (during TPT), the training objective will always decrease if we fix the direction of H and W and only increase their norms (similar to temperature scaling), with the optimum attainable only at infinity. However, with MSE loss without regularization, the loss function increases towards infinity as the norm of the predictor Wh_i approaches infinity. The MSE loss function therefore forces the norms of the predictors to be finite on its own without the need of regularization. The actual optimal solutions for W and H are characterized in Theorem 4.3 (no regularization case) which shows that there are actually an infinite number of finite-norm solutions.
-
In Figure 5, the train MSE decreases when decreasing the regularization constants, which mirrors the decrease of the train MSE in Figure 4, when decreasing . This observation aligns with modern machine learning practices, where training MSE tends to be lower with no or weaker regularization. Moreover, Figure 5 exhibits a phase change in NRC, with NRC being more pronounced with stronger or higher regularization constants. Following your suggestion, we have added the experimental results corresponding to no regularization to Figure 5; see Figure B in the rebuttal PDF for these additional results.
-
Figure 4/A has been updated to include experiments on all six datasets, providing a comprehensive validation of the impact of regularization. We have also updated Figure B to include the Reacher, Swimmer, Hopper, and CARLA2D datasets. Due to space limitations, CARLA1D and UTKFace are not included in Figure B and will be added in the future revision.
Thanks for your detailed explanation. I will raise my score to 6.
Thank you for your insightful review, your comment, and for raising your score.
This paper investigates Neural Regression Collapse (NRC), a new form of Neural Collapse observed in multivariate regression tasks. NRC is characterized by three phenomena: (NRC1) last-layer feature vectors collapsing to the subspace spanned by the principal components of feature vectors, (NRC2) feature vectors collapsing to the subspace spanned by weight vectors, and (NRC3) the Gram matrix for weight vectors converging to a form dependent on the targets' covariance matrix. Empirical evidence from various datasets and architectures supports NRC's prevalence. The Unconstrained Feature Model (UFM) explains these phenomena, indicating NRC emerges with positive regularization parameters. This study extends Neural Collapse to regression, suggesting a universal deep learning behavior.
优点
The paper addresses the significant issue of Neural Collapse in regression tasks, extending its understanding beyond classification and suggesting a universal behavior in deep learning models.
缺点
in line 601, for , there is no in the matrix.
问题
Could the authors explain in detail how Eq. (24) and Eq. (25) are derived in the appendix?
局限性
see Weaknesses
We would like to thank you for the detailed review and helpful comments. We also thank you for recognizing NRC as a “new form of Neural Collapse observed in multivariate regression tasks” and that “this study extends Neural Collapse to regression, suggesting a universal deep learning behavior”.
Weaknesses:
Indeed, in line 601, in the matrix that you quoted, multiplication of by is needed. We have spotted this typo in our manuscript and corrected it.
Questions:
Finally, let us now elaborate on the derivation of Eq. (24) and Eq. (25) in the Appendix. The proof of Theorem 4.1 leverages [Zhou et al., 2022a, Lemma B.1]. For clarity, we have restated their lemma in our notation, see Lemma D.1 in Appendix D. In order to derive , Eq. (23) in our work, Zhu et al., first showed that, if and denote the -th column of and respectively, then, when and (if this is not the case, then ), we have that
with , for all , where is the -th singular value of and , are the corresponding left and right singular vectors. Rewriting the equation above (see Eq. (22) in [Zhou et al., 2022a, Lemma B.1]) in matrix notation readily yields Eq. (24) and Eq. (25) in the Appendix. We will include this clarification in the Appendix.
I sincerely apologize. While reviewing, I initially wrote the comments in markdown, but it seems that I didn't manage to paste all of them. Overall, the Neural Collapse in regression is very interesting, and this is a work of significant originality and innovation. I appreciate the value of this research.
We thank you for your review and your appreciation of our work!
The authors explore a new notion of neural collapse which has been formulated to accommodate multivariate regression tasks. NC was originally introduced and recognized as an artifact of multi-class classification tasks. While there has been extensive research into the phenomena of neural collapse for classification, this paper is interesting in that it is the first theoretical study of NC for regression problems. The results are novel and the methodology is sound.
The paper is nice because it expands the view of NC in deep learning to cover regression as well as classification. The authors shed further light on this phenomenon and their results fit nicely in the existing literature. Being a new viewpoint (e.g. regression analysis) on a well-studied phenomenon (e.g. neural collapse), there is some question as to whether their formulation is the right approach. For me, their definition seems compelling.
Essentially, the idea is that instead of looking at the class means or class features and the simplex formed by those vectors in embedding space, you instead look at the principal components of the sample features in embedding space where is the dimension of the target space (e.g., n=1 for univariate regression). In the same way that NC for classification says the class feature means collapse to an ETF, for the regression setting, the sample features collapse to the principal components in embedding space.
优点
The paper is well-written and well motivated. This is an interesting problem and it's nice to see such a well thought out first attempt at extending an important phenomenon (i.e. neural collapse) to a more broad task setting. The theory is solid and well supported with clear and careful proofs.
缺点
It would be nice to see more comprehensive and consistent experiments (I've outlined suggestions below). The authors are introducing a new concept; or more specifically, they are reframing a well-known concept to a new setting (i.e. neural collapse to neural regression collapse). It is of course impossible to include all possible datasets and architectures but when claiming such a novel and fundamental property of deep learning there is an additional onus on the authors to be as comprehensive as possible.
Overall, the primarily weaknesses are surrounding the experiments which good but lacking. There are a few things missing from the experiments that I would like to see and would alleviate some of my confusion around these results:
- None of the plots in Figures 1-6 include error bars. Were multiple seed trials run for these experiments? This seems to be a glaring omission
- Indicate the value of \gamma that is being used for the NRC3 plots in Figure 2.
- in Figure 3, is this unique (?) minimizer for \gamma/\lambda_{max} stable or consistent across multiple trials with differing seeds and initializations?
- the representation of datasets is not comprehensive and inconsistent. I understand not all plots can be contained in the main paper but it would be nice to see them in the Appendix. For example Figure 4 and 5 (which examine similar quantities) have Figure 4 looking at CARLA2D & Swimmer and Figure 5 looking at Swimmer and Reacher. Figure 3 does not include CARLA1D or UTKFace datasets. Why?
问题
There is extensive analysis of the neural collapse to the principal components. Did you do any work to justify that is the right number of principal components to assume? The embedding dimension could be quite large in comparison. How would your analysis change if you let be different from the dimensionality of the targets? For example, what happens to your NC conditions when you take for one of your robotic datasets? Or set for I realize this doesn't lend itself to as clear an interpretation (if any) but to play devil's advocate it would be helpful to see that analysis, particularly in the form of some explained variance ratio. For classification tasks, the structure of the ETF during neural collapse is very natural. For a regression task, particularly when introducing a new definition, it requires an additional justification or at least an additional analysis.
High level question out of curiosity: There are standard ways to convert a (univariate) regression task into a multi-class classification task (e.g. via quantile regression). Similarly, there are standard ways to express a classification task as a regression taks. How would the standard notion of NC for classification compare with your formulation of NC for regression between these two formulations? It would be interesting to see how analaglous or stable this proposed definition is under such a re-framing.
line 71 (clarification): you state that when the regression parameters are zero or very small, there is no collapse. Does that mean all NC1 - NC3 fail? or just some?
line 77: you mention that this framing of NC for regression can lead to more explainable models and potentially more efficient training processes. Can you reasonably justify this claim? How exactly? Do you have any references or reasonable examples (albeit from the classification setting) that you can point to and how those assumptions would compare to a regression task?
In the related work (lines 88-92) you discuss previous work concerning NC for classification tasks when the classes are imbalanced. In the imbalanced classification setting, some of the original NC properties no longer hold and/or they must be reformualted to account for the class imbalance. How does one translate the notion of balanced dataset for regression task? This aspect seems to have been ignored entirely. Can you add some more detail justifying why it's been ignored or how it relates to your current framing?
line 150: you state "typically or larger and typically . Typically with respect to what?
line 199-201: you state "This indicates that NC is…a fundamental property of multivariate regression" This feels like an overstatement. This paper is examining 6 datasets over two model classes. The results are certainly very promising but the world of multivariate regression and model architectures can be vastly larger.
In Figure 2: NRC3 states that there exists a constant \gamma \in (0,\lambda_min) such that the derived quantity goes to zero. What value of \gamma are you using in the plots for NRC3 here? According to Figure 3 it seems that there is a unique (?) optimal value for \gamma which minimized NRC3
line 202: you mention that each dataset (excluding CARLA1D and UTKFace) exhibit a unique minimum value of NRC3 for the range of \gamma's explored. Why is that the case? I may have missed this in the exposition and theory.
In Figure 3: Why are CARLA1D and UTKFace not included? Presumably bc these have n=1. One could still evaluate \lambda_{max}. What do those corresponding plots look like for these univariate datasets?
In Figure 4: you only show results for CARLA2D and Swimmer datasets. Where the other datasets not included in your experiments? It would be nice to see comparable results in the appendix at least, particularly, if you're claiming the ubiquity of these results for multivariate regression tasks.
line 207: you claim the geometric structure you propose NC1-NC3 is due to the regularization and, in fact, is not exhibited without at least some regularization. Particularly, when there is no regularization why does NRC occur (in any criteria)? And do the models correspondingly not converge? Or do they? Can you include the Train MSE (and/or Test MSE) in Figure 4 as you do in Figure 5? This point about the regularization constant seems important but admittedly remains unclear to me. You mention that this is addressed later (e.g. section 4.4), but the subsequent section is focused only on the UFM model. There is an underlying assumption that the UFM model encapsulates the NRC setting similar to what has been demonstrated for classification NC. As for intuition, I agree. But a rigorous logical connection for the regression case remains unclear to me.
In Figure 5: I assume when \lamba_H=\lambda_W=0, we would see the training diverge or at least the NRC values diverge? Is that true? What happens in this setting that connects with what we see in Figure 4.
(typos/nits) In Figure 4: I would change the labeling of the y-axis to denote the quantity being measured (e.g. NRC1 - NRC3) and use the plot titles to indicate the dataset. However, I guess this would turn your 2x3 plot into a 3x2 plot and not be the most efficient use of space. Perhaps you can do something similar to how you've displayed Figure 5.
Figure 4 and Figure 5 seem comparable to me (an exploration of the NRC1-NRC3 values for small weight decay regularization constants). But why is there an inconsistency between the datasets examined. For Figure 4 you look at CARLA2D & Swimmer and for Figure 5 you look at Swimmer and Reacher. Why? This is even more glaring because the remaining datasets aren't included in the appendix either. Is that because the results aren't as good?
Figure 5: the caption is not very descriptive. I recommend adding more detail here.
局限性
Yes, the authors have addressed the limitations
We thank you for your very detailed and insightful review, and for all the positive comments you made regarding our work. We also fully agree that the presentation of the experimental results can be improved. During the rebuttal week, we have worked very hard, running additional experiments and reorganizing our figures to respond to your comments. We believe our experiments are now “comprehensive and consistent”. We promise to include all additional results below in the future revision.
We ran all experiments with 3 random seeds and plot the variance across seeds as the shaded area. As shown in all figures, there is little change across seeds, confirming that NRC consistently emerges in regression.
Concerning your question about 𝜸 in Figure 2: We ran all experiments for long enough to ensure the training enters TPT measured by (See Figure A). After training, we extracted the matrix and identified 𝜸 that minimizes NRC3 for that specific . This 𝜸 was then used to compute the NRC3 metric for all matrices during training, resulting in the NRC3 curves shown in Figure 2. Figure 3 visually shows NRC3 as a function of 𝜸 for the final trained . Mathematically, we can show that if is sufficiently large, NRC3(𝜸), as given in the definition of NRC3, is convex and has a unique minimum. Empirically, our experiments show that 𝜸 w.r.t. minimum NRC is consistent across different seeds.
We very much liked your suggestion to examine the explained variance ratio. The results are striking as shown in Figure C: for all datasets, there is significant variance for the first components; for other components, there is very low or even no variance. We also examine different definitions of NRC1 with varying from 1 to 4 below to justify the choice of being equal to the target dimension.
| NRC1_pca1 | NRC1_pca2 | NRC1_pca3 | NRC1_pca4 | |
|---|---|---|---|---|
| Reacher | 1.01e-2 | 4.02e-4 | 2.55e-6 | 2.27e-12 |
| Swimmer | 5.48e-1 | 3.96e-11 | 1.10e-11 | 5.02e-12 |
| Hopper | 0.617 | 0.558 | 0.499 | 0.129 |
Concerning your question about converting a univariate regression problem to a classification problem, and vice-versa. We believe this is an interesting direction for future work. Of course, if is large, it would become difficult to convert from regression due to quantization and the resulting classification dataset would likely be highly imbalanced. In classification, as you state, the dataset can be balanced or imbalanced, and there are NC studies for both. If we were to convert a univariate regression dataset to a classification dataset using quantization, then the dataset would be balanced if there are the same number of values in each quantile. As this property is likely unrealistic for most datasets, then we can conclude neural regression has a stronger connection to classification with imbalanced datasets.
We believe that the UFM theory provided in the paper helps to explain many properties of neural regression. In terms of efficiency, experiments in Figure D show that we can fix W to a random matrix such that as suggested by Theorem 4.1, and use this fixed matrix throughout training. This approach significantly reduces the number of weights to be optimized and results in similar performance and NRC metrics, paralleling a well-known approach in NC for classification.
The number of features in the penultimate layer is typically 256 or larger for deep learning. In multivariate regression, the target dimension is typically small in comparison. For example, in most MuJuCo environments, often used to study imitation and reinforcement learning, satisfy .
We agree that “NC is…a fundamental property of multivariate regression” is a bit of an overstatement. We will rewrite this to say “often occurs in neural multivariate regression”.
We ignored the plots for 𝜸 and NRC3 for univariate regression with . For , , the variance of the 1D targets. This is just a scalar value for each of the datasets and does not seem to lead to an insightful plot.
Concerning your question about which NRC metrics fail when regression parameters are zero or very small. Figure A and B show when weight decay approaches zero, NRC1-3 typically become larger, compared with NRC1-3 obtained with larger weight decay values.
Particularly, when there is no weight decay, we run more training epochs to verify the asymptotic behavior of NRC1-3 and test MSE in Figure A. We observe that NRC1-3 has a strong tendency to converge (There is a relatively small amount of collapse since gradient descent tends to seek solutions with small norms.), while the test MSE increases on small MuJoCo datasets. Theorem 4.3 provides some insight: when there is no regularization, there is an infinite number of non-collapsed optimal solutions under UFM; whereas Theorem 4.1 shows that when there is regularization, all solutions are collapsed. When there is regularization, we are seeking a small norm optimal solution, which leads to NRC1-3.
The main assumption in the UFM model – for both classification and regression – is that the neural network is capable of mapping any set of training inputs to any set of feature vectors. Based on this assumption, the UFM model leads to a new optimization problem for classification and regression. We believe the UFM model for regression makes the same logical connection that is made for classification.
We have included UFM experiments with no regularizer in Figure B. Similarly as in Figure A, NRC1-3 do not converge to very low values as is when there is regularization. As indicated in section 4.5, Figure 5/B is intended to validate the results under the UFM framework. To align with the UFM assumption, we remove ReLU in the penultimate layer such that the learned feature can be any real number instead of being restricted to positive values.
Thank you for your detailed reply.
Just to clarify, do I understand correctly that all of the experiment plots you present in the original paper (Fig. 2 - Fig. 5) have been run with 3 different seeds? It's hard for me to see any shaded error bars on the plots themselves (except for some of the plots in Fig B of the rebuttal). Are you saying there is that little change across seeds for all experiments?
Thanks for the additional experiments you've included in the rebuttal document. In particular Fig. C with the Explained Variance Ratio seem nice and i would recommend including in the final version of the paper.
Thanks for the additional work for the rebuttal. I'm glad to hear you feel the comments have helped to improve the paper. I will keep my score as it is.
Thank you for your comment. The original paper used only one seed. In response to your review, we re-did all the experiments over all 6 datasets with 3 seeds, and the results are shown in the figures of the rebuttal. The shaded regions are generally narrow, showing little change across seeds for most (but not all) of the datasets.
We will include Figure C (and all of the other updated figures in the rebuttal) if accepted. Thanks again for your excellent review.
This paper studies the neural collapse phenomena in neural multivariate regression. The authors rigorously analyze the neural collapse behavior using a simplified model that only includes the last two layers. It was shown that in the multivariate regression task, the last layer of the neural network would collapse to a certain structure that aligns with the principle components of the last layer features. Moreover, the authors highlight the importance of regularization in the prevalence of neural collapse and argue that small regularization may lead to a non-collapsed solution. Experimental results on various real-world datasets are presented to support the theoretical findings.
优点
The presented study is novel, it discovers the neural collapse phenomenon in multivariate regression, extends the boundary of neural collapse, and provides new understanding of neural multivariate regression. The theoretical results and experimental results are solid and well-organized.
缺点
My major concern is the significance and potential impact of the results:
-
Neural collapse under MSE loss has been extensively studied in the literature already [1]. The analysis technique in this paper is not new, and the only difference is in the distribution of Y.
-
The authors don't provide enough evidence to support the potential impact of neural collapse in multivariate regression. For example, does neural collapse imply better generalization or robustness in regression [2]? Does fixing the last layer to be neural collapse help the training [3]?
Overall I would encourage the authors to add more discussion about the practical message to increase the impact of the current paper.
[1] Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path [2] Prevalence of Neural Collapse during the terminal phase of deep learning training [3] A geometric analysis of neural collapse with unconstrained features
问题
-
Since the neural collapse is primarily for the terminal phase of training, it is helpful to include of each experiment to confirm the network has entered the terminal phase of training.
-
Is there any specific reason for not regularizing on the bias term in the UFM (Equation 1)?
-
In theorem 4.3, it was shown that without regularization, there are an infinite number of global minimums that are not collapsed solutions. However, it was well known that linear regression trained with gradient descent is implicitly biased towards minimum norm solution. Therefore I am curious about if collapse will happen in the optimization result on UFM.
-
Discussing the difference with [1] is beneficial, it shows neural collapse happens in classification tasks without any regularization, while the current work shows this is not the case in regression.
[1] An unconstrained layer-peeled perspective on neural collapse
局限性
The authors have addressed the limitations properly.
We would like to thank the reviewer for the detailed review and helpful comments. We also thank you for recognizing that the “study is novel,” “provides a new understanding of neural multivariate regression,” and that the “theoretical results and experimental results are solid and well-organized”.
We agree, however, that we didn’t sufficiently emphasize the significance and impact of the results. Indeed, neural collapse for the classification problem under MSE loss has been extensively studied. We will update our paper to fully describe our contribution with respect to earlier works on MSE loss for classification. Indeed in our proofs, we leveraged important existing results from the UFM theory for MSE loss. But the regression problem remains substantially different from the MSE classification problem, a difference which is highlighted by a very different definition of neural collapse. Classification with a balanced data set gives rise to an ETF geometric structure, whereas multivariate regression gives rise to a distinctly different structure defined by our NRC1-3 definitions which involve subspaces spanned by principal components and the covariance matrix of the targets. This new definition requires an entirely new empirical analysis on entirely different datasets. And although the mathematical UFM derivations for regression mirror those for classification, the qualitative nature of the final results - involving the covariance matrix of the targets and their eigenvalues - is substantially different than those for classification.
In terms of impact, the NRC properties offer valuable insights into model training for regression tasks. For example, we can fix the last layer to neural collapse solutions by setting the weight matrix to be a random matrix which satisfies as suggested by Theorem 4.1. By training the model with the weights kept frozen, we can significantly reduce both the number of parameters and the computational burden, while still maintaining comparable performance in terms of training/testing MSE and NRC metrics, as shown in Figure D. We will include this figure and the corresponding discussion in the revised version. Thank you for your suggestion. We believe that the empirical and theoretical results in this paper will guide the future design of multivariate regression problems.
Thank you for your suggestion to investigate R2 during the terminal phase of training. In Figure A of the supporting PDF we provide the results, which confirm the network has entered the terminal phase of training.
We omitted regularization of the bias term for the UFM for two reasons. First, in practice, it rarely improves performance. Second, it allows for a cleaner presentation of the theorem statements and proofs. It could be easily added without changing the main takeaway points.
Thank you for your interesting comment about no regularization. We agree that gradient descent often exhibits an implicit bias towards a minimum norm solution. We have added the curve corresponding to in Figure B, which indeed shows to be the case, with the NRC metric typically decreasing as training progresses. This is consistent with the observation in the layer-peeled paper, as you indicated. We will highlight this in the revision.
Thank you for your detailed response and the efforts of additional experiments. I don't have further concerns and would like to raise my score to 6.
Thank you for your insightful review, your comment, and for raising your score.
Since 2020, when [Papyan et al., 2020] published their seminal paper on neural collapse, there has been a flurry of activity in the area, with at least a dozen papers on the topic of neural collapse published in major machine learning venues. To our knowledge, this entire stream of important research is focused on the classification problem.
As discussed in the introduction of our paper, arguably, regression is at least as important as classification in modern machine learning. To our knowledge, our paper under review to NeurIPS is the first paper to examine regression, including multivariate regression, in the context of neural collapse. The paper first proposes a brand-new definition for neural collapse for regression with a very different geometric description than the ETF definition appropriate for classification. The paper then presents extensive experimental results establishing that multivariate regression indeed exhibits neural collapse as defined by this new definition. In this rebuttal, we supplement our original experimental results with a suite of new experimental results, as requested by the reviewers, further confirming the prevalence of neural collapse in regression. Our submission also considers an approximation in which the neural network is assumed to be capable of mapping any set of training inputs to any set of feature vectors, the so-called Unconstrained Feature Model (UFM), which has also been used to analyze neural collapse in classification. Under the UFM model, we derive the explicit solutions for the optimal features, last-layer linear weights, predictors, and MSE training error, and show that the theoretical results match the empirical results, providing evidence that the UFM model can help explain the emergence of neural collapse in regression.
The paper was reviewed by four expert reviewers, and all four of them appeared to be positive about the paper. Reviewer vXK1 writes: “this paper is interesting in that it is the first theoretical study of NC for regression problems. The results are novel and the methodology is sound.” The reviewer goes on to say, “the paper is nice because it expands the view of NC in deep learning to cover regression as well as classification. The authors shed further light on this phenomenon and their results fit nicely in the existing literature…there is some question as to whether their formulation is the right approach. For me, their definition seems compelling.” The same reviewer also writes, “this is an interesting problem and it's nice to see such a well thought out first attempt at extending an important phenomenon (i.e. neural collapse) to a more broad task setting. The theory is solid and well supported with clear and careful proofs”. Reviewer 6Hcy writes, “the presented study is novel, it discovers the neural collapse phenomenon in multivariate regression, extends the boundary of neural collapse, and provides a new understanding of neural multivariate regression. The theoretical results and experimental results are solid and well-organized.” The reviewer T8Jo writes, ”the paper addresses the significant issue of Neural Collapse in regression tasks, extending its understanding beyond classification and suggesting a universal behavior in deep learning models”. And reviewer 8FCG writes, “the paper introduces a novel extension of neural collapse to neural regression collapse” and further writes, “the paper is well-written and has a good structure,” and “the paper also includes comprehensive proofs, demonstrating NRC's performance under the UFM setting”.
Although there is a consensus that the paper is novel and interesting, the reviewers have raised many important questions which is perhaps why they originally scored the paper as a “marginal accept” rather than a “strong accept”. Since receiving the reviews, we have worked hard at running complementary experiments (as shown in Figures A, B, C, and D in the attached PDF file) and writing the rebuttals, addressing all of their comments and suggestions. We hope that the reviewers and the area chair will find our rebuttal satisfactory.
This paper presents a study of the neural collapse (NC) phenomenon in a new setting that has not been studied before; that of multivariate regression tasks. While parts of the analysis are similar to previous studies of NC under a MSE loss, the task studied here differs and thus the structural properties that characterize NC in this case are new. The authors provide these characterizations, provide empirical evidence, and a theoretical study under the unconstrained feature model. All reviewers believe this is a worthy contribution for the conference, and I agree.