PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
6
6
6
5
3.8
置信度
ICLR 2024

Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching

OpenReviewPDF
提交: 2023-09-24更新: 2024-03-16

摘要

关键词
Supervised contrastive learningneural collapseimplicit biasclass imbalance

评审与讨论

审稿意见
6

This paper applies RELU activation in the last layer of models in supervised contrastive learning (SCL). With that, SCL can learn representations that converge to the OF geometry irrespective of the level of class imbalance. The author theoretically and empirically verify the effectiveness of this method. Besides, this paper finds that the batch selection is important in representation geometry and they design a batch selection strategy (batch-binding).

优点

  1. The theoretical analysis and empirical results cooperate well. Figure 1, 2, 3 empirically demonstrate the advantages of the additional RELU function.
  2. The motivation is clear and the paper is written well.
  3. The proposed method is simple and effective.

缺点

  1. Based on the empirical results in the paper (Table 1), the improvements in the test accuracy on CIFAR-100 are not significant, even when the imbalance ratio is large. It would be better to show the advantages and disadvantages of learned representations in different downstream tasks (e.g., Does SCL+ReLU performs better in fine-tuning tasks?).
  2. It would be better to conduct some additional experiments on large scale datasets (e.g., ImageNet100).
  3. It seems that batch size is an important argument when applying sample selections according to the theoretical analysis. It would be better to add the ablation study on that.
  4. Based on the analysis, it seems that various activation functions can achieve the similar effect as ReLU. Is it possible to add a discussion about that?

问题

See my comments above.

评论

Thank you for constructive feedback and insightful questions. Please find our response below in addition to the global response:

Batch size: Thanks for highlighting the role of batch size. Indeed, for contrastive loss and in the analysis of our model, batch size plays an important role in defining the set of global solutions. This is the case as the batching scheme specifies the interactions among samples during training. Our theoretical framework outlines the necessary and sufficient conditions for the batching to achieve a unique minimizer for contrastive loss. Following your suggestion, we conducted additional experiments for varying batch sizes to empirically explore its impact on the final learned embeddings. The detailed results of these new experiments are presented in Section E.4.3 in the appendix of the revised draft. In these experiments, we examine the effect of batch size in two scenarios, depending on whether ReLU is applied on the final embedding layer or not. We compare the learned features with OF when ReLU is present at the final layer, and with ETF (as predicted by Graf et. al (2021)) when ReLU is removed. Our observations indicate that the presence of ReLU reduces the reliance on batch size, leading to consistent convergence to OF. In contrast, in the absence of ReLU, large batches are necessary to achieve the global optimizer ETF. We also note that Khosla et.al (2020) argue that increased batch sizes lead to improved performance. In particular, they typically use batch sizes extending up to 6144. Our findings suggest that with the addition of ReLU, one can potentially achieve comparable performance with smaller batch sizes, which offers venues for reducing the training computational cost.

Activation functions:

Thank you for bringing light to this subtle point. We considered ReLU as the primary focus of our study in terms of the activation function due to its simplicity and frequent use in practice. While it is interesting to study different activation functions, our motivation was to identify that a simple and commonly used activation function can significantly alter the implicit geometry of features.

That said, we agree that this is indeed an interesting question. The theoretical analysis only assumes that the post-activation features are non-negative and span the region [0,][0, \infty]. It is reasonable to anticipate that activation functions satisfying this requirement lead to the same result. However, the non-linearity of the activation function is likely to play a role in the optimization dynamics and needs further investigation. We have planned to conduct specific experiments and report our findings in the paper.

评论

Dear Reviewer,

Thank you for your time and effort in reviewing our work. We appreciate your useful inputs, and have made sincere efforts to address your comments. Thank you for the constructive suggestions of batch-size ablation studies and discussion on other activation functions. These discussions will be useful additions to our paper. On the test accuracy discussion, we would like to refer you to our global response (including additional results on gains in worst-class accuracy reported in Sec. E.8 in the revised appendix). If you have more questions, we’re happy to provide further information.

We would appreciate it if you would consider re-evaluating your rating.

审稿意见
6

This work proposes a variants of SCL loss to restore the symmetrical geometry of the class-mean learned embeddings in the presence of unbalance class samples, just adding a ReLU activation after the projection layer. It theoretically demonstrates that due to the existence of ReLU, the last-layer feature embeddings of samples with the same labels will be finally aligned, while that of samples with different labels will be orthogonal, when we employ the full-batch training. While implement mini-batch training, it also showcases that mini-batch selection strategy will significantly influence the learned embedding geometry.

优点

  • The proposed approach stands out for its simplicity and effectiveness. It offers valuable insights into the subject matter, shedding light on the restoration of symmetrical geometry in the presence of unbalanced class samples.

  • The theoretical analysis is well-founded, enhancing the credibility of the approach. Experimental results, while limited to relatively simple datasets like MNIST, CIFAR10/CIFAR100, and TinyImageNet, do support the method's efficacy.

缺点

The main concerns still lie in the experimental parts.

  • The experiments primarily utilize CNN-based architectures. It would be beneficial to explore the applicability of this approach to train other architectures, such as Transformers, to gauge its versatility across neural network types.

  • While the proposed approach demonstrates superiority on simpler datasets, questions remain regarding its performance on more complex and larger datasets. The potential impact of the non-negative constraints imposed by the ReLU activation on representation ability in more intricate tasks needs further investigation.

问题

The minimum of the loss is not zero and varies with the number of samples in different classes.

评论

Thank you for your appreciation of our messages and analysis, and for your constructive feedback. Please find our response below in addition to the global response:

Additional experiments on transformer architecture: Thank you for the suggestion of extending the experiments to transformer-based architectures. Motivated by the question, for the revised version of our paper, we have included additional experiments on vision transformers (see Fig.22). Based on the observations, our hypothesis of OF geometry holds here as well, thus solidifying the claim of the generality of our results across various architectures.

Questions:

The minimum of the loss is not zero and varies with the number of samples in different classes:

That is correct. Due to fixing the norm of the features to a constant value (say 1), the loss does not go to 0. The minimum value is exactly given by the lower bounds in Eq (3) and (4), respectively, for full-batch and mini-batch optimizations. For full batch SCL, we show the loss evolution in a DNN experiment, where the loss is seen to converge exactly to our lower bounds in Theorem 1. This prediction is made possible due to our theoretical analysis.

To demonstrate the effectiveness of our Theorem 1, in Fig 3, we show how closely the empirical loss in a full-batch DNN experiment converges to the predicted lower bound.

评论

My concerns still revolve around the representation ability of UAM+. Assuming that the feature of the last layer is nn-dimensional, the representation space of the feature is constrained to only the positive space due to the existence of UAM+, reducing it to 12n\frac{1}{2^n} of the original representation space, shrinking at an exponential rate. Thus, it raises questions about whether the proposed method is sufficient for learning from large and complex datasets. I raised this concern with the authors during the original review, requesting them to conduct experiments on large-scale datasets to address this question, but it appears that the authors have not implemented these experiments.

评论

Thank you for your response! Please allow a couple of remarks:

  1. First, because of neural collapse (i.e. features of each class collapse to their class means), the features lie on a k-dimensional subspace. In other words, and following the reviewer’s notation for convenience (i.e. n for the feature dimension, and, say, N for the train-set size and k for the number of classes) the rank of the (N x n) feature matrix H is k. Thus, there is no “shrinking of the representation space.”

  2. Second, the above holds irrespective of the inclusion or not of ReLU. In fact, even if ReLU is not present and one considers a simplex ETF geometry for which features are maximally separated, then the cosine of the angle between the features is -1/(k-1). Thus, the cosine actually approaches 0 (same as the angles of OF), as the number of classes k increases.

  3. Third, let’s even consider the case of CE optimization on balanced data, i.e. the setting of the original work by [Papyan et al. (2020)] and a long list of follow-up papers. In that setting, again, the features of the last layer usually undergo a ReLU nonlinearity in most common architectures (eg. ResNet). Hence, your concern would still apply to that setting, which is however arguably well-established by now in the community. For the above reasons, there are no concerns with regards to the “sufficiency of the method for learning from large and complex datasets.”

With regards to additional experiments: we kindly note that our paper identifies a new phenomenon, which we not only theoretically justify, but also experimentally confirm on four different datasets (MNIST, CIFAR10, CIFAR100, TinyImagenet) of both STEP and LT imbalance trained on four different architectures (MNIST, VGG, ResNet, ViT). Given the range of scales and complexities of these datasets/architectures, we believe this forms sufficient evidence in support of the identified phenomenon. We also kindly note that the breadth of the experiments is consistent with the rest of the literature on neural collapse phenomena (e.g. Graf et al. (2020), Ji et al. (2022), Sukenik et al. (2023), Thrampoulidis et al. (2022), Yaras et al. (2023), Zhu et al. (2021), Zhou et al. (2022a), Zhou et al. (2022b) etc). Please note that our experimental evidence is more extensive than several prior works. As a matter of fact, and to the best of our knowledge, we are the first to demonstrate neural-collapse geometry convergence for a transformer architecture (ViT). (Thanks again for your suggestion!)

PS: By “UAM”, we believe the reviewer is referring to UFM (Unconstrained Features Model)

评论

Thank you for your further clarifications. In the era of "large model" and "big data", I still think the experimental results for training ResNet-18/34 and a two-layer ViT on MNIST, CIFAR10, CIFAR100 and Tiny ImageNet is not so convincing. Hence, I will keep the score.

评论

We sincerely appreciate your thoughtful suggestion to replicate the experiments with larger datasets and architectures. While we recognize that there are settings (particularly in the context of recently studied phenomena with LLMs) where behaviors/phenomena might only emerge at sufficiently large scales of datasets, the setting we consider here does not inherently fall in this category. Hence, from a conceptual standpoint, extension to even larger datasets does not yield additional insights. Instead, we have already consistently observed in the current experiments that when the provided architecture size is sufficiently large and standard training assumptions are met (e.g., training until the terminal phase), the identified phenomenon of invariance in the OF geometry, irrespective of label distribution, emerges. This observation is also supported by theoretical insights grounded in the UFM+. Furthermore, it is consistent with all recent studies on related neural-collapse phenomena. Overall, we have chosen to redirect our efforts towards verifying the cross-situational nature of our results by conducting experiments on architectures with different inductive biases, such as MLPs, ResNets, and ViTs, across various imbalance profiles. Additionally, we have delved into a comprehensive investigation of the role of batching in the observed geometry. To the best of our knowledge, this marks the first result of its kind in the existing literature. Once again, thank you for your time and valuable feedback.

审稿意见
6

This work studies the training of deep neural networks with a supervised contrastive objective. Empirically, it is shown that if there are ReLU activations after the last layer, then the learned features per class collapse, and that the features of different classes are orthogonal to each other. This is supported by an analytic argument that characterizes the minimum of the loss function as a function of the features (instead of the network weights). This feature geometry is independent of the class ratios. Moreover, a batching strategy is derived that guarantees the emergence of this feature geometry.

优点

  • The finding of this work; that training a deep neural network with a supervised contrastive loss and ReLU activations leads to the same feature geometry irrespective of the class imbalances is very surprising. Prior work showed that this is not the case for the cross entropy loss. Moreover, it is shown that this feature geometry is a common minimizer for all batches. This results in a straightforward proof, as there are no 'averaging effects' to account for.
  • The paper is well written
  • Related work (with one exception) is extensively discussed and compared to.

缺点

  • Significance: The work appears to be written on the premise that achieving a symmetric (orthogonal or simplex) feature geometry is beneficial. For example, a batching strategy is devised that guarantees such a geometry. While symmetry seems reasonable for class-balanced data, for imbalanced data this is not so clear. I did not see an argument supporting this premise, and thus also not for why using the proposed batching strategy is a good idea. Moreover, there is few empirical support, as the use of ReLUs does not have any effect on classification accuracy, cf. Table 1.

  • The analytical argument requires a very strong assumption: unconstrained features. Moreover, only the loss minimizer is derived, but optimization effects for this non-convex problem are not accounted for. On the other hand, this strong assumption allows for results that are architecture independent.

  • The empirical part of this work would be stronger, if the theoretical findings are verified on a large scale imbalanced dataset (compared to the mid-scale datasets CIFAR10, CIFAR100 and TinyImagenet).

  • The very related work [Wang & Isola, Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, ICML 2020] is not cited. This work studies feature geometries when minimizing the (unsupervised) contrastive loss and should be compared to.

问题

Are there empirical differences when training models with and without ReLU activations after the last layer. If so, can such differences be explained by the targeted feature geometry? Are such differences more pronounced when using the batching strategy?

评论

Thank you for your supportive comments and useful feedback. We reply to the questions below in addition to the global response:

UFM assumption: We would like first to emphasize that our work addresses the problem for highly overparameterized models and training beyond the zero-training error. This is a setting considered in the majority of previous works on neural collapse [Mixon et al. (2020); Fang et al. (2021); Graf et al. (2021);Lu & Steinerberger (2020)]. The UFM-inspired theoretical analysis in the literature is also motivated by this overparameterization assumption. In fact, UFM assumes that overparameterized models are capable of representing any function mapping the input to the embedding space. As pointed out by the reviewer, this indeed is a strong assumption, although one that has proved to be very effective. Our findings alongside the results in a series of previous related works confirm this model's efficacy in predicting the properties of the embeddings learned at the final layer for large enough models. In other words, the high expressivity of overparameterized models makes the UFM a useful tool for identifying the global minimizer feature configuration, so that the impact of the loss function itself can be isolated from the role of the network architecture. Our work shows that the UFM remains a useful mathematical model even with the non-linear activation function of ReLU.

On the optimization aspect, we believe that studying the optimization effects on the UFM has scope for future work. Our experiments provide evidence for consistent convergence towards the global minimizer (OF geometry) despite the non-convexity of the network optimization.

Empirical: We appreciate the reviewer's recommendation regarding experiments on large-scale datasets and consider adding these experiments to the final version of the paper. Along these lines, we are excited to report new experimental findings in support of our hypothesis of OF geometry on transformer architecture. We have included additional experiments on vision transformers (see Fig.22). Based on our current observations, our hypothesis of OF geometry holds here as well, thus solidifying the claim of the generality of our results across various architectures.

In summary, we would like to note that our current findings are backed up by experiments spanning various architectures and datasets, including MLP, ResNet, and DenseNet applied to MNIST, CIFAR10, CIFAR100, and TinyImageNet. Across this range of tasks, from simpler and smaller ones like MNIST (10 classes) to more complex and larger tasks like TinyImageNet (200 classes), we have not observed significant changes in the convergence behavior or the communication of our key message (see Fig 2) as long as the network and dimension of output features are large enough. The observations from all of the above experiments provides us with the confidence to believe that the core message of our paper remains largely unchanged. This is the case for both our main findings: convergence to OF and the efficacy of the new batch-binding scheme.

Missing Citation: Thank you for pointing this out. We agree that Wang, Isola (2020) is a relevant work. It appears that not mentioning the same was an editing oversight on our part. We have included the reference and commented on the relationship.

Empirical differences: At the moment the main empirical differences observed are the following : the test accuracies (Tab. 1) and new analysis on worst-class accuracies between with and without ReLU. We discuss this in the global response as well. We are not aware of differences in the batching aspects. In fact, the observation in Fig. 21 suggests that the batch-binding scheme is beneficial even when training without ReLU, for convergence to the final feature geometry.

评论

I have read the authors response. It appears that my first and major concern (missing motivation for promoting a symmetric feature geometry) has not been addressed.

Moreover, I disagree with the authors, that a 'major empirical difference can be noted in the test accuracies (Tab. 1)'. The observed differences are not significant and could be due to randomness in the (presumably 5) training runs. The results in the new table in Section E.8 are convincing and show that adding ReLUs improves classification accuracy on worst classes. However, at the cost of classification accuracy in the best classes, as overall accuracy remains unchanged (as concluded from Tab 1).

I will keep my rating.

评论

Please note that there was an unfortunate typo that may have led to a misunderstanding. We did not mean to claim "a major difference". Instead, it was meant to read "the major differences at the moment are the accuracy noted in Tab 1 and worst class accuracy". In other words, these are the two empirical differences that we have observed.

Please note that the typo has been corrected.

评论

Thank you for the clarification

评论

Thank you for your time and effort in reviewing our paper. On the motivation, we would kindly refer you to our comments in the global response. At the outset, the explicit goal was not to symmetrize the feature geometry, but to uncover and understand the same. This is in line with the research goal of the rapidly growing interest in neural collapse and its implications for DNN training with various objective functions. Please note that there is no prior work that characterizes the feature geometry and thus the behavior of the loss function in inducing the same, for the SCL under imbalances. Prior to our work, the understanding of the intricate role of mini-batch optimization was limited as well. We have identified a setting where the geometry is predictable, incidentally observing that it is symmetric. Based on the works in literature on related loss functions under class imbalance, this result remains surprising. Additionally, we show that the convergence quality is excellent, in practical experiments, as compared to experimental results on geometry under imbalances in literature. Thus, we believe that our discovery, supported by concrete theoretical results, paves the way to important future questions in understanding the loss functions that are used across datasets and model architectures, and identifying characteristic properties of various loss objectives.

审稿意见
5

Background summary

This paper focus on the issue of neural collapse (NC), when hidden features of sample class collapse into one another, and orthogonal framing (OF), when the mean vectors of hidden representation of each class, are orthogonal to one another. For the theoretical framework, the paper chooses the unconstrained features model (UFM), where the loss is minmized over the classification layer and last hidden layers, unconstrained from the other model parameters.

  • UFM with cross entropy loss: we have minwc,hiRdLCE(wc,hi)\min_{w_c,h_i\in R^d}\mathcal{L}_{CE}(\\{w_c\\}, \\{h_i\\})
  • UFM with supervised contrastive loss (SCL) for full batch (for mini-batch, it is over i[n]i\in [n]):
    minhiRdiB1nB,yi1jB,yi=yj,ijlog(1+i,jexp(hihhihj))\min_{h_i\in R^d} \sum_{i\in B} \frac{1}{n_{B,y_i} - 1} \sum_{j\in B, y_i=y_j, i\neq j} \log\left(1 + \sum_{\ell\neq i, j} \exp(h_i^\top h_\ell - h_i^\top h_j)\right)
  • Symmetrical solutions under balanced classes. Intuitively, when classes are balanced (same number of samples per class) both cross entropy & contrastive loss, repel the samples from seperate classes, while absorb the class within one class. This has the effect of the finding the geometry that is symemtric w.r.t. to the class identity, resulting in a equiangular tight frame (ETF). This attraction/repulsion can be seen in the logits iexp(hih)\sum_{\ell\neq i} \exp(h_i^\top h_\ell)
  • Non-symmetrical solutions when classes are not balanced When the classes are not balanced, the symmetry between class will be broken. For example, if one class completely dominates the training set, both cross-entropy and contrastive loss will lead to solutions, where the majority class is almost anti-parallel to other class means.

Main contribution: symmetric solutions by passing last hidden layer through ReLU

The main contribution of the paper is showing that if the spectral loss is optimised over the space of all-positive hidden layers, namely by passing it through a ReLU activation, denoted as UFM_+, the UFM leads to symmetrical solutions, i.e., NC & OF properties are held again.

  • Thm1 proves that for the full batch contrastive loss, if we additionally pass the hidden layers through a ReLU, ie, optimize last hidden layers over elementwise positive vectors hi0h_i\ge 0, global minima is achieved when for all i,j[n]i,j \in [n] we have hihj=1(i=j)h_i^\top h_j = 1(i=j).
  • Thm2. proves a similar property for mini-batch contrastive loss, where the collapse and orthogonality property hold within each batch (for samples within each batch BBB\in\mathcal{B} and pair within batch i,jBi,j\in B, we have hihj=1(i=j) h_i^\top h_j = 1(i = j).
  • Cor 2.1: shows a batching propery that if held, the neural collapse and orthgonal frame property hold across the full batch, which can be explained as 3 conditions:
    1. every batch has a sample from each class
    2. The batch connectivity graph is connected (nodes uu are samples u[n]u\in [n], edges represent uu and vv appeared in the same batch u,vBu,v\in B)
  1. The sub-graphs corresponding to each class are also connected
  • The paper validates the theories on various ResNet architectures. The experiments show two variations of unbalanced datasts (step and long-tailed) which seem to show that the proposed approach (passing last layer throgh ReLU) does indeed lead to symmetric solutions. The experiments also include a result (Table 1) that shows the proposed approach leads to a higher test accuracy.

优点

  1. The presentation. The presentation of results, both in terms of writing, mathematical notation and formula, and the figures, is exquisite. The authors take many positive steps in ensuring that the concepts are clear and simple to grasp for the reader.
  2. The selection of the research problem and the ideas presented by the author are (to the best of my knowledge) original. The simplicity of the ideas as well as the presentation of the results make the results intriguing to read.

缺点

Main issue

  • While the stated objective of the paper, i,e, embeddings that satisfy NC & OF properties, is clear, it not at all clear that it is a justified goal. For example, consider the simple case of binary classification. The optimal solution without ReLU will map two classes to some anti-parallel vectors uu and v=uv = -u. This means that the hidden representations for one class will have the maximum degree of separation uv=2\| u - v\| = 2. Now, if we pass them through ReLU, forcing them to be element-wise positive, the solutions will be instead some uu and vv such that uv=0u^\top v = 0. Thus, the their mean-class distance will be uv=2.\| u - v\| = \sqrt{2}. Without any formalities, what doesn't seem to be a benefit in terms of distance, doesn't seem to be a good approach for generalization either. Even if we make the two classes here very unbalanced, still the anti-parallel geometry seems to be better than the orthogonal approach. Since the authors do not make any formal/theoretical comments on this topic, the only relevant evidence they provide in this direction is Table 1 test accuracy results. Because of the aforementioned reasons, I consider the results of table 1 to be particularly surprising. Thus, if the the main evidence to support utility of the approach is empirical, I would like much more substantive experiments. Some elaborate explanations or at least speculative comments by the authors will also go a long way to explain the benefits of the orthogonal mean-classes.
  • My second main issue is about the novelty and significance of the contributions. The theory, while stated clean and clearly, is rather thin. The main theorems are not very surprising, and are rather straightforward. As elaborated in the questions section, If I'm not mistaken, similar results can be replicated for the cross-entropy loss in a similar straightforward fashion. On the empirical front, while the results are definitely interesting (notably Table 1), they are not comprehensive to solidify that point. For example, if this approach has any benefits for improving test accuracy, there should be much broader experiments, covering different model architectures, datasets, and most importantly, competing methods, that deal with unbalanced classes. I am not advocating here for state-of-the-art performance. However, it is important to see the empirical results in the context of prior/similar work, so that the reader may assess the empirical benefits for themselves. That said, I am open to reconsidering my views (both on theory and empirical contributions) upon hearing authors responses.

Minor issues:

  • I think there is a slightly overuse of abbreviations in this paper which make it slightly hard to read. For example, I had to go back several times to read what ETF is, or other notations. If possible, please keep the shorthands to a minimum.
  • There is a small discrepancy between eq(1) and eq(2) in the presence of (1+ ... ) within the logarithm and the summation is over i,j\ell\neq i, j while another is over i\ell \neq i. Perhaps authors can comment on this?
  • In Figures 2 & 5 axes are not clearly defined, eg. caption can say x-axis ... y-axis ...

问题

While the authors present the ideas for symmetrising non-balanced the contrastive loss, to me it seems like the cross-entropy loss will benefit from the same approach. For example, consider the CE loss for cc-class classification for the full batch is given by: L=i=1nlog(exp(hiwyi)kcexp(hiwk))\mathcal{L} =\sum_{i=1}^n\log\left( \frac{\exp(h_i^\top w_{y_i} ) }{\sum_k^c \exp(h_i^\top w_k)}\right) It is rather trivial to see that if we pass the last hidden layer through ReLU, since that elementwise condition hi0h_i\ge 0 ensures that wkw_k's are also positive, and the resulting inner products hiwih_i^\top w_i will be always non-negative, and the global optimum is achieved when neural collapse and orthognal frames are achieved. For example, if the hidden dimension is larger than number of classes h,wkRd,dc,h,w_k\in R^d, d\ge c, then we can set w1,,wcw_1,\dots, w_c to the standard basis wk:=ekw_k := e_k and collapse hidden layers onto the corresponding vectors hi:=eyi.h_i := e_{y_i}. Therefore, my questions are:

  • Do authors agree with my reasoning? Please correct me if my reasoning above is flawed.
  • If yes, why have authors left out cross entropy in their analysis, given that it's leading to the same result as the contrastive loss results?
  • Is there any pros/cons if we compare symmetrized CE loss, and contrastive loss, both by passing hidden layer through ReLU?
评论

Questions:

Question on CE: It turns out that adding ReLU at the final embedding layer does not guarantee that the classifiers wkw_k learned by CE are positive. To see this, take a simple binary setting. In the binary setting, we can assume that we only search for a single classifier, as at the optimal solution w2=w1w_2=-w_1. Now, suppose the embeddings hi,i=1:nh_i, i=1:n are fixed to an OF: hi=eyih_i=e_{y_i}, and further suppose the norm of the classifier is fixed (since only the direction matters for the final decision rule). Then, the max-margin classifier that optimizes the objective is w[1,1]w\propto[1,-1]. In short, the optimal classifier can have negative entries despite the embeddings being constrained to the positive orthant.

Our empirical observations also support this claim. First, we note that ResNet networks incorporate ReLU in the final embedding layer (so they fit in this setup). In our ResNet experiments, we have compared the embeddings learned by CE and SCL. First, in Fig. 7, we report the convergence of the embeddings to the OF, and we observe that CE learns embeddings that do not converge to the OF (despite the positivity constraints). Second, in Fig. 1 and 10, we plot the heatmap of the learned embeddings with CE for different imbalance ratios. These results indicate that merely introducing ReLU to the embedding layer does not ensure the orthogonality of features learned by CE. Moreover, the experiments in Thrampoulidis et al. (2022) on networks that have ReLU at the hidden layer show that the embedding geometry alters with imbalance ratio, further suggesting that simply constraining the embeddings to be positive for the CE loss does not guarantee orthogonality of the features. In general, solving for the global minimizer feature-classifier configuration with CE loss is non-trivial. Several works have tackled this problem with different assumptions [Zhu et al. (2021), Graf et al. (2021), Fang et al.(2021), Thrampoulidis et al. (2022), etc]. In light of these observations, we believe exploring the impact of the final layer adjustments on CE deserves a separate study.

We appreciate the interesting questions, and are happy to provide any further clarifications.

评论

We thank you for their detailed review and questions. We are pleased to see your recognition of our efforts in presentation. Please find our detailed response below, in addition to the comments in the global response.

For example, consider the simple case of binary classification...: Thank you for prompting this discussion. The statement regarding antipodal features for binary SCL without ReLU is exactly correct. In fact, this holds in general for balanced multi-class setup as well: When the dataset is balanced Graf et al. (2020) shows that the embedding geometry will form an ETF in the absence of ReLU. In an ETF geometry, we have μcμcμcμc=1k1\frac{\mu_c^\top\mu_{c'}}{\|\mu_c\|\|\mu_{c'}\|}=\frac{-1}{k-1} for the mean-embeddings of any two classes. As opposed to OF where μcμcμcμc=0\frac{\mu_c^\top\mu_{c'}}{\|\mu_c\|\|\mu_{c'}\|}=0.

Therefore, for ETF, μcμc2=2kτ(k1)\|\mu_c-\mu_{c'}\|^2 = \frac{2k}{\tau(k-1)} and in the case of OF μcμc2=2τ\|\mu_c-\mu_{c'}\|^2 = \frac{2}{\tau}. τ\tau is the temperature of the SCL loss. While the distance in the ETF geometry is larger for a fixed temperature parameter, we note that by simply scaling the temperature parameter τ\tau by k1k\frac{k-1}{k}, we can achieve an equivalent absolute distance in the learned geometry in the presence of ReLU. Thus, in terms of this measure, the addition of ReLU does not compromise the trained model.

On the contrary, as shown in Fig. 1, in the imbalanced setup, without ReLU, the minority classes are drawn closer to each other, yielding a reduced absolute distance between classes. The introduction of ReLU, however, maintains an equal pairwise distance between classes. Hence, in terms of this measure, the addition of ReLU is in fact helpful.

Finally, as noted in Section D.1.2, centering the OF by 1kcμc\frac{1}{k}\sum_c\mu_c results in an ETF. So, in summary, the geometries learned in balanced or binary scenarios, with or without ReLU, can be considered equivalent, accounting for a shift and scaling.

Novelty and significance

Thank you for appreciating the originality of our work and ideas. We believe that solving for the global minimizer feature-classifier configuration with the SCL loss under is non-trivial and novel. We elaborate below:

  1. Symmetric feature geometry: the first surprising observation that we support with theory is the fact that ReLU symmetrizes the feature geometry regardless of imbalances. To the best of our knowledge, such an outcome is unique to the SCL and stands out from several prior works considering class imbalance in CE and other losses. For e.g., Thrampoulidis et al. (2022), Fang et al. (2021) discuss that the feature and classifier geometry depends critically on the imbalance. We further identify the appropriate theoretical model for the observation, and solve for the exact geometry of the global optimizers and the optimal value of the loss function.
  2. Batching matters: the first study of global optimizers of the SCL, Graf et al. (2020) only considers the setting of balanced data. In their words, the theoretical analysis is based on “a combinatorial argument which hinges on the balanced class assumption”. The analysis therein requires the specific batching case of all combinations of examples in the dataset of a given batch-size. On the other hand, our mini-batch analysis is applicable to arbitrary batching schemes. We discover the batching conditions under which a unique minimizer geometry exists- this condition allows for a large class of batching schemes, and includes the specific case of Graf et al. (2020) as a special case. We leverage our theoretical results to concretely explain geometry observations made in DNN experiments as a function of batching schemes.

We discuss the question of possible extension to CE below, within the response to the specific question.

Minor issues

overuse of abbreviations: Thanks for the feedback. We have added a table defining all the important abbreviations used in the paper in one place at the beginning of the appendix.

eq(1) and eq(2): Eq (1) and (2) are equivalent, in terms of the loss function. The apparent difference is only in presentation, in that in Eq (2), the sum over ll includes jj, so that when l=jl = j, the innermost exponential term becomes exp(hihlhihj)=exp(0)=1\exp(\mathbf{h}_i^\top\mathbf{h}_l - \mathbf{h}_i^\top\mathbf{h}_j) = \exp(0) = 1.

******** Continued below *********

评论

Dear Reviewer,

Thank you for your time and effort in reviewing our work. We have made diligent efforts to address your comments and we hope that our responses have addressed your questions. If you have more questions, we’re happy to provide further information.

We would appreciate it if you would consider re-evaluating your rating.

评论

We appreciate the reviewers for their positive and constructive feedback, as well as their detailed reviews. Their recognition of the surprising nature and originality of our findings (PL8u, BEvk, q7qd, ScWC), the solid theoretical analysis supporting our experiments (q7qd, ScWC), and the thorough discussion of related work (PL8u) is encouraging. We also thank them for noting the clarity and good organization of our paper (PL8u, BEvk, ScWC). We are particularly pleased that all reviewers acknowledge the surprising nature of our finding, demonstrating that the simple introduction of a minor architectural modification, such as the incorporation of ReLU at the last layer, results in a consistent geometry despite label imbalance (STEP/LT and imbalance ratio R). In response, we would like to address two specific points: (i) highlighting an additional distinctive feature of our result and (ii) providing a discussion on our interpretation of these findings.

(i) The quality of convergence to the common geometry (OF) is consistently impressive, even for large imbalance ratios (R) and LT distributions. Notably, this convergence quality is further enhanced when implementing the proposed batch-binding scheme, especially evident in more complex datasets. We find the robust convergence quality as surprising as the convergence of geometries to a common structure (OF). To underscore this point, we remark that the convergence quality is far inferior for the cross-entropy (CE) loss under imbalances (e.g. see Thrampoulidis et al. (2022)). The high-quality convergence emphasizes the practical relevance of our neural-collapse geometry characterizations.

(ii)We interpret the discovery of a simple and cross-situationally invariant structural property in complex deep learning models trained on complicated datasets as a step towards unraveling their underlying mechanisms, moving beyond the conventional perception of these models as black boxes. This perspective aligns with the seminal work on neural collapse of cross-entropy loss and balanced data by Papyan et al. (2020), and the subsequent extensive body of literature as cited in our submission. At the same time, our approach offers a rather unique perspective in the following manner: Acknowledging an inherent challenge on the task of unveiling the black-box nature of deep representations’ geometries of SCL (in its standard use), we identify the inclusion of ReLUs at the last layer as a minor network modification that significantly easens the task. Importantly, this comes without compromising test accuracy. This is revealed by Table 1 and extensive additional evaluations in Table 2 in the appendix. As a matter of fact, in specific instances such as CIFAR100-LT, the inclusion of ReLU even leads to an improvement in balanced accuracy. Our revised manuscript incorporates additional results showcasing more substantial enhancements, particularly in terms of worst-class accuracy. It might be tempting to attribute these (small) improvements in accuracy to the fact that OF geometry avoids the minority collapse that would otherwise occur (see Fig.1 for evidence of this being the case without ReLU). However, given the lack of any substantial and formal evidence (in the broader literature) of such a causal relationship between geometry of trained embeddings and accuracy, our submission refrains from making such a statement. Instead, we highlight that our finding yields a learning framework that is at least as accurate as the original one while being easier to understand in terms of the structure of its deep representations. While it may be tempting to attribute these modest accuracy improvements to the avoidance of minority collapse inherent in the absence of ReLU (as evidenced in Fig. 1), we exercise caution. Despite the observed evidence, we refrain from making definitive claims due to the lack of substantial and formal evidence within the broader literature establishing a causal relationship between the geometry of trained embeddings and accuracy. Rather than asserting causation, we emphasize that our findings result in a learning framework that, at a minimum, matches the accuracy of the original model while offering enhanced interpretability regarding the structural aspects of its deep representations.

评论

Finally, we outline the main additions to the revised manuscript below:

  • Section E.4.3: In response to reviewer ScWC's suggestion, we provide additional experiments examining the impact of batch size on learned embeddings during training. Our findings indicate that when ReLU is present in the final layer, the learned embeddings are less sensitive to the choice of batch size.
  • Section E.8: To further compare the performance of the models with and without ReLU, we report the worst-class test errors for different imbalance ratios in each of the two scenarios.
  • Section E.8.2: Addressing reviewer q7qd's recommendation, we present preliminary empirical results on vision transformers to showcase the applicability of our findings on more general architectures.
AC 元评审

This research explores the training of deep neural networks using a supervised contrastive objective. The empirical findings indicate that when ReLU activations are present after the final layer, the features learned for each class tend to collapse, and these features for distinct classes exhibit orthogonality. This observation is substantiated by an analytical analysis that describes how the loss function's minimum is related to the features themselves, rather than the network weights. Importantly, this feature geometry is not influenced by class ratios. Additionally, the study introduces a batching strategy designed to ensure the emergence of this feature geometry.

Most reviewers find the result interesting, and well-presented, with a comprehensive literature review. Therefore, we recommend acceptance. Please include all reviewers' comments in the camera-ready version, especially more comprehensive experiments on larger datasets and Transformer architectures.

为何不给更高分

The theory is developed under strong assumptions, and the experiments are not comprehensive enough

为何不给更低分

Most reviewers agree that this is an interesting result.

最终决定

Accept (poster)