PaperHub
6.3
/10
Rejected4 位审稿人
最低5最高8标准差1.1
8
6
5
6
3.0
置信度
正确性3.0
贡献度3.0
表达2.8
ICLR 2025

Approaching Deep Learning through the Spectral Dynamics of Weights

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We show that an optimization bias in the singular values and vectors of weight matrices links together many complex phenomena in deep learning.

摘要

关键词
simplicity biasgrokkinglottery ticketslinear mode connectivity

评审与讨论

审稿意见
8

This paper proposes to inspect the evolution of singular values & vectors of weights in training to gain mechanistic interpretability in deep learning. Experiments are conducted on various tasks with various models. Results shed light on grokking, low-rank weights, weight decay, and alignment between adjacent layers.

优点

  • This paper is well-written. The phrasing is polished. The literature review is mostly comprehensive. The problem is motivated well with an insightful introduction. I enjoyed reading it.
  • The authors use the tool of spectral dynamics to examine a wide variety of tasks and analyze many recent outstanding questions in deep learning.
  • Extensive experiments are carried out to provide information for the analysis.

缺点

  • The authors reviewed some prior theoretical papers and mentioned that they often cannot explain seemingly relevant phenomena in practice, which I appreciate. But in the same spirit, there are probably cases where this paper's tools cannot explain either. Could the authors give a few such instances or some relevant insights? For instance, in line 118, the authors write that any arguments about generalization cannot be capacity-based alone. Then I wonder whether arguments about generalization can be spectral-dynamics-based alone. Are there cases where the generalizing solution has a higher rank than the memorizing solution? If yes, perhaps the conclusion in line 140 should be revised or stated with more caveats.
  • For the architectures studied in this paper, except for MLPs, there is not a very standard definition of alignment off the shelf. As mentioned in the Appendix, the authors tried a few different definitions before settling on the current version. So I am wondering if the results vary much with different ways of evaluating alignment.

问题

  • In top row of Figure 2(a), why is there a large blue shaded region? The caption doesn't introduce what the shade represents. My first guess is the standard deviation but the shade reaches negative values.
  • In top row of Figure 2, is the model a one-layer transformer? If so, why are there 6 layers in Figure 2(c) and 50 layers for a 12-layer transformer in Figure 3(d)?
  • In line 359, why can't the last layer be pruned?
  • In line 395, the authors state that the absence of alignment is reminiscent of Mulayoff & Michaeli. Mulayoff & Michaeli found that the alignment appears with GD and SGD with reasonable stepsizes and may break for GD with a large stepsize. So I'm wondering if absence of alignment in Figure 5 is due to large stepsizes and can be recovered if a smaller stepsize is used. Also, if I'm not mistaken, the assumption of whitened input is not necessary for alignment in linear networks.
评论

Thank you again for your review. Below we respond to your specific questions.

Questions

...large blue shaded region...

This is indeed the standard deviation of accuracy as we run all experiments for 3 random seeds and the seeds converge at very different times. We would be happy to revise the caption to reflect this, and clip the boundary at 0.

...one-layer transformer...

Thank you for the careful attention, this is due to choices driven by space constraints. We use the word "Layer" in the plots to denote "matrix parameter." A 1-block Transformer has W_Q, W_K, W_V, W_O, MLP_1, MLP_2 and input-output embeddings, hence the multiple rows. The choice of this word, while certainly not ideal, was a compromise due to space and readability (in Fig. 5), and also to try to make it clear that the y-axis represented depth (which "matrix parameter" lacks). We would be happy to revise the caption of Fig. 3 to include this detail.

...why can't last layer be pruned?

For image classification and speech recognition, the output dimensions are very small (10 and 29 respectively). Thus the full rank of the last matrix parameter is bounded by these numbers. Pruning would remove a large amount of the last layer capacity, and in our tests significantly harmed performance in an anomalous way that didn't reflect the compressibility of the rest of the network. A similar decision was made for unstructured pruning by [2].

...absence of alignment in Figure 5 is due to large stepsizes...

Thank you for this careful read. It is our understanding that assumption A.2 (whitened data) was required for the theorems in Section 4 of Mulayoff and Michaeli [3]. We would be happy to revise if mistaken. To discuss the difference, the results which quantify balancedness on nonlinear systems through layer gain (Fig 6 in [3]) are done on small scale MNIST, where we also show (Fig 7) that there is top singular vector alignment. It would seem the more important factor is the complexity in larger networks rather than choice of learning rate. We're also not totally certain what is meant by "reasonable" step size here: as we wish to analyze practical systems, we choose hyperparameters to match performant settings in the literature, so these are "reasonable" from the perspective of empirical systems, and lower learning rates would be considered small. Such choices of small learning rates in practical systems lead to worse performance, and they also break similar theoretical work on the connection between NTK and real systems [4]. Please let us know if we have misunderstood at all.

评论

Thank you for the detailed rebuttal. Overall I don't have major concerns with this paper.

Some follow-up comments about the rebuttal:

It is difficult to convey this in a bullet point, hence the conciseness in the writing.

I understand the benefit for conciseness, but it's probably better to not trade accuracy for conciseness in conclusions -- because presumably you would want readers to take them away as your newly-found truth. So I'd suggest that the strength of the claims in the conclusion is at a level that they are correct as standalone take-home messages.

It is our understanding that assumption A.2 (whitened data) was required for the theorems in Section 4 of Mulayoff and Michaeli.

Mulayoff and Michaeli indeed assumed whitened data. But I presume that they needed that assumption for other results they were deriving but not alignment. It is well-established in many other works that linear MLPs have aligned weights when trained from small or balanced initialization with a small learning rate; and no assumption on the data is needed to prove it. See Theorem 2.2 in [1].

Because the only place in Mulayoff and Michaeli that reported disappearing alignment is in Section 7, they said that alignment "breaks for GD with a large step size". So I interpreted these two sentences as: the reminiscence lies in the choice of the learning rate size, which Mulayoff and Michaeli said can make alignment disappear.

"a weak signal of alignment in the top ranks develops and disappears. This trend is somewhat reminiscent of the theoretical result"

I am happy for the authors to correct me if that was not the reminiscence they were referring to. In either case, the reminiscence argument probably needs a correction or clarification.

[1] Du et al. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. NeurIPS 2018.

Other clarifications in the rebuttal are helpful. Perhaps it'd good to include them in the paper during revision.

评论

Thank you for the dialogue. We're glad that the rebuttal was able to answer most of your concerns and will certainly incorporate the changes. We address the additional discussion below.

It is difficult to convey this in a bullet point...

Thank you for the feedback. We agree that likely this is a little bit too unqualified. To rectify the issue, we propose making the following changes to the contribution bullets in Section 1:

Generalizing solutions have a lower rank than memorizing ones; and

->

Random-label training yields high rank, unaligned parameters when compared to true-label training; and

In Section 7 we propose to make the following change:

...promotes this low-rank bias. We also show that random label training and true label training differ strongly in the rank and alignment of solutions found by optimization, echoing the rank and generalization connection during grokking.

It is our understanding that assumption A.2 (whitened data) was required for the theorems in Section 4 of Mulayoff and Michaeli

Thank you for this clarification and helpful comments, we completely agree that there are other ways to derive balancedness (and already reference Du et al. 2018 just prior in line 377).

Outside of the whitened data, our initial understanding of the Mulayoff & Michaeli work was flawed. In particular we did not realize it was relying on the same assumptions (small initialization) and thus conclusions (balancedness continues throughout training) of prior work, thus not commenting on the dynamics from unbalancedness to balancedness. Your interpretation is very sensible and we will revise (details below).

"a weak signal of alignment in the top ranks develops and disappears...

Thank you for your careful attention here. The sentences in our paper in question indeed need to be revised and were quite unclear. Our intention was to say that the "increasing alignment" is reminiscent, not that the "increasing and decreasing alignment" is reminiscent. Given that Mulayoff & Michaeli are not making arguments about the dynamics starting from random initialization (only empirical results noting it is worse), even this claim does not make sense. As to what differences may be, they are not close to the standard practical setting that uses such random initialization far from 0. To answer your initial question: the differences between the theoretical work and our setting are many-fold: large initialization, larger step sizes, more complex architectures, more complex tasks, etc. To revise this we would make the following change:

However, a weak signal of alignment in the top ranks develops and disappears. We are very far from the theoretical settings of prior work like \citet{mulayoff, du, arora} which use balanced or near-zero initialization and smaller step sizes, and the strange trends...

Thank you for your careful attention throughout this process.

评论

Thank you for the further clarification and proposed fixes. I raise my score from marginal acceptance to acceptance. I hope the planned revisions will be well implemented. Look forward to seeing the revised version.

评论

Thank you again for your careful feedback throughout the review process. Your comments have improved the quality of the paper greatly and we will incorporate all of the proposed changes. We are glad this discussion was useful in improving your opinion of the paper.

评论

Thank you for your comments. Lots of care was taken in writing the introduction and establishing appropriate context given the scope of the work, so we are glad to see that was helpful. We are also glad to see that our experiments resonated. We address your concerns below.

Weaknesses

...probably cases where... tools cannot explain either...

Thank you for the nuanced view. It is difficult to convey this in a bullet point, hence the conciseness in the writing. We believe it's a question as to what "generalization" vs. "memorization" means. If we're concerned only with validation performance, then it could be the case that there are certain tasks that require memorization (e.g. in language modeling where many long-tail words only appear a few times), but in these cases we still expect such solutions to be lower-rank than completely unrelated data (e.g. random labels shown in Figs 7, 8, 9). A more pertinent limitation of our work is that we do not consider the nonlinearities, which are critical for deep learning. It's somewhat surprising thus that we even have alignment for these nonlinear models (lines 426-427). Follow-up work should take this into account, as alignment between layers might become much more clear post-activation on a per-example basis. We propose modifying the conclusion to include some discussion of this:

...common underlying mechanism. One limitation of this work is the focus on linear alignment between neighboring layers. In linear systems, this alignment is a proxy for the network being able to pass signals from input to output, but once nonlinearities are introduced this proxy looks much weaker. We would guess that there still is alignment in larger systems, but it needs to be viewed post-activation for particular examples. This is a promising direction for the future.

...not a very standard definition of alignment off the shelf...

Thank you for this concern. We clarify our choices. For alignment, we started with the definition of the linear operations in the network. For example, with MLPs it is straightforward as there is a single input to output. For ConvNets the parameters map inputs (H x W, c_in,) to outputs of shape (c_out,) in a linear fashion, each of which is used as input for a succeeding layer with parameters of shape (H' x W', c_out, c_out'). Thus the correct choice is to compute alignment between (H x W x c_in, c_out) and each spatial position in the next layer (h', w', c_out, c_out'), but then there are H' x W' plots to present. To simplify the presentation we manually inspected each choice, and found alignment to be strongest for the central position of the following layer, hence used that for visualization. A similar line of reasoning led to the choice of the LSTM alignment, where we choose the gate weights as they had the strongest alignment. We focus on the maximum alignment choice as we did not in general observe much alignment, so we wanted to double check if this was simply a poor choice on our part. For Transformers, we have simple linear transformations, so it is very similar to the MLP case. We emphasize that these are not arbitrary choices, but rather driven by the mathematics (and [1] concludes similarly). Making a different choice may yield different results, but it is unclear what one would actually be measuring. We would be happy to revise the respective Appendices to include this exact discussion per architecture.

[1] https://arxiv.org/abs/2208.06894

审稿意见
6

This paper presents a framework based on spectral dynamics—the evolution of weight singular values and vectors during training—to unify key deep learning behaviors. The authors reveal a consistent optimization bias across diverse tasks, from image classification to language modeling, and demonstrate that weight decay strengthens this bias beyond norm regularization. Spectral dynamics also distinguish between networks that memorize versus generalize, offering insights into effective sparse subnetworks (lottery tickets) and the structure of loss landscapes through linear mode connectivity.

优点

The paper introduce some situations in deep learning and explain them by the spectral dynamics of weights. The advantages are

  1. The paper establishes a link between ''grokkking'' and rank minimization.

  2. The paper provides different examples of deep learning, including CNN, UNet, LSTM, and transformer.

缺点

  1. There is no clear description of the influences of spectral dynamics on deep learning. For example, in the line 421, weight decay can produces a low-rank behavior, but how much weight decay is needed or is weight decay related with some performance metrics? Figure 6 shows a phenomenon of adding weight decay. However, there lacks a metrics to show the level of weight decay. "The exact choice of “too much” varies across architectures and tasks." could not explain everything.

  2. The theoretical foundation is weak, providing limited intuition and isolated examples rather than a cohesive theory. In section 4.1 and 4.2, the authors provide the visual processing of trainings, but there is no explanation and theoretical analysis. The authors should provide more accurate analysis rather than only visualization.

  3. The conclusion of the paper seems to be too simple and not rigorous. It provides some right but not useful insights. In section 6, the authors provide additional connections between spectral dynamics and other phenomena. However, the connections seems to be verfied by previous works, the paper only provides some examples and visualizations of these connections.

问题

Please see the weakness.

评论

We are glad to see that our comprehensive experiments and goal of linking phenomena together were appreciated. We respond to your concerns below.

...how much weight decay is needed...

Thank you for your feedback. At a high level, our goal in this work is to make a scientific contribution to unifying many different phenomena in deep learning. We cede that we do not provide immediate improvements to neural networks based on this work, but we are in very good company. Many empirical works make fundamental contributions to our scientific understanding without immediate practical applications, and we believe this is a worthwhile endeavor [1, 2, 3, 4, 5]. The applications may come further down the line (model-averaging came after understanding LMC [6]).

To be precise about line 421, we mean the following: for tasks like 10-class image classification very low-rank parameters may lead to generalization as the task is compressive. On the other hand, some memorization may be strictly necessary for something like language modeling with many long-tail words. Thus, even if weight decay acts like a rank regularizer, rank regularization is not immediately ideal, depending on the dataset. To make this more precise, we propose revising lines 429-477 to include the following:

"...minimal rank for a given task, for example language modeling may require some form of memorization for long-tail words and thus a low-rank solution may be incapable of this, while simpler tasks like image classification do not have this property. Still, we note the role..."

Note that we provide quantitative measures for the effective rank based on the level of weight decay in Figure 6 and the performance of networks on the task in Figure 18.

To make a simple suggestion on the practical implications, we propose adding the following at the end of Section 5:

"One simple suggestion for the practical utility of these results is: with increased weight decay, it is immediately simpler to compress networks as the effective parameter rank is much smaller (see performance gaps between full and truncated decreasing in Fig 18). Such compression leads to memory efficiency when deploying, and no custom training procedure is needed besides this single hyperparameter."

...theoretical foundation...

One significant contribution of this work is to question the theoretical foundations of existing work in deep linear systems and small-scale nonlinear ones, specifically with respect to alignment between layers, which was an assumption in prior work (lines 215-256, 241-242, 375-397). To develop a theory for a new setting, it is necessary to formalize the setting and provide appropriate assumptions. Without an empirical understanding of how larger systems behave, making good assumptions is difficult, hence the mismatch between the theory in small-scale systems and our results. In this work, we provide comprehensive empirical evidence that can lead to such formalization, but even studying 2-layer neural networks theoretically is the subject of thousands of prior papers (156k results for "two-layer neural network theory" on Google Scholar, so 1k is a healthy lower estimate). Developing a comprehensive theory for all of the systems studied in our work would make the scope of the work much too large.

...conclusion of the paper seems to be too simple...

Our goal is to unify many different phenomena in deep learning, which is difficult to do precisely without a firm understanding of how systems behave. This work provides extensive evidence to ground that understanding, refutes assumptions of prior work (balancedness in Section 4), and presents a broad range of connections linking these phenomena (Sections 3, 5, 6). That a simple view like spectral dynamics ties all of these phenomena together, however, is not a simple/trivial result.

...In section 6...connections seem to be verified by previous works...

This is incorrect. The connections in Section 6 are novel and have not been discussed in prior work. If there is somewhere in the writing that gave this impression, we would be happy to revise.

评论

I would like to thank the authors for their comprehensive rebuttal, which has helped me better understand this paper. Most of my concerns have been addressed, and I vote to accept this paper

评论

We are glad that our rebuttal was helpful in clarifying the contributions of our work and answering your concerns. We will incorporate proposed edits at the time of final submission, subject to space constraints. Thank you for your attention in reviewing our paper.

审稿意见
5

The authors propose an empirical approach centered on the spectral dynamics of the behavior of weights during optimization and they find that 1) Grokking is intimately linked to rank minimization; 2) Rank minimization is a general phenomenon in more complex tasks; 3) Weight decay acts implicitly as a low-rank regularizer; 4) Generalizing solutions have a lower rank than memorizing ones; and 5) Top singular vectors are preserved when performing magnitude pruning and while linearly interpolating between connected modes.

Those phenomena provides a coherent framework to understand the behavior of networks across different settings.

优点

  1. This paper is clearly written and the idea is easy to follow.
  2. The authors connect simple settings (grokking) to complex ones (LSTMs, transformers), and explore tasks from different domains including image, speech and language processing tasks. This shows that the rank minimization is a common property for people to understand deep neural networks.

缺点

As this paper is mainly empirical, it would be better to see more results of different tasks in each domain to give more evidence for the observed properties, as one task in each domain might not be enough to say they are general phenomena.

问题

This paper is clearly written and I do not have other questions.

评论

Thank you for taking the time to review. We are glad to see that you appreciated the clarity of the writing, and our goal of drawing a line from simple to complex settings when understanding optimization of neural networks. We address your main concern below.

"...better to see more results of different tasks..."

Certainly without a tight theoretical characterization which is very difficult for realistic deep learning models (part of the results of our work are exactly to refute such deep linear models), it is reasonable to desire a comprehensive evaluation. We want to point out that the existing empirical exploration we present is already far and above the standard of evidence for similar work. That we vary architecture and task simultaneously and still see similar results contributes to the generality. If it were only true on image classification, regardless of architecture/dataset this would be a weaker argument. If the concern is about task scale, we have experiments on a broad spectrum, from small-scale modular arithmetic/MNIST to speech recognition on LibriSpeech in the middle, to Transformers on Wikitext-103 at the limits of our budget. Appendix D.5 also includes results on scaling language models. See below for some sample settings of comparable prior empirical works:

  • Progress Measures [1]: Single-layer Transformers on modular arithmetic
  • Deep Grokking [2]: MLPs on image classification
  • LMC [3]: ConvNets on image classification
  • CKA [4]: ConvNets on image classification, Transformers on machine translation
  • Simplicity Bias [5]: MLPs on regression, ConvNets on image classification
  • Why do We Need Weight Decay [6]: ConvNets on image classification, Transformers on language modeling
  • Implicit bias in Deep Matrix Factorization [7]: Deep Linear Networks on matrix factorization
  • Ours: Single-layer Transformers on modular arithmetic, MLPs on MNIST, VGG on image classification, UNet on image generation, LSTM on speech recognition, Transformers on language modeling, large scale Transformers on language modeling (Appendix D.5).

In general prior work uses 2 domains for comparable experiments. We use 4 representing many of the primary tasks of interest in the literature, in addition to specific settings for connecting to prior literature (e.g. Grokking). All other reviewers (vA12, NaDw, F51L) praise the extensiveness of the experiments. If there is a particular architecture/dataset/domain that you believe would have different behavior for specific experiments we would be happy to discuss, but current results for Sections 4-6 are quite broad, and the fact that the behavior is common when varying both architecture and task strengthens the generality of our results. If they were only true on image classification it would make for a substantially weaker argument even with a broader range of architectures/datasets. In addition Section 3 is comparable to prior work on grokking [1, 2].

评论

As the discussion period will soon be drawing to a close we wanted to bring our response to your attention again. Please let us know if there is anything unclear, and we look forward to discussion.

审稿意见
6

The paper surveys the spectral dynamics of the weights of deep neural networks in several settings, attempting to draw some general conclusions from the commonalities in the observed dynamics. In particular, the authors examine the singular values of weight matrices in the network, the alignment between singular vectors of consecutive layers, and the effect of weight decay on the dynamics of these quantities. They find that neural network training is biased towards low-rank solutions with only a few singular modes dominating information processing after training, in line with previous theoretical and empirical findings. This effect is shown to be more pronounced with higher weight decay. They also find that the alignment between layers is not quite as strong as prior theoretical work has suggested, and requires further study. The paper also reports interesting connections between spectral dynamics and deep learning phenomena such as grokking, generalization, memorization, lottery tickets, and linear mode connectivity.

优点

  • Spectral analysis of the weights is an informative approach that has yielded several theoretical insights in linear networks and informed empirical algorithms, particularly in recent approaches for low-rank adaptation. In the past, such analyses have been done in isolation on simple setups. So this paper's attempt to extend the scope of the insights into something more general is novel.
  • The comprehensive experiments touch upon a wide array of observed phenomena in deep learning and generally support the idea that the top singular subspace is more important for task performance.
  • Some of the findings were surprising, and call for further study. For instance, panel d in Figure 2 shows an increasing alignment that moves up the ranks over training, suggesting the evolution of the relevant functional low-rank subspace.
  • I also found the differences between random label learning and true label learning to be interesting. This nicely highlights how the task structure implicitly shapes the network over learning.

缺点

I have some critiques concerning the writing, particularly the lack of any solid insights. I elaborate below.

  • The paper seems to be largely an exercise in plotting the singular value dynamics of several neural networks. While this is interesting in itself, I found that the paper did not extract any conclusive or striking insights from the empirical study, making it difficult to judge the significance of the work. For example, Figure 2 points out that grokking seems to be correlated with drops in the effective rank. This is interesting, but no concrete mechanistic explanation for this effect is attempted. A similar tone pervades the paper, with several disparate empirical reports but no substantive specific hypothesis or overarching message about the phenomena being observed. I understand that the paper is primarily empirical, but the paper would be much more compelling if the reader is given more details as to the significance of each finding or its implications for practitioners.
  • A few concrete insights are indeed pointed out, such as weight decay implicitly promoting a low-rank bias. I appreciated the arguments provided for developing an intuition for this effect (lines 410-420). While interesting, this was not investigated in sufficient depth to clear the bar for acceptance of the paper. Furthermore, the reasoning for looking into this effect (lines 193-197) seems somewhat vague.
  • The lack of a core message is also reflected in the writing. For instance, the abstract mentions a "consistent bias in optimization" (line 13) but doesn't specify what this bias is at all. Similarly, I found the introduction perhaps too long but unspecific. It briefly describes several prior works that could be better placed in the related work section. More importantly, I found it difficult to appreciate what is the main question that is being asked. The closest was the "aim to provide a common language for probing and understanding neural networks" (lines 144-145), but still, such a language had not been specified in detail by the end of the paper. I think the paper could be much improved by addressing just these few aspects and being more specific about the core message of the paper.
  • In general, I also found the treatment of prior work a little lacking in details. While I appreciate the comprehensive literature that was cited, I think the paper could benefit from closer tailoring of the references to a more specific core message.

The concerns I have raised can largely be addressed by being more specific about the significance of this work and some rewriting. I look forward to hearing more from the authors.

问题

  • What could be the reason for not observing a strong alignment between layers as proposed in the existing theory? Why does the alignment appear and then go away? I found this surprising and would like to hear what the authors think.
  • The spectral dynamics are different between training with random vs true labels. But how could this insight be leveraged to distinguish memorization vs generalization between different networks trained on the same task? Would the idea be that the higher rank network is more memorizing?
评论

We continue responding to the points raised in the initial review below.

...treatment of prior work lacking in details...

We appreciate this feedback but want to clarify. Reviewer F51L points out that the treatment of related work is "mostly comprehensive." We have tried to be quite specific about the difference to prior work when related to experiments (e.g. lines 162-165, 176-182, 262-269, 274-276, 308-312, 394-397, 416-420). If there are specific places for our attention in these cases, we very much welcome the help revising them. Further, as our goal is to link many prior results together, there is no single benchmark or set of papers against which we can contrast our work; the goals are simply too different. Prior work that we examine all looks at these different facets in isolated contexts, deriving simplified models to study either empirically (e.g., Progress Measures) or theoretically (e.g., Linear Matrix Factorization), but it is then unclear how these results expand to larger scale systems. Work that establishes phenomena like LTH or LMC is interesting but also disconnected from the rest of the literature. We make this point in lines 143-146. To make this more precise, we propose adding the following to the end of the related work:

"Though these prior results are interesting in their own right, they exist in isolation. On the other hand, our contribution is to link these many prior works together through a single viewpoint: spectral dynamics."

...significance of this work..

Thank you for providing a path for addressing your concerns. We've tried to make these clear above, but would very much welcome specific suggestions for areas where the comparison to related work could be made more precise. Just to restate the significance: we provide a unifying view of many different phenomena in deep learning by investigating the evolution of the singular values and vectors of weights. As it turns out such connections are broad, and we provide large empirical evidence for them.

Questions

...reason for not observing strong alignment...

We do see it in simple networks (Fig 2, Fig 7) but not in the more complex ones. In linear systems, this alignment is a proxy for the network's ability to pass signals from input to output, but once nonlinearities are introduced, this proxy looks much weaker. We would guess that there still is alignment in larger systems, but it needs to be viewed post-activation for particular examples. We propose adding a description of this to the conclusion as a direction for follow-up work, for example:

"...common underlying mechanism. One limitation of this work is the focus on linear alignment between neighboring layers. In linear systems, this alignment is a proxy for the network being able to pass signals from input to output, but once nonlinearities are introduced this proxy looks much weaker. We would guess that there still is alignment in larger systems, but it needs to be viewed post-activation for particular examples. This is a promising direction for the future."

...distinguish memorization vs. generalization...

Yes, as preliminary evidence for this, notice the before/after behavior in grokking in Fig 2. The task and labels are the same throughout the run, though certainly this is a very constrained setting. If the task is sufficiently simple (like 10-class image classification in a narrow domain), then it is reasonable to suspect such a simple solution would be performance. On the other hand, for language modeling, even if we have a simpler lower-rank solution, we don't always expect performance to be as strong (e.g., long-tail words may require some form of memorization, and there are lots of them). We propose to add the following to lines 429-477 to flesh this out:

"...minimal rank for a given task, for example language modeling may require some form of memorization for long-tail words and thus a low-rank solution may be incapable of this, while simpler tasks like image classification do not have this property. Still, we note the role..."

评论

Thanks to the authors for their detailed rebuttal. This addressed most of my concerns by clarifying the contributions. A few final comments follow below.

Regardless, our work aims to provide a unifying perspective for many disparate phenomena in deep learning by studying the spectral dynamics of weights (defined in lines 74-78).

Just to restate the significance: we provide a unifying view of many different phenomena in deep learning by investigating the evolution of the singular values and vectors of weights. As it turns out such connections are broad, and we provide large empirical evidence for them.

I still am unsure about the specifics of the messaging here though. In its current form, the paper provides empirical evidence that spectral dynamics "reflect" several phenomena in deep learning and could be one useful metric to track in general for any setting. This is indeed a valuable finding. However, would the authors agree that there could surely be other interesting metrics that also reflect these phenomena, which may all arise due to yet another fundamental mechanism? We cannot resolve these questions without an actual theoretical explanation. In a similar vein, another reviewer raised the question asking whether there are phenomena that cannot be explained within this framework. In light of all this, I find the claim that it is a "unifying perspective" (lines 12, 52) to be too strong, specifically due to the lack of an overarching theory. In my opinion, moderating this claim and being more specific - for instance something along the lines of "we have used this set of quantities to verify or contradict several theoretical predictions from previous work" - would make the paper stronger.

That said, I am happy to raise my score to acceptance with the changes already proposed.

评论

Thank you for the discusssion. We're glad that our rebuttal was helpful in addressing your concerns, and will certainly incorporate the changes in the writing discussed thus far. We respond to your lingering concern below.

...still unsure about the specifics of the messaging...

We cede the point that without a completely precise mathematical theory, it is difficult to attribute causality to something as generic as the evolution we see, and there might always be a further underlying mechanism as a result. We appreciate the need to make appropriate claims, and will certainly consider your alternate framing, but want to point out that the goal of the work is not just to test existing theory, but to propose a viewpoint based on the theory and use this to draw the connection between a broad array of phenomena which did not have any such common description. With respect to Reviewer F51L's question on what does not fit in this framework, our answer was to give the limitation that it does not track the effect of nonlinearities (which might yet reveal even more structure), but without a counterexample we are not certain as to which phenomena in deep learning are completely agnostic to the spectral dynamics and thus fall outside their scope.

For example, previously when posed the question as to why we can do convex model-averaging among finetuned checkpoints even though the optimization process is highly nonconvex, the answer was simply to shrug and marvel. In Section 6 we contribute experiments that tie model-averaging (LMC) to the dynamics in Section 4, which almost seems simple in hindsight. Though it is not completely precise, it is true that the same dynamics yield the connection, and this has additional value in pointing out fruitful areas for future work.

Perhaps the word "unify" is too strong in this case, and maybe it would be more appropriate to use your word "reflect," as in:

...perspective that reflects many different phenomena in a common light...

We will think on how to modulate this appropriately. We appreciate all your feedback thus far as it has certainly improved the paper.

评论

As the discussion period will soon be drawing to a close we wanted to highlight our response again. Please let us know if there is anything unclear in our response, and we look forward to discussion on these points in the next few days.

评论

Thank you for taking the time to review our work! We are glad you find our work in bridging gaps between setups novel. We also are glad to see you appreciate the comprehensiveness of our experiments, and that the findings were novel and interesting. We respond point by point to the concerns below.

Weaknesses

...largely an exercise in plotting the singular value dynamics... did not extract any conclusive or striking insights

We are a bit confused when contrasting this with the comment below that says that there are indeed insights contributed by our work. Regardless, our work aims to provide a unifying perspective for many disparate phenomena in deep learning by studying the spectral dynamics of weights (defined in lines 74-78). We structured the order of sections to present grokking as a lede given the striking nature of the phenomenon and then arrive at the mechanism of rank collapse due to weight decay in the later Section 5. This is described in lines 108-110, but we would be happy to add the following after lines 269-270 to make it clearer:

"We will show later in Section 5 that large amounts of weight decay have a strong rank-regularizing effect, so one way to understand grokking is that the network first memorizes, but the rank regularization of weight decay eventually pushes the model into generalizing."

The goal of Section 3 and the comments on memorization in Section 6 is to make the point that generalization in simple settings can be directly viewed in the rank of the network's parameters. Section 4 establishes the nonuniform evolution of singular values, which leads to the understanding of weight decay's secondary effects in Section 5. With this nonuniform evolution, we bridge the gap to lottery tickets and linear mode connectivity (discussed in Section 6), which previously did not have as precise an explanation.

...a few concrete insights are indeed pointed out... not investigated in sufficient depth

We establish that the top singular vectors are most important for performance in Fig. 4 by truncating the weights. We also establish the rank regularization effect (previously unseen in large networks) in Fig. 6. Both provide empirical support for the intuition developed in 410-420. Prior theoretical work that has stronger conclusions relies on very structure assumptions like balancedness [1] (which we show is not valid), or neural collapse [2] (which is not true at the beginning of training). For realistic systems, the analysis is difficult, but we provide such broad experiments as a counterbalance to this. Is there something in particular that the reviewer would like to see in order to feel more firm about these conclusions?

To clarify the motivation for grokking: grokking is a striking phenomenon that presents a puzzlingly sudden transition from memorization to generalization that currently lacks a general explanation. We structure our work to expand from this striking phenomenon to more realistic systems, showing that what is occurring in grokking is a rank minimization bias driven by norm regularization, which can be extended to large-scale networks by simply matching the level (Fig. 6). We would be happy to revise line 193-197 to make this more clear, for example:

"Motivated by experimental results showing the importance of weight decay for grokking\cite{}, and theoretical work connecting low-rank weights, generalization and weight decay\cite{}, we evaluate the potential connection between parameter rank and grokking in neural networks. Low-rank weights would naturally complement other descriptions such as..."

...consistent bias in optimization but doesn't specify...'

Given the large scope of our work, we chose to keep the abstract concise and expanded in the introduction. The specifics are explained in lines 74-78. We would be happy to revise the abstract to include this so in its new form:

"...Transformers. In particular this bias comprises three key ingredients: 1) singular values evolve unequally 2) as a result top singular vectors stabilize throughout training 3) though without alignment between neighboring layers required for previous results..."

This is quite general as our study is aimed at a high-level description, but the striking point is the consistency of this behavior regardless of the training data or architecture.

...introduction perhaps too long...

Thank you for this feedback. We believe this is a matter of taste (reviewer F51L found the introduction "insightful"). Given the large scope of our work, we believed it was necessary to build the context in a long introduction so as to describe the skeleton of the paper before diving into the results.

citations:

AC 元评审

The authors studied the evolution of singular values and vectors of weight matrices during training. In particular, the authors observed several interesting empirical phenomenon of low rank behaviour related to grokking, weight decay, and magnitude pruning.

I must admit the reviews and the scores left this paper very borderline, and the discussions did not seem to make further progress, so this is a very close call. Since a decision must be made, I have to be a bit more opinionated towards this paper.

I have mentioned this in my discussion with reviewers privately, but I am broadly speaking a bit skeptical of this approach of empirical research: making a lot of interesting observations, but not drawing useful conclusions. This is obviously subjective, but I did see some reviewers echo the same concern. I am genuinely not sure if these interesting observations can be put together in a useful manner to be built upon.

For example, in Figure 2, there are a lot of information here. I'm pretty confused which part of the plot is helpful for me to understand what's happening. More importantly, as a researcher reading wanting to build on top of this, what are some useful take-aways here for the next project? Does this figure help us understand grokking or low rank behaviour of weights? These are genuine questions of mine, and I hope you don't take the tone as being harsh.

That being said, there are ways this paper can improve over the borderline for me. As reviewer vA12 has already mentioned, a more conclusive story developed from the observations will be helpful. If there is a helpful interpretation beyond "these singular value plots are interesting", I can see this work being built upon, and therefore would see this in a positive light.

However, at this point, given a decision must be made, I will recommend reject based on the discussion above and the borderline score.

审稿人讨论附加意见

I think reviewer vA12's thread of discussion was the most helpful here. Once again, since the decision was very borderline, I looked for more opinionated discussions to help me make a decision. In this case, I feel that the authors were not able to address the same concern I had with reviewer vA12, and this was ultimately what helped push me over the decision boundary.

最终决定

Reject