PaperHub
6.3
/10
Poster4 位审稿人
最低5最高8标准差1.1
5
8
6
6
3.3
置信度
正确性2.8
贡献度3.0
表达2.3
ICLR 2025

IDInit: A Universal and Stable Initialization Method for Neural Network Training

OpenReviewPDF
提交: 2024-09-26更新: 2025-04-02
TL;DR

An initialization method for fast and stable training of deep neural networks based on the identity matrix.

摘要

关键词
InitializationIdetity MatrixDynamic Isometry

评审与讨论

审稿意见
5

This paper proposes a new initialization method for deep neural networks. It aims to initialize residual branches to have output close to zero, but keep an identity-like initialization in linear transformations on the residual except for the final one. They focus on the case where the feature dimension of the residual differs from that of the skip connection (typically wider), covering both convolutional and MLP style residuals. The authors analyze their initialization and experimentally validate its performance across various neural networks.

Recommendation: I lean towards rejection. The core idea is interesting and the results seem promising but I think the paper would benefit from a significant rewrite to improve the presentation clarity as well as better controlled comparison experiments.

优点

  • Problem tackled is practical and of interest to the community
  • The core idea is interesting
  • Authors claim decent performance improvements

缺点

  • This paper is very hard to follow. It frequently relies on citing prior works without summarizing or explaining their results sufficiently. The setup is not explained well, it is often hard to tell if there are non-linearities between the matrices, if normalization layers are used etc. The notation is not consistent throughout the paper. Finally the paper is full of spelling and grammar mistakes as well as poorly phrased sentences.
  • Some of the experiments are not convincing e.g. in Section 4.1 and 4.2. In general the hyperparameter tuning might not provide a fair comparison between methods.

问题

  • Why change the notation from Figure 1 to Equation 1? What do you mean when applying the activation function to the weights / transformation in Equation 1? This notation is not defined or explained.
  • Many aspects of Figure 2 are unclear. Does this network have activation functions? Why is the learning rate not tuned per method? Why is the rank issue so significant here despite the matrices being almost square? I find it hard to believe that a difference of 40 features out of 280 can cause such a large difference between the methods. Why use Adagrad and not a more standard optimizer like SGD or Adam?
  • Figure 3: I think this would be clearer if you used one symbol for the number of features on the skip path and another for the intermediate dimension. Here D_i is used twice to mean different dimensions.
  • Figure 4: I feel that when comparing momentum to no momentum the learning rate must be taken into account. Using momentum in SGD increases the effective learning rate so it is not surprising that the difference is larger.
  • Lines 314-318: Here you introduce the “loose” condition where you add random noise to your leading matrices. However, this is not mentioned again. Are the later experiments conducted with this change and does it make a difference in practice?
  • Section 4.1: I don’t find this convincing. The granularity of the sweep is too high and the best performance for the Kaiming initialization occurs at the edge of the sweep.
  • Section 4.2: I also don’t find this very convincing. In general the hyperparameters need to be tuned per method for a fair comparison. How do we know that your hyperparameters don’t favor your method?
  • Section 4.3: Here I feel you could explain the exact setup a bit better, e.g. what exactly do you do without IDIC?
评论

Response to "This paper is very hard to follow...."

We appreciate the reviewer’s feedback and the time taken to evaluate our work. While we understand the reviewer's concerns, we believe the paper is generally well-structured and well-written. This perspective is supported by the feedback from the other three reviewers. However, we recognize the importance of addressing any potential clarity issues to ensure accessibility for a broader audience. Below, we provide specific responses to the points raised:

Why change the notation from Figure 1 to Equation 1? What do you mean when applying the activation function to the weights / transformation in Equation 1? This notation is not defined or explained.

We use the notation W1W_1 in Figure 1 as a convenient way to represent matrices, which helps illustrate the identity-control initialization in Figure 1 and the motivation presented in Figure 2. On the other hand, the notation θ(i,0)\theta^{(i, 0)} in Equation 1 is central to defining our method and providing the derivations of the theorems and proofs.

Equation 1 serves as a general formulation for the residual network. The activation function a()a(\cdot) is introduced as a placeholder to suggest its applicability in this context. It is intentionally left unspecified to keep the discussion general and not tied to a particular function, such as ReLU or others.

Many aspects of Figure 2 are unclear. Does this network have activation functions? Why is the learning rate not tuned per method? Why is the rank issue so significant here despite the matrices being almost square? I find it hard to believe that a difference of 40 features out of 280 can cause such a large difference between the methods. Why use Adagrad and not a more standard optimizer like SGD or Adam?

Thank you for your thoughtful comments. We address each point below to clarify and justify our experimental setup and the insights from Figure 2.

  1. Activation Functions and Learning Rate

The experiments in Figure 2 follow the setup described in [1] using their open-source code. As detailed in Appendix C.5 (referenced in the caption of Figure 2), we use the ReLU activation function. The learning rate of 0.01 and the Adagrad optimizer are also adopted from [1]. Adagrad adaptively adjusts the learning rate during training, which mitigates the need for manual tuning of a fixed learning rate. These choices ensure consistency and fairness in the experimental setup.

  1. Rank Constraint and Matrix Design

The focus of Figure 2 is not directly on rank constraints but rather on motivating the design of W1W_1 when W2W_2 is set to 0\mathbf{0}.

  • Figure 2(b): In Figure 2(b), none of the initialization methods encounter rank issues. However, Identity-1 can surpass Random and Hadamard by maintaining identity. We speculate the worse performance of Random and Hadamard is caused by initializing W1W_1 with random values or partial values of Hadamard matrix, as this is the only difference between them and Identity-1. To address this, we consider initializing W1W_1 as II to preserve the inductive bias of Identity-1, which leads to improved results.

  • Figure 2(c): Here, we introduce rectangular W1W_1 and W2W_2 to explore cases where a real identity matrix cannot be applied (requiring square matrices). Among the methods, only "Partial Identity" encounters rank constraints due to the matrix dimensions, leading to slightly worse performance. This degradation, caused by the loss of fewer than 40 features (as noted by the reviewer), is evident when compared to Random and Hadamard. In contrast, the Default initialization performs the worst, as it does not align with the training dynamics. IDInit avoids the rank constraint problem and effectively preserves the inductive bias of Identity-1, resulting in superior performance.

  1. Choice of Adagrad

Adagrad, chosen in accordance with [1], dynamically adapts the learning rate based on gradient history, making it particularly suited for this experiment. While SGD and Adam are more common optimizers, the use of Adagrad ensures consistency with the original study and highlights the generality of our method.

  1. Motivation and Future Exploration

Figure 2 effectively motivates the design of IDInit by demonstrating how identity-based initialization maintains structural bias and avoids pitfalls like rank constraints. While the exact mechanism underlying the improvement remains an open question, these results strongly suggest the potential of identity initialization. We plan to include a discussion section in the manuscript to encourage further exploration of this phenomenon and inspire the community to investigate advanced initialization strategies.

评论

Response to "Some of the experiments are not convincing...."

Section 4.1: I don’t find this convincing. The granularity of the sweep is too high and the best performance for the Kaiming initialization occurs at the edge of the sweep.

Thank you for your comment. To address your concern, we have extended the hyperparameter sweep range in Appendix G.3 of the uploaded PDF. Specifically, we expanded the learning rate sweep from 10310^{-3} to 10110^{1} and the weight decay sweep from 10810^{-8} to 10110^{-1}. This ensures that the best-performing hyperparameters are not located at the corners or edges of the grid.

As shown in Figure 22 of the uploaded PDF, IDInit consistently delivers strong performance across this expanded range of settings and achieves the best results among all initialization methods tested. This demonstrates the robustness of IDInit and supports its effectiveness under a fair and comprehensive comparison framework.

Section 4.2: I also don’t find this very convincing. In general the hyperparameters need to be tuned per method for a fair comparison. How do we know that your hyperparameters don’t favor your method?

Thank you for raising this concern. We show a grid search for ResNet-110, as detailed in Appendix G.4 of the uploaded PDF. Specifically, we explored the hyperparameters across a grid of learning rates {1,0.2,0.1}\{1, 0.2, 0.1\} and weight decay values {104,5×104,103}\{10^{-4}, 5 \times 10^{-4}, 10^{-3}\} on the baseline Kaiming initialization and the Fixup method. As shown in Figure 23, both Kaiming and Fixup achieve their best accuracy with a learning rate of 0.2 and weight decay of 5×1045 \times 10^{-4}. Notably, Fixup fails to train with a learning rate of 1. Based on this analysis, using a learning rate of 0.2 and weight decay of 5×1045 \times 10^{-4} as the training hyperparameters in Section 4.2 is justified.

We hope this addresses your concern, and we welcome any further suggestions for improvement.

[1] Bachlechner, Thomas, et al. "Rezero is all you need: Fast convergence at large depth." UAI, 2021.

[2] Zhu, Chen, et al. "Gradinit: Learning to initialize neural networks for stable and efficient training." NeurIPS, 2021.

[3] Zhang, Hongyi, et al. Fixup initialization: Residual learning without normalization. ICLR, 2019.

评论

Figure 3: I think this would be clearer if you used one symbol for the number of features on the skip path and another for the intermediate dimension. Here D_i is used twice to mean different dimensions.

Thank you for pointing this out. To clarify the notation, we will use DhD_h to represent the intermediate dimension (the middle dimension between two matrices) and retain DiD_i to denote the number of features on the skip path. This adjustment will ensure consistency and eliminate any ambiguity in the figure. We will update Figure 3 accordingly in the revised manuscript.

Figure 4: I feel that when comparing momentum to no momentum the learning rate must be taken into account. Using momentum in SGD increases the effective learning rate so it is not surprising that the difference is larger.

The primary contribution of Section 3.1.1 and Figure 4 is not to highlight the acceleration of differences, but to demonstrate that SGD effectively resolves the symmetry problem caused by identical initialization. Momentum further amplifies this effect, providing a supplementary benefit, as it is a common enhancement used with SGD.

We also analyze the influence of the learning rate in Appendix G.7 of the uploaded PDF. Specifically, when x1,x2,y1,y2Rdx_1,x_2,y_1,y_2 \in \mathbb{R}^d are random vectors whose entries are i.i.d. Gaussian random variables with N(0,σ2)N(0, \sigma^2), the magnitude of asymmetry can be bounded by (6η2+3)d2σ2(6\eta^2+3)d^2\sigma^2. This shows that a higher learning rate promotes greater asymmetry, further explaining the observed differences. However, a high learning rate can affect training stability. Therefore, while using a higher learning rate to reduce symmetry, it is crucial to carefully select its magnitude to maintain stability.

Lines 314-318: Here you introduce the “loose” condition where you add random noise to your leading matrices. However, this is not mentioned again. Are the later experiments conducted with this change and does it make a difference in practice?

Thank you for your observation. We have addressed the "Loose" condition by including additional experiments in the newly added RTable 1.

RTable 1. Analysis of components.

Setting 1Setting 2Setting 3Setting 4Setting 5Setting 6Setting 7Setting 8
Loose\checkmark\checkmark\checkmark\checkmark
IDIC\checkmark\checkmark\checkmark\checkmark
IDIZ\checkmark\checkmark\checkmark\checkmark
Accuracy86.12±0.5286.12_{ \pm 0.52}92.68±0.0892.68_{ \pm 0.08}89.47±0.2489.47_{ \pm 0.24}87.01±0.2987.01_{ \pm 0.29}92.95±0.2192.95_{ \pm 0.21}92.9±0.1892.9_{ \pm 0.18}90.43±0.1490.43_{ \pm 0.14}93.22±0.0593.22_{ \pm 0.05}

The extended analysis in RTable 1 shows that the "Loose" condition, along with the components 'IDIC' and 'IDIZ,' contributes independently to performance improvements. Furthermore, the combination of these components yields the most significant results. Across all comparison pairs—specifically, settings 1/4, 2/6, 3/7, and 5/8—the "Loose" condition consistently demonstrates performance improvements. This highlights its practical value and its role in enhancing the overall effectiveness of the initialization methods.

Section 4.3: Here I feel you could explain the exact setup a bit better, e.g. what exactly do you do without IDIC?

Thank you for your comment. In Section 4.3, IDIC refers to the use of the patch-maintain scheme (as described in Eq. (9)) instead of the channel-maintain scheme (outlined in Lines 325–327 of the original paper). This adjustment enables IDInit to better increasing channel diversity during initialization, leading to a notable improvement of 3.42% in accuracy. We will revise the manuscript to provide a clearer explanation of the setup and explicitly contrast the configurations with and without IDIC to enhance clarity.

We hope these revisions will address the reviewer’s concerns while maintaining the strengths highlighted by the other reviewers. Thank you again for your valuable feedback.

评论

Dear Reviewer 96yw,

We hope this message finds you well.

We have addressed the comments and concerns raised in the reviews to the best of our ability and are keen to receive any further feedback or clarification. With the decision deadline approaching, we would like to kindly check if there are any remaining questions or points you would like us to address.

We greatly appreciate the time and effort you and the other reviewers have dedicated to this process, and we are happy to provide any additional information if needed.

Thank you for your attention, and we look forward to hearing from you soon.

Best regards,
Authors of Submission #6753

评论

Thank you for the response and revisions. I still think the presentation of the paper could be significantly clearer (a concern which I believe reviewer jXZf also shares to some extent). The new hyperparameter sweeps help somewhat, although the grid sweeps are still too coarse (increments of 10x in the learning rate and weight decay are too high). I still lean slightly towards reject overall, but will raise my score to 5 as the revisions have partially addressed my concerns. My main concerns remain the clarity of the manuscript and the soundness of the experimental comparison.

评论

Thank you for your response and for raising the scores. We're glad our responses have addressed most of your concerns. We will make sure to incorporate all the discussion points into the revision to enhance the presentation clarity as suggested.

审稿意见
8

This manuscript studies the initialization methods for deep neural networks. Through an elaborated problem identification, the manuscript proposes several treatments, including (1) padding a new identity matrix in adjacency to an identity matrix, (2) using optimizers with momentum to tackle the convergence issue, (3) improving the universality of IDInit and the identity-control framework from the perspectives of higher-order weights and dead neurons. Extensive numerical results over various neural architectures and datasets demonstrate the effectiveness of the proposed strategy.

优点

  • The manuscript is well-structured and well-written. Most concepts are carefully explained and form a good story.
  • The numerical evaluations look extensive, considering several baselines and neural architectures on image classification, text classification, and llm pre-training.

缺点

  1. The manuscript may need to elaborate the discussion on the momentum part, e.g., by providing some theoretical justifications.
  2. Some design/hyper-parameter choices are not justified by some references, e.g., many parts in Sec 4 (line 434-line 439, line 449-line 454, line 464-line 474, line 504-line 509, line 517-line 523)

问题

NA

评论

The manuscript may need to elaborate the discussion on the momentum part, e.g., by providing some theoretical justifications.

Thank you for the suggestion. To provide further theoretical justification for the role of momentum, we conducted additional analysis, which is detailed in Appendix G.7 of the uploaded PDF. Specifically, when x1,x2,y1,y2Rdx_1,x_2,y_1,y_2 \in \mathbb{R}^d are random vectors whose entries are i.i.d. Gaussian random variables with distribution N(0,σ2)N(0, \sigma^2), the magnitude of asymmetry can be bounded by (6η2+3)d2σ2(6\eta^2+3)d^2\sigma^2.

Since momentum is accumulated from gradients, it inherently amplifies the asymmetry present in the gradients. Consequently, when momentum is incorporated into weight updates, it promotes an increasing magnitude of asymmetry over time. This amplification effect provides a theoretical explanation for the enhanced performance observed when momentum is used, as it helps break the symmetry more effectively during training.

We will update the manuscript to include this theoretical insight, offering a clearer understanding of momentum’s role in promoting asymmetry and its implications for improving training dynamics.

Some design/hyper-parameter choices are not justified by some references, e.g., many parts in Sec 4 (line 434-line 439, line 449-line 454, line 464-line 474, line 504-line 509, line 517-line 523)

Thank you for your thoughtful feedback. We appreciate your reminder and will carefully review and supplement the missing references in the identified sections of Section 4 in the next version of the manuscript. This will ensure that all design and hyperparameter choices are properly justified and supported by relevant literature.

We hope this addresses your concern, and we welcome any further suggestions for improvement.

审稿意见
6

In conclusion, this paper introduces Fully Identical Initialization (IDInit), a novel approach that leverages an identity-like matrix to effectively preserve identity across both main and sub-stem layers of residual networks. This method addresses the rank constraints in non-square weight matrices through a padded identity-like matrix and improves upon traditional identity matrix convergence issues using stochastic gradient descent. By processing higher-order weights and tackling dead neuron problems, IDInit enhances stability and performance, demonstrating improved convergence in various settings, including large-scale datasets and deep models.

优点

  • This paper introduces IDInit, a novel initialization method that utilizes an identity-like matrix to preserve identity across both the main and sub-stem layers of residual networks. IDInit accelerates the training process across diverse datasets and neural architectures.
  • An effective initialization method is crucial for the deep learning community.
  • The paper offers a clear presentation, including well-demonstrated experimental results and a robust technical solution.
  • Several claims are theoretically supported. Notably, the paper demonstrates how this method can address the rank constraint problem.

缺点

  • In certain instances, IDInit remains suboptimal. For example, in Table 1, the final accuracy of a 110-layer neural network does not surpass the baselines.
  • The proposed IDInit has not been validated across diverse learning paradigms, such as self-supervised learning methods [1] and diffusion models [2].
  • The paper lacks theoretical analysis regarding the convergence rate of IDInit.

[1] Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

[2] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.

问题

  • Could you include additional experimental results involving self-supervised learning methods and diffusion models?
  • Could you provide further theoretical analysis on the convergence rate of IDInit?
评论

suboptimal: In certain instances, IDInit remains suboptimal. For example, in Table 1, the final accuracy of a 110-layer neural network does not surpass the baselines.

Thank you for your comment. We demonstrate consistently superior performance in most scenarios compared to various baselines, as shown in Table 1 and Table 4. In Table 1, we also highlight the faster convergence speed, which is another advantage of IDInit. Furthermore, as illustrated in Figure 7 and Figure 12, IDInit exhibits robustness across a wider range of hyperparameters (e.g., learning rate and weight decay). Consequently, IDInit significantly outperforms baseline initialization methods.

More experiments on self-supervised learning methods and diffusion models: The proposed IDInit has not been validated across diverse learning paradigms, such as self-supervised learning methods [1] and diffusion models [2]. Could you include additional experimental results involving self-supervised learning methods and diffusion models?

Thank you for your suggestion. We conduct experiments on GPT-Base-MOE in Appendix G.5 and DiT-S/4 in Appendix G.6 of the uploaded PDF. Both experiments demonstrate that IDInit enables faster training compared to the default initialization.

Convergence rate: The paper lacks theoretical analysis regarding the convergence rate of IDInit. Could you provide further theoretical analysis on the convergence rate of IDInit?

We share an interest in this question; however, analyzing the convergence rate theoretically is challenging. To the best of our knowledge, no initialization methods currently provide a theoretical analysis of it. The convergence process of deep neural networks is iterative and influenced by numerous factors beyond the initialization method, including the network architecture, optimization algorithm, learning rate, batch size, and data distribution. Further exploration is needed in this area. We will include a discussion in the paper to encourage the research community to focus more attention on this important problem.

We hope this addresses your concern, and we welcome any further suggestions for improvement.

评论

Thank you for the response. I still have concerns regarding the generalizability of this approach across diverse applications, particularly those involving state-of-the-art generative and representation models. While the authors have provided additional evidence, certain limitations persist. For instance, Fréchet Inception Distance (FID) is typically the primary metric for evaluating the quality of generative models, rather than test loss. Nonetheless, I am inclined to support the acceptance of this paper and will maintain my score.

评论

We sincerely appreciate your thoughtful comment and support for the acceptance of our paper. Regarding the generalizability in diverse applications, we have included results for GPT-MOE and DiT in the updated manuscript. For GPT-MOE, perplexity (equivalent to test loss) is typically used to evaluate the quality of generative models. As FID is the standard metric for assessing DiT, we will incorporate FID into our evaluations to align with standard practices in the field. Thank you once again for your positive feedback, and we will continue refining the manuscript based on your valuable suggestions.

审稿意见
6

The authors propose an alternative method for neural network initialization that maintains the identity transition. The work uses the block identity-control initialization framework, and as such is composed of two blocks, one identity preserving (which in case of non-square transformations repeats the identity along the sub-diagonals, instead of zero padding), and zero preserving transformation. The work also discusses how this concept can be used for convolutional architectures, and how its setting may mitigate the dead-neurons problem in identity control.

优点

  • The method is a simple addition to the family of identity-control initialization, which is easy to implement and use. The obtained results are in most studied cases competitive to, or better than other identity-control schemes
  • The investigation into the ways and mitigating the dead neuron problem with the use of small perturbations in the IDInit method seems interesting and potentially valuable to the community.
  • While the experiments on various hyperparameters (Section 4.1) are limited in the scope of datasets and models examined, they are potentially insightful, showing that switching to IDInit initialization can lead to a broader range of hyperparameter configurations performing effectively.

缺点

The are places in the paper in which I am not sure what the empirical evaluation serves to demonstrate, or whether the obtained results are significant:

  1. Firstly, I have concerns regarding Figure 2. I understand that it serves as a motivation plot for the later claims and I believe that comparing the impact of the initialization of matrix W1W_1 on the performance of various control-initialization setups could be a valuable study by itself. However, my issue is with the presentation of the results. It’s not clear a) What is the model used in this study? b) What is the dataset and task being evaluated? c) What is the type of loss measured and what does “Rectangle Loss” mean? Do the words “Square” and “Rectangle” in the description of the plots refer to the type of matrices used? Additionally, are the lines in Figures 2a and 2b averaged over multiple runs? If so, what is the standard deviation (this is also an opportunity to say something about the stability of those different methods)? Including this information would help demonstrate the stability of the different methods. Finally, where is “Random” in these plots? Is it obscured by another method? Since Figure 2 seems to be an important plot for establishing the importance of introducing the IDInit, I would wish for this plot to be very clear.

  2. Figures 8 and Table 3 are potentially valuable, as they involve the largest networks tested in the paper and might capture readers' interest. IDInit appears to outperform the default initialization in these cases, but a comparison with other identity-control methods would be useful to determine if the performance gain is due specifically to IDInit or identity-control in general. Although these experiments are resource-intensive, reporting the mean and standard deviation over multiple runs would help evaluate whether the performance differences are statistically significant.

Overall, the paper is well-written, but there are sections with vague claims or unclear intent:

  1. Lines 110–114 state, “By further proposing modifications to CNNs and solutions to dead neuron problems we have significantly improved accuracy by…”. It is unclear what task, network, or dataset this refers to, or what it is being compared to—state-of-the-art methods or alternative identity-control techniques? Since this is part of the main contributions section (end of Section 1), this ambiguity is distracting, particularly as similar wording appears in lines 115–116: “On ImageNet, IDInit can achieve almost all the best performance…”. Again, compared to what? I suggest being specific or moving such claims to the experimental results section, where precise differences in accuracy can be reported.
  2. The definition and understanding of dynamical isometry from the related work seems to be slightly inadequate. I.e. we say that a network attains dynamical isometry if all the singular values of the input-output Jacobian are equal to 1. This is not equivalent with operating at the criticality regime X=1X=1. In fact, for certain initialization and activations (e.g. orthogonal with ReLU), it is possible to operate in the criticality regime, yet never obtain the dynamical isometry property (see Section 2.5.2 of Pennington 2017). I know it does not seem to be a huge issue, since the dynamical isometry is not the center of attention of this paper, but it raises doubts whenever dynamical isometry is mentioned in the paper.
  3. In Section 3.1.1's convergence analysis, the statement, “as η\eta is usually 1e-1 and both training pairs…” is unclear. It is not accurate to label any learning rate as “standard” or “usual,” which makes the comment about the “magnitude of the asymmetric component” vague. Instead, it would be more informative to discuss the constraints or boundaries for the η\eta parameter that ensure broken symmetry.

These issues, while not critical individually, collectively (together with some other issues -- see section “Questions”) make the paper challenging to follow.

问题

  1. In Appendix A3, what is the upper index in xi(0)x_i^{(0)} ? Why in Π1\Pi_1 in the derivative it changes to 3? Also, should not Theorem 3.1 also include some constraints on the batch size, since this will influence the rank of the gradients?
  2. In Figure 2, I am not sure if I understand what Identity-1 is? Is it simply initializing W1 to Identity (since the matrices for plot 2b are squared)?
  3. In Equation (3), how is τ\tau determined for different activation functions? It does not seem to be specified in the paper.
  4. I liked the hyperparameter experiments from Section 4.1 and actually maybe even wished for this part to be more extended. If IDInit improves the stability over other initialization schemes, it would be nice to highlight this more (e.g. repeat Figure 7 on more datasets, by commenting on the standard deviation of the loss during the runs, etc.)

Minor:

  • How do the results in Figure 2 change with varying matrix sizes?
  • A suggestion: remove the word “Some” from the caption of Figure 8, as it diminishes the perceived significance of the experiment.
  • The term “residual stem” was initially unclear. Introducing and defining this term earlier in the paper might help readers understand it more quickly.
  • Lines 134–135: The phrase, “In this paradigm, the input-output Jacobian is calculated as…” is somewhat confusing. Equation 2 simply defines the input-output Jacobian, and its definition does not change based on whether it is viewed through the lens of dynamical isometry. I recommend removing the phrase “In this paradigm” for clarity.

Since currently the weaknesses outweigh the strengths of the paper, I am slightly leaning more towards rejection rather than acceptance.

评论

How do the results in Figure 2 change with varying matrix sizes?

We conducted an experiment with a larger model size, as detailed in Appendix G.2 of the uploaded PDF. With the increased size, the model converges faster across all initialization methods. Among them, IDInit consistently achieves the fastest convergence and delivers the best performance.

A suggestion: remove the word “Some” from the caption of Figure 8, as it diminishes the perceived significance of the experiment.

We have removed the word "Some".

The term “residual stem” was initially unclear. Introducing and defining this term earlier in the paper might help readers understand it more quickly.

We have defined it as follows: "Consider an LL-layer residual network, each residual block of which consists of a residual connection and a residual stem that refers to the component excluding the residual connection."

Lines 134–135: The phrase, “In this paradigm, the input-output Jacobian is calculated as…” is somewhat confusing. Equation 2 simply defines the input-output Jacobian, and its definition does not change based on whether it is viewed through the lens of dynamical isometry. I recommend removing the phrase “In this paradigm” for clarity.

Thanks for the valuable suggestion. We have revised this part as "Before introducing the mechanism, we first consider the input-output Jacobian which is defined as...."

We hope this addresses your concern, and we welcome any further suggestions for improvement.

[1] Bachlechner, Thomas, et al. "Rezero is all you need: Fast convergence at large depth." UAI, 2021.

[2] Zhu, Chen, et al. "Gradinit: Learning to initialize neural networks for stable and efficient training." NeurIPS, 2021.

[3] Zhang, Hongyi, et al. Fixup initialization: Residual learning without normalization. ICLR, 2019.

[4] Zhao, Jiawei, et al. Zero initialization: Initializing neural networks with only zeros and ones. TMLR, 2022.

[5] Bartlett, Peter, Dave Helmbold, and Philip Long. "Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks." ICML, 2018.

[6] Hardt, Moritz, and Tengyu Ma. "Identity matters in deep learning." ICLR, 2017.

评论

detail of Figure 2: Firstly, I have concerns regarding Figure 2...

Thank you for your valuable feedback. We construct the experiment following the methodology outlined in [1], utilizing their open-source code. The classification task is conducted on the Cifar10 dataset. The model employs a residual structure with 64 residual blocks, each comprising two matrices, W1W_1 and W2W_2, along with a classification layer at the top. Each block is represented as Y=ReLU(W_2ReLU(W_1X))+XY = \text{ReLU}(W\_2 \text{ReLU}(W\_1 X)) + X.

The term "Square" refers to W1R240×240W_1 \in \mathbb{R}^{240 \times 240} and W2R240×240W_2 \in \mathbb{R}^{240 \times 240}, while "Rectangle" denotes W1R280×240W_1 \in \mathbb{R}^{280 \times 240} and W2R240×280W_2 \in \mathbb{R}^{240 \times 280}, as described in the caption of Figure 2. Unlike "Square" matrices, "Rectangle" matrices cannot be initialized to an identity matrix since identity matrices are defined for square matrices. "Square Loss" and "Rectangle Loss" represent the losses observed during the experiments with "Square" and "Rectangle" matrices, respectively.

The lines in Figures 2(a) and 2(b) are averaged over three runs. To provide additional clarity, we include Figure 20 in Appendix G.1 of the uploaded PDF, which includes standard deviation.

"Random" refers to initializing W1W_1 using Xavier initialization and setting W2W_2 to 0\mathbf{0}, a strategy adopted in Fixup initialization. We use "Random" instead of "Fixup" to align with the "Random" matrix shown in Figure 2(a).

More experiments on ImageNet: Figures 8 and Table 3 are potentially valuable...

Thank you for your suggestion. To further compare with other identity-control methods, we conducted experiments on ResNet-50 using Fixup and Zero. As shown in RTable 1, IDInit outperforms both Fixup and ZerO, demonstrating its superior performance on large-scale datasets.

RTable 1. Results of ResNet-50 on ImageNet.

Init.Epochs to 60% AccAccuracy
Default3875.70
Fixup2475.83
ZerO3075.64
IDInit2476.72

Regarding multiple runs, we acknowledge that repeated experiments provide a more precise assessment of performance. However, large-scale experiments are computationally intensive, and the extensive amount of data involved minimizes the likelihood of significant fluctuations in the empirical results. As a result, prior studies on ImageNet ([1], [2], [3], and [4]) often rely on single runs for large-scale experiments.

For ImageNet, we performed five experiments, all of which demonstrated the superior performance of IDInit. Specifically, for ResNet-50, IDInit achieved an accuracy of 76.72%, surpassing the 76.2% reported in [2], which confirms the feasibility of our settings.

Therefore, we believe our validation on ImageNet and the tested models is both robust and reliable.

Imprecise claims: Lines 110–114 state...

We have revised the claims in the uploaded PDF for more precise, e.g., "we have significantly improved accuracy by 3.42% and 5.89% respectively." -> "we have significantly improved accuracy of classifying Cifar10 on ResNet-20 by 3.42% and 5.89%, respectively.". Please check it.

doubts on dynamical isometry: The definition and understanding of dynamical isometry...

Thank you for the reminder—this is a very interesting observation. The networks discussed in Pennington (2017) are not residual networks, and for residual networks, which are the focus of this paper, the situation can differ significantly. Specifically, in residual networks, such as Y=ReLU(W_2ReLU(W_1X))+XY=\text{ReLU}(W\_2\text{ReLU}(W\_1X))+X, the non-linearity resides in the sub-stem of the residual block. As explored in [5] and [6], when the weights in the sub-stem are small, residual networks with ReLU can approximate a linear network. This allows the model to follow dynamic isometry, as shown in Figure 17 of the original paper, where IDInit produces most χ\chi values close to 1.

However, the situation changes if ReLU is placed in the main stem, e.g., Y=ReLU(ReLU(W_2W_1X+X))Y=\text{ReLU}(\text{ReLU}(W\_2W\_1X+X)). In this case, the network cannot maintain isometry, as noted in Pennington (2017). This suggests that placing ReLU in the main stem is not advisable.

We will include this discussion in a dedicated paragraph in the next version of the paper.

评论

break symmetry: In Section 3.1.1's convergence analysis...

Thank you for your insightful comment. We extending the analysis on the magnitude of asymmetry in Section G.7 of the uploaded PDF. According to Eq.(4), the asymmetry in the gradient arises from:

$ \Omega = -\eta x_1x_1^Tx_2x_2^T + \eta y_1x_1^Tx_2x_2^T - y_2x_2^T. $

Here, we assume x1,x2,y1,y2Rdx_1, x_2, y_1, y_2 \in \mathbb{R}^d are random vectors with entries that are i.i.d. Gaussian random variables, following N(0,σ2)N(0, \sigma^2). The magnitude of asymmetry can be calculated as

$ &\mathbb{E}(||\Omega - \Omega^T||_F^2) \\\\ &= \mathbb{E}\{||[-\eta x_1x_1^Tx_2x_2^T+(\eta x_1x_1^Tx_2x_2^T)^T]+ [\eta y_1x_1^Tx_2x_2^T-(\eta y_1x_1^Tx_2x_2^T)^T] + [-y_2x_2^T+(y_2x_2^T)^T]||_F^2\}\\\\ &\text{According to Relaxed Triargle Inequality, there is} \\\\ &\leq 3\{\eta^2\mathbb{E}[||- x_1x_1^Tx_2x_2^T+ (x_1x_1^Tx_2x_2^T)^T||_F^2]+\eta^2\mathbb{E}[|| y_1x_1^Tx_2x_2^T- ( y_1x_1^Tx_2x_2^T)^T||_F^2]+\mathbb{E}[||-y_2x_2^T+(y_2x_2^T)^T||_F^2]\} \\\\ &\leq 3(\eta^2d^2\sigma^2+\eta^2d^2\sigma^2+d^2\sigma^2) \\\\ &= (6\eta^2+3)d^2\sigma^2 $

This shows that a higher learning rate promotes greater asymmetry, further explaining the observed differences. However, a high learning rate can affect training stability. Therefore, while using a higher learning rate to reduce symmetry, it is crucial to carefully select its magnitude to maintain stability.

Detial of Appendix A3: In Appendix A3, what is the upper index in xi(0)x_i^{(0)}...

In Appendix A.3, xi(0)x_i^{(0)} is similar to x(0)x^{(0)} in Eq. (1), representing the activations within the layers. Since the network consists of three layers, the final output is denoted as xi(3)x_i^{(3)}, where ii indicates the ii-th training batch. For the first batch, x1(3)x_1^{(3)} is used to calculate the gradient.

Regarding the batch size, it is set to N=D0N = D_0, ensuring that each sample in the batch is linearly independent. This guarantees that the rank of Π1\Pi_1 and Π2\Pi_2 in the gradient reaches D0D_0, a crucial condition for demonstrating Theorem 3.1.

Identity-1: In Figure 2, I am not sure if I understand what Identity-1 is...

Identity-1 serves as a baseline consisting of residual blocks, each of which contains only a square matrix initialized to 0\mathbf{0}. This setup is equivalent to a non-residual layer with a square matrix initialized to II. The baseline is designed to demonstrate the properties of a precise identity maintenance scheme.

Specified τ\tau: In Equation (3), how is τ\tau...

We have already specified the setting in Line 798-800 of the original paper of the beginning of Appendix B, as "In this paper, for ReLU activated networks, τ\tau is set to 2\sqrt{2} for the first layer in a network and 1 for other IDIC_τ/IDIC_τ\text{IDIC}\_{\tau} / \text{IDIC}\_{\tau} initializing layers, while for tanh-activated networks, all IDI_τ\text{IDI}\_{\tau} is set to 1, and ε\varepsilon is 1e61e-6 for all IDIZ_ε/IDIZC_ε\text{IDIZ}\_{\varepsilon}/\text{IDIZC}\_{\varepsilon} initializing layers."

More extension of Section 4.1: I liked the hyperparameter experiments from Section 4.1...

We extend the search grid of Figure 7 in Appendix G.3 of the uploaded PDF. As shown in Figure 22 of the uploaded PDF, we scanned the learning rate from 1e-3 to 1e1 and weight decay from 1e-8 to 1e-1, ensuring that the best-performing hyperparameters are not at the corners or edges of the grid. As illustrated in Figure 22, IDInit consistently performs well across a wide range of settings and achieves the best performance among all the initialization methods tested.

We also add search of ResNet-110 in Appendix G.4. We perform a detailed hyperparameter analysis for ResNet-110, evaluating the learning rates {1, 2e-1, 1e-1} and weight decays {1e-4, 5e-4, 1e-3} on the standard baseline Kaiming and the more fragile Fixup method. As shown in Figure 23, both Kaiming and Fixup achieve optimal accuracy with a learning rate of 2e-1 and a weight decay of 5e-4. However, Fixup fails to train with a learning rate of 1. Consequently, selecting a learning rate of 2e-1 and a weight decay of 5e-4 as the training hyperparameters in Section 4.2 is justified.

评论

Thank you for detailed answers. I appreciate the time and effort the authors undertook to address my concerns. However, I still have some concerns regarding the work. I divided them into “Major” (i.e. influencing my decision) and “Minor” (i.e. easy to fix, or just suggestions on the style of writing, does not have an influence on my decision but are more of an advice on how to improve the paper). Given the short amount of time remaining for the rebuttal the authors might like to focus on the “Major” issues.

Major:

Ad. Point 4) [doubts on dynamical isometry]

I still have problems with this point. First of all, yes, residual connections that maintain the identity along the main stem make it possible for the Residual Network to attain dynamical isometry for any type of activations. To the reference provided by the authors I would suggest adding [1], which investigates just this matter, providing theory. My problem is that the authors never properly define what is dynamical isometry, despite using this term. Again, achieving X=1X=1 is not the definition of the dynamical isometry. Moreover, the discussion on how the residual connections influence that property (as provided by the authors in the rebuttal) should be included in the paper.

[1] Tarnowski, Wojciech, et al. "Dynamical isometry is achieved in residual networks in a universal way for any activation function." The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.

Ad Point 5) [break symmetry]

Thank you, but I have now significant issues with the provided argumentations and equations. Basically, you are giving an upper bound on you asymmetry metric E(ΩΩTF2)(6η2+3)d2σ2 \mathbb{E}(||\Omega-\Omega^T||_{F}^2)\le (6\eta^2+3)d^2\sigma^2

and argue that by increasing η\eta the asymmetry will increase as well. But, since you provide only an upper bound, this does not necessarily have to be true. I.e. upper bound can be arbitrarily large, but the exact value of the asymmetry metric might not change. To make the point, the authors should show a lower bound in terms of η\eta - something of type f(η)E(ΩΩTF2)f(\eta) \le \mathbb{E}(||\Omega-\Omega^T||_{F}^2) . Then indeed, with the increase of the learning rate the asymmetric metric will for sure increase as well. Otherwise the entire argumentation is baseless.

Question 3. - I am sorry, maybe I was not specific enough - Yes, I have seen that in the paper you give the values of τ\tau for different activations (e.g. τRELU=2\tau_{RELU}=\sqrt{2}). My question was not what is the value of τ\tau, but why is the value equal to this specific quantity? I.e. how what basis was that determined or calculated?

Minor:

Ad. Point 1) [Figure 2]. Thank you for addressing this, the description is better, but I still have some minor suggestions:

We construct the experiment following the methodology outlined in [1], utilizing their open-source code. [...]

Thank you, that clarifies the setup for me. I have seen appendix C.5 in which you add some of the information, but please, just add the entire paragraph you wrote to me to that description in the appendix. I.e. it is meaningful that you have used the code of ReZero for reproducibility.

"Random" refers to initializing W1W_1 using Xavier initialization and setting W2W_2 to 0\mathbf{0},

Thank you, maybe I was not precise. My question was not about “what is “Random”” but rather “where is “Random” in the plot? In the plots in Figure 2b and 2c the “Random” approach is supposed to be denoted using the yellow line, but this line is not visible in the plots. Hence my question. From Figure 20 I suspect that the results for Random overlay with Hadamard, but it would be nice if the authors add a comment about this in captions, otherwise it is confusing.

Ad Point 2) [experiments on ImageNet]

Thank you for the additional table with comparison, this looks interesting. If it is possible, I would also like to see those comparisons depicted on a plot as in Figure 8b (i.e. the learning curve). It is clear that IDInit gives the best final accuracy, but up to 60% it goes neck-to-neck with Fixup. So it has to surpass Fixup at some later epoch in the training, and it would be interesting to see what that epoch is.

Regarding multiple runs, [...]

I understand that many papers on ImageNet use a single seed and that ResNet models typically show small standard deviations (e.g., Table 2 of ReZero). While I’d be surprised if IDInit increased the standard deviation—since it aids signal propagation and might improve stability—I believe repeating experiments is good scientific practice. This ensures robustness against unexpected impacts or unusually favorable seeds. Though the authors may lack time for this during the rebuttal, I recommend reporting results from multiple runs if the paper is accepted to enhance reliability.

评论

doubts on dynamical isometry: I still have problems with this point...

We sincerely thank the reviewer for their insightful comments and valuable suggestions. We acknowledge that achieving X = 1 is not the definition of dynamical isometry, and we will explain that it refers to the concentration of all singular values of the input-output Jacobian JioJ_{io} around 1. This clarification will be added to the section on related work.

Additionally, we appreciate the recommendation to include reference [1], which provides a theoretical perspective on this topic. We will incorporate this reference into the manuscript and revise the discussion to integrate the points made in the prior rebuttal about how residual connections influence dynamical isometry. We believe these revisions will significantly improve the clarity and rigor of our work.

Thank you again for your thoughtful feedback and guidance.

break symmetry: Thank you, but I have now significant issues with the provided argumentations and equations...

Thank you for your insightful comment. We appreciate your critique regarding the asymmetry metric and your suggestion to derive a lower bound to substantiate our argument. After revisiting our derivations, we discovered an error in the prior calculation of the upper bound, which has now been corrected. Additionally, we have derived the lower bound as follows:

Setup and Target. Let x1,x2,y1,y2Rdx_1, x_2, y_1, y_2 \in \mathbb{R}^d be random vectors with entries i.i.d. N(0,σ2)N(0, \sigma^2). Define Ω=ηx1x1Tx2x2T+ηy1x1Tx2x2Ty2x2T\Omega = -\eta x_1 x_1^T x_2 x_2^T + \eta y_1 x_1^T x_2 x_2^T - y_2 x_2^T.

Our target is to compute E(ΩΩTF2)\mathbb{E}(||\Omega - \Omega^T||_F^2):

$ &\mathbb{E}(||\Omega - \Omega^T||_F^2) \\\\ &= \mathbb{E}\{||[-\eta x_1x_1^Tx_2x_2^T+(\eta x_1x_1^Tx_2x_2^T)^T]+ [\eta y_1x_1^Tx_2x_2^T-(\eta y_1x_1^Tx_2x_2^T)^T] + [-y_2x_2^T+(y_2x_2^T)^T]||_F^2\}. $

Lower Bound Derivation. Introducing substitutions u=y1x1u=y_1-x_1, and s=x1Tx2=x2Tx1s=x_1^Tx_2=x_2^Tx_1, we rewrite:

$ \mathbb{E}(||\Omega - \Omega^T||_F^2) = \mathbb{E}\{||\eta s(ux_2^T-x_2u^T)-(y_2x_2^T-x_2y_2^T)||_F^2\}, $

Let w=ηsuy2w=\eta su-y_2, then:

$ \mathbb{E}(||\Omega - \Omega^T||_F^2) &= \mathbb{E}\{||wx_2^T-x_2w^T||_F^2\}, \\\\ &= \mathbb{E}\{2(||w||^2 ||x_2||^2 - (w^T x_2)^2)\}, \\\\ &= 2 \left( \mathbb{E}[||w||^2] \mathbb{E}[||x_2||^2] - \mathbb{E}[(w^T x_2)^2] \right), $

Expanding and computing expectations:

$ \mathbb{E}(||\Omega - \Omega^T||_F^2) &\geq 2 ((\eta^2(2d^2-d)\sigma^6+d\sigma^2)d\sigma^2 - \eta^2d^2\sigma^8), \\\\ &= 4\eta^2d^3\sigma^8-4\eta^2d^2\sigma^8+2d^2\sigma^4. $

Corrected Upper Bound Derivation. We derive the upper bound as

$ &\mathbb{E}(||\Omega - \Omega^T||_F^2) \\\\ &= \mathbb{E}\{||[-\eta x_1x_1^Tx_2x_2^T+(\eta x_1x_1^Tx_2x_2^T)^T]+ [\eta y_1x_1^Tx_2x_2^T-(\eta y_1x_1^Tx_2x_2^T)^T] + [-y_2x_2^T+(y_2x_2^T)^T]||_F^2\}\\\\ &\text{According to Relaxed Triangle Inequality, there is} \\\\ &\leq 3\{\eta^2\mathbb{E}[||- x_1x_1^Tx_2x_2^T+ (x_1x_1^Tx_2x_2^T)^T||_F^2]+\eta^2\mathbb{E}[|| y_1x_1^Tx_2x_2^T- ( y_1x_1^Tx_2x_2^T)^T||_F^2]+\mathbb{E}[||-y_2x_2^T+(y_2x_2^T)^T||_F^2]\} \\\\ & \leq 3(\eta^2d^3\sigma^8 + \eta^2d^3\sigma^8+d^2\sigma^4) \\\\ & = 6\eta^2d^3\sigma^8 + 3d^2\sigma^4 $

Empirical Validation. To further clarify, we conducted experiments with d=100d=100, σ=1\sigma=1, and η{1e2,1e1,1e0,1e1,1e2}\eta \in \{1e-2, 1e-1, 1e0, 1e1, 1e2\} to calculate E(ΩΩTF2)\mathbb{E}(||\Omega - \Omega^T||_F^2). For each η\eta, we run 1000 times to calculate the expectation. The results are summarized below:

RTable 2. Results of the magnitude of the asymmetry metric. "Target" means the result of E(ΩΩTF2)\mathbb{E}(||\Omega - \Omega^T||_F^2).

1e-21e-11e01e11e2
Target2.04e46.10e44.01e63.98e83.97e10
Lower Bound2.04e45.96e43.98e63.96e83.96e10
Upper Bound3.06e49.00e46.03e66.00e86.00e10

The results show that the lower bound is tight and as dN+d\in \mathbb{N}^+, the increase in learning rate η\eta directly leads to an increase in the asymmetry metric.

We hope this can address the concerns regarding the impact of the learning rate η\eta on the magnitude of the asymmetry metric.

评论

about τ\tau: I am sorry, maybe I was not specific enough - Yes, I have seen that in the paper you give the values of τ for different activations (e.g. τRELU=2). My question was not what is the value of τ, but why is the value equal to this specific quantity? I.e. how what basis was that determined or calculated?

Thank you for your clarification. We use τ\tau to ensure that the variance of the input and output of an activation function remains consistent across layers. This prevents the variance from exponentially increasing or decreasing as the depth of the network grows, thereby maintaining stability during training. For saturated activation functions (e.g., sigmoid and tanh), this issue does not arise because these functions inherently constrain the variance, so we set τsigmoid=1\tau_{sigmoid}=1 and τtanh=1\tau_{tanh}=1.

For non-saturated activation functions like ReLU, careful adjustment of τ\tau is required. Specifically, consider ReLU as an example, with input xx and output yy defined as y=ReLU(τx)y=ReLU(\tau x). Using σ2()\sigma^2(\cdot)to denote variance, we aim to maintain:

E(σ2(x))=E(σ2(y)).\mathbb{E}(\sigma^2(x))=\mathbb{E}(\sigma^2(y)).

Substituting the variance expression for yy:

$ \mathbb{E}(\sigma^2(y))=\mathbb{E}(\sigma^2(ReLU(\tau x)))=\tau^2\mathbb{E}(\sigma^2(ReLU(x))). $

Equating the input and output variances:

τ2=E(σ2(x))E(σ2(ReLU(x)))\tau^2=\frac{\mathbb{E}(\sigma^2(x))}{\mathbb{E}(\sigma^2(ReLU(x)))}

For ReLU, it can be shown that E(σ2(ReLU(x)))=12E(σ2(x))\mathbb{E}(\sigma^2(ReLU(x)))=\frac{1}{2}\mathbb{E}(\sigma^2(x)), which leads to:

τ=E(σ2(x))E(σ2(ReLU(x)))=2\tau=\sqrt{\frac{\mathbb{E}(\sigma^2(x))}{\mathbb{E}(\sigma^2(ReLU(x)))}}=\sqrt{2}

However, it is important to note that there is no universal method for determining τ\tau for all activation functions. The value must be calculated on a case-by-case basis, depending on the activation function's properties. Despite this, the overarching goal remains the same: ensuring consistent variance between the input and output to maintain stability during training. This principle is also supported in [2]. We hope this explanation clarifies how the specific values of τ\tau are determined.

We are deeply grateful to the reviewer for their insightful comments and thoughtful suggestions. Your feedback has been invaluable in helping us enhance the quality and clarity of our paper, and we sincerely appreciate the time and effort you dedicated to reviewing our work.

[1] Tarnowski, Wojciech, et al. "Dynamical isometry is achieved in residual networks in a universal way for any activation function." The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019.

[2] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." ICCV, 2015.

评论

Thank you for addressing my issues. I ask the authors to add the above discussions to the paper (either in the main body or in the appendix). Especially, when discussing the asymmetry, instead of mentioning exact used learning rate values provide the discussion (or link to such in an Appendix) on how learning rate value influences the asymmetry metric.

In general, I still have issues with the presentation of the paper (even after the changes made by authors in the revised pdf) – i.e. the paper could benefit from another proofreading, and an overall pass to improve the style. But since this seems fixable, I trust the authors to update the paper accordingly to my (minor) style suggestion and with the above discussions, hence I slightly increase my score.

评论

We sincerely appreciate your time and effort in reviewing our paper and providing valuable feedback, which is crucial for improving our work. We're pleased that our responses have addressed your concerns. We will ensure to incorporate all the discussed points into the revised manuscript and enhance the presentation style as suggested.

评论

Dear Reviewer jXZf,

We hope this message finds you well. As the review period approaches its end, we would like to kindly follow up on your feedback regarding our rebuttal. We would greatly appreciate it if you could let us know whether your concerns have been adequately addressed. If there are any remaining issues or points requiring clarification, we would be happy to provide further details.

Thank you very much for your time and efforts in reviewing our submission.

Best regards,
Authors of Submission #6753

评论

Dear Reviewers,

We sincerely appreciate the time and effort you've dedicated to reviewing our work, and we apologize for the delay in our response as we are conducting additional experiments. We understand your schedule is likely very busy, and we are deeply grateful for your valuable feedback. As there is still time before the end of the discussion phase, we would greatly appreciate the opportunity to engage in further dialogue with you. Our goal is to ensure that our responses adequately address your concerns and to determine if there are any additional questions or points you'd like to discuss.

We look forward to the opportunity for further discussion with you. Thank you for your thoughtful consideration.

Best regards,
Authors

AC 元评审

(a) Scientific Claims and Findings: The paper introduces IDInit, an initialization method that preserves identity mappings in both main and sub-stem layers of residual networks. By employing a padded identity-like matrix, IDInit addresses rank constraints in non-square weight matrices, aiming to enhance convergence speed, training stability, and overall performance.

(b) Strengths:

  • Improved Training Dynamics: IDInit demonstrates enhanced convergence rates and stability during training, as evidenced by empirical results.
  • Simplicity: The approach is straightforward to implement, making it accessible for practitioners.

(c) Weaknesses:

  • Limited Theoretical Analysis: The paper lacks a comprehensive theoretical explanation of why IDInit outperforms existing initialization methods.
  • Scope of Experimental Validation: While the method shows promise, the performance gains are limited and often insignificant, as common for methods that focus purely on neural network initialization.

(d) Decision Rationale: The main contribution is that two challenges of identity initialization are addressed, i.e., a rank constraint issue, and, more importantly, a convergence problem, which was pointed out by Bartlett et al. (2019). Interestingly, the solution to the convergence problem entails a change of optimiser (that relies on momentum). The solution results from an asymmetry in the parameter update, which is facilitated by large learning rates in combination with momentum.

审稿人讨论附加意见

The main changes and additional results provided during the rebuttal are as follows:

  • The authors added a theoretical justification of the momentum based on a toy example to the appendix.
  • The claimed induced asymmetry and its relation to the learning was empirically validated.
  • A comparison with more baselines (Fixup and ZerO) for a ResNet-50 on ImageNet was added.
  • Experiments on a GPT-Base-MOE in Appendix G.5 and DiT-S/4 in Appendix G.6 demonstrate the versatility of the approach.
  • Clarifications on the derivation of τ\tau (a constant in the initialization) were added.
  • An ablation on the effect of the different components of the method (IDIZ and

Some concerns regarding the clarity of the manuscript and the soundness of the experimental comparison remain (see Reviewer 96yw). As I found the work sufficiently clear and understandable for publication at ICLR, I recommend acceptance. Additional experiments demonstrate the merit of the proposed method and insights into the mechanisms how the propose IDInit overcomes the two stated challenges were provided.

最终决定

Accept (Poster)