PaperHub
3.0
/10
withdrawn4 位审稿人
最低3最高3标准差0.0
3
3
3
3
4.0
置信度
正确性2.5
贡献度1.5
表达2.3
ICLR 2025

Optimization Insights into Deep Diagonal Linear Networks

OpenReviewPDF
提交: 2024-09-28更新: 2024-11-28

摘要

关键词
Diagonal Linear NetworkOverparameterizationImplicit BiasMirror Flow

评审与讨论

审稿意见
3

This paper attempts to build the implicit regularization of gradient flow (GF) for the deep diagonal linear networks. In particular, the authors show that the GF dynamics induces a certain kind of mirror flow dynamics without an explicit form of the corresponding entropy function. In addition, they also reveal the convergence property of the GF dynamics and establish how the convergence rate is affected by the initialization.

优点

In general, this paper is well organized, e.g., the authors clearly demonstrate their motivation and contribution. They also clearly develop their notations, definitions, and theorems to support their claims. These efforts make the understanding of this paper fairly straightforward. In addition, the characterization of certain properties of the learning dynamics of deep diagonal linear networks might also be interesting, e.g., the second point of Proposition 1.

缺点

Unfortunately, both the technical and theoretical contributions of this paper are rather limited, which I will discuss as follows.

  1. The ultimate goal of this paper is to reveal the implicit regularization effect of GF for deep diagonal linear networks. However, the explicit form of the corresponding entropy function for the induced mirror flow dynamics is completely absent. There is even no suggestion about possible properties that the entropy function should have.

    In addition, the derivation of the mirror flow form is a direct application of results in Li et al., 2022, and the first point of Proposition 1 can be a direct application of the Euler’s theorem for homogeneous function. Thus I think the technical contributions of this paper are rather limited.

  2. As a comparison, Yun et al., 2021 already explicitly characterized the implicit bias of deep diagonal linear networks by using the tensor network formulation developed in their paper.

    Specifically, they established the optimization problem with an explicit form of the entropy function that the GF dynamics of deep diagonal linear networks (note that they did not require the parameterization uLvLu^{\odot L} - v^{\odot L}) aims to solve. They also established the convergence of the dynamics. The only possible weakness of their result is the additional requirement of the initialization, which the authors in this paper are able to relax at the cost of the characterization for explicit form of entropy function. But I cannot view such relaxation as a significant theoretical contribution that is sufficient for this paper to be published in its current version.

Reference

Yun et al., 2022. A Unifying View on Implicit Bias in Training Linear Neural Networks.

问题

  1. What are the technical and theoretical contributions of this paper compared to Yun et al., 2021? For example, are results in this paper more general than those in Yun et al., 2021?
  2. Can you derive the explicit form of the entropy function of the induced mirror flow dynamics?
审稿意见
3

This work examines deep diagonal linear networks and demonstrates that gradient flow induces a mirror flow dynamic within the model. Under a mild initialization assumption, applying gradient flow to the layers generates a parameter trajectory that satisfies a mirror flow. The convergence properties related to the original optimization problem are highlighted.

优点

The study brings a relatively fresh perspective of implicit bias.

缺点

问题

  • For experiments, have you tested networks with different numbers of layers?\
  • Line 480: "The initial value of the remaining layers of the first network (in blue) is generated randomly". Any requirements for the random initialization?
审稿意见
3

This paper shows gradient flow on diagonal linear networks induces a mirror flow on the input-output model, and also show that gradient flow converges exponentially given certain initializations.

优点

The presentation in this paper is good, derivations are clear, and theoretical results are well explained.

缺点

This paper lacks novelty in the following ways:

  1. Theorem 1, showing that GF on LL-layer diagonal linear networks induces a mirror flow, as authors have acknowledged, is an application of Li et. al., 2022. So the contribution of this theorem is rather weak.

  2. Theorem 2 is not new. Min et. al., 2023 (See their Section 4.2) have shown the exponential convergence of GF under the same condition as described in equation (A)(\mathcal{A}) with a better lower bound on the rate.

References:

Z Li, et. al., Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent, NeurIPS, 2022

H Min, et. al., On the Convergence of Gradient Flow on Multi-layer Linear Models, ICML, 2023

问题

See "Weaknesses"

审稿意见
3

The paper investigates the optimization dynamics of deep diagonal linear networks. Diagonal linear networks have garnered significant theoretical interest because they are easier to analyze and reflect many empirical behaviors induced by various hyper-parameters while training deep networks. The paper aims to diagonal networks to understand the impact of depth.

优点

  • The paper introduces a mild technical assumption A\mathcal{A} on the initialization which holds almost surely for a random initialization. Under this assumption, the gradient flow for the parameterization of deep diagonal network can be rewritten as mirror flow with a convex potential on the linear predictor θ\theta, which is an interesting technical observation.
  • Under the same assumption, the linear convergence for any loss function LL which satisfies PLPL condition in θ\theta is established.

缺点

  • The major weakness is the mirror potential is not explicitly defined - even if it not explicitly defined the limiting behavior in the case of large depth or small initialization are not discussed or analyzed which is a major drawback.
  • The implicit bias of optimization benefits/drawbacks of the depth is not discussed and this weakens the motivation for studying the deep diagonal linear networks.

问题

  1. Can the authors compare with convergence analysis and result for general deep linear networks in 1 ? Given the analysis of convergence for deep linear networks for general initializations, the convergence established for diagonal networks is insignificant.
  2. Does the condition A\mathcal{A} necessary for the existence of convex potential ?

[1] Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers Bah et. el. 2020

撤稿通知

We thank the reviewers for their valuable comments and choose to take the time to revise the manuscript before a possible resubmission.