PaperHub
6.2
/10
Rejected5 位审稿人
最低3最高8标准差1.8
8
6
3
8
6
3.0
置信度
正确性3.0
贡献度2.6
表达3.2
ICLR 2025

Understanding Mode Connectivity via Parameter Space Symmetry

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05

摘要

关键词
symmetrymode connectivity

评审与讨论

审稿意见
8

This paper introduces a topological framework to understand mode connectivity in general, and linear mode connectivity in particular. Initially, the authors use topological structures to calculate the number of connected components at the 0-level set of the quadratic loss for invertible multi-layer perceptrons. For one-dimensional spaces, they demonstrate that residual connections reduce the number of connected components. Subsequently, they examine the influence of permutations on connectedness and provide a setup where connectedness fails. Finally, conditions are presented for achieving low-loss curves that connect modes.

优点

The paper proposes a very powerful framework to analyze properties of the loss surface of deep neural networks. In particular, understanding mode connectivity can shed light on the ways to optimize networks more efficiently. The paper generally provides clear and concise explanations of the proofs for its theorems and propositions.

缺点

However, in its current form, the paper does not fully clarify the connection between topological concepts and the loss landscape of deep neural networks. While it opens with a detailed and precise introduction to topological concepts, it then directly applies these concepts to loss surfaces and networks without discussing necessary assumptions (such as the invertibility of all networks considered). Additionally, there is little exploration of how the topological concept of connected components relates to the depth of the network or which elements correspond to orbits or groups, potentially with examples. Although all this does not detract from the technical accomplishments, it may reduce the paper’s impact by making it difficult for an unprepared reader to link the framework to neural network applications.

In the final section, the concept of curvature is used without clearly defining it in this context. Further, this section could connect more directly to practical applications by demonstrating empirical curves that connect modes and align with derived formulas (e.g., regarding loss growth). A similar issue is in Section 5 concerning symmetries.

Minor Issues:

  • one of the first works on the algorithms for finding connectivity is [1], not [2]

  • the parameter space (Param) is referenced in Section 3.3 before being formally defined

  • please use \citep where appropriate (e.g., the last paragraph of Section 5.2)

[1] Singh, Sidak Pal, and Martin Jaggi. "Model fusion via optimal transport."

[2] Ainsworth, Samuel K., Jonathan Hayase, and Siddhartha Srinivasa. "Git re-basin: Merging models modulo permutation symmetries."

问题

1 - The networks analyzed are invertible up to the output layer, meaning the output dimension matches the input dimension. How strictly is this condition required? Does switching to a one-dimensional output immediately yield negative results, as suggested in Section 5.2?

2 - According to [3], layer-wise mode connectivity is achievable. Does Proposition 5.3 contradict this result, or is there a possible connection?

3 - Is proposition about residual connections restricted to 1 dimension?

[3] Adilova, Linara, Asja Fischer, and Martin Jaggi. "Layer-wise linear mode connectivity."

评论

Response to questions

1 - The networks analyzed are invertible up to the output layer, meaning the output dimension matches the input dimension. How strictly is this condition required? Does switching to a one-dimensional output immediately yield negative results, as suggested in Section 5.2?

The invertibility condition is required when we want to establish a homeomorphism between the minimum and the symmetry group. When there is one, we can easily infer topological properties of the minimum from the symmetry group. When the network is not invertible, as in the example with skip connections, we are still able to analyze the connectedness of the minimum, but this requires more careful handling of analyzing multiple orbits. Switching to a one-dimensional output may change the number of connected components of the minimum, although the direction of change may depend on the exact loss function (Proposition A.8). There does not seem to be a connection between this change and the failure cases of linear mode connectivity in Section 5.2, which is primarily caused by the non-compact symmetry group.

2 - According to [3], layer-wise mode connectivity is achievable. Does Proposition 5.3 contradict this result, or is there a possible connection?

[3] Adilova, Linara, Asja Fischer, and Martin Jaggi. "Layer-wise linear mode connectivity."

Proposition 5.3 does not contradict with the layer-wise connectivity result in [3]. In the proof, we construct the two minima W,WW, W’ by rescaling two layers. As a result, both layers are different between WW and WW’. This is different from the setting in Theorem 4.1 in [3], where only one layer is different between the two sets of parameters. The empirical observations of the connectivity of certain groups of layers in [3] may reflect the implicit bias of SGD, which means it is possible that the minima reachable by SGD are approximately linearly connected, even though the complete set of minima may have more complex structures. We appreciate the pointer to this relevant work and have added a brief discussion in the updated paper.

3 - Is proposition about residual connections restricted to 1 dimension?

Yes, as mentioned in the proposition (n=1n=1) and subsection header. When the weight matrices are higher-dimensional invertible matrices, the number of connected components is further reduced to 2. We are working on relaxing the invertibility condition and will include full proofs in the final version of the paper.

评论

Thank you for the replies and additions. I raise my score to acceptance.

评论

Thank you for your comments! We are encouraged that you consider our framework powerful for analyzing the loss surface. We address your comments and questions below.

However, in its current form, the paper does not fully clarify the connection between topological concepts and the loss landscape of deep neural networks. While it opens with a detailed and precise introduction to topological concepts, it then directly applies these concepts to loss surfaces and networks without discussing necessary assumptions (such as the invertibility of all networks considered). Additionally, there is little exploration of how the topological concept of connected components relates to the depth of the network or which elements correspond to orbits or groups, potentially with examples. Although all this does not detract from the technical accomplishments, it may reduce the paper’s impact by making it difficult for an unprepared reader to link the framework to neural network applications.

We appreciate your suggestions on clarifying the connection to topological concepts. We have attempted to make the assumptions clear by stating them before each proposition as well as the subsection headers. Since the topological concepts from the preliminary section are mostly used in proofs, we did not reference them often in the main text. However, we have made an effort to explain the topological intuitions in the proof sketch and explanation of the theorems. We also hope the last two corollaries in Section 3 provide some correspondence, or at least a hint of connection, between the topological concepts and elements in neural networks. We will expand Section 3 by including more connection to neural networks if space permits.

In the final section, the concept of curvature is used without clearly defining it in this context. Further, this section could connect more directly to practical applications by demonstrating empirical curves that connect modes and align with derived formulas (e.g., regarding loss growth). A similar issue is in Section 5 concerning symmetries.

Thank you for these suggestions. In the final section, we have added a formal definition of curvature, as well as experiments showing that the loss on the curves induced by approximate symmetry is consistently low, as predicted by Proposition 6.1 (Figure 3b,c). For Section 5, we added a visualization showing that the loss barrier on a linear interpolation between two minima in a homogeneous network can become unbounded, as predicted by Proposition 5.3 (Figure 4 in Appendix C).

Minor Issues.

Thank you for pointing these out. We have added the reference, corrected the notation for the input space of LL in Section 3.3, and replaced \citet with \citep where appropriate.

审稿意见
6

This paper introduces a method to determine the number of components achieving zero loss in linear regression by identifying a homeomorphism between the symmetry in parameter space and the set of parameters yielding zero loss. It demonstrates that permutations can link previously isolated components and offers a novel perspective on the effectiveness of residual connections.

优点

  1. The concept of employing a homeomorphism to connect the general linear group with the set of parameters yielding zero loss, and quantifying the loss basins by counting the connected components, is both innovative and intriguing.
  2. This paper offers a fresh rationale for the effectiveness of residual connections by examining the connected components within the minima of the loss function.
  3. In Section 5.1, the paper shows that permutations can link otherwise disconnected loss minima.

缺点

  1. Lack of limitations: Since the homeomorphism is specifically tailored to linear regression models, it is necessary to clearly state this limitation in the introduction.

问题

  1. Is it possible to experimentally validate the results of Sec. 6? Can we confirm that the valleys of the loss lines predicted by Eq. 7 correspond to the valleys in the two-dimensional heatmap of the loss landscape?
  2. Is it possible to extend the analysis to more realistic models such as ResNet, which has a softmax function, using parameter symmetry and homeomorphic mapping?
评论

Thank you for your comments and positive feedback!

Lack of limitations: Since the homeomorphism is specifically tailored to linear regression models, it is necessary to clearly state this limitation in the introduction.

Thank you for the suggestion. We have made the limitations explicit in the introduction. However, we would like to point out that while our examples of homeomorphism between the minimum and the symmetry group are limited to linear regression models, our framework can be applied to networks without a homeomorphism and different loss functions. For example, when the minimum comprises more than one orbits, we can still obtain the number of components by analyzing the connectedness of each orbit. Our method can also be generalized to loss functions other than the mean square loss.

Response to Questions

  1. Is it possible to experimentally validate the results of Sec. 6? Can we confirm that the valleys of the loss lines predicted by Eq. 7 correspond to the valleys in the two-dimensional heatmap of the loss landscape?

Yes, we have added experiments showing that the inequality in Proposition 6.1 holds empirically (Figure 3a), and the loss on the curves induced by approximate symmetry is consistently low as predicted by Proposition 6.1 (Figure 3b,c). Since these curves live in a high dimensional space, it is not straightforward to produce a two-dimensional heatmap. Nevertheless, we hope it suffices to show that the loss at every point of the curve is low compared to the loss on linear interpolations between two minima.

  1. Is it possible to extend the analysis to more realistic models such as ResNet, which has a softmax function, using parameter symmetry and homeomorphic mapping?

We believe it is possible to extend our results to a network with a softmax function. Softmax is known to have a translational symmetry, which means that points on the minimum can have different network outputs before applying the softmax while giving the same final output after the softmax. For each possible network output, the connectedness of the set of minima corresponding to that output can be analyzed by methods from our paper. The connectedness of the union of these sets, or the entire minimum, can then be obtained by analyzing the connectedness of the set of outputs that map to the same value after softmax. We will include a precise formulation and full proofs in the final version of the paper.

审稿意见
3

This paper studies the space of minima of neural networks (mostly in the linear case) via the angle of group symmetries. A number of standard results in math are listed, and then a number of small results about the structure of minima are given, with some emphasis on the role of skip connections and on the permutations. Each result highlights the role of group actions and this is the main originality of the paper. Overall, I do not see any results that are particularly striking or insightful to understand neural networks in practice. The derivations do not shed a light on any phenomenon of interest, and the presence of group actions when it is not already almost trivial is not established. As a result, while the idea of looking at things from the angle of group action is appealing and elegant, it does not bring convincing additional insight to study the hard question of the minimum landscape of nonlinear neural networks. I would encourage the authors to keep looking in this direction, but something striking would be needed for me to excited.

优点

Rigorous, mathematically clean, reasonably clear article.

缺点

The results are not very strong or interesting. Many standard basic results of math are presented like big theorems. Overall, this article does not make things simpler or clearer.

问题

Is there any reason to believe that there is a natural group action like the ones you explain beyond the cases where it is intuitively clear that there is one?

Can simulations reveal the presence of group actions or the relevance of your derivations beyond the obvious cases?

评论

Thank you for your feedback and comments. You are right that this paper is held together by group actions - our main message is precisely that we can infer topological properties of the minimum from topological properties of symmetry groups. This connection, while obvious in hindsight, has been overlooked by mode connectivity researchers for years. We hope that our insights will help future research on loss landscapes. We also hope that the idea of inferring properties of an unknown object from a known one could inspire new work beyond this field.

The results are not very strong or interesting. Many standard basic results of math are presented like big theorems. Overall, this article does not make things simpler or clearer.

We appreciate your perspective, although we believe our results are novel and, according to other reviewers, will be of interest to the field. It is not our intention to make our theorems look like big results - we value simplicity over complexity. Our goal is to introduce new intuitions behind why and when mode connectivity holds and would appreciate concrete suggestions on how to make the presentation clearer.

Is there any reason to believe that there is a natural group action like the ones you explain beyond the cases where it is intuitively clear that there is one?

Yes, parameter space symmetry is prevalent in common architectures, and there exists complicated and possibly data-dependent symmetry group actions [1]. The high-dimensional nature of the minimum [2] also suggests possible group actions with nontrivial orbits. The existence and number of symmetries in neural network architectures is an active field. Recent work has also found symmetry groups and actions with an automated framework [3].

Can simulations reveal the presence of group actions or the relevance of your derivations beyond the obvious cases?

It is not clear whether simulation could reveal the presence of group actions, but other approaches, such as the learning-based symmetry discovery method in [3], have shown that there exist non-obvious parameter symmetries. Our paper complements these works by providing an application for the discovered symmetries.

References:

[1] Zhao, Ganev, Walters, Yu, Dehmamy. Symmetries, flat minima, and the conserved quantities of gradient flow. arXiv preprint arXiv:2210.17216, 2022.

[2] Cooper. The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200, 2018.

[3] Zhao, Dehmamy, Walters, Yu. Symmetry Discovery in Neural Network Parameter Spaces. UniReps 2024.

评论

I have read the rebuttals and other reviews thank everyone for their time. Unfortunately it doesn't change my view, which is that I'm not learning anything about neural networks that I'm finding interesting in this paper.

审稿意见
8

The paper studies mode connectivity in neural networks modulo parameter space symmetries from the perspective of topology.

Authors begin by the counting the connected components of observation that for Euclidean distance as the loss function, the minima of a deep linear network with ll hidden layers and invertible weight matrices has 2l12^{l-1} connected components, followed by the observation that having skip connections similar to ResNets reduces the space of connected components.

For deep linear networks, authors show that one can reduce connected components if one takes into account for permutations.

Authors then use layer wise scale symmetries in deep networks to show that linear mode connectivity doesn’t hold, however if one controls the weight norms for each layer, we can control the error barrier incurred by linear interpolation within the same connected component.

Finally, authors introduce general symmetry induce curves that parameterizes the level set of the loss, authors use to curvature of the curve to give sufficient condition for when approximate linear connectivity holds.

优点

  • This is a well written paper and authors include sufficient background to make the paper readable for a non-expert in topology, like myself to follow all the results and make inferences.
  • Authors provide a novel analysis studying mode connectivity in deep linear networks.
  • Authors make a number of contributions studying necessary conditions for mode connectivity and approximate linear connectivity that can be used to understand symmetries in neural networks.

缺点

  • I see that there is no dependence on width of the network when considering the number of connected components, however permutation symmetries grows exponentially with the width of the network. This appears to be due to the fact we are studying a very simplified setting of linear networks. It’ll be useful for the readers to include a discussion on this.
  • It’ll be improve the paper further to improve an example / figure for section 5.1 where permutations leads to mode connectivity.- It is well known that scale symmetries lead to a failure of linear mode connectivity however it is interesting that controlling the weight norms and control over the curvature leads to approximate linear connectivity.
    • How does this relate to empirical solutions explored by SGD? Specially because it appears that weight decay is necessary for lmc mod permutations.

问题

It is well known that scale symmetries lead to a failure of linear mode connectivity however it is interesting that controlling the weight norms and control over the curvature leads to approximate linear connectivity.

  • How does this relate to empirical solutions explored by SGD? Specially because it appears that weight decay is necessary for lmc mod permutations.
评论

Thank you for your comments and positive feedback!

I see that there is no dependence on width of the network when considering the number of connected components, however permutation symmetries grow exponentially with the width of the network. This appears to be due to the fact we are studying a very simplified setting of linear networks. It’ll be useful for the readers to include a discussion on this.

The lack of dependence of the number of connected components on width is a result of the fact that the set of nn by nn invertible matrices (GLn(R)GL_n(\mathbb{R})) has two connected components independent of nn. Therefore, although wider networks have a larger symmetry group and a larger set of minima, the number of connected components remains unchanged. This is one example where connecting the minimum to symmetry groups brings out simple yet otherwise non-obvious results. We appreciate this question and have added a short discussion in Section 4.

It’ll improve the paper further to improve an example / figure for section 5.1 where permutations leads to mode connectivity.

Thank you for the suggestion. We have added an example in Appendix C.

It is well known that scale symmetries lead to a failure of linear mode connectivity however it is interesting that controlling the weight norms and control over the curvature leads to approximate linear connectivity.

How does this relate to empirical solutions explored by SGD? Specially because it appears that weight decay is necessary for lmc mod permutations.

This is an interesting observation, and we have included a short discussion at the end of Section 5.2. The empirical observation of mode connectivity and linear mode connectivity is likely due to the fact SGD typically only explores certain parts of the minimum, often referred to as implicit bias. Weight decay may further encourage SGD to favor certain minimizers. The subset of minima that is likely to be reached by SGD can therefore have very different structures than the entire set of minima.

审稿意见
6

The authors investigate mode connectivity from a mathematical perspective. Several results are obtained: (1) The number of connected components of the set of minimizers is characterized in case of linear networks with and without skip connections, where adding skip connections reduced the number of components. (2) In case of 2 layers, they show how permutations can map points to different components, thus “connecting” them and shedding light on recent empirical observations. (3) Next the authors also show that linear mode connectivity does not hold in case of ReLU networks, and that permuting the last two layers does not reduce the barrier either. (4) Finally, the authors characterize non-linear paths connecting such minima and obtain bounds on their curvature, which measures how far away one is from a “linear mode connectivity” regime.

优点

  1. The paper is well-organized and states all the mathematical results in the Preliminaries section, making this a largely self-contained work and thus easier to read and understand. Linear mode connectivity is still lacking a proper mathematical understanding to this day, making this submission thus a timely contribution.
  2. The authors manage to gain quite some insight into the problem with rather mathematically elementary tools, relying on topological properties and results from group theory. I appreciate the result showing that skip connections reduce the number of components, which is in-line with what people observe in practice in terms of easier optimization.
  3. It is also quite nice that in case of two layers, the authors manage to show that permutations indeed connect the components back. While things in practice might be significantly more complicated than the setting considered in this work, I still believe this is a good first step towards obtaining a better understanding of this intriguing phenomenon.

缺点

  1. What does it mean intuitively and geometrically if two network parameters are in the same connected component? As the authors point out, this does not imply being path-connected, so while it sounds convincing at first, it is actually not clear to me how this notion ties back to the usual geometric understanding of connectivity used in deep learning. I would appreciate if the authors could clarify things. I.e. how pathological are counter-examples of “connected but not path-connected”?
  2. One weakness in the “large barrier” type of results is their global nature, i.e. there is no notion of what types of minima SGD actually finds. The counter-example (as far as I understood) for which a path is constructed for Prop 5.3 and 5.4 starts from a set of parameters and then constructs a new one using the rescaling symmetry. It is not clear to me how “degenerate” these solutions are in the sense that SGD might never choose them due to its implicit bias. I believe there are actually results that show that SGD prefers certain parameters out of the re-scale orbit. In general I think it would be important to better highlight that this work deals with the loss landscape in a global sense, and is not restricted to the minimizers discovered by SGD.
  3. The word “connected” has several meanings in this work and I sometimes was confused which one is currently used in a given part of the text. E.g. when two points are in the same connected component (e.g. when mapping with permutations), this is not the same thing as when two points cannot be connected linearly etc. I feel like the manuscript could do a better job at distinguishing these things.

问题

See above.

评论

Thank you for your detailed comments and positive feedback!

  1. What does it mean intuitively and geometrically if two network parameters are in the same connected component? As the authors point out, this does not imply being path-connected, so while it sounds convincing at first, it is actually not clear to me how this notion ties back to the usual geometric understanding of connectivity used in deep learning. I would appreciate if the authors could clarify things. I.e. how pathological are counter-examples of “connected but not path-connected”?

Intuitively, imagine the minimum of a loss function as a manifold or a high dimensional surface. Then two network parameters are in the same connected component if they reside on the same piece of this manifold. Connectedness ensures there is no separation of the space into disjoint non-empty open subsets, while path-connectedness allows one to construct continuous paths between points.

While it is theoretically possible for two points in the same connected component to lack a path between them, such counterexamples are often specifically constructed and unlikely to be encountered in the context of deep learning. A classic example is the topologist’s sine curve T=T0T+T = T_0 \cup T_+, where T0={(x,y):x=0 and y[1,1]}T_0=\{(x, y): x=0 \text{ and } y \in [-1,1]\} and T+={(x,y):x(0,2/π] and y=sin(1/x)}T_+ = \{(x, y): x \in (0, 2/\pi] \text{ and } y = \sin(1/x)\}. This space is connected but not path-connected since the infinitely oscillating waves prevent any continuous path from linking T+T_+ to T0T_0.

  1. One weakness in the “large barrier” type of results is their global nature, i.e. there is no notion of what types of minima SGD actually finds. The counter-example (as far as I understood) for which a path is constructed for Prop 5.3 and 5.4 starts from a set of parameters and then constructs a new one using the rescaling symmetry. It is not clear to me how “degenerate” these solutions are in the sense that SGD might never choose them due to its implicit bias. I believe there are actually results that show that SGD prefers certain parameters out of the re-scale orbit. In general I think it would be important to better highlight that this work deals with the loss landscape in a global sense, and is not restricted to the minimizers discovered by SGD.

We agree that our work studies the entire set of minima instead of the ones discovered by SGD, and have added clarifications regarding this aspect in the introduction. We do not believe this is necessarily a weakness though. While SGD is known to explore only a small portion of the minimum, it is less clear whether or to what extent other optimizers behave in similar ways. Additionally, a characterization of the complete set of minima might be useful beyond the context of optimization, such as in studying model complexity. Hence, although our paper does not specialize on minimizers discovred by SGD, the results could still be useful in understanding the loss landscape.

  1. The word “connected” has several meanings in this work and I sometimes was confused which one is currently used in a given part of the text. E.g. when two points are in the same connected component (e.g. when mapping with permutations), this is not the same thing as when two points cannot be connected linearly etc. I feel like the manuscript could do a better job at distinguishing these things.

“Connected” indeed has multiple possible meanings. To distinguish different definitions, we have included clarifications in the paper and checked for consistency of the use of this concept. In the first part of the paper (Section 3 and 4), connectedness assumes its mathematical definition given in Section 3. From Section 5 onwards, when discussing mode connectivity, we use the term “mode connectivity” when points can be connected by arbitrary curves and always specify “linear mode connectivity” when only linear interpolation is considered.

评论

I thank the authors for engaging with my feedback and clarifying my concerns and questions!

Intuition for connectivity: Thank you for clarifying, I think it is maybe worth pointing out in the main text that for most practical purposes, connectivity can be thought of as path connectivity, while still pointing out that this is not always true in a strict mathematical sense. This allows for better intuition in my opinion.

My concerns are all addressed.

评论

We agree that path connectedness is useful for developing intuitions for connectedness and have added this to Section 3. Thank you for the suggestion.

AC 元评审

This paper studies mode connectivity in neural networks taking a viewpoint from topology, and it provides a number of results on the structure of the loss landscape with particular emphasis on skip connections and permutations, highlighting the role of group actions. Specifically, the number of connected components at the 0-level set of the quadratic loss for invertible multi-layer perceptrons is computed. For one-dimensional spaces, it is shown that residual connections reduce the number of connected components. Next, the impact of permutations on connectedness is examined and an example where connectedness fails is provided. Finally, the authors characterize non-linear paths connecting minima and obtain bounds on their curvature.

All the reviewers agree that the topological perspective on mode connectivity is original, the results are new, and the paper is well written and accessible to a broad audience. These are all strengths of the paper. However, reviewer SRZz has raised an issue concerning the significance of the results. While simple results can in general be impactful, I concur with reviewer SRZz that the usefulness of the topological toolbox developed here towards understanding the loss landscape of neural networks remains unclear. More specifically, the first papers on mode connectivity date back to 2018; since then, there has been a lot of work in this direction, and the advantage of the methodology pursued by the authors w.r.t. this body of research is not evident.

In summary, the paper fails to provide a strong, novel, convincing insight regarding mode connectivity and, for this reason, I recommend rejection.

I still find the perspective taken by the paper interesting and I do recommend that the authors keep working in this direction and provide more convincing evidence of the effectiveness of their approach. One possible direction (mentioned by multiple reviewers) would be algorithmic, in terms of the characterization of the implicit bias of (S)GD.

审稿人讨论附加意见

The main weakness raised by reviewer SRZz is rather fundamental and requires re-thinking the approach and the results. As such, it could not be addressed in the short rebuttal period.

最终决定

Reject