PaperHub
5.5
/10
Rejected4 位审稿人
最低3最高8标准差1.8
5
3
8
6
3.8
置信度
正确性2.5
贡献度2.5
表达2.3
ICLR 2025

Equivariant Polynomial Functional Networks

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We propose a novel monomial matrix group equivariant polynomial neural functional network (MAGEP-NFN) that enhances model expressivity while maintaining low memory consumption and running time.

摘要

关键词
neural functional networkequivariant modelpolynomial layermonomial matrix group

评审与讨论

审稿意见
5

This paper follows a recent line of work on neural functional networks (NFNs), i.e. neural networks that are specifically designed to process other neural networks. In specific, the first NFNs are composed of linear layers (interleaved with non-linearities) that are equivariant to the permutation symmetries of the input NNs, while a recent follow-up called Monomial-NFN follows the same strategy, but additionally accounting for the scaling symmetries. However, the latter turned out to suffer from limited expressivity, due to the extensive parameter sharing that resulted from the existing symmetries.

To overcome this limitation, the authors in the present manuscript propose a Polynomial-NFN, i.e. a combination of polynomial layers (potentially interleaved with non-linearities) that are equivariant to permutation + scaling symmetries. The paper mainly focuses on the characterisation of polynomial equivariant layers, using a construction named stable polynomials. Experimentally, the method is tested on a variety of common NFN benchmarks showing improved performance in the majority of the cases.

优点

  • Significance and Motivation. Similarly to many equivariant machine learning tasks, designing models that are equivariant to parameter symmetries, typically requires dealing with a computational complexity vs expressivity dilemma. The authors correctly identify this dilemma, promptly (since NFNs/metanetworks are increasingly becoming more popular), and with their approach, they attempt to find a better trade-off.
  • Novelty. This is the first work that proposes polynomials equivariant to parameter symmetries and may potentially mark a first step towards improving this paradigm as an alternative to linear layers + non-linearities, as well as GNNs.
  • Empirical performance. In practice, the improvement in expressivity is shown to be reflected in the downstream performance in most of the tested cases.

缺点

  • Presentation and Exposition. My major concern with this paper is that the methodological/technical contributions are not well-presented. In particular,

    • The methodological part (Section 4) is quite hard to follow and the information provided is highly technical. This is not necessarily an issue, but currently, the authors do not provide any intuition/insights or simpler explanations of their thought process (why did they choose this proof technique, i.e. via stable polynomials), derivations (e.g. a proof sketch) and resulting equivariant/invariant layers (why do they look like that).
    • Therefore, I am afraid that this problem might be reflected in reproducibility/extensibility, as it will be hard for the interested reader to adopt/re-implement this method, and even harder to deeply understand the structure of the resulting layers.
    • I would recommend the authors invest more in explanations and intuition about each part of their methodological section, most importantly in explaining the terms involved in the equivariant/invariant layer equations.
  • Claims and Theory. Another major concern is that the authors have made a few claims that are not adequately supported with evidence:

    • First, they argue that their method improves expressivity compared to Monomial-NFN. Intuitively, probably this is indeed the case, but the authors have not provided any theoretical argument to support this claim.
    • Additionally, the authors mention that GNNs have increased memory consumption and runtime. As far as I can see, this claim is also not supported with, at least, experimental evidence. I suggest the authors report the above metrics and compare those of their method to GNNs (metanetworks), NFN, Monomial-NFN, and ScaleGMN (see below), to improve our understanding of the trade-offs. Additionally, a computational complexity table with comparisons would help a lot.
    • Finally, the theoretical contributions need some improvements. For example, it is unclear what is the role of Proposition 4.2. and Theorem 4.3. (why are they important?). Additionally, as far as I understand Theorem 4.4. characterises all G-invariant polynomials (by the way, of what degree?). Is there a similar argument for G-equivariant polynomials?
  • Method. Additionally, examining the method per se, even if it improves expressivity, I believe it inherits several of the limitations of Monomial-NFN:

    • Only certain types of activation functions can be used.
    • It is hard to extend to other activation functions
    • It cannot operate on diverse architectures. I advise the authors to mention those clearly and discuss potential options to address them.
  • Experiments.

    • Apart from the memory/runtime comparisons that this paper is missing, I think that the authors should include another baseline: “Scale Equivariant Meta Networks”, Kalogeropoulos et al., NeurIPS’24. This is going to be a fairer competitor as opposed to Monomial-NFN and it would be interesting to observe the performance vs runtime tradeoffs.
    • I am wondering why in the editing experiment we don’t observe any performance improvement.

问题

  • Due to the lack of insights in the methodological section, several questions arise. For instance,
    • What is the degree of the resulting polynomial? Shouldn’t this be chosen as a hyperparameter by the user?
    • I assume that for different activation functions (or more precisely for different symmetries induced by the activation function) the resulting layers should be different. However, this is not clear to me from the Equations 20 and 21. Could the authors explain what are the differences?
    • What are the matrices Φ\Phi and Ψ\Psi? They are kind of abruptly introduced.

Minor:

  • Related work and background: I have the impression that these two sections (2 and 3) are quite similar to those of the Monomial-NFN paper. Especially, regarding the background section, I felt it was too long, which might not be necessary, since the reader can be pointed to Monomial-NFN or ScaleGMN to find a detailed explanation of the background.
  • The paper misses an important citation: “Equivariant Polynomials for Graph Neural Networks”, Puny et al., ICML’23. This work discusses equivariant polynomials for permutation symmetries present in graphs.
评论

Reply [2/2]

W6: Finally, the theoretical contributions need some improvements. For example, it is unclear what is the role of Proposition 4.2. and Theorem 4.3. (why are they important?). Additionally, as far as I understand Theorem 4.4. characterises all G-invariant polynomials (by the way, of what degree?). Is there a similar argument for G-equivariant polynomials? Q1. What is the degree of the resulting polynomial? Shouldn’t this be chosen as a hyperparameter by the user?

Answer to W6+Q1: Following the reviewers' suggestion, we have added more intuition and explanations to explicitly clarify the roles of Proposition 4.2 and Theorem 4.3. Specifically, Proposition 4.2 says that stable polynomial terms can be viewed as a generalization of the weights themselves, while Theorem 4.3 provides the reasoning behind the stability of these polynomials. Additionally, we have included another theorem from appendix (Theorem 4.4 in the revised version) to further discuss the linear dependence of stable polynomial terms, offering essential computational steps for calculating the equivariant and invariant layers through the weight-sharing mechanism.

It is important to note that Theorem 4.4 (renumbered as Theorem 4.5 in the revised version) does not characterize all GG-invariant polynomials. Instead, it characterizes all GG-invariant polynomials of the specific form presented in Equation (20) (now is Equation (12) in the revised version), which is a linear combination of stable polynomial terms. The degree of these polynomials is LL, which is the maximal degree of the stable polynomial terms. A similar theorem for GG-equivariant polynomial layers is provided in Appendix C.5.

W7: Only certain types of activation functions can be used. W8: It is hard to extend to other activation functions W9: It cannot operate on diverse architectures. I advise the authors to mention those clearly and discuss potential options to address them.

Answer: MAGEP-NFN is designed to handle any MLPs or CNNs architectures of any size. In general, our method is applicable to other architectures with different activation functions, provided that the symmetric group of the weight network is known. The idea is to use the weight-sharing mechanism to redetermine the stable polynomials and then calculate the constraints of the learnable parameters. Although the calculation may require adaptation, it remains feasible. These discussions were added to the limitation part of the paper.

Q2: Apart from the memory/runtime comparisons that this paper is missing, I think that the authors should include another baseline: “Scale Equivariant Meta Networks”, Kalogeropoulos et al., NeurIPS’24. This is going to be a fairer competitor as opposed to Monomial-NFN and it would be interesting to observe the performance vs runtime tradeoffs

Answer: We have compared our model with GNN[2] and ScaleGMN[3] for predicting CNN generalization, using HNP[1] as a reference. GNN shows a performance drop with separate activation subsets. While ScaleGMN greatly enhances performance on the Tanh subset, improvements on the ReLU subset are less substantial. In contrast, our model achieves an overall significant improvements on both datasets. We refer the reviewer to Table 15 in Appendix G of our revision for the results, and present them in Table 3 below as well for convenience.

Table 3: Performance comparison with Graph-based NFNs

HNP[1]GNN [2]ScaleGMN[3]MAGEP-NFN
ReLU subset0.9260.8970.9280.933
Tanh subset0.9340.8930.9420.940

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

Q3: I am wondering why in the editing experiment we don’t observe any performance improvement.

Answer: In the editing INRs task, the activation function used is the trigonometric sine function (sin), which inherently induces a flipping-sign group action M±1\mathcal{M}^{\pm 1}. This group action is less diverse compared to the positive scaling group action Mn>0\mathcal{M}_n^{>0} of ReLU activation funcion. As a result, our model has access to less inductive information, which could explain the lack of observable performance improvement over the baselines. Nonetheless, our model achieves performance comparable to all the baselines and consistently matches or outperforms them, particularly sharing the best results with the NP model.


We hope we have cleared your concerns about our work. We have also revised our manuscript according to your comments, and we would appreciate it if we can get your further feedback at your earliest convenience.

评论

Reply [1/2]

Thank you for your thoughtful review and valuable feedback. Below we address your concerns.

W1: The methodological part (Section 4) is quite hard to follow and the information provided is highly technical. [...]

Answer: Following the reviewers' suggestions, we have significantly revised Section 4 to include additional intuitions and insights, along with simpler explanations throughout the process of deriving invariant and equivariant layers. We believe that these updates have made the section more readable and informative.

W2: Therefore, I am afraid that this problem might be reflected in reproducibility/extensibility, as it will be hard for the interested reader to adopt/re-implement this method, and even harder to deeply understand the structure of the resulting layers.

Answer: We have included an Implementation section in the Appendix F, featuring pseudocode designed for better clarity. Our MAGEP-NFN's architecture incorporates several components expressed using concise and adaptable einsum operators. Moreover, the supplementary materials contain code for implementing MAGEP-NFN layers, along with comprehensive documentation to support accurate reproduction of the results.

W3: I would recommend the authors invest more in explanations and intuition about each part of their methodological section, most importantly in explaining the terms involved in the equivariant/invariant layer equations.

Answer: Following reviewers' suggestions, we have revised Section 4 by adding more explanations and intuition for every part of this section. We also explain the terms involved in the equivariant and invariant equations.

W4: First, they argue that their method improves expressivity compared to Monomial-NFN. Intuitively, probably this is indeed the case, but the authors have not provided any theoretical argument to support this claim.

Answer: We believe that the superior expressiveness of our equivariant polynomial layers compared to equivariant linear layers is a straightforward and self-evident fact. It is a similar situation to when we claim that polynomials are more general than linear functions. Consequently, we deemed it unnecessary to formalize this as a theorem. However, in response to the reviewers' suggestion, we have added a remark at the end of Section 4.2 to explicitly compare the equivariant and invariant polynomial layers proposed in our work with the equivariant and invariant linear layers presented in [4]. This remark emphasizes that our equivariant layers are more general and expressive than those in [4].

References

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

W5: Additionally, the authors mention that GNNs have increased memory consumption and runtime. As far as I can see, this claim is also not supported with, at least, experimental evidence. I suggest the authors report the above metrics and compare those of their method to GNNs (metanetworks), NFN, Monomial-NFN, and ScaleGMN (see below), to improve our understanding of the trade-offs.

Answer: To compare computational and memory costs, we have included runtime and memory consumption for our model and previous ones in predicting CNN generalization task in the tables below. We have also include these results in Table 16 and 17 in Appendix H of our revision. For graph-based architectures, we compare with two recent works: GNN[2] and ScaleGMN[3]. Our model runs significantly faster and uses much less memory than these graph-based networks and NP/HNP[1]. Introducing additional polynomial terms slightly increases our model's runtime and memory usage compared to Monomial-NFN [4]. However, this trade-off results in considerably enhanced expressivity, which is evident across many tasks like Predict CNN Generalization or INRs Classification.

Table 1: Runtime of models

NP [1]HNP [1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset35m34s29m37s4h25m17s1h20m18m23s28m12s
ReLU subset36m40s30m06s4h27m29s1h20m23m47s28m43s

Table 2: Memory consumption

NP [1]HNP[1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset838MB856MB6390MB2918MB582MB584MB
ReLU subset838MB856MB6390MB2918MB560MB584MB

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

评论

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论

I thank the authors for their response. I acknowledge their effort to improve the exposition of their methodology. However, I believe there is still room for improvement, by providing deeper intuition and additional explanations, e.g. w.r.t. the reason for choosing stable polynomials and the limitations of this approach (as far as I understand this construction limits the expressivity of the method, i.e. it does not include all invariant/equivariant polynomials), the degree of the polynomial (as discussed in their rebuttal), etc. Additionally, the issues concerning the background section (similar to Monomial NFN), and the fact this method inherits Monomial NFN's limitations, persist. Additionally, if I am not mistaken, an important question remains unaddressed (regarding the differences that arise for different activation functions).

I will keep my score unchanged, mainly because I believe the paper would greatly benefit from an extra revision. However, I am not strongly opposed to acceptance.

评论

Thank you for your response. Thank you for your response. Below, we address the two remaining concerns and provide clarification regarding the issues related to the background section (similar to Monomial NFN [2]).

  • As mentioned, we introduced a specific form of polynomials to derive the constraints on their coefficients needed to ensure the equivariance or invariance property. It is important to note that the proposed form is sufficiently general, making the determination of these constraints a non-trivial task. Identifying all equivariant polynomials is both non-trivial and often unnecessary for practical applications, as the method can only be implemented up to a certain finite degree.

  • Different activation functions induce distinct group actions on the weight spaces. Constructing a neural functional network that achieves equivariance or invariance under these varying conditions is inherently challenging. To the best of our knowledge, recent studies on neural functionals, such as [1] and [2], have primarily focused on individual activation functions rather than addressing all of them simultaneously.

The reason for the similarity in the writing style between the introduction, related works, and background sections of our paper and those in [2] is, While we have used different wording, we acknowledge that, as our work addresses the same problem as [2], the related works, definitions, terminologies, and formulas in the background sections inevitably overlap. These are fundamental to the problem and therefore remain consistent across both works.

To address your concern and reduce the perceived similarity, we have revised the writing style of these sections, removed some of the repeated content, particularly in the background, and cited to [2]. This ensures that our paper remains distinct while still providing all necessary technical details for reproducibility and clarity.

References

[1] Ioannis Kalogeropoulos et al., Scale Equivariant Graph Metanetworks. NeurIPS 2024.

[2] Tran et al., Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.


We hope to have addressed your concerns regarding our work. As we have carefully resolved the two issues raised and revised our manuscript, we kindly ask the reviewer to reconsider revising our score.

审稿意见
3

This work proposes a novel Neural Functional Network (NFN), i.e. a neural network that takes the weights of neural networks as input data. The method, termed MAGEP-NFN, belongs to the family of parameter-sharing NFNs and incorporates both the permutation symmetries of neural networks and the scaling symmetries of common non-linearities, such as ReLU and tanh. The proposed method alleviates the reduced expressivity of previous works by introducing non-linear equivariant layers through polynomials of the input weights and biases. MAGEP-NFN achieves competitive results in INR tasks and CNN generalization tasks, while having far fewer parameters than the compared baselines.

优点

The manuscript is fairly clear and easy to follow, despite being math heavy.

NFNs are a very promising research direction, and making them faster and more efficient is key to scaling them up to more real-world applications. Hence, the fact that the proposed method can reach good performance with a reduced number of parameters compared to previous works is very promising, and a significant step forward.

缺点

The numbers reported for NFN in the INR classification experiments are considerably worse than the official ones. Namely, NFN scores 46.6±0.07246.6 \pm 0.072 in CIFAR10, 92.9±0.21892.9 \pm 0.218 on MNIST and 75.6±1.0775.6 \pm 1.07 on FashionMNIST. This also means that the performance of MAGEP-NFN is quite poor. According to the appendix E, the authors use the same datasets as Zhou et al. Is there a different setup used in the experiments?

Almost no explanation is given for equations 20 and 21, that comprise the core functionality of the proposed method. These equations are very math-heavy, and it is very hard for the reader to understand the usefulness of all individual components. Further explanation of each module could provide more intuition about the method. Further, both formulas could benefit from some visualizations that show the connections between layers and the flow of information. Finally, it is unclear if all the individual components contribute to the effectiveness of the method. An ablation study on those components could make the method more efficient, without sacrificing accuracy.

问题

[1] What does the outer sum Ls>t0\sum_{L\geq s>t\geq 0} in equation 20 denote? Is it a double sum over ss and tt?

[2] Related to question 1, in equations 20-21, is it necessary to sum over the whole range of ss and tt? This seems like a very inefficient operation for very deep networks (e.g. ResNet 101). Isn't it possible to sum over a few adjacent layer indices? An ablation study on the number of hops would be useful here.

[3] What is the computational complexity of the method compared to having linear layers?

[4] Can the method work in a combined Small CNN Zoo dataset with both ReLU and tanh activations (and potentially other similar groups)?

Minor questions:

[5] In Eq. 15, shouldn't it be πi(j)\pi_{i}(j) instead of πi1(j)\pi_{i}^{-1}(j)?

伦理问题详情

The manuscript is, in extensive parts of the first three sections, almost identical to the work of Tran et al. [1]. This is especially true in the first 5 pages (until the end of section 3, line 281). Often times, whole paragraphs bear striking similarity, and only a few words differ between the two paragraphs.

More specifically:

  • In the introduction: lines 36-59.
  • In the related work section: lines 104-127
  • In section 3: Equations 3-10, 14, 15. Lines 216-221. Proposition 3.3.

[1] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

评论

Reply [2/2]

Q2: Related to question 1, in equations 20-21, is it necessary to sum over the whole range of ss and tt ? This seems like a very inefficient operation for very deep networks (e.g. ResNet 101). Isn't it possible to sum over a few adjacent layer indices? An ablation study on the number of hops would be useful here.

Answer: Yes, it is necessary to sum over s and t. It should be noted that many terms in Equation (21), which is Equation (15) in the revised version, vanish after applying the weight-sharing process. Ultimately, the additional polynomial terms in the final equivariant and invariant layers are not excessive. To validate this observation, we have included a comparison of the computational and memory costs of our models with those of linear methods (and graph-based approaches as well) in the response to your next question. Additionally, an ablation study examining the necessity of these components in the equivariant and invariant layers has been provided in the response to your earlier concern.

Q3: What is the computational complexity of the method compared to having linear layers?

Answer: To compare computational and memory costs, we've included runtime and memory consumption data for our model and previous ones in predicting CNN generalization task in the tables below. We have also include these results in Table 16 and 17 in Appendix H of our revision. For graph-based architectures, we compare with two recent works: GNN[4] and ScaleGMN[5]. Our model runs significantly faster and uses much less memory than these graph-based networks and NP/HNP[2]. Introducing additional polynomial terms slightly increases our model's runtime and memory usage compared to Monomial-NFN [4]. However, this trade-off results in considerably enhanced expressivity, which is evident across many tasks like Predict CNN Generalization or INRs Classification.

Table 2: Runtime of models

NP [2]HNP [2]GNN [4]ScaleGMN[5]Monomial-NFN[3]MAGEP-NFN
Tanh subset35m34s29m37s4h25m17s1h20m18m23s28m12s
ReLU subset36m40s30m06s4h27m29s1h20m23m47s28m43s

Table 3: Memory consumption

NP [2]HNP[2]GNN [4]ScaleGMN[5]Monomial-NFN[3]MAGEP-NFN
Tanh subset838MB856MB6390MB2918MB582MB584MB
ReLU subset838MB856MB6390MB2918MB560MB584MB

References

[2] Zhou, Allan, et al. "Permutation equivariant neural functionals." NeurIPS 2024.

[3] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

[4] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024.

[5] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024.

Q4: Can the method work in a combined Small CNN Zoo dataset with both ReLU and tanh activations (and potentially other similar groups)?

Answer: No, our method is specifically designed to work with architectures that share the same weight space and symmetry group. CNNs with ReLU activations and CNNs with tanh activations have different symmetry groups, making them incompatible within the same framework.

Q5: In Eq. 15, shouldn't it be πi(j)\pi_{i}(j) instead of πi1(j)\pi^{-1}_{i}(j)

Answer: We believe that it should be πi1(j)\pi^{-1}_{i}(j).


We hope we have cleared your concerns about our work. We have also revised our manuscript according to your comments, and we would appreciate it if we can get your further feedback at your earliest convenience.

评论

Reply [1/2]

Thank you for your thoughtful review and valuable feedback. Below we address your concerns.


Regarding the Ethics Concerns: Thank you for pointing out the similarity in the writing style between the introduction, related works, and background sections of our paper and those in [3]. While we have used different wording, we acknowledge that, as our work addresses the same problem as [3], the related works, definitions, terminologies, and formulas in the background sections inevitably overlap. These are fundamental to the problem and therefore remain consistent across both works.

To address your concern and reduce the perceived similarity, we have revised the writing style of these sections, removed some of the repeated content, particularly in the background, and cited to [3]. This ensures that our paper remains distinct while still providing all necessary technical details for reproducibility and clarity.

References

[3] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

W1: The numbers reported for NFN in the INR classification experiments are considerably worse than the official ones. Namely, NFN scores in CIFAR10, on MNIST and on FashionMNIST. This also means that the performance of MAGEP-NFN is quite poor. According to the appendix E, the authors use the same datasets as Zhou et al. Is there a different setup used in the experiments?

Answer: For the INRs classification task, we follow the same settings as in Monomial-NFN[3], which utilized the dataset from [2]. The key difference between [2] and [3] is that [2] augments the data tenfold, resulting in 450,000 training instances, while [3] does not use augmentation. By not augmenting the dataset, we enable a fairer comparison, highlighting architectures that can inherently handle transformations in weight space without relying on additional data.

References

[2] Zhou, Allan, et al. Permutation equivariant neural functionals. NeurIPS 2024.

[3] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

W2.1: Almost no explanation is given for equations 20 and 21, that comprise the core functionality of the proposed method. These equations are very math-heavy, [...]

Answer: Following the reviewers' suggestion, we have revised the entirety of Section 4, incorporating additional intuition and explanations to make the stable polynomial terms and their applications in constructing equivariant and invariant polynomial layers in Equations (20)-(21), which are now Equations (12) and (15) in the revised version, easier to understand.

W2.2: Finally, it is unclear if all the individual components contribute to the effectiveness of the method. An ablation study on those components could make the method more efficient, without sacrificing accuracy

Answer: We have conducted an ablation study to evaluate the significance of the components introduced in our work. Specifically, we categorize the terms as follows:

  • Non Inter-Layer Terms: These are terms that involve only the mapping of Non-inter-layer weights and biases, (Wl,bl)l=1,,L(W^l, b^l)^{l=1,…,L}, to the output weight space, consistent with prior works (DWSNet [1], NP[2], HNP [2], Monomial-NFN[3]) on neural functional network layers.

  • Inter-Layer Terms: These are the novel terms introduced in our paper, [W],[WW],[bW],[Wb][W],[WW],[bW],[Wb], designed to capture relationships between weights and biases across multiple layers.

To assess their impact, we perform experiments on the invariant task of predicting CNN generalization on ReLU subset, using the architecture specified in Equation (21). The results of our experiments are presented in the table below. We have also include these results in Table 18 in Appendix I of our revision.

Table 1: Ablation study to evaluate the significant of each component

ComponentsKendall's τ\tau
Only Non Inter-Layer terms0.929
Only Inter-Layer terms0.932
Non Inter-Layer terms + [W]0.930
Non Inter-Layer terms + [WW]0.930
Non Inter-Layer terms + [Wb]0.931
Non Inter-Layer terms + [bW]0.931
Non Inter-Layer terms + Inter-Layer terms0.933

We can see that each newly introduced Inter-Layer terms provide additional information to the network, and when combined with the Non Inter-Layer terms, the performance is boosted considerably.

Q1: What does the outer sum Ls>t0\sum_{L \geq s > t \geq 0} in equation 20 denote? Is it a double sum over s and t

Answer: Yes, it is a double sum over s and t with the constraint Ls>t0L \geq s > t \geq 0.

评论

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论

I would like to thank the authors for their detailed rebuttal, and I encourage them to include parts of it in the camera-ready version of the manuscript to increase its impact. While they have successfully responded to many of my questions, my two major concerns still exist.

My first major concern is that the ethical concerns for plagiarism are still there. I disagree with the authors when they say that "as our work addresses the same problem as [3], the related works, definitions, terminologies, and formulas in the background sections inevitably overlap", especially when it involves paragraphs of text or math equations that are taken nearly verbatim from previous works. Further, while the revised manuscript has a reduced "perceived similarity", the similarity between whole paragraphs is remarkable, as each sentence in some paragraphs describes the same thing, with slightly different syntax or vocabulary.

My second major concern is with regard to the quantitative results for INR classification (W1 in the rebuttal). First, the augmentations performed by NFN are caused by different INR initializations, I do not see how they connect with transformations in the weight space and whether a method can inherently handle them. Further, it would be useful to know whether the proposed method can make use of the augmentations to achieve better performance, or if it underperforms/saturates given the augmentations. Finally, the results are also much worse than [1, 2] for MNIST and Fashion MNIST, and to the best of my understanding, there are no augmentations used in this dataset either.

Thus, while I think that this is a novel and interesting method, and it will be useful for the community, I find that the two aforementioned concerns prohibit it from being ready for an ICLR publication. I maintain my score to a 3.

[1] Navon et al. Equivariant Architectures for Learning in Deep Weight Spaces. ICML 2023 [2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

评论

Thank you for your response. We appreciate your suggestions and your acknowledgment that our work is novel and interesting. We would like to take this opportunity to address the first and second concerns you raised in your reply.


(1) Ethical concerns for plagiarism: We thank the reviewer for bringing this to our attention and sincerely apologize for the oversight. We want to clarify that this was an honest and unintentional mistake stemming from our lack of thorough proofreading before submission, and we take full responsibility for it. We kindly ask the reviewer, as well as other reviewers, to evaluate our submission based on its novelty and technical integrity and to provide us with an opportunity to revise our work to address the plagiarism issue.

(2) Quantitative Results for INR Classification: The table below presents the results of our method when trained with augmented data. For the baseline methods, we use the reported values from [1]. Our method demonstrates best performance on both MNIST and FashionMNIST. These results highlight our method's ability to effectively utilize augmented data to enhance performance.

Table 1: INR Classification for augmented data

HNP[1]NP[1]DWSNets[2]MAGEP-NFN
MNIST92.5 ±\pm 0.07192.9 ±\pm 0.21874.4 ±\pm 0.14394.8 ±\pm 0.053
FashionMNIST72.7 ±\pm 1.5375.6 ±\pm 1.0764.8 ±\pm 0.68575.6 ±\pm 0.95
CIFAR-1044.1 ±\pm 0.47146.6 ±\pm 0.07241.5 ±\pm 0.43144.05 ±\pm 0.411

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Navon et al. Equivariant Architectures for Learning in Deep Weight Spaces. ICML 2023

评论

I would like to thank the authors for their response and their additional experiments. I encourage them to include the experiment on INR classification with the augmentations in the revised manuscript, and to include Monomial-NFN as a baseline, since it seems to be missing. I would further recommend including graph-based NFNs as baselines for INR classification, similar to the current Appendix G, to showcase the effectiveness of the proposed method even further.

Unfortunately, ethical concerns for plagiarism still exist. The authors mention that "[...] this was an honest and unintentional mistake stemming from our lack of thorough proofreading before submission [...]". Given the remarkable similarity in the first three sections between Monomial NFN and MAGEP-NFN, it seems highly unlikely that the two manuscripts ended up looking so similar by coincidence.

I believe that academic integrity is just as important as the technical integrity. As such, I do not think that this work is not yet up to the standards of an ICLR publication, and could use a major revision, which would include rewriting the first three sections to make the manuscript unique, as well as improve the overall clarity of the manuscript.

评论

We appreciate the reviewer’s helpful suggestion. Regarding the concern about plagiarism, while we acknowledge that certain background sections in the initial pages of our paper share some overlap with [Tran2024], we respectfully disagree that this constitutes plagiarism. Our paper spans 59 pages, introducing novel ideas supported by comprehensive mathematical proofs.

As our work addresses the same problem within a similar context as [Tran2024], a degree of overlap in sections such as the introduction (Section 1), related work (Section 2), and preliminaries (Section 3) is unavoidable. However, these overlaps are minor and form a proportionally insignificant part of the overall content, which is predominantly original.

Since academic integrity is just as important as technical integrity, as you mentioned, it is important to approach allegations of plagiarism with care. We respectfully request that the reviewer focuses on the paper's novel contributions and original ideas and considers the plagiarism concern to be a misunderstanding.

Reference

[Tran2024] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

审稿意见
8

This paper proposes monomial matrix group equivariant polynomial neural functional networks (MAGEP-NFN), a new class of neural functionals whose input and output are the weights of neural networks. MAGEP-NFN builds on a prior work Monomial NFN which adds scale equivariance and sign equivariance of the original NFN which is only equivariant to neuron permutations. MAGEP-NDF introduces stable polynomial terms which are the product of a series of adjacent weights and biases of a neural networks that are also equivariant under different neuron permutations. These stable polynomial terms can be thought of as some "features" of the original weights, similar to how polynomials are used as features in kernels. MAGEP-NFN layers can process these stable polynomials in addition to the original weights. Experimental results demonstrate that MAGEP-NFNs outperform previous NFNs that use weight-sharing schemes in terms of generalization prediction and perform similarly on INR editing tasks.

优点

This paper brings both theoretical and empirical contributions to NFNs. The techniques proposed in the paper are well-motivated and rigorously derived. Experiments and mathematical details are quite thorough, especially given how mathematically involved these approaches tend to be (I was able to follow the derivation reasonably well).

Overall, I believe MAGEP-NFN provides a concrete improvement in the line of work on NFNs and points towards a direction to improve NFNs further.

缺点

The paper makes some unsubstantiated claims about its advantages over prior methods and GNN-based methods in terms of runtime and memory consumption. Specifically, while the paper emphasizes that MAGEP-NFN enjoys superior runtime and memory efficiency, there are no comparisons of these quantities for MAGEP-NFN and relevant baselines (For example, it is not clear to me that MAGEP-NFN should be more efficient than vanilla NFN). It would be good to add a runtime and memory efficiency comparison to the paper which in my opinion would make the paper stronger.

Furthermore, while the theoretical contribution is nice and sound, MAGEP-NFN does not seem to bring substantial empirical improvement over relevant baselines.

Minor

  • Use `` and '' for quotations.

问题

  • How would one apply this approach to methods that support arbitrary computation graphs such as [1]?

Reference

[1] Universal neural functionals. Zhou et al.

评论

Thank you for your thoughtful review and valuable feedback. Below we address your concerns.


W1: [...] While the paper emphasizes that MAGEP-NFN enjoys superior runtime and memory efficiency, there are no comparisons of these quantities for MAGEP-NFN and relevant baselines [...]. It would be good to add a runtime and memory efficiency comparison to the paper which in my opinion would make the paper stronger

Answer: To compare computational and memory costs, we have included runtime and memory consumption for our model and previous ones in predicting CNN generalization task in the tables below. We have also include these results in Table 16 and 17 in Appendix H of our revision. For graph-based architectures, we compare with two recent works: GNN[3] and ScaleGMN[4]. Our model runs significantly faster and uses much less memory than these graph-based networks and NP/HNP[2]. Introducing additional polynomial terms slightly increases our model's runtime and memory usage compared to Monomial-NFN [4]. However, this trade-off results in considerably enhanced expressivity, which is evident across many tasks like Predict CNN Generalization or INRs Classification.

Table 1: Runtime of models

NP [2]HNP [2]GNN [3]ScaleGMN[4]Monomial-NFN[5]MAGEP-NFN
Tanh subset35m34s29m37s4h25m17s1h20m18m23s28m12s
ReLU subset36m40s30m06s4h27m29s1h20m23m47s28m43s

Table 2: Memory consumption

NP [2]HNP[2]GNN [3]ScaleGMN[4]Monomial-NFN[5]MAGEP-NFN
Tanh subset838MB856MB6390MB2918MB582MB584MB
ReLU subset838MB856MB6390MB2918MB560MB584MB

References

[2] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[3] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[4] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

[5] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

W2: Furthermore, while the theoretical contribution is nice and sound, MAGEP-NFN does not seem to bring substantial empirical improvement over relevant baselines.

Answer: Our model shows some significant improvement in the task Predict CNN Generalization and Classify INRs (See Table 1, 2). In the ReLU subset of Small CNN Zoo, while Monomial-NFN only observes the performance gain when augmenting the dataset, our method surpass the best baseline HNP by a large margin (0.933 versus 0.926). Moreover, since most recent graph-based baseline ScaleGMN - with much more training time and memory consumption - reach the number 0.928 in this task (please refer to Q2 in General Response), this further highlights the performance gain of our method. On the task Classify INRs, our model surpass all baselines, with the gap of 7.7% in MNIST, and 2.9% in CIFAR-10.

Q1: How would one apply this approach to methods that support arbitrary computation graphs such as [1]?

Answer: MAGEP-NFN is designed to handle any MLPs or CNNs architectures of any size. In general, our method is applicable to other architectures, provided that the symmetric group of the weight network is known. The idea is to use the weight-sharing mechanism to redetermine the stable polynomials and then calculate the constraints of the learnable parameters. We hope this addresses your question. If not, please let us know what we can provide in order to have a more comprehensive response.

References

[1] Zhou et al. Universal neural functionals.


We hope we have cleared your concerns about our work. We have also revised our manuscript according to your comments, and we would appreciate it if we can get your further feedback at your earliest convenience.

评论

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论

Thank you for the added response and additional results. Please make sure they are properly incorporated in the paper. I will keep my current score.

评论

Thanks for your response, and we appreciate your endorsement. We will include the additional results and clarifications from our discussion in the revised paper.

审稿意见
6

The paper proposes a new method to model neural network weights based on Equivariant Neural Functional Networks. Compared to previous approaches, this method is claimed to be more expressive while still efficient. This is achieved by adding nonlinear (polynomial) equivariant layers acting on weights. The paper presents an extensive theoretical framework and shows improved empirical performance on the INR-based tasks and generalization prediction.

优点

  1. The paper tackles a challenging problem of modeling neural network weights and their symmetries.
  2. The proposed nonlinear equivariant layers appear to be original and interesting.
  3. The paper seems theoretically sound, however, I haven't checked the correctness.
  4. The results are better than for the linear polynomials (Monomial-NFN).

缺点

The paper's main weaknesses are experiments and writing/presentation.

The experiments are weak in several aspects:

  1. The paper claims "low memory consumption and efficient running time", however these claims are not supported by empirical evidence. The previous paper (Tran et al., 2024) shows that it can reduce memory from 6GB to 0.5GB, however, the extra (stable) polynomial terms may increase memory/runtime of the method in this paper.
  2. No comparison to graph-based approaches (Lim et al.,2023; Kofinas et al., 2024) in Table 1-4 (ideally in terms of memory/efficiency vs results), so the significance of the results is hard to estimate.
  3. Some inconsistencies in evaluation, which is not explained, making comparison to previous literature difficult. For example, in (Kofinas et al., 2024) for MNIST the INR classification performance was about 90-95% and for FashionMNIST 70-75%, while in this paper it's 77 and 62 respectively. On the Small CNN Zoo dataset (Unterthineretal.,2020) the method achieves 0.933, while graph-based approaches (Kofinas et al., 2024) achieve better results (0.935).
  4. The shortcoming of the method is that it requires some manual design of the polynomials for each architecture, so it's unclear how to extend it to other architectures such as transformers or even if it's possible to use a trained MAGEP-NFN on the wider networks.
  5. The experiments on INR and generalization prediction for simple small CNNs are not sufficient. Previous works (Lim et al.,2023; Kofinas et al., 2024) added more practical experiments with learning to optimize or diverse transformer architectures. Otherwise, the proposed approach appears to be overengineered for some specific toy cases.

Writing and presentation

  1. "These specialized networks are known as neural functional networks (NFNs)(Zhouetal.,2024b)" is a questionable statement.
  2. "graph-based equivariant NFNs suffer from high memory consumption and long running times" is an overstatement given that Tran et al., 2024 showed the memory consumption of those methods is at maximum 6GB and runtime at max 4.5 hours. So in practice, this does not seem to be a problem for any modern GPUs. Please rewrite or scale the experiments.
  3. Related works is not informative, because it just list existing papers without discussing how the current work is connected (improved, different, etc.) to them.
  4. The whole Section 3, if I understand correctly, seems to be background (notation introduction, etc. from Tran 2024), and it's quite lengthy, which makes the amount of the original content in the paper quite small.
  5. In section 4 it's not described how the equations 20-21 are different from the previous work (Tran 2024) or section 3. It would be more clear to directly contrast nonlinear and linear layers in this context to understand the contribution better.
  6. Contribution 2 of the paper (linear independence) is overstated - there is no comprehensive study in the main text. In contrast, the graph-based approached handled diverse and complex architectures (such as transformers) more easily.
  7. The paper could be more accessible, e.g. by having illustrations of the proposed vs previous method.

问题

  1. How difficult is the implementation of the proposed nonlinear layers? Can the authors provide pseudo code?

  2. Can the trained MAGEP-NFN be used on the wider networks (e.g. predict generalization for larger CNNs than in the Small CNN Zoo dataset or retraining the model would be necessary?

评论

Reply [1/4]

Thank you for your thoughtful review and valuable feedback. Below we address your concerns.


W1: The paper claims "low memory consumption and efficient running time", however these claims are not supported by empirical evidence. The previous paper (Tran et al., 2024) shows that it can reduce memory from 6GB to 0.5GB, however, the extra (stable) polynomial terms may increase memory/runtime of the method in this paper

Answer: To compare computational and memory costs, we have included runtime and memory consumption for our model and previous ones in predicting CNN generalization task in the tables below. We have also include these results in Table 16 and 17 in Appendix H of our revision. For graph-based architectures, we compare with two recent works: GNN[2] and ScaleGMN[3]. Our model runs significantly faster and uses much less memory than these graph-based networks and NP/HNP[1]. Introducing additional polynomial terms slightly increases our model's runtime and memory usage compared to Monomial-NFN [4]. However, this trade-off results in considerably enhanced expressivity, which is evident across many tasks like Predict CNN Generalization or INRs Classification.

Table 1: Runtime of models

NP [1]HNP [1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset35m34s29m37s4h25m17s1h20m18m23s28m12s
ReLU subset36m40s30m06s4h27m29s1h20m23m47s28m43s

Table 2: Memory consumption

NP [1]HNP[1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset838MB856MB6390MB2918MB582MB584MB
ReLU subset838MB856MB6390MB2918MB560MB584MB

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

W2: No comparison to graph-based approaches (Lim et al.,2023; Kofinas et al., 2024) in Table 1-4 (ideally in terms of memory/efficiency vs results), so the significance of the results is hard to estimate

Answer: We compare our model with GNN[2] and ScaleGMN[3] for predicting CNN generalization, using HNP[1] as a reference. GNN shows a performance drop with separate activation subsets. While ScaleGMN greatly enhances performance on the Tanh subset, improvements on the ReLU subset are less substantial. In contrast, our model achieves an overall significant improvements on both datasets. We refer the reviewer to Table 15 in Appendix G of our revision for the results, and present them in Table 3 below as well for convenience.

Table 3: Performance comparison with Graph-based NFNs

HNP[1]GNN [2]ScaleGMN[3]MAGEP-NFN
ReLU subset0.9260.8970.9280.933
Tanh subset0.9340.8930.9420.940

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

W3: Some inconsistencies in evaluation, which is not explained, making comparison to previous literature difficult. [...]

Answer: For the INRs classification task, we follow the same settings as in Monomial-NFN[4], which utilized the dataset from [1]. The key difference between [1] and [4] is that [1] augments the data tenfold, resulting in 450,000 training instances, while [4] does not use augmentation. By not augmenting the dataset, we enable a fairer comparison, highlighting architectures that can inherently handle transformations in weight space without relying on additional data.

In the CNN generalization task, we evaluate all models on two subsets using ReLU and Tanh activations, while [2] uses the entire dataset for evaluation. Importantly, [2] cannot handle scaling symmetries, leading to a performance drop reported in [4]: 0.897 for the ReLU subset and 0.893 for the Tanh subset.

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

评论

Reply [4/4]

W12: The paper could be more accessible, e.g. by having illustrations of the proposed vs. previous method.

Answer: We have added a remark in Section 4.2 to illustrate a comparison of the proposed method vs. the previous method. In particular, our invariant polynomial layer is derived from the parameter-sharing mechanism. In contrast, the invariant equivariant layer proposed in [4] is an ad hoc formulation and does not result from a parameter-sharing mechanism. Consequently, there is no direct relationship between our invariant layer and the invariant layer in [4].

However, the equivariant polynomial layer in our MAGEP-NFNs and the equivariant linear layer from [4] are related. Specifically, the equivariant layer in [4] is exactly the linear component of our equivariant polynomial layer. Due to the lengthy formulation and construction process, we have provided the details of the equivariant polynomial layers in Appendix B.

References

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

Q1: How difficult is the implementation of the proposed nonlinear layers? Can the authors provide pseudo code?

Answer: We have included an Implementation section in the Appendix F, featuring pseudocode for better clarity. Our MAGEP-NFN's architecture incorporates several components expressed using concise and adaptable einsum operators. Moreover, the supplementary materials contain code for implementing MAGEP-NFN layers, along with comprehensive documentation to support accurate reproduction of the results.

Q2: Can the trained MAGEP-NFN be used on the wider networks (e.g. predict generalization for larger CNNs than in the Small CNN Zoo dataset or retraining the model would be necessary?

Answer: Yes, MAGEP-NFN is designed to handle any MLPs or CNNs architectures of any size. In general, our method is applicable to other architectures, provided that the symmetric group of the weight network is known. The idea is to use the weight-sharing mechanism to redetermine the stable polynomials and then calculate the constraints of the learnable parameters.


We hope we have cleared your concerns about our work. We have also revised our manuscript according to your comments, and we would appreciate it if we can get your further feedback at your earliest convenience.

评论

Reply [3/4]

W10: In section 4 it's not described how the equations 20-21 are different from the previous work (Tran 2024) or section 3. It would be more clear to directly contrast nonlinear and linear layers in this context to understand the contribution better.

Answer: Equations (20)-(21), which are Equations (12) and (15) in the revised version, describe the invariant polynomial layer derived from the parameter-sharing mechanism of our MAGEP-NFNs. In contrast, the invariant equivariant layer proposed in [4] is an ad hoc formulation and does not result from a parameter-sharing mechanism. Consequently, there is no direct relationship between our invariant layer and the invariant layer in [4].

However, the equivariant polynomial layer in our MAGEP-NFNs and the equivariant linear layer from [4] are related. Specifically, the equivariant layer in [4] corresponds to the linear component of our equivariant polynomial layer. Due to the lengthy formulation and construction process, we have provided the details of the equivariant polynomial layers in the appendix. These discussions have been incorporated into the revised version of the paper.

References

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

W11: Contribution 2 of the paper (linear independence) is overstated - there is no comprehensive study in the main text. In contrast, the graph-based approached handled diverse and complex architectures (such as transformers) more easily.

Answer: Regarding the linear indepedence of stable polynomials: The detailed analysis of the linear independence of stable polynomial terms, which is both extensive and technically intricate, is distributed between Section 4.1 in the main text and Appendix B. The primary result on linear independence is presented in Theorem B.6. In response to the reviewers' suggestions, we have included a simplified version of Theorem B.6 along with additional results related to stable polynomials in the main text.

Regarding comparison to graph-based approaches: It is worth noting that, although graph-based approaches can handle diverse and complex architectures with relative ease, they suffer from higher memory consumption and runtime. To provide envident for this fact, we have included runtime and memory consumption for our model and previous ones in predicting CNN generalization task in the tables below. We have also included these results in Table 16 and 17 in Appendix H of our revision. For graph-based architectures, we compare with two recent works: GNN[2] and ScaleGMN[3]. Our model runs significantly faster and uses much less memory than these graph-based networks and NP/HNP[1]. Introducing additional polynomial terms slightly increases our model's runtime and memory usage compared to Monomial-NFN [4]. However, this trade-off results in considerably enhanced expressivity, which is evident across many tasks like Predict CNN Generalization or INRs Classification.

Table 1: Runtime of models

NP [1]HNP [1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset35m34s29m37s4h25m17s1h20m18m23s28m12s
ReLU subset36m40s30m06s4h27m29s1h20m23m47s28m43s

Table 2: Memory consumption

NP [1]HNP[1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset838MB856MB6390MB2918MB582MB584MB
ReLU subset838MB856MB6390MB2918MB560MB584MB

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024.

评论

Reply [2/4]

W4: The shortcoming of the method is that it requires some manual design of the polynomials for each architecture, so it's unclear how to extend it to other architectures such as transformers or even if it's possible to use a trained MAGEP-NFN on the wider networks.

Answer: Our method is applicable to these architectures, provided that the symmetric group of the weight network is known. The main idea involves leveraging the weight-sharing mechanism to redefine the classes of "stable polynomials" and subsequently determine the constraints on the learnable parameters. While the calculations may need adaptation to maintain the new architecture's structure, they certainly remain feasible. As our paper focuses on enhancing the expressivity of Monomial-NFN while preserving its efficiency and accuracy within the architectures considered in the context of current NFNs in the literature, these mixed architectures fall outside the scope of this work.

W5: The experiments on INR and generalization prediction for simple small CNNs are not sufficient. Previous works (Lim et al.,2023; Kofinas et al., 2024) added more practical experiments with learning to optimize or diverse transformer architectures. Otherwise, the proposed approach appears to be overengineered for some specific toy cases.

Answer: For the task learning to optimize, the previous work [2] has not provided the official implementation in their github repository (https://github.com/mkofinas/neural-graphs) yet. We will include this benchmark when the code is available.

References

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

W6: "These specialized networks are known as neural functional networks (NFNs)(Zhouetal.,2024b)" is a questionable statement.

Answer: For clarity, we revised the sentence to "Neural functional networks (NFNs) [1] have recently gained prominence as specialized frameworks designed to process key aspects of DNNs, such as their weights, gradients, or sparsity masks, treating these as input data".

References

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

W7: "graph-based equivariant NFNs suffer from high memory consumption and long running times" is an overstatement given that Tran et al., 2024 showed the memory consumption of those methods is at maximum 6GB and runtime at max 4.5 hours. So in practice, this does not seem to be a problem for any modern GPUs. Please rewrite or scale the experiments.

Answer: We agree with the reviewer that in terms of computational cost and runtime, modern GPUs is suffice to perform Graph-based NFNs experiments. What we want to mention in the paper is that when compared with other parameter sharing models, graph-based models fall behind in terms of runtime and memory consumption (Please refer to Table 1 and 2 in our manuscript). We will rewrite this statement to "Compared to graph-based models, parameter-sharing-based NFNs built upon equivariant linear layers exhibit lower memory consumption and faster running time.".

W8: Related works is not informative, because it just list existing papers without discussing how the current work is connected (improved, different, etc.) to them.

Answer: Thanks for your comment. We have updated the related works section in the revised version. The updated relates works section is now more informative with discussions on how MAGEP-NFN is connected, improved, and different to them.

W9: The whole Section 3, if I understand correctly, seems to be background (notation introduction, etc. from Tran 2024), and it's quite lengthy, which makes the amount of the original content in the paper quite small.

Answer: Section 3 indeed serves as the background section, providing the necessary notations, definitions, and propositions required for constructing stable polynomials and equivariant/invariant layers in Section 4. Due to the extensive number of required notations, definitions, and propositions, this section spans nearly three pages. Nevertherless, following the reviewers' suggestions, we have shortened this section to less than two pages and relocated some of this material to the appendix.

Also, we respectfully disagree with the reviewer that the amount of the original content in our paper is quite small. Note that our manuscript spans 58 pages, including detailed derivations, extensive theoretical discussions, and thorough empirical validations of our methods.

评论

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论

I thank the authors for a detailed response which addressed most of my concerns, therefore I'm raising the score to 6.

评论

Thanks for your response, and we appreciate your endorsement.

评论

Incorporating feedback from reviewers, as well as some further informative empirical studies, we summarize the main changes in the revised version of our paper as follows:

  1. We have rewritten the abstract and introduction for improved clarity, focusing particularly on the description of graph-based models and neural functional networks.
  2. We have revised the writing style and wording of the abstract, related works, and introduction. Additionally, we removed some repeated content in the background section, as in [1], and included an extra citation for that work. These changes ensure our paper remains distinct while providing the necessary technical details for reproducibility and clarity.
  3. We have revised the entirety of Section 4, incorporating additional intuition and explanations to make the stable polynomial terms and their applications in constructing equivariant and invariant polynomial layers in Equations (20)-(21), which are now Equations (12) and (15) in the revised version, easier to understand
  4. We have added a Memory and Runtime comparison in Appendix H. The results, presented in Tables 16 and 17, highlight the efficiency of our model compared to graph-based models ([2], [3]) and HNP [4].
  5. We have conducted additional baseline experiments in Appendix G, comparing our model with recent graph-based baselines (GNN [2], ScaleGMN [3]) on the task of predicting CNN generalization on the ReLU subset. The results, presented in Table 15 (Appendix G), show that our model achieves considerable improvements across both datasets, slightly falling behind ScaleGMN on the Tanh subset but demonstrating a significant gap in the ReLU subset.
  6. We have conducted an additional ablation study in Appendix I to assess the contribution of our newly introduced components. The results are presented in Table 18, Appendix I. Each of our newly introduced Inter-Layer terms can provide additional performance gained for the network, and when combined together, the performance is boosted considerably.

References

[1] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

[4] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

评论

Dear AC and Reviewers,

We greatly appreciate your thoughtful reviews and valuable feedback, which have significantly enhanced the quality of our paper. We are encouraged by your recognition of the following aspects: 1) Our work is acknowledged for contributing both theoretical insights and empirical advancements to neural functional networks (Reviewer C8VP), marking a significant step forward in this area (Reviewers 4jXM, C8VP, zVNR); 2) The paper addresses a novel and challenging problem of designing polynomials that are equivariant to parameter symmetries (Reviewers jcC7, 4jXM) and presents a well-motivated study (Reviewer C8VP). The theory is described as sound (Reviewer jcC7), with rigorous and clear derivations that are easy to follow (Reviewers zVNR, C8VP); 3) The experiments are noted for their thoroughness (Reviewer C8VP) and demonstrate promising outcomes (Reviewers jcC7, zVNR), including improved in expressivity (Reviewer 4jXM).

Below, we address some common points raised in the reviews:

1. Regarding runtime and memory consumption: To compare computational and memory costs, we have included runtime and memory consumption for our model and previous ones in predicting CNN generalization task in the tables below. We have also include these results in Table 16 and 17 in Appendix H of our revision. For graph-based architectures, we compare with two recent works: GNN[2] and ScaleGMN[3]. Our model runs significantly faster and uses much less memory than these graph-based networks and NP/HNP[1]. Introducing additional polynomial terms slightly increases our model's runtime and memory usage compared to Monomial-NFN [4]. However, this trade-off results in considerably enhanced expressivity, which is evident across many tasks like Predict CNN Generalization or INRs Classification.

Table 1: Runtime of models

NP [1]HNP [1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset35m34s29m37s4h25m17s1h20m18m23s28m12s
ReLU subset36m40s30m06s4h27m29s1h20m23m47s28m43s

Table 2: Memory consumption

NP [1]HNP[1]GNN [2]ScaleGMN[3]Monomial-NFN[4]MAGEP-NFN
Tanh subset838MB856MB6390MB2918MB582MB584MB
ReLU subset838MB856MB6390MB2918MB560MB584MB

2. Regarding comparison with more graph-based baselines: We compare our model with GNN[2] and ScaleGMN[3] for predicting CNN generalization, using HNP[1] as a reference. GNN shows a performance drop with separate activation subsets. While ScaleGMN greatly enhances performance on the Tanh subset, improvements on the ReLU subset are less substantial. In contrast, our model achieves an overall significant improvements on both datasets. We refer the reviewer to Table 15 in Appendix G of our revision for the results, and present them in Table 3 below as well for convenience.

Table 3: Performance comparison with Graph-based NFNs

HNP[1]GNN [2]ScaleGMN[3]MAGEP-NFN
ReLU subset0.9260.8970.9280.933
Tanh subset0.9340.8930.9420.940

3. Regarding experiments setup for INRs Classification and Predict CNN Generalization: For the INRs classification task, we follow the same settings as in Monomial-NFN[4], which utilized the dataset from [1]. The key difference between [1] and [4] is that [1] augments the data tenfold, resulting in 450,000 training instances, while [4] does not use augmentation. By not augmenting the dataset, we enable a fairer comparison, highlighting architectures that can inherently handle transformations in weight space without relying on additional data.

In the CNN generalization task, we evaluate all models on two subsets using ReLU and Tanh activations, while [2] uses the entire dataset for evaluation. Importantly, [2] cannot handle scaling symmetries, leading to a performance drop reported in [4]: 0.897 for the ReLU subset and 0.893 for the Tanh subset.


Reference.

[1] Zhou et al. Permutation Equivariant Neural Functionals. NeurIPS 2023

[2] Kofinas et al. Graph Neural Networks for Learning Equivariant Representations of Neural Networks. ICLR 2024

[3] Kalogeropoulos et al. Scale Equivariant Graph Metanetworks. NeurIPS 2024

[4] Tran et al. Monomial Matrix Group Equivariant Neural Functional Networks. NeurIPS 2024


We are glad to answer any further questions you have on our submission.

AC 元评审

The paper is the first to propose polynomials equivariant to parameter symmetries in neural field networks, allowing for a good balance between computational complexity vs expressivity dilemma. There were also concerns regarding comparisons. There were questions regarding the presentation that was too dense and could lead to challenging reproducibility, which still persisted by the reviewers, regarding deeper intuition and additional explanations for choosing stable polynomials and possible limitations of this approach with respect to the expressivity of the method, by not including all invariant/equivariant polynomials, the degree of the polynomial the limitations of monomials, and not discussing sufficiently the differences that arise for different activation functions.

By far the most controversial point was whether the paper committed plagiarism, as raised by one of the reviewers. After checking myself, it appears that even with the edits, there is quite some similarity. I do not want to go too much into the reasons why this similarity could be (for instance, maybe it's the same group of people co-authoring) but in general, plagiarism is a hard issue to address, especially when it comes to very large conferences with many unknown parties involved (at the time of the reviewing), since more often than not the truth is somewhere in the middle. The only thing that is certain, to avoid problems with other papers in the future (and how much is something or not a plagiarism), it is better to be strict to ascertain that there is no doubt.

As far as the content is concerned, I think the paper is good. However, considering that there are also some unclarities to improve, I suggest the authors to do one more round of serious rewriting, and resubmit.

审稿人讨论附加意见

There were no significant comments or changes during the reviewer discussion other than what I have described in my metareview already.

最终决定

Reject