PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
5
7
4
6
3.5
置信度
正确性2.8
贡献度2.5
表达2.8
NeurIPS 2024

MatrixNet: Learning over symmetry groups using learned group representations

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

When learning over symmetry groups, a learned group representation of group elements works well as a feature representation.

摘要

关键词
Group TheoryRepresentation TheoryFeature learninggeometric deep learninghomomorphic

评审与讨论

审稿意见
5

In this paper, authors study the question of what feature representations to use for learning tasks with inputs coming from a symmetry group. They propose MatrixNet, a neural network architecture that learns matrix representations of group element inputs instead of using predefined representations.

The main contributions are as follows:

  1. Formulation of the problem of learning over groups as a sequence learning problem in terms of group generators with invariance to group axioms and relations.
  2. The proposed neural network architecture, MatrixNet, achieves higher sample efficiency and generalization over several more general baselines in prediction tasks over the symmetric group and Artin braid group.
  3. The matrix block method for constraining the network to respect the group axioms and an additional loss term for learning group relations.

优点

The strength of the paper is building up links among group reprsentation theory, learning tasks and neural networks, more specifically:

  1. Formulation of the problem of learning over groups as a sequence learning problem in terms of group generators with invariance to group axioms and relations.
  2. The matrix block method for constraining the network to respect the group axioms and an additional loss term for learning group relations.

缺点

The motivation is not so clear to me. The applications of group theory are everywhere in the real-world. Thus, how to choose a good representation for the group which is associated with the learning task is more important. In this paper, authors formulate this problem but do not give a clear answer (see my questions below), and authors focus on solving a more abstract group theory question instead which is already studied a lot by mathematician.

I didn't see any insight behind the paper. For the experiment, I think it only solved a mathematical problem at simple setting which is far from what we expect. For example, how the model works when the group's order become larger?

问题

  1. The motivation is to solve a mathematical problem or real-world learning task? I hope it can help me understand the paper clearly and inprove the writing.

  2. In subsection 4.1, authors formulate the problem. In line 166, can ff be seen as a representation of GG or not? If it is a representation, can you write a explicite expression as an example? Since if it is a representation, the target domain RcR^{c} may not be a linear space which makes it unclear for me.

  3. Can you show more experiment results? Such as larger symmetric group? Also I didn't understand the meaning of predict the order of the group element? I think we should determine it instead of estimating it? Maybe I did't get your point, but I am open to discuss.

局限性

see the weaknesses and questions.

作者回复

Response to Reviewer puxG

The motivation is not so clear to me. The applications of group theory are everywhere in the real-world. Thus, how to choose a good representation for the group which is associated with the learning task is more important. In this paper, authors formulate this problem but do not give a clear answer.

In the S10S_{10} experiment in section 5.1 we compare against using precomputed representations and show significantly worse performance compared to the learnable representations of MatrixNet. This ablation shows MatrixNet can automatically learn more useful group representations for a given task. Our results parallel other results in deep learning showing that learned feature representations often outperform expert-engineered features. We discuss our motivation below with your question.

I didn't see any insight behind the paper. For the experiment, I think it only solved a mathematical problem at simple setting.

The task used for our experiment on the Artin braid group in section 5.2 is an open mathematical research problem. This task is limited to B3B_3 since the multiplicity counts are not known for any larger braid groups. The experiment over S10S_{10} in section 5.1 was chosen to provide a well-studied group with known representations to compare precomputed representations against the learned representations of MatrixNet.

Questions:

Q1: The motivation is not so clear to me. The motivation is to solve a mathematical problem or real-world learning task?

A1: Thank you for the feedback. We will make sure to make the motivation and application more clear in the paper. The second task used in our experiments over the Artin braid group is a current open mathematics research problem. Our goal is to create a model that can help mathematicians build intuition and formulate conjectures by 1) computing additional data points and 2) providing new insights through inspecting the learned representations. We believe this application is relevant to real mathematical research.

Q2: In line 166, can f\mathcal{f} be seen as a representation of \mathcal{G} or not?

In line 166 the function f\mathcal{f} does not constitute a representation of the group since the output space does not obey the group composition operation. Let vi=f(gi)v_i = \mathcal{f}(g_i) be the vector target for a group element. Then if gk=gigjg_k = g_i \circ g_j it is not the case that vk=vi+vjv_k = v_i + v_j.

Q3.1: Can you show more experiment results? Such as larger symmetric group?

A3.1: Yes, thank you for the suggestion. Using the same data generation scheme and data splits detailed for the S10S_{10} experiment, we generated 800,000 samples for the group S12S_{12}. S12S_{12} is roughly two orders of magnitude larger than S10S_{10} (12! vs 10! elements). MatrixNet achieves an MSE of 3e-4 across all dataset splits with a classification accuracy of over 99%. The only increase to the network size necessary in this case was altering the rep size from 10 to 12.

Q3.2: Also I didn't understand the meaning of predict the order of the group element? I think we should determine it instead of estimating it?

A3.2: Sorry for confusing terminology. We will change “predict” to “determine.” Our model outputs an exact order, not an estimate. We used the term predict in a colloquial way to refer to the model output.

评论

Thanks for authors' reply. It helps me understand the paper better.

审稿意见
7

The paper describes MatrixNet, a method to learn group representations such that they are optimized for a certain task of interest using a neural network. , The neural network takes in a group element in the form of a sequence of generators that compose to form the group element and forms an intermediate output that contain the matrices that are the group representations for each of the generators. The generators can them be multiplied to form the group representation which is then fed to further neural layers that are task-specific. The network is trained so as to (approximately) satisfy the constraints of group representations using architectural constraints and the loss function. Experiments on predicting the order of a element of the symmetric group and Jordan-Holder multiplicities for the 3-strand braid group.

优点

  1. The paper present new ideas for learning on groups. Design of architectures is clever and done well with good mathematical justification. The architectures provably output group representations.

  2. Experiments show that the proposed architecture leads to learning group representations that lead to better results compared to just using

  3. The paper also does a good study of the possible variations of MatrixNet and presents experimental results for them.

缺点

  1. One of the weaknesses is that the parts about the braid group are very difficult to follow and probably need a lot more background for readers and attendees for this conference. This includes myself and I was not able to fully follow the details or why that experiment is important.

  2. The experiments are also a little weak in my opinion. It was not clear to me how the learned representations were different from precomputed ones and why they were better for a given task. Also, what happens when the size of the group representations is larger. Perhaps for the first experiment, it would be nice to see the results as a function of the size of the group representation. I am not sure if the proposed architecture and learning mechanism can learn good representations when the size is large. This is important to address, in my opinion.

问题

No additional questions.

局限性

Yes, authors have listed the limitations of the current submission and suggested directions for future research.

作者回复

Response to Reviewer KZHf

Parts about the braid group are very difficult to follow and probably need a lot more background for readers and attendees for this conference.

Thank you for this feedback. We will make sure to revise this section to add more clarity. Intuitively the braid group defined in the paper represents all of the possible ways to braid a set of NN ropes. The braid group is closely connected to many fields of math but in particular we are interested in how it acts on mathematical categories through “categorical braid actions”. Mathematicians are interested in studying “filtrations” of the objects of categories but there isn’t a unique filtration for all objects within a category. The categorical braid actions however act on all such filtrations in the same way and so the multiplicity counts provide a canonical way to describe object filtrations regardless of the specific filtration. We include a more in-depth discussion of the braid group and categorical braid actions in section A of the appendix.

The experiments are a little weak in my opinion. It was not clear to me how the learned representations were different from precomputed ones and why they were better for a given task.

Our results parallel other results in deep learning showing that learned feature representations often outperform expert-engineered features. The precomputed representations we used are a natural choice for matrix representations for the groups used. For example, the group S10S_{10} represents all of the ways to permute 10 objects and can intuitively be represented by 10x10 permutation matrices, which is the precomputed representation we use in the experiment. However, depending on the task, this may not be the most useful representation. For example, the sign representation of a permutation is simply ρ(σ)=±1\rho(\sigma) = \pm1 depending on the parity of σ\sigma. This would be a useful feature for determining a group element's order since every permutation of odd order has even parity. Our approach is designed to automatically learn a representation that is useful for the given task.

What happens when the size of the group representations is larger. I am not sure if the proposed architecture and learning mechanism can learn good representations when the size is large.

Our method can scale to large representations using MatrixNet-MC. MatrixNet-MC assumes large representations have a block diagonal structure, which is efficient since it means the number of trainable parameters grows asymptotically linearly instead of quadratically. For well chosen block sizes, this does not harm expressivity since for many groups, all representations will be block diagonal with respect to a good choice of basis with the maximum block size equal to the largest dimensional irreducible representation. Even for very large groups these irreducible representations are low dimensional which can be learned by MatrixNet-MC.

We omitted results with larger representation sizes as the performance did not change with matrix size. We include additional results for MatrixNet on the braid group task with doubled representation sizes in Table 2.

ModelMSEAcc.
MatrixNet0.97585%
MatrixNet-MC0.05296%
MatrixNet-LN4.5e-4100%
MatrixNet-tanh1.1e-3100%
评论

Thank you for answering my questions.

I believe the authors have done a good job in addressing most of the concerns that all the reviewers had.

The authors should definitely include all the new results in the rebuttal into the main paper.

I will raise my score to a 7.

审稿意见
4

This paper studies feature representations of a group element for supervised learning. It considers a regression task where an input is a group element g of a finite group and a target is some label. Firstly, g is decomposed into a sequence of generators (g_1, ..., g_n) such that g = g_1 \circ ... \circ g_n. Next, each generator is mapped by a trainable matrix embedding W: Gen(G) --> R^{n \times n} followed by the matrix exponential so that we can get a group representation M_i = exp(W(g_i)) ∈ GL(n). Then their product M_1 \circ ... \circ M_n becomes the feature. Similar variants are also proposed. The performance is evaluated on two synthetic tasks: the order prediction of the symmetric group S_10 and the action prediction of the Braid group B_3.

优点

  1. The paper is well written. Technical details are clear enough.
  2. The idea to construct a feature representation of a group element as a learnable parameter is interesting and it is a possibly promising direction.

缺点

  1. Limited applicability to real tasks. To the best of my knowledge, regression (or classification) of group elements is practically important for continuous groups such as SO(3) (e.g., pose estimation). However, the current approach is only applicable to finite groups. Also, I think the tasks conducted in the experiments are not directly bridged to real problems, and I feel uncertain about how the proposed method is practically valuable.
  2. It is unclear which component significantly contributed to the final performance gain. The proposed method consists of (at least) two parts: the decomposition of g into generators and a trainable representation of a generator. The current experiments are not designed to evaluate them separately.
  3. The experiments are not convincing enough—they are relatively small scale, use synthetic tasks only, and have less variety (two tasks).

问题

  1. Given g, is its decomposition into the generators always uniquely determined?
  2. In the experiment of 5.1, what will happen when we employ the group decomposition while using the fixed representation? I mean, the feature of g is given as concat[\rho(g_1), ..., \rho(g_n)] where g = g_1 \circ ... \circ g_n and \rho(g_i) is some representation of g_i (e.g. irreducible rep).

Typo:

  • One of M_{g_ik^-1} would be {M_{g_ik}}^{-1} in the equation below line 203.

局限性

Limitations are adressed.

作者回复

Response to Reviewer JC7A

Limited applicability to real tasks. The current approach is only applicable to finite groups.

Our method is not limited to finite groups. One of our experiments focuses on the infinite Artin braid group. You are correct that our approach is not formulated for continuous groups as it is limited to discrete groups.

Also, I think the tasks conducted in the experiments are not directly bridged to real problems, and I feel uncertain about how the proposed method is practically valuable.

The second task used in our experiments over the Artin-Braid group is a real current research problem in pure math. Mathematicians have only found a way to compute the answer for the simplest braid group B3B_3 – and they do not have a simple or intuitive formula. Our goal is to create a model that can help mathematicians build intuition and formulate conjectures by 1) computing additional data points and 2) providing new insights through inspecting the learned representations. We believe this application is relevant to real mathematical research.

It is unclear which component significantly contributed to the final performance gain: decomposition of g into generators or a trainable representation of a generator.

We do test the impact of a tranable representation independent of the decomposition in our first experiment in section 5.1 of the paper by comparing against the precomputed representation. This ablation replaces the learnable representations with the permutation representation of S10S_{10} and we see significantly worse performance when compared to learnable representation.

We also show that just decomposing g into generators but not using a learned or fixed group representation does not result in improved performance. We compare against two sequential baselines, a transformer and LSTM, which both take the decomposition of g as input. These models however do not learn a group representation showing that just the decomposition of g into a generator sequence does not explain the improved performance of MatrixNet.

The experiments are not convincing enough—they are relatively small scale, use synthetic tasks only, and have less variety (two tasks).

We disagree that only synthetic tasks are used. The task used in our Artin braid group experiment is an unsolved math problem from a recent pure mathematics publication [39]. Mathematicians have only found a way to compute the answer for the simplest braid group B_3 – and they do not have a simple or intuitive formula. Even so, we believe the two groups used, S10S_{10} and B3B_3, are sufficiently large with S10=10!|S_{10}| = 10! and B3B_3 being an infinite group.

Questions: Q1: Given g, is its decomposition into the generators always uniquely determined?

A1: No, the decomposition of g is not unique. This is a great question and is a huge part of the motivation behind our approach. MatrixNet is designed to be invariant to the choice of decomposition.

Q2: In the experiment of 5.1, what will happen when we employ the group decomposition while using the fixed representation? I mean, the feature of g is given as concat[ρ(g1)\rho(g_1), ..., ρ(gn)\rho(g_n)] where g=g1...gnandρ(gi)g = g_1 \circ ... \circ g_n and \rho(g_i) is some representation of gig_i (e.g. irreducible rep).

A2: You could do this, but concatenating in this way means the feature size grows with the length of the decomposition. Since the decomposition of is not unique, another drawback is that this method would result in different features for two different but equivalent decompositions of g. The length of a decomposition is also unbounded. For example, in the braid group an element can be decomposed into arbitrarily many generators which would result in unbounded feature sizes. Our method by contrast is more efficient since feature sizes are fixed and with low relational error produces unique representations for a group element.

Typo: One of Mgik1M_{g_ik^-1} would be Mgik1{M_{g_ik}}^{-1} in the equation below line 203.

Good catch. We will fix it.

评论

Thank you for your response. My concerns are almost addressed. Still, I keep the original score because I'm not convinced enough of its applicability. Since I'm not a mathematician and I cannot judge how the group problems are significant.

审稿意见
6

The authors use neural networks to learn group representations. A group element is represented by its generators formatted as a sequence of learned matrix representations. These generators are mapped to a single matrix representation of the group element via the Matrix Block which enforces group axioms. The resulting feature is used for downstream tasks on groups. The authors show that the proposed architecture MatrixNet successfully predicts group element orders for S_10 and Jordan-Holder multiplicities for braid group B_3. Further analysis on word length extrapolation and visualization demonstrates the superiority and usefulness of the approach.

优点

  1. In Table 1, the faster convergence of MatrixNet potentially indicates that the proper architectural constraints provide good inductive bias for learning group representations and solving downstream tasks over groups. It’s good to see how one can explicitly build in these constraints in the problem of group representations and that it works better than a less constraining MLP without domain-specific inductive bias.
  2. The paper is well-written.

缺点

  1. It’s unclear why naive MatrixNet and MatrixNet-MC cannot extrapolate to longer word length.
  2. Empirical evaluations are a bit limited. It might be helpful to evaluate groups with different properties (see questions).

问题

In the matrix block, is it possible to introduce commutative operations between different matrix representations of group generators if the input group element comes from an abelian group. It seems possible to introduce further architectural constraints for specific groups.

局限性

None

作者回复

Response to Reviewer jQuD

It’s unclear why naive MatrixNet and MatrixNet-MC cannot extrapolate to longer word length.

While MatrixNet and MatrixNet-MC underperform compared to our other two variants, it is overstated to say they cannot extrapolate to longer word lengths. Despite their high MSE, both approaches maintain relatively high accuracy compared to our baselines. That said, MatrixNet-LN and MatrixNet-tanh have better extrapolation. The reason for this performance discrepancy is that MatrixNet and MatrixNet-MC have higher relational error indicating they do not learn group representations as accurately. This error compounds for longer words. To help make this difference clearer we computed the relational errors of the different models on the Artin Braid group, shown in Table 1, which we will add to the paper.

ModelRel. Error
MatrixNet14.78
MatrixNet-MC5.21
MatrixNet-LN0.33
MatrixNet-tanh0.45

Empirical evaluations are a bit limited. It might be helpful to evaluate groups with different properties.

Thank you for the feedback. The task we used for the braid group was a primary motivation for our approach. We are limited to B3B_3 since multiplicity counts for larger braid groups are not known. We chose the symmetric group for our initial experiment in section 5.1 as it is a well-studied group that has the unique property that all finite groups are isomorphic to a subgroup of a symmetric group. It also is closely connected to the braid group making it well suited for ablation tests. We believe that these two groups provide a strong foundation for evaluating MatrixNet but we will perform further evaluations on an abelian group as well.

Questions:

Q1: In the matrix block, is it possible to introduce commutative operations between different matrix representations of group generators if the input group element comes from an abelian group?

A1: Yes this is possible. For an abelian group commutativity can be enforced in two ways in MatrixNet. One is with a loss term L=M1M2M2M1L = |M_1 M_2 - M_2 M_1| and the other is by choosing the learned matrix representation to be diagonal, i.e. a direct sum of one dimensional representations. Since irreps for an abelian group are one dimensional, this would not harm expressivity of the learned representation and would enforce exact commutativity. We did not use diagonal matrices for the non-abelian groups specifically for this reason. More concretely, MatrixNet-MC with scalar channels is an architecture that is constrained to learn commutative representations.

评论

Thank you for the additional experiments and clarifications on how to incorporate other group constraints. My concerns are addressed. I maintain my score.

Reasons for not raising my scores: I am not familiar with abstract algebra enough to see the full implication of this work. As of now the proposed approach works well on tasks chosen in the paper. It's hard for me to see if the specific architectural constraints here generalize to broader classes of groups.

评论

Thank you for your comment - To test Matrixnet performance on broader classes of groups we've generated data for the following groups: S12S_{12}, S5×S5×S5×S5S_{5} \times S_{5} \times S_{5}\times S_{5}, and C11×C12×C13×C14×C15C_{11} \times C_{12} \times C_{13} \times C_{14} \times C_{15}, and trained using Matrixnet-tanh with minimal tuning. We hope that the inclusion of the S5S_5 product group and latter Abelian group help illustrate the robustness of our method. Results are summarized in the table below:

GroupRep SizeLossTest Acc
S12S_{12}121.1e-298.4%
S5×S5×S5×S5S_{5} \times S_{5} \times S_{5}\times S_{5}202.1e-298.6%
C11×C12×C13×C14×C15C_{11} \times C_{12} \times C_{13} \times C_{14} \times C_{15}101.01e-5100%
作者回复

NeurIPS Rebuttal

We thank the reviewers for their feedback and insightful comments. We are glad they found our work well-written(jQuD, JC7A). It is particularly encouraging that many reviewers found our design of architectural constraints to learn group representations novel(KZHf), interesting(JC7A), and theoretically justified(jQuD, KZHf). We respond to specific comments below.

__Unclear real-world motivation (JC7A, puxG)

(JC7A) I think the tasks conducted in the experiments are not directly bridged to real problems. (puxG) The motivation is to solve a mathematical problem or real-world learning task?

The primary motivation of our approach is to assist with mathematical research. The second task used in our experiments over the Artin braid group is a current open research problem. Mathematicians have only found a way to compute the answer for the simplest braid group B3B_3 – and they do not have a simple or intuitive formula. Our goal is to create a model that can help mathematicians build intuition and formulate conjectures by 1) computing additional data points and 2) providing new insights through inspecting the learned representations. We believe this application is relevant to real mathematical research.

Response to Reviewer jQuD

It’s unclear why MatrixNet and MatrixNet-MC cannot extrapolate to longer word length.

While MatrixNet and MatrixNet-MC underperform compared to our other variants, it is overstated to say they cannot extrapolate. Despite their high MSE, both approaches maintain relatively high accuracy compared to our baselines. That said, MatrixNet-LN and MatrixNet-tanh have better extrapolation. The reason for this discrepancy is that MatrixNet and MatrixNet-MC have higher relational error indicating they do not learn group representations as accurately. To help make this difference clearer we computed the relational errors of the models on the braid group which we will add to the paper.

ModelRel. Error
MatrixNet14.78
MatrixNet-MC5.21
MatrixNet-LN0.33
MatrixNet-tanh0.45

Should evaluate on groups with different properties. We chose the symmetric group for our initial experiment in section 5.1 as it is a well-studied group that has the property that all finite groups are isomorphic to a subgroup of a symmetric group. It is also closely connected to the braid group which serves as our motivating problem. We believe that these two groups provide a strong foundation for evaluating MatrixNet but we will perform further evaluations as you suggested.

Response to Reviewer JC7A

It is unclear which significantly contributed to the final performance gain: decomposition of g into generators or a trainable representation of a generator.

We test the impact of a trainable representation independent of the decomposition in our experiment in section 5.1 by comparing against the precomputed representation. This ablation replaces the learnable representation with the permutation representation of S10S_{10} and we see significantly worse performance when compared to learnable representation. We show that just decomposing g into generators but not using a group representation does not result in improved performance by comparing against two sequential baselines, a transformer and LSTM, which take the decomposition as input but do not use a group representation.

The experiments are not convincing enough—they are relatively small scale, use synthetic tasks only, and have less variety (two tasks).

We disagree that only synthetic tasks are used. The task used in the braid group experiment in section 5.2 is an open math problem from a recent mathematics publication [39]. Mathematicians have only been able to compute the answer for the simplest braid group B3B_3 – and they do not have a simple or intuitive formula. We believe the two groups used, S10S_10 and B3B_3, are sufficiently large with S10=10!|S_{10}| = 10! and B3B_3 being an infinite group.

Response to Reviewer KZHf

It was not clear to me how the learned representations were different from precomputed ones and why they were better for a given task.

Our results parallel other results in deep learning showing that learned feature representations often outperform expert-engineered features. The precomputed representations we used are a natural choice for matrix representations for the groups used. We use the 10x10 permutation matrices to represent S10S_{10} but this may not be the most useful representation for every task. Our approach is designed to automatically learn a representation that is useful for the given task.

What happens when the size of the group representations is larger?

Our method can scale to large representations using MatrixNet-MC. MatrixNet-MC assumes a block diagonal structured representation, which is efficient since it means the number of trainable parameters grows asymptotically linearly instead of quadratically. For well chosen block sizes, this does not harm expressivity since many group representations will be block diagonal with respect to a good choice of basis. We include additional MatrixNet results on the braid group task with representation sizes roughly doubled.

ModelMSEAcc.
MatrixNet0.97585%
MatrixNet-MC0.05296%
MatrixNet-LN4.5e-4100%
MatrixNet-tanh1.1e-3100%

Response to Reviewer puxG

For the experiment, I think it only solved a mathematical problem at simple setting.

The task used for our braid group experiment in section 5.2 is an open mathematical research problem. This task is limited to B3B_3 since the multiplicity counts are not known for any larger braid groups. The experiment over S10S_{10} in section 5.1 was chosen to provide a well-studied group to compare precomputed representations against the learned representations of MatrixNet.

最终决定

The paper proposes to learn symmetry groups so that, in turn, to learn optimal representations for downstream tasks at hand. In particular, MatrixNet is proposed, which takes as input generators, then combines them in a Matrix Block that returns an invertible square matrix. Then for a new group element, its matrix representation is the product of the matrix representations of the respective generators that are needed to generate the group element. A parameterized neural network form of the Matrix Block is proposed, as well as variations. An interesting experiment is on a mathematical problem of predicing Jordan-Holder multiplicities, which to be frank is out of my domain, but seems relevant and interesting for the community.

The reviews are mixed, with most reviewers appreciating the novelty and the presentation of the paper. The main complaint is whether the experiments represent a real-world setting, and if a mathematical problem constitutes really a real-world problem. I think it does, or at least it is an interesting new direction to consider for possible ways to validate machine learning models.

Further, while reading the paper, I think some subsections of related work are missing, specifically on 'Symmetry Discovery', eg

Yang et al., Latent Space Symmetry Discovery, PMLR, 2024 Gabel et al., Learning Lie Group Symmetry Transformations with Neural Networks, TAGML, 2023 Dehmamy et al., Automatic Symmetry Discovery with Lie Algebra Convolutional Networks, NeurIPS, 2021

and in 'Mathematically Constrained Networks' a subsection on mathematical programming for imbuing constraints to neural networks:

Pervez et al., Differentiable Mathematical Programming for Object-Centric Representation Learning, ICLR 2023 Pervez et al., Mechanistic Neural Networks for Scientific Machine Learning, ICML 2024

I suggest the reviewers incorporate the reviewers' suggestions and extend their related work accordingly.