5.5

/10

Poster4 位审稿人

最低4最高7标准差1.1

3.3

置信度

正确性3.0

贡献度2.3

表达2.8

NeurIPS 2024

Improving Equivariant Model Training via Constraint Relaxation

Stefanos Pertigkiozoglou,Evangelos Chatzipantazis,Shubhendu Trivedi,Kostas Daniilidis

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

TL;DR

Improving the optimization of equivariant neural networks by relaxing the equivariant constraint during training

摘要

关键词

Equivariant Neural NetworksSymmetriesApproximate EquivarianceOptimization

评审与讨论

审稿意见

评分: 5置信度: 42024-06-21

The paper proposes to relax the equivariance constraints in equivariance networks by adding a non-equivariant residual in the network weights. The methodology can be applied to different equivariant architectures, e.g. vector neurons, SE(3)-equivariant GNNs, Equiformers. Besides these strictly equivariant networks, the proposed framework can also extend to the approximately equivariant networks by modulating the strength of the unconstrained network weights. Experiments suggest moderate performance increases on a variety of tasks and equivariant architectures when applied the proposed framework.

优点

The paper addresses an important problem: overcoming the optimization challenges for equivariant neural networks. The methodology is clearly motivated and easy to follow. The proposed framework of adding an unconstrained component in the network weights seems general enough to be implemented on various (though not all) equivariant architectures, which the authors have demonstrated with quite a few examples.

缺点

Formatting: the paper is visually difficult to read because of the incorrect use of the parentheses in citations.
Theoretical contribution: as far as I understand, the motivation for relaxing the equivariance constraints is (i) the equivariant networks can be difficult to optimize, or (ii) the data have imperfect symmetry. The paper does not include any theoretical evidence of how the proposed framework of adding unconstrained weights can be helpful in these scenarios.
- Specifically, for (i), the paper did not point out what could be the specific challenges during the optimization of equivariant networks, compared to unconstrained optimization.
- And for (ii), previous works have shown the error bound of approximately equivariant networks on imperfectly symmetric data (Wang et al 2022) and proposed how to find the symmetry-breaking factors (Wang et al 2023). Compared to these works, I feel this paper didn't provide enough analysis e.g. of equivariance error of the proposed network, or how to interpret the learning result and possibly identify the imperfect symmetry in data.
Also, many of the results in the paper are already well-known, e.g. the equation about the Lie derivative. Several works have already proposed to use the Lie derivative as a regularization to encourage equivariance, e.g. Otto et al 2023.
Significance of experimental results: in the experiments, the proposed method only increases performance by a little. Also, in Figure 2, it seems that some models have not converged after 200 epochs. It would be better to train for more epochs and include the full results. In Table 1, the difference between Equiformer and your method is very small. It's hard to verify the significance of this result without looking at the error bar.

问题

The scheduling choice of $\theta$ in Section 3.2 seems arbitrary. Have you tried other scheduling choices, e.g. initializing with zero and constantly increasing it? What is the intuition for the current choice?
How sensitive is the model to the regularization coefficient $\lambda_\text{reg}$ ?
L193: "During inference, we evaluate only on the equivariant part of the model." It would be interesting to also see the result for the full model (i.e. including W). I wonder how equivariant and how large W would be under the current regularization. I'm asking because approximately equivariant networks have proved to outperform strictly equivariant networks on certain tasks with imperfect symmetries (Wang et al 2022). As you have both the non-equivariant network and the equivariant one, I'm curious about the comparison between them.
Is it possible to have different weighing coefficients for each network layer in the Lie derivative regularization and projection error regularization? As these errors can accumulate after passing through multiple layers, I think it is intuitively reasonable to have larger weights for the first few layers.

References

Wang, Rui, Robin Walters, and Rose Yu. "Approximately equivariant networks for imperfectly symmetric dynamics." International Conference on Machine Learning. PMLR, 2022.
Wang, Rui, Robin Walters, and Tess E. Smidt. "Relaxed Octahedral Group Convolution for Learning Symmetry Breaking in 3D Physical Systems." arXiv preprint arXiv:2310.02299 (2023).
Otto, Samuel E., et al. "A unified framework to enforce, discover, and promote symmetry in machine learning." arXiv preprint arXiv:2311.00212 (2023).

局限性

The authors have discussed the limitations.

作者回复

2024-08-07

We thank the reviewer for the positive comments as well as raising many points for clarification. We try to address the points raised below.

Citation Format

We apologize for the incorrect formatting of citations and appreciate the reviewer for pointing it out. We will correct this in the final version of the paper.

Theoretical contributions

Unlike in other works on relaxed equivariance, here we specifically focus on the setting where the training data distribution and the model don't have a mis-match in terms of symmetry. As a result the performance improvements come from the fact that the proposed relaxation process can ease the optimization of equivariant networks. As we also noted with other reviewers, amongst equivariant NNs practitioners it is now a somewhat common observation that equivariant NNs can be harder to optimize than their non-equivariant counterparts [1][2][3]. However, the question itself remains unexplored. We take a step towards examining this question in more detail --- we believe that identifying processes that can provide empirical improvements in the optimization of such networks is a reasonable standalone contribution that can be valuable to the community.

However, we have been working on developing theoretical models that develop an optimization-generalization trade-off which could perhaps provide some rationale for why equivariant models might be harder to optimize. We leave explicating on this for a future paper, since the situation can get quite involved.

Significance of experimental results

We thank the reviewer for the feedback regarding the result shown in Figure 2 of the paper. We followed the suggestion and let the models train over the limit of 200 epochs and we show the results in Figure 1 of the attached rebuttal PDF. We see that indeed the additional epochs allow the model to converge, increasing their performance in some cases. It is important to note that while we extended the training epochs we kept the scheduling of theta the same, meaning that theta becomes equal to zero from epoch 200 and onwards. In the original submission we chose to follow the exact setup used from the baseline method (that used 200 epochs) to showcase how our method can provide improvements in performance without the need to tune again all of the training hyperparameters. However, we appreciate the comment of the reviewer and will include the updated figure in the final version of the paper.

Regarding the error bars to quantify the significance of the results for the Equiformer experiment: due to the high computational cost of training a single equiformer model, there is a significant cost involved in providing detailed error bars. The original paper (Liao and Smidt 2023) did not include error bars in their reported results. However, we are actively training more models and we can provide the resulting error bars in the final version of the paper.

Choice of $\theta$ scheduling.

We thank the reviewer for the comment. The main constraint for the scheduling of $\theta$ is that we want to be equal to zero at the end of training so that the final model is equivariant. As a result, the choice of linearly increasing theta doesn't satisfy that constraint. Additionally, we observed that a warm up period in the beginning of training with a small $\theta$ provides significant improvement in the training process. These two observations informed our choice of $\theta$ scheduling where we linearly increase it for the first half of the training and then we linearly decrease it. After the suggestion of the reviewer we will add these observations to the Appendix of the paper.

Sensitivity to the regularization coefficient.

In Figure 2 in the attached rebuttal PDF we show the performance of the VN-Pointnet model for different values of the regularization parameter. This experiment was part of a hyperparameter search using cross-validaton on 80%-20% split of the original training set. As a result the documented performance represents models trained only on the 80% of the train dataset and evaluated on the other 20%.

Additional Details

Regarding evaluation on unprojected relaxed model: We appreciate the suggestion of ther reviewer regarding adding comparison between the relaxed non-equivariant model and the projected equivariant model. Figure 3 (included in the attached rebuttal PDF) shows a comparison between a model trained with our proposed method and a relaxed model before and after we project it to the equivariant space. For the latter model the $\theta$ parameter is kept constant and the equivariant error is controled only through the regularization term. We can observe that, although the performance of the relaxed model before projection is close to the results achieved by our method, the control of the relaxation through our proposed annealing of $\theta$ benefits the performance. We will add this comparison in the final version of the paper.

Regarding the use of different weighting in the Lie derivative regularization: It is true that a more general approach would use different regularization weights for each of the layers of the network. However the use of different weights per layer heavily depends on the individual architectures and will introduce additional complexity to our method. Thus we chose to use the same weight for all the layers to make our method simpler and easier applicable to different tasks independently to the specific architectures used in the task.

References used in this response

[1] Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs Yi-Lun Liao, Tess Smidt, arXiv:2206.11990

[2] Discovering Symmetry Breaking in Physical Systems with Relaxed Group Convolution Rui Wang, Elyssa Hofgard, Han Gao, Robin Walters, Tess E. Smidt, arXiv:2310.02299

[3] Clebsch-Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network Risi Kondor, Zhen Lin, Shubhendu Trivedi, arXiv:1806.09231

2024-08-10

Thank you for your response. Some of my concerns have been addressed. E.g. the additional experiment in Figure 1 indeed shows the benefit of relaxing the equivariance constraint.

For my question 3, I was originally approaching it from the perspective of ``approximate symmetry in data'', where an approximately equivariant model (e.g. with a small $\theta$ in $\theta W$ ) would help. So I was thinking about training such a model by scheduling $\theta$ to be close to zero at the end, e.g. by still using your schedule in Section 3.2 but ending a few epochs earlier. However, since you focus on improving the optimization of a (strictly) equivariant model instead of specifying an approximately equivariant model that matches the data symmetry, I believe what I proposed was not the most relevant. Still, I appreciate your effort in providing the additional results in Figure 3.

My main concern, however, is still about the theoretical contributions. I definitely agree that equivariant models pose optimization challenges, and also tackling this problem itself is important. However, I'd expect more theoretical evidence on why your current method would work and what kind of optimization obstacles it could possibly overcome.

As also mentioned by Reviewer GoLr, there are other intuitive approaches to your goal, e.g. simply using a non-equivariant model and including the Lie derivative regularization. The amount of symmetry violation can also be controlled by the regularization coefficient $\lambda_{\text{reg}}$ , and perhaps you can do a similar scheduling procedure for $\lambda_{\text{reg}}$ as for $\theta$ . I agree that this would be a less explicit control of the level of relaxation. But since there are fewer components and loss terms in the model, it's hard to say exactly which one would be better. Due to the lack of theoretical analysis, I am unconvinced that the current method is a better way of addressing optimization challenges in equivariant networks, among the many possible approaches.

评论- Response to Reviewer's b5uM comment

2024-08-10

Thank you for the response. We appreciate your comment and engagement.

Regarding the suggestion of using a non-equivariant model and including a Lie derivative regularization:

Optimizing a non-equivariant model with a Lie derivative regularization, even in the case where scheduling of $\lambda_{reg}$ is performed, doesn't guarantee that the final model will be exactly equivariant i.e. the Lie derivative will be zero for all possible inputs. Thus we believe that if we want to learn a model that is exactly equivariant, which is the focus of this work, a projection operation that projects the model to the equivariant space is required. While there are multiple works on training relaxed equivariant networks, they do not consider this projection step. In our work, we described a projection operation that is simple-- so it can be easily incorporated into a typical training process-- and we provided experimental evidence showcasing how it benefits training of equivariant networks.

Regarding the contribution of this work

We recognize that a theoretical analysis (or motivation) of this phenomenon will be an important contribution, and it is a research direction we are interested in pursuing in the future. In general, we think that a first principles theoretical approach to come up with a better optimization could be beneficial to the community since the right language for this problem (in the equivariant setting) is also missing. Nevertheless, we believe that providing a simple training procedure that improves the performance of a large range of equivariant networks is an important standalone contribution to the community. As we discussed in our rebuttal responses, previous works mainly showcase the benefits of learning relaxed equivariant networks and do not consider the case that we focus on, where we want to return to the space of exactly equivariant networks. As a result, we believe that our work still provides a novel perspective that shows that even when a practitioner requires an exact equivariant network, it can still be beneficial to relax the equivariant constraint during training and project back to the space of equivariant networks during inference. Due to the lack of attention, the problem space of improving optimization of equivariant networks is unexplored and our work can be seen as a step to start exploring the space.

As for the other suggestion by the reviewer, we are happy to run experiments using that and include it in the appendix if the reviewer thinks that will make it more comprehensive.

2024-08-12

Thank you for the response. I appreciate the authors' efforts to provide additional experiments and clarifications toward the paper's objective and contribution. The lack of theoretical analysis is still a concern to me, but I agree that the method proposed in this paper is indeed an important step toward understanding the optimization challenges in equivariant networks. I will raise my score to 5.

审稿意见

评分: 7置信度: 32024-07-10

The paper proposes relaxing the equivariance constraint on an equivariant network during training. This is done by adding free weights to equivariant linear layers but setting the free weights to zero after training. Further, two regularizations are introduced to stabilize the training: a Lie derivative term encouraging the free weights to be close to equivariant and a term encouraging the influence of the free weights to be low compared to the equivariant weights. The approach is evaluated on several equivariant tasks, showing improved performance compared to the baseline of non-relaxed optimization.

优点

The paper provides a solid contribution to the understudied topic of optimizing equivariant networks. As far as I am aware, the idea has not been studied in the literature before.
The presented experiments show a small but consistent benefit using the proposed approach.

缺点

There is no theoretical motivation for why the proposed approach should work.
The optimization will be heavier using the proposed approach since a large number of additional parameters are introduced. See also Question 2 below.

问题

The proposed parameterization of equivariant layers during training is $f(x) = f_e(x) + \theta W x$ where $f_e$ is equivariant. $W$ is also encouraged to be close to equivariant through Lie derivative regularization. Would it be possible to remove $f_e$ and $\theta$ to parameterize $f(x)=W x$ as a non-equivariant layer that is regularized by the Lie derivative loss? The projection to the equivariant subspace at the end could be group averaging: $W\mapsto \int_{g\in G} \rho(g^{-1}) W \rho(g) \mathrm{d}g$ . Is there something that speaks against such an approach?
What is the introduced overhead during training? In terms of time and memory costs.
What is the performance of the obtained trained network without projecting to the equivariant subnetwork? I.e. is the approximately equivariant network better than the equivariant?

Minor:

Line 243 "table 2" -> "Figure 2"

局限性

作者回复

2024-08-07

Thank you for the positive assessment of our work. For the questions you raise, we attempt to address them below. Please let us know if we can provide further clarifications.

Theoretical motivation

Within the equivariant NNs community, especially amongst practitioners using such networks in scientific applications, it seems to be a fairly common observation that such networks can be harder to optimize as compared to their non-equivariant counterparts [1][2][3]. We wanted to take a step towards exploring this question and its attendant space, which we believe to be a good enough contribution to the community. We have started to form some theoretical justifications for how/why such approaches should work or be modified. We hope to explicate on a theory for future work. Generally, working out an optimization-generalization trade-off is hard (say compared to working a tradeoff between generalization and approximation). Here we would like to work it out for equivariant networks versus non-equivariant ones.

Regarding the additional Optimization Cost

Thank you for the question. As you correctly point out the cost of optimization of our method will be higher than a baseline equivariant network due to the additional parameters. Nevertheless, due to the parallel nature of the added unconstrained component, the main overhead of the method is on the additional memory required to store the added parameters. This number of additional parameters makes the model of similar size to the typical unconstrained non-equivariant models, which is expected since during training we optimize over a space that is larger to the constrained space of equivariant models. We would like to note that contrary to other methods on approximate equivariant models, that require to access additional parameters both in training and inference, our method requires additional computational resources only during training. As a result, after training is completed our proposed method will not have an effect on inference time. We provide a comparison between the time cost of our proposed training process and the baseline training of different equivariant models. Also to showcase the memory overhead we provide a comparison between the number of learnable parameters between a model trained with our method and an equivariant model trained with a standard training process:

Model Type	Number of Parameters (Base Model)	Additional Parameters (Ours)	Time per Epoch (Base Model)	Time per Epoch (Ours)
PointNet	1.9M	6.4M	75s	80s
DGCNN	1.8M	6.2M	148s	154s
Equiformer	3.4M	10M	52s	57s

We will add a discussion about the optimization cost of our method in the Appendix of the final version of the paper.

Regarding Suggestion in Question 1

Thanks for the interesting question. An important part of our proposed method is the ability to control the level of the equivariant relaxation explicitly, in addition to the implicit control that comes as a result of the regularization term. With our current parametrization, we can control the level of relaxation by controlling the value of $\theta$ . The importance of this control and of the annealing of theta can be seen in Figure 2 (of the submitted paper), where in the case where only the Lie derivative regularization is applied with the value of $\theta$ being constant, the performance of the method decreases.

We think that with the group averaging approach, it could be harder to have explicit control on the level of relaxation of the equivariant constraint during training. As a result when projecting back to the equivariant case it is possible that the performance gap between the relaxed and projected model might be large, even with the lie derivative regularizer.

Additionally, the operation of group averaging can be performed up to approximation in the case of continuous groups, since we need to sample the group elements to approximate the convolution integral.

Performance of the Network without the Projection

In Figure 3 in the rebuttal PDF we added a comparison between the performance of a model trained with our method and the performance of a relaxed equivariant model before and after we project it to the equivariant space. For the relaxed model that we compare with, the $\theta$ parameter is kept constant and the equivariant error is controlled only by the regularizer. We can observe that our method outperforms the relaxed equivariant model with constant $\theta$ even before we project the latter in the equivariant space.

Typos

Thank you for pointing this out. We will correct it in the paper.

References used in this response

[1] Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs Yi-Lun Liao, Tess Smidt, arXiv:2206.11990

[2] Discovering Symmetry Breaking in Physical Systems with Relaxed Group Convolution Rui Wang, Elyssa Hofgard, Han Gao, Robin Walters, Tess E. Smidt, arXiv:2310.02299

[3] Clebsch-Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network Risi Kondor, Zhen Lin, Shubhendu Trivedi, arXiv:1806.09231

评论- -

2024-08-08

I acknowledge that I have read the reviews and rebuttals. The authors have successfully answered the most important critique points, so I lean towards keeping my score.

评论- Re

2024-08-09

Thank you for your comment and for engaging with our response. We are glad that our responses have been clarifying.

审稿意见

评分: 6置信度: 22024-07-13

Starting from the consideration that equivariant neural networks, though effective for tasks with known data symmetries, are hard to optimize and require careful hyperparameter tuning, this study proposes a framework to improve their optimization process. This is done by temporarily relaxing the equivariance constraint during training. By adding and progressively reducing a non-equivariance term in intermediate layers, the model explores a broader hypothesis space before converging to an equivariant solution.

优点

The paper is clearly written, the topic is significant and of interest to the community.

缺点

Experimental results, despite confirming the theoretical considerations, seem to not compare with competitor methods (the ones reporte in Related works section, e.g. Mondal et al. (2023), Basu et al. (2023b) etc.) or to describe and analyze the additional computational costs (time/memory) of the provided optimization procedure (also in this case, compared to such alternative approaches).

问题

What are the additional computational/memory costs of the proposed method? Is there a trade-off between the achieved increased optimization stability and exploiting the proposed procedure?
Could the authors give some examples where the proposed method is not applicable, i.e. the symmetry group is not a matrix Lie group or a discrete finite group? And do they believe that such settings are common or not, i.e. is the proposed method general enough to be applied to real-world scenarios?

局限性

The authors adequately addressed the limitations.

作者回复

2024-08-07

We thank the reviewer for the positive assessment of our paper and for the comments. Below we address all of the points the reviewer has raised.

Comparison with works on Equivariant Adaptation/Fine-tuning of pre-trained models

Thank you for raising this point. While it is true that both our method and the works of of Mondal et al. (2023), Basu et al. (2023b) have a similar motivation i.e. that the optimization of equivariant models can be more challenging than their non-equivariant counterparts, our paper addresses a different research question from their work. Specifically, our paper focuses on improving the optimization of equivariant architectures themselves, while the works of Mondal et al. (2023) and Basu et al. (2023b) focus on techniques that bypass the need for training equivariant models. Specifically, they focus on creating equivariant functions by using pre-trained non-equivariant models e.g. by canonicalization (as in Mondal et al.).

We believe that the difference in focus between our work and that of Mondal et al. (2023) and Basu et al. (2023b), does not quite permit a straightforward and fair comparison. Nevertheless, we include a comparison on a task of point cloud classification (ModelNet40) below.

Method	PointNet	DGCNN
Mondal et al. (2023)	66.3%	90.4%
Basu et al. (2023b)	74.9%	89.1%
Original VNN	66.4%	88.5%
Ours	74.5%	92.0%

Here it is important to note that in the case of Basu et al. (2023b), equivariance is achieved by averaging the results over multiple transformed inputs. As a result during inference, the model is required to perform multiple forward passes, which slows down the method's inference time.

We will be happy to include this comparison in the final version (in the main paper or the appendix) if the reviewer believes that it will benefit the overall narrative of the paper.

Additional Details on Computational and Memory costs of the proposed method

We provide a table of the computational overhead of our method compared to the baseline models. Additionally, to illustrate the overhead on the memory constraint we provide a comparison on the number of learnable parameters between our method and the baseline equivariant model. We would like to note that this overhead is only during the training of the models since during inference we remove the additional relaxation layers.

Model Type	Number of Parameters (Base Model)	Additional Parameters (Ours)	Time per Epoch (Base Model)	Time per Epoch (Ours)
PointNet	1.9M	6.4M	75s	80s
DGCNN	1.8M	6.2M	148s	154s
Equiformer	3.4M	10M	52s	57s
We will add these details in the appendix.

Trade-off for optimization stability

Could the reviewer clarify further? We might be misunderstanding and would be happy to discuss. In general, however, we think that the optimization stability of our method depends on the projection error term which can affect a tradeoff. For perfectly equivariant models in which we don't do any relaxation, we would expect them to be somewhat harder to optimize than a setup with relaxed weights that don't have high projection error. In cases where the projection error is too high, the quality of optimization will go down. Note that in our proposed method we can control the projection error by annealing the value of the $\theta$ parameter of equation 2.

Examples where the proposed method is not applicable

The approach will generally work for a variety of groups of real-world interest. The approach as described will work for compact Lie groups such as $S(1), SO(n), O(n)$ , and their finite subgroups like $Z/NZ$ , as well as their quotients and products. The approach can be made to work for certain non-compact Lie groups which are reductive for which we can write expressions for the bracket, such as the Lorentz group. It might be challenging for the approach to work for permutation groups.

2024-08-12

I acknowledgeto have read the reviews and rebuttals. Given that the authors have answered the most important questions I raised (ando also clarified the trade-off between stability and performances), I lean towards keeping my score of acceptance.

评论- Thanks

2024-08-12

Dear Reviewer, Thank you for taking the time for examining our response. We are glad that our responses have been able to answer the points you had raised.

审稿意见

评分: 4置信度: 42024-07-13

The work proposes a method for improving generalization by relaxing hard equivariance and minimizing the equivariance error as additional regularizer during training.

优点

Symmetries play an important role in machine learning and deep learning specifically. There has been recent attention to relaxed forms of equivariance, making it a relevant topic. The paper is well-written.

缺点

There is a lack of attribution to related work, which results in a false sense of novelty. Although the paper does cite many papers on (relaxed) equivariance, it does not always give proper attribution to their contributions. Especially, since several cited papers in the related work section have proposed forms of relaxed equivariance and even minimizing such “relaxation errors” as regularization objective - either in their main method or as a baseline. Yet, the paper claims that “ solutions don’t directly focus on the optimization process itself”. This gives a false impression that regularizing the amount of equivariance in the loss is novel, while it is not. As such, to me it is not entirely clear what the contribution of this paper is.

问题

Projecting back to equivariance models during testing. The first contribution mentions “projecting back to the space of equivariance models during testing”. where is this method? To my understanding, the model is not actually projected, but the error in the projection is rather minimized (but not zero). I could have misunderstood this aspect?
Regularization term for deep neural networks. In line 165, a regularization term is proposed to encourage equivariance solutions. It seems this term is only for a single layer. For multiple layers, how are the relative importances between layers chosen? Uniform? Also, it is not clear to me how the overall regularization strength \lambda_reg should be chosen. Cross-validation? Similarly for \theta.

局限性

The main contribution of the paper seems to be adding a regularization term that penalizes the ‘equivariance error’. Several works have proposed similar losses and minimizing such regularizers in the training objective. The paper lacks comparison or discussion between different choices of such regularizers, yet alone an empirical comparison. To me it is not clear what the main contribution of this paper is?

作者回复

2024-08-07

We thank the reviewer for sharing constructive feedback. We agree that we can improve attribution to related work to emphasize the differences from our work. It is certainly true that there are many works on relaxed equivariance, some of which also employ regularization terms that penalize relaxation errors. However, these works tend to assume that the data itself has some (possibly) relaxed symmetry. The regularization term can then be seen as trying to match the symmetry encoded in the model to the symmetry of the data itself. In our case, the main difference---although it may seem subtle---is that we aren't looking to correct for model misspecification, although we include experiments for approximately equivariant NNs too. We try to make the case that even if we assume that the model is correctly specified, relaxing the equivariance constraint during optimization and then projecting back to the equivariant space can itself help in improving performance. Within the equivariant NNs community it is a common observation that equivariant NNs are harder to optimize than their non-equivariant counterparts [1][2][3]. We take a step towards examining this question in more detail as we believe that this particular hasn't seen exploration specifically.

With the above context in mind, we expand on each of your points below.

Paper Contribution compared to related work

As we said, while there are multiple works on different forms of relaxed equivariance, that also introduce regularization by minimizing the equivariance error, the main difference with our work is that they always remain in the space of relaxed equivariance models. Furthermore, the regularization solves for model mis-specification. We believe that the contribution of our work lies in the projection of the models back to the equivariant space which, contrary to previous works, guarantees that the solution has zero (or fixed) equivariant error. Thus, we are not aiming to address model mis-specification. However, we will improve the attribution to existing work to emphasize this difference, and to highlight the observation that optimizing equivariant networks can be harder than their non-equivariant counterparts.

Projecting back to equivariance models during test

Thank you for your comment. We believe there might be a misunderstanding regarding the projection back to the equivariant space during testing. In Section 3 we define the approximate equivariant linear layer as: $f(x)=f\_e(x)+\theta Wx$

where $f\_e(x)$ is an equivariant linear layer and $\theta Wx$ is an unconstrained term. As we mentioned in lines 142-143, we can project this linear layer to be exactly equivariant by setting $\theta=0$ . In that case, only the equivariant part of the layer is activated $f(x)=f\_e(x)$ which guarantees that the layer and as a result the overall model is equivariant and has exactly zero equivariant error. During inference, we set $\theta=0$ , resulting in an exact equivariant model. To improve clarity, in addition to the sentence in lines 142-143 we added an explicit note that the operation of setting $\theta=0$ is what we refer to as projection in the rest of the paper.

Regularization Term for Multiple Layers

As we show in the overall training objective (Section 3.3, line 189) the regularization terms, derived for each individual layer, are added in the loss function with the same weight $\lambda\_{reg}$ . While it is possible to introduce different weights for the different layers, these weights will heavily depend on the specific network architecture and will increase the complexity of the method. In order to keep the method simple and easily applicable to different architectures we choose to simplify the solution and use the same weight for all layers.

Choice of hyperparameters

For the choice of the hyperparameter $\lambda\_{reg}$ we performed a typical grid search using cross validation with a 80%-20% split of the training set into training and validation set. We found this value to be relatively robust across tasks, so we performed it on the task of point cloud classification and used the value in all other tasks. We include a figure (Figure 2) in the attached rebuttal PDF which shows how the value of $\lambda\_{reg}$ affects the performance of the method while using VN-PointNet as the baseline model. We will add these details and the additional figure in the Appendix of the final version of the paper. We can also include results tuning them individually in the appendix if the reviewer thinks it will be helpful. However, we must note that the accuracy shown in the figure is for the validation set when the model is trained on the 80% of the training set. The results shown in the paper are when we train the model on the complete training set (after we have chosen the hyperparameter $\lambda\_{reg}$ ).

Regarding the parameter $\theta$ : since we perform the scheduling as described in section 3.2, we are not required to do hyperparameter search. The main constraint for the choice of $\theta$ is that it needs to arrive at zero at the end of training. As shown in the ablation in section 4.1 if we do not perform $\theta$ annealing we observe deterioration in performance.

References used in this response

[1] Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs Yi-Lun Liao, Tess Smidt, arXiv:2206.11990

[2] Discovering Symmetry Breaking in Physical Systems with Relaxed Group Convolution Rui Wang, Elyssa Hofgard, Han Gao, Robin Walters, Tess E. Smidt, arXiv:2310.02299

[3] Clebsch-Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network Risi Kondor, Zhen Lin, Shubhendu Trivedi, arXiv:1806.09231

2024-08-13

Thank you for the response. I appreciate the authors' efforts to provide additional explanations. To me it seems that the main contribution of the work is the projecting back step. Although most works do not consider such projection, I remain concerned about the lack of sufficient contribution to prior work (for instance relating the used relaxation). Further, some experiments demonstrate that some training runs result in better test loss, but am not sure whether the provided experiments give sufficient evidence for the made claims, which is that introducing some relaxation during training but removing it afterwards through a back projection is beneficial. Especially, since the paper lacks a theoretical analysis.

Regarding the claimed contributions. The used relaxation to me seems exactly equivalent to a residual pathway prior RPP [1]. Although this work is mentioned, its relation to the used relaxation is not. As such, I also argue this not to be a particular novel contribution on its own which is claimed. Similarly, the authors do cite a range of papers that introduce parameterizations of relaxed equivariance, but do not explain how these relaxations relate to the relaxations used in this paper. It is true that these work do often not explicitly consider projecting back, as also argued in the rebuttal, but it is not clear to me why this would prohibits a comparison between these relaxations and explain differences and similarities. Moreover, since projecting back seems to be the primary contribution of this work, why not consider projecting back other common relaxations of equivariance?

Given the above, I lean towards keeping my current score.

[1] Finzi, M., Benton, G., and Wilson, A. G. Residual pathway priors for soft equivariance constraints. 385 In Advances in Neural Information Processing Systems, volume 34, 2021.

评论- Response to Reviewer's 3CAa comment

2024-08-13

Thank you for the comments, we appreciate your engagement with our work and response.

Regarding the attributions to prior work

We again do agree with the suggestion of the reviewer that improving relation to previous works on relaxed equivariance will be useful. We will add a more detailed discussion, along the lines that we have stated in the rebuttal and the responses. In addition to that, we will add further discussion about the similarities between our proposed method and the Residual Pathway Priors (RPP). While we agree with the reviewer that the mechanism used to perform our relaxation is similar to the one used in RPP, we would like to note that the process for updating the level of relaxation during training is significantly different between the two works since their motivation and focus are different. Specifically, while RPP assumes only partial knowledge of the degree of equivariance for a given task, and designs a training process that allows for updating this knowledge given the data, we assume definitive knowledge about the symmetries of the given task and show that optimizing over a larger space of relaxed equivariant networks and projecting back to the equivariant space can help optimization. We will add the above discussion in Section 3 of our paper.

Regarding the Contribution of this work

As we discussed in the rebuttal and was also mentioned by the reviewer, one of the main contributions of this work is the observation that even when we know the exact symmetries of the task it can still be beneficial to train over a larger space of relaxed equivariant models and project back to the original constrained equivariant space during inference. Since this observation is not discussed by previous works on relaxed equivariant networks, we believe that it constitutes a standalone contribution that can motivate further research on this unexplored area.

To support our claims we propose a simple training process that allows us to efficiently control the relaxation level of the equivariant constraint during training and project back to the equivariant space during inference. In Section 3 of our paper, we discuss the motivation for the specific choices used in our method, including the specific form of the relaxed equivariant linear layer. By evaluating our method on a diverse set of equivariant networks and tasks, we show that our proposed relaxation during training results in increased performance of equivariant networks compared to the performance achieved by the standard training of such networks solely on the equivariant space. We believe that our experimental evaluation presented in the paper, along with the additional results we added in the rebuttal after suggestions from the reviews, provides sufficient empirical evidence of a phenomenon, which can be beneficial for improving the optimization of equivariant networks and is not documented in previous literature. Therefore, we think that the step about projection (which seems very similar to works on relaxed equivariance) is only a way to get there and can serve as a baseline, but the main contribution can also be seen as considering how optimization of equivariant networks with fixed symmetries can be improved. So the contribution is also about exploring that problem space itself. Nevertheless, we are happy to incorporate additional suggested experiments that the reviewer believes can strengthen our claims.

作者回复

2024-08-07

We thank all of the reviewers for their comments and constructive feedback on our paper. Here we would like to provide an overview of some of the main points of our individual responses to the reviewers and also take the opportunity to highlight the main contributions of our work:

In this work, we propose a novel training procedure for equivariant neural networks that relaxes the equivariant constraint during training and projects back to the space of equivariant models during inference. As we discuss in more detail in our responses to reviewers 3CAa, b5uM, the differentiating factor between our paper and previous works on relaxed equivariance is that while we optimize over a larger space of relaxed equivariant models, at the end of optimization we project back to the space of equivariant models. This process allows us to learn models that have increased performance, compared to equivariant models trained with standard training, while at the same time, they have exactly zero equivariant error.

Following the suggestions of the reviewers, in the attached rebuttal PDF we provide additional ablations showing the effect of the individual component of our method on the overall performance (e.g. lie derivative regularization, $\theta$ scheduling, projection to the equivariance space). Specifically:

Figure 1 extends the results of the ablation study provided in the submitted paper, where it shows the performance of different versions of our method when different components are removed.
Figure 2 shows the sensitivity of our method to the choice of the regularization coefficient $\lambda_{reg}$ .
Finally, in Figure 3 we provide a comparison between a model trained with our proposed method and a relaxed equivariant model that is trained without the use of $\theta$ scheduling. In the case of the relaxed equivariant model with constant $\theta$ , we show that its overall performance is worse than our method even before we project it in the equivariant space.

We believe that improving the training of equivariant neural networks is an important research question that can benefit the community and that our work is a positive step in that direction. We appreciate all the comments raised by the reviewers as they helped us enhance the presentation of this work, and we try to address all of them in our individual responses.

最终决定Accept (poster)

2024-09-25

This paper advocates facilitating training of equivariant networks by relaxing the model’s equivariance during training. In particular, it adds and progressively removes the non-equivariant terms in the training process.

Reviewers found this paper clear and the motivation of improving training of equivariant models a good one. They considered the idea of progressing to equivariant solution during training mostly new, and appreciated the ``small yet consistent’’ improvement in results. Some concerns were raised regarding the similarity to previous works, where both the way the current paper adds non-equivariant terms, and the usage of the Lie derivative loss to encourage equivariance were used in previous works. However, the notion of forcing zero equivariance loss during training seems novel. Other concerns were the missing empirical and/or theoretical analysis, and the authors added during the rebuttal more experiments and some ablations of different components of their method. The authors also addressed issues of missing comparisons and memory+time cost tables. Overall, it seems most critical criticism of the reviewers were addressed. The only remaining negative reviewer is making relevant claims (mostly the contribution concerns mentioned above) but the overall impression is that the paper has merit and passes the bar for publication.

We request the authors to incorporate the promised changes from the rebuttal in the camera ready version of this paper.

Improving Equivariant Model Training via Constraint Relaxation

摘要

评审与讨论

优点

缺点

问题

References

局限性

Citation Format

Theoretical contributions

Significance of experimental results

Choice of θ\thetaθ scheduling.

Sensitivity to the regularization coefficient.

Additional Details

Regarding the suggestion of using a non-equivariant model and including a Lie derivative regularization:

Regarding the contribution of this work

优点

缺点

问题

局限性

Theoretical motivation

Regarding the additional Optimization Cost

Regarding Suggestion in Question 1

Performance of the Network without the Projection

Typos

优点

缺点

问题

局限性

Comparison with works on Equivariant Adaptation/Fine-tuning of pre-trained models

Additional Details on Computational and Memory costs of the proposed method

Trade-off for optimization stability

Examples where the proposed method is not applicable

优点

缺点

问题

局限性

Paper Contribution compared to related work

Projecting back to equivariance models during test

Regularization Term for Multiple Layers

Choice of hyperparameters

Regarding the attributions to prior work

Regarding the Contribution of this work

Choice of $\theta$ scheduling.