Energy Loss Functions for Physical Systems
We use energy differences as loss functions for physics applications
摘要
评审与讨论
The authors propose to replace generic standard training objectives like MSE or cross-entropy with energy-based loss functions that embed known physics of a system into the training signal: each datapoint is treated as an equilibrium state and the loss is simply the (approximate) energy difference between the model's prediction and that state, evaluated under a surrogate potential that respects the system's symmetries. The authors construct losses of this type for atomistic coordinates, diffusion-model training, and discrete spin systems and show that they are architecture-agnostic and symmetry-invariant without costly alignments. They demonstrate across shape-generation, molecular-generation, and spin-glass tasks that show that this physically grounded loss yields higher-quality samples and better data efficiency using standard datasets and models.
优缺点分析
Strengths
- The idea is creative and a refreshing rethink of foundational concepts of ML
- Through examples and theory, the authors show that their energy-based loss functions mirror real energy landscapes and can respect rigid motions and permutations. This is a strong property, since e.g. default DDPM optimizer gradients can pull predictions away from valid symmetry-equivalent structures.
- The versatility of the recipe for constructing such loss functions is a strength; it works for several machine learning tasks without requiring architectural changes.
- Losses lead to better gradients and avoids pathologies of e.g. MSE
- On the benchmarks shown, the energy-based loss functions lead to tangible improvements.
Weaknesses
- The authors derive the loss function from a Boltzmann distribution; the derivation only makes sense here if training samples are near local minima of the true energy landscape. The authors note this assumption, but it is unclear to me how crucial it is to the framework. For example, if our data comes from MD simulations and contains e.g., non-equillibrium and transition states, can this result in loss gradients pointing the wrong way in situations where MSE stays agnostic?
- The driving example of the paper uses a quadratic pair potential which results in a loss that scales with the number of particles N squared. While the authors mitigate this issue by selecting an O(N) subset of spring edges from a globally rigid graph, those missing long-range edges no longer penalise collective shears or electrostatic couplings, so it is unclear whether the sparse loss loses some of its informative gradient signal, even if its optima remain unchanged in theory. Table 2 shows that results are comparable to the non-sparse energy for QM9, but these are all small systems. I wonder what the tradeoff will be for systems that are larger and thus where the sparsification will be most significant for runtime.
- Generally, I would have appreciated wall-clock speed comparing the default losses vs the dense and sparse training losses, preferably versus molecule size.
- While the paper's derivations are mostly convincing, I find the experimental section thin where it matters most for adoption. From an application point of view, the molecule generation task with GDM and EDM is the most interesting, and the paper lacks a comparison that clearly indicates whether it can push molecule generation tasks beyond state of the art. For example, it would have been useful to see numbers from other publications reported in the papers' tables, in particular if the same architecture was used. For example, the JODO paper (Huang et al.) reports molecular stability numbers for QM9 of 93.4 % which is higher than the 89.8 reported in the present paper. It is plausible that the authors' energy-based loss objective could outperform if a similar newer architecture was used, but it is unclear whether the loss function improvement is orthogonal to the improvements from newer architectures with more embedded physics. I also believe this would be very interesting in terms of the molecular stability on the GEOM-Drugs dataset, since this number seems to be either remarkably bad or remarkably good (I believe the latter is the case but see next point). It is, however, simply unclear where these numbers place the authors compared to prior work, and the experiments section thus does not lift this paper from theoretically interesting to broadly applicable, which I very well believe it could be.
- Some background is missing. It would be useful to understand exactly how the stability was calculated, since the molecular stability number can be calculated in several ways (See e.g. Table 2 in the JODO paper, Huang et al., 2023). It would also be useful to discuss some of the points made in the discussion of stability calculations in GEOM-Drugs presented in Nikitin et al. (GEOM-Drugs Revisited:.. 2025), although this paper is very new and the authors haven't had a chance to include these results/discussions in their paper. Generally, performing and discussing experiments with newer architectures and clearly reporting numbers that are fully comparable to previous publications could contribute to a better understanding of whether a modern architecture paired with an energy-based loss could achieve state-of-the-art molecular generation. If affirmative, this would very significantly impact my assessment of this paper.
问题
Your derivation relies on the data being close to local minima of the (unknown) true energy. How sensitive is the loss to violations of that assumption? Can you discuss or demonstrate this?
Could you provide numbers on the wall-clock time and efficiency/accuracy tradeoffs of the different loss types on the more realistic benchmarks? The small reduction in accuracy for the sparse objective on QM9 would look stronger if you could demonstrate how much compute is saved
Could you please clarify exactly how molecular stability was calculated and where the numbers place your results compared to prior publications? Preferably you could include numbers from other publications in the tables you already have produced.
In my opinion, the biggest improvement would be to show that using one of the more modern architectures (e.g. JODO or EQGAT) trained with your energy-based loss similarly leads to better results on the metrics typically reported. This would very likely push results beyond current state-of-the-art if the improvements are comparable. Even one small-scale experiment (same hyper-parameters, swapped-in loss) would show whether the gains you see are orthogonal to architecture advances, and would strengthen the claim that the loss is generally useful.
局限性
Yes
最终评判理由
I choose to raise my initial score to 4, a borderline accept.
My reasons are that I appreciate the thorough rebuttal and added experiments. The JODO+Energy runs on QM9 demonstrate that the loss provides consistent gains beyond the original architectures, and the wall-clock discussion plus the sparse, globally-rigid construction convinces me runtime isn’t a blocker. Those were my main practical reservations, and they’ve been adequately addressed.
My remaining caveat is breadth. This work proposes a new loss family, so I expect wider evaluation across task types and datasets—not mostly generative settings. The current results show it works where tested, which is meaningful, but to move from a promising direction to something with general impact, I need to see the method tried on a mixed set of tasks (e.g., supervised/conditional regression, inverse problems/dynamics, larger systems) and situated with standardized comparisons. Even if the pending GEOM-Drugs result lands well, that alone doesn’t resolve the central question of generality. I also note, more lightly, that the reverse-KL derivation presumes near-equilibrium data and the diffusion analysis holds best at low noise; neither is fatal, but both strengthen the case for breadth.
Weighting these points: the idea is simple, principled, and architecture-agnostic; the added evidence shows real gains with acceptable cost, and I would lean towards accept. I stop short of an outright Acceptance (i.e. a score of 5) because the experimental scope doesn’t yet match the ambition of introducing a general-purpose loss.
格式问题
None
We thank the reviewer for the review and useful feedback. We appreciate that they have found our idea creative and sound, the theory convincing and that experiments show meaningful improvements.
We address their concerns and questions in details below:
Comparison with more recent architectures
We thank the reviewer for this suggestion, and agree. Our goal is to show that improvements resulting from the energy loss are orthogonal to architectural improvements and can indeed push the state-of-the-art. Following the reviewer’s suggestion, we have therefore run experiments using the recently proposed JODO architecture on QM9.
Given the time constraint, we ran the model on 500k steps, rather than the 1.5M reported in the paper. For transparency, we include both settings for JODO. We can update these results later in the discussion period.
Metric-3D
| Model | Atom stable ↑ | Mol stable ↑ | Validity ↑ | Completeness ↑ | FCD ↓ |
|---|---|---|---|---|---|
| JODO (paper) | 99.2% | 93.4% | — | — | 0.885 |
| JODO (500k steps) | 98.96% | 91.57% | 98.04% | 97.93% | 0.839 |
| JODO + Energy Loss (500k steps) | 99.53% | 96.00% | 98.29% | 98.19% | 1.558 |
Metric-Align
| Model | Bond ↓ | Angle ↓ | Dihedral ↓ |
|---|---|---|---|
| JODO (paper) | 0.1475 | 0.0121 | 6.29e-4 |
| JODO (500k steps) | 0.1994 | 0.0189 | 0.001029 |
| JODO + Energy Loss (500k steps) | 0.1464 | 0.0164 | 0.006370 |
Our method outperforms JODO (which uses an equivariant network and Kabsch alignment) on nearly all 3D metrics and has more accurate bond lengths and angles than JODO at matched training compute. We note that the reported run uses exponential coefficients, a choice based on our results with EDM. It is possible other forms of coefficients could be best when targeting other metrics.
Local minimum assumption
Using the true energy would rely on the assumption that the data is at a minimum: if the data is no longer at a local minimum, the assumption is not satisfied and the model would no longer be trained to fit the data. This is one of the reasons why we argue for not using the true energy. We construct our approximation in such a way that for any data points (equilibrium or not), the data sits at a minimum of the approximate energy, while the landscape still captures energetic considerations. So the response to the question asked by the reviewer is that the loss used in the paper will not result in gradients pointing in wrong directions. The MSE loss function can be seen as an (simplest possible) energy approximation that yields an energy minimum at the data, but with an unphysical landscape around this minimum. In practice, what we really aim for is an improvement over the MSE loss. We will augment and clarify the discussion of this in the paper regarding this point.
Wall-clock speed
We have included wall-clock times on a single NVIDIA L40s following your suggestion. Our main objective in including the sparse energy loss was to demonstrate our method can efficiently generalize to systems with many particles where loss calculation may contribute significantly to running time (e.g. very large point clouds). This is not the case for molecules, where the neural network (GNN or Transformer) is typically fully connected and thus scales as N^2. In the results below on QM9 with GDM, the most expensive loss calculation is less than 1% of the total backward and forward time.
| Component | Loss Type | Time (ms) (ms) |
|---|---|---|
| Loss computation | MSE | 0.1840 ± 0.0084 |
| Energy | 0.5100 ± 0.0161 | |
| Sparse Energy | 0.5712 ± 0.0242 | |
| Kabsch Align | 1.1361 ± 0.0300 | |
| Forward pass | – | 73.55 ± 16.38 |
| Backward pass | – | 94.47 ± 3.37 |
| Optimizer step | – | 1.43 ± 0.0072 |
To better understand the scale at which this becomes a relevant consideration and the utility of the sparse energy loss, see the following wall clock times from the shape generation setting:
| # Nodes | Energy (ms) | Sparse Energy (ms) | MSE (ms) | Kabsch Align (ms) |
|---|---|---|---|---|
| 30 | 0.240 ± 0.0021 | 0.255 ± 0.0012 | 0.058 ± 0.0004 | 0.807 ± 0.0027 |
| 300 | 0.245 ± 0.0011 | 0.257 ± 0.0024 | 0.059 ± 0.0004 | 0.804 ± 0.0048 |
| 3000 | 20.120 ± 0.0055 | 0.275 ± 0.0018 | 0.0779 ± 0.0008 | 0.944 ± 0.0028 |
| 30000 | - | 0.293 ± 0.0029 | 0.0740 ± 0.0011 | 3.242 ± 0.1452 |
| 300000 | - | 2.652 ± 0.0050 | 0.131 ± 0.0004 | 24.381 ± 0.0272 |
At 30000+ nodes, the energy loss requires too much memory to compute. Importantly, the sparse energy is cheaper than the Kabsch Align by a factor 5-10x.
Note that all losses have some constant cost that does not scale with N contributing to the wall-clock time. This explains why for QM9 (avg. 28 atoms) energy loss is marginally faster than sparse energy and why in the scaling table wall-clock times start to increase with N only after a certain point.
Interestingly, using an equivariant network with EDM takes 129.71 ± 0.044 ms for the forward pass and 180.07 ± 0.070 ms for the backward pass. Using the energy loss imparts a 0.3% increase on one backward pass through the model, while using an equivariant architecture imparts a 94% increase, while providing inferior benefits. A crucial conclusion is that our results show that the energy loss with a non-equivariant architecture results in more improvement than using an equivariant architecture, at negligible computational cost, which we think is a significant finding. We will discuss it further in the main text.
Sparse loss on larger systems
This is a good suggestion, we plan to include results using the sparse version of the loss for GEOM-drugs, which includes bigger molecules than QM9, during the discussion period. We also want to emphasize that even if it were the case that some physical information was lost, it would still be better than MSE, since the sparse energy at least includes some physical information while having the same minimizer. That being said, we consider that generalizing the proposed sparse energy approximation to explicitly account for long-range effects (as well as other contributions to the energy) is a very relevant avenue for future work.
Stability calculation
You are correct that we use the 3D calculation for molecular stability as in EDM and the second part of Table 2 in JODO. A molecule is stable if all atoms have the correct valence. Valence is determined by bonds that are drawn according to a bond distance table. Our new results show this metric can be improved by 3-5% when using energy loss over MSE.
We hope this response addresses your concerns. If any further clarification would help you reconsider your overall assessment, we would be happy to provide it.
I thank the authors for the detailed rebuttal and for running the JODO+Energy experiments on QM9. I appreciate that the authors show that the loss term consistently boosts atom‐ and molecule‐stability over JODO’s baseline, even at 500 k steps. I also appreciate the discussion on the wall-clock times and I understand that my question on this may not have been as relevant as I thought initially. However, I thank the authors for conducting the experiments and cementing that this is not an issue.
Naturally, I view the molecular‐generation experiments as the most important part of this paper. I think it's a very good idea to run the sparse version of the loss for GEOM-drugs. In the paragraph that promises this experiment during the discussion phase, the authors claim that "it would still be better than MSE". I am inclined to believe this, given the other results, but empirical validation is important.
To the extent possible in the remaining time, I think it would be great to see the following:
-
Authors should report the promised GEOM-Drugs results with the sparse energy loss—ideally alongside the same stability metrics used in Nikitin et al. so we can directly compare large-molecule performance.
-
Train JODO+Energy out to 1.5 M steps as the authors also mentioned could be done in the discussion period (matching JODO’s original training schedule) to see whether the energy loss can actually break past JODO’s published ceiling (particularly on the Align metrics) rather than matching it at an earlier stopping point. Regardless, I do acknowledge that a faster model is also desirable, even if results are the same.
I view the above experiments both as strong validation of the proposed energy loss idea, and think they are required to give a complete picture.
Thank you for constructive feedback and for engaging with our work thoroughly. We believe your suggestions have significantly strengthened our experimental evaluation.
We include the final results on JODO below and will update with the sparse GEOM-Drugs experiment tomorrow. Notably, using the energy loss has allowed us to significantly improve the results in the JODO paper across almost all metrics with minimal tuning, providing evidence that energy loss can indeed push the state-of-the-art.
Metric-Align
| Model | Bond ↓ | Angle ↓ | Dihedral ↓ |
|---|---|---|---|
| JODO (paper) | 0.1475 | 0.0121 | 6.29e-4 |
| JODO (ours) | 0.1218 | 0.0110 | 5.91e-4 |
| JODO + Energy Loss (Inv. Dist) | 0.1125 | 0.0046 | 4.95e-4 |
| JODO + Energy Loss (Exp. Dist) | 0.0928 | 0.0142 | 4.97e-3 |
Metric-3D
| Model | Atom stable ↑ | Mol stable ↑ | Validity ↑ | Completeness ↑ | FCD ↓ |
|---|---|---|---|---|---|
| JODO (paper) | 99.2% | 93.4% | — | — | 0.885 |
| JODO (ours) | 99.21% | 92.76% | 95.63% | 95.50% | 0.8536 |
| JODO + Energy Loss (Inv. Dist) | 99.35% | 94.26% | 97.11% | 97.02% | 0.8921 |
| JODO + Energy Loss (Exp. Dist) | 99.61% | 96.59% | 98.44% | 98.39% | 1.4945 |
For transparency, we have included both the JODO results from our run and those in the paper, along with energy loss using 2 different types of coefficients. Note that our results for Inv. Dist outperforms JODO on all align metrics, and nearly all 3D metrics, with comparable FCD.
We hope these results will significantly impact your assessment of our work and we are happy to answer any further questions.
I thank the authors for their detailed further empirical evaluation and their clarification on my issues. The energy loss looks like a nice tool, and I have therefore raised my score.
We thank the reviewer for their suggestions. We include the sparse energy results on Geom-Drugs below. We found using a more gradual distance decay in the coefficient worked better when the edges are sparse and random. These results highlight a potential compute-performance tradeoff for this version of the loss on larger graphs.
| Loss | Mol. stab. (%) | Atom stab. (%) | Valid. (%) | Unique (%) |
|---|---|---|---|---|
| MSE | 0.3 | 84.7 | 93.8 | 100 |
| Energy | 21.1 | 95.8 | 89.6 | 100 |
| Sparse Energy (Inv. Dist) | 7.4 | 91.89 | 92.60 | 100 |
This paper proposes a method to incorporate knowledge of physical systems into the loss function, whereas previous approaches have primarily focused on model architecture. This method is realized through a reverse KL-type loss function, which allows for an analytical expression of the loss. The authors demonstrate that this approach can be applied to diffusion models and enables the incorporation of invariances, with theoretical justification. Finally, they show the effectiveness of the method through several experiments, including molecule generation and spin ground-state prediction.
优缺点分析
Strengths
・Overall, the paper is clearly written and easy to follow.
・The idea of circumventing the intractability of the normalizing term by using the reverse KL divergence is simple yet interesting.
・The proposed approach is broadly applicable to various machine learning problems, including generative tasks with diffusion models.
Weaknesses
・The difference between incorporating physical knowledge into model architectures and into loss functions should be more clearly articulated. At present, it remains unclear how important it is to include physical information in the loss.
问题
-
This energy-based physical knowledge can also be incorporated into model architectures. Could you discuss the strengths and weaknesses of incorporating it into the loss function, compared to incorporating it into the architecture?
-
In Section 6.2, the results show that the energy-based loss yields stability. Could you provide a more detailed explanation of why the proposed approach leads to such stability?
-
Can this approach be applied to other generative models, such as flow matching? Additionally, do the theoretical results discussed in Section 4.2 scale to those cases?
局限性
Yes.
最终评判理由
I have understood the benefit of incorporating constraints into the losses. While I acknowledge that this research is significant and the proposed method is convincing, the technical novelty and theoretical soundness are relatively incremental, which prevents me from assigning a higher score.
格式问题
No.
We thank the reviewer for their review. We are happy that they have found the paper clearly written, the method interesting and broadly applicable and that they only note one weakness.
We address it and their questions below:
Difference between incorporating physical biases in architecture or in loss function
This is a good question. One type of physically motivated inductive bias that can be incorporated in the architecture is equivariance (if the reviewer has something else in mind, please let us know). We argue that the benefits of invariance of the loss function are complementary to equivariance of the architecture. Essentially, equivariance guarantees that for a given input output pair, the output associated with a transformed input will be the transformed output. However, it does not guarantee that the output will be the correct one, it ensures generalization to samples related by symmetry. The energy loss helps in learning the correct output, by appropriately weighting the errors and gradients during training. In the case of symmetries, instead of being forced to regress the original configuration, the network can learn to regress any configuration related by symmetry, which equivariance does not allow. This results in a reduction of variance in the gradients as we show in Proposition 4.5. Thus, using the energy loss results in significant improvements even for equivariant architectures (see results in Appendix C.2.1, Table 7 and new results on more modern equivariant architecture JODO in the reply to reviewer uwRJ).
A crucial point is that our results show that the energy loss with a non-equivariant architecture results in more improvement than using an equivariant architecture, at a much cheaper cost, which we think is a significant finding. We will discuss it further in the main text.
There is therefore an important difference between loss level and architecture level improvement. We will improve the discussion of this in the paper, following your feedback.
Improvement in stability
The stability metric is determined by valency checks on all of the atoms within a molecule. This valency is determined by the bonds drawn between atoms, and for the task of 3D molecule generation this is determined by a lookup table of what bond orders correspond to the interatomic distances predicted by the model. Since our loss encourages the model to generate molecules with accurate interatomic distances, this results in a greatly improved stability. This result adds evidence to the usefulness of having a loss that is more physically motivated than MSE.
Extension to flow matching
The energy loss can indeed be applied to Gaussian flow matching [1, 2]. We show hereafter the correspondence for the conditional vector field of [1]. The noisy sample is given by the interpolation
The flow matching objective aims at regressing the vector field:
Given a vector field prediction, the corresponding sample prediction is
The correspondence between MSE on the vector field prediction and on sample prediction is therefore:
Therefore the associated energy objective is obtained by replacement of the regression MSE:
Our theoretical results relating to score estimation properties also transfer to flow matching. This is because Gaussian flow matching also implicitly provides a method for score estimation similar to diffusion models [3]. Given the optimal vector field, , the score is given by
We will discuss the extension to flow matching in the revised paper.
Additional results
Finally, following the suggestion from reviewer uwRJ, we have significantly improved our experimental evaluation by obtaining results on a more recent, state-of-the-art equivariant architecture for molecular generation (JODO [4]). Replacing the loss function used in this model by the energy loss function resulted in an improvement across nearly all 3D metrics. Our method therefore provides complementary gains to state-of-the-art architectures.
We hope this response addresses your concerns. If any further clarification would help you reconsider your overall assessment, we would be happy to provide it.
[1] Lipman et al. Flow matching for generative modelling, 2023
[2] Liu et al. Flow straight and fast: learning to generate and transfer data with rectified flow, 2023
[3] Albergo et al. Stochastic interpolants: A unifying framework for flows and diffusions, 2023
Thank you for your reply.
I have understood the benefit of incorporating constraints into the losses. After the discussion during the rebuttal period, I am currently inclined to raise my score to 4. While I acknowledge that this research is significant and the proposed method is convincing, the technical novelty and theoretical soundness are relatively incremental, which prevents me from assigning a higher score.
We thank the reviewer for recognizing our work as significant and convincing while recommending it for acceptance.
We agree with the reviewer that our method is simple. We view this as a strength that allows for easier adoption for the community. We respectfully disagree that the theoretical soundness is incremental. The energy loss is well motivated through looking at the loss as a learned energy landscape, following directly from the reverse KL and the Boltzmann distribution. This not only provides us with a practical method*, but it also provides a new perspective on a fundamental concept in ML as highlighted by reviewer uwRJ. We also show our loss is uniquely minimized at the data with appropriate symmetries and prove some theoretical results applying generally to invariant loss functions, for example the variance reduction of score estimates.
* Please see our new results in improving JODO, a near-SOTA molecular generation model, just by swapping in our loss, in response to reviewer uwRJ's suggestion
The paper introduces novel energy based loss functions for physical systems, like molecules and lattice systems. Although the authors mention that the energy of the target system could be utilized directly if available, they propose a different strategy and explain why using the true energy function can be infeasible. For many particle systems like molecules, they propose a loss based on distances between particles. In this case, the energy is given by the difference between the distance matrix of the prediction and the target. They use this to derive an alternative loss for diffusion models for many particle systems evaluate on small molecules, showing improvements over traditional diffusion models. Moreover, they also investigate a spin lattice system, where they use a different energy function based on the difference of the predicted spin and the target spin and its surroundings.
优缺点分析
Strengths:
- The paper is well written and clearly motivated.
- The idea of using a surrogate energy function in the loss in the proposed way is to the best of my knowledge novel. However, there are some related methods that utilize the actual target energy in the loss, which should be discussed in the paper (see suggestions)
- They evaluate their method on very different systems, i.e., molecules and a lattice spin system, showing in both cases the advantages of their proposed loss.
- As there are many applications of diffusion models for molecular systems, the proposed energy based loss instead of the usual diffusion loss might be useful for many applications.
Weaknesses:
- See suggestions and questions
问题
-
Suggestions:
- Many flow modes, especially Boltzmann Generators, use the target energy function during training. Although it is different to what is discussed in this paper, I think it should be mentioned in the related work sections. And potentially also the abstract and introduction should be changed accordingly, as the phrase "previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function ..." is not correct.
- 44: Concurrent work of Klein et al., 2023 should be cited as they introduce the same training objective, i.e. "Equivariant Flow Matching with Hybrid Probability Transport" by Yuxuan Song et al.
-
Questions:
- Could the energy diffusion loss also be directly applied to flow matching?
- Why is the number of valid molecules in table 1 lower for the energy based loss? This seems counterintuitive, but maybe there is a good explanation. Moreover, are there errors for the results reported in this table?
局限性
Yes
最终评判理由
The authors were able to address all my concerns and provided insightful answer to my questions.
格式问题
No concerns
We thank the reviewer for their review. We appreciate that they have found the paper well-written, the idea novel, and the evaluation comprehensive.
We answer their questions hereafter:
Discussion of Boltzmann generators
Discussion of Boltzmann generators is included in the related works (in Appendix B). In sampling the target energy is queried and therefore has a direct role. In generative modelling and regression however, the true energy is generally not available (see section 3 of the manuscript for discussion.) Our work in considering energy loss is novel in this pure generative modelling context, for which only data is available. We will nuance the statements referred by the reviewer to address their concern, and will be happy to move some discussion on the Boltzmann generators to the main text.
Related papers
Thank you for the suggestion, we will cite Yuxuan Song et al. as another example of an invariant loss achieved using alignment.
Extension to flow matching
The energy loss can indeed be applied to Gaussian flow matching [1, 2]. We show hereafter the correspondence for the conditional vector field [1]. The noisy sample is given by the interpolation
The flow matching objective aims at regressing the vector field:
Given a vector field prediction, the corresponding sample prediction is
The correspondence between MSE on the vector field prediction and on sample prediction is therefore:
Therefore the associated energy objective is obtained by replacement of the regression MSE:
Our theoretical results relating to score estimation properties also transfer to flow matching. This is because Gaussian flow matching also implicitly provides a method for score estimation similar to diffusion models [3]. Given the optimal vector field, , the score is given by
We will discuss the extension to flow matching in the revised paper.
Validity metric
The reasons for this decrease is not fully clear, because we only observe a decrease for the GEOM-Drugs dataset and it occurs despite a significant increase in stability, which we attribute to the physical inductive bias of our method. Validity is measured by first converting the 3D molecules into SMILES strings and then using RDKit sanitization, which is more of an opaque metric than stability. Due to the significant expense of training on GEOM-Drugs, we did not obtain error bars, following [4,5].
Additional results
Finally, following the suggestion from reviewer uwRJ, we have significantly improved our experimental evaluation by obtaining results on a more recent, state-of-the-art equivariant architecture for molecular generation (JODO [4]). Replacing the loss function used in this model by the energy loss function resulted in an improvement across nearly all 3D metrics. Our method therefore provides complementary gains to state-of-the-art architectures.
Please let us know if you have any additional concerns or questions we can answer.
[1] Lipman et al. Flow matching for generative modelling, 2023
[2] Liu et al. Flow straight and fast: learning to generate and transfer data with rectified flow, 2023
[3] Albergo et al. Stochastic interpolants: A unifying framework for flows and diffusions, 2023
[4] Huang et al. Learning joint 2D and 3D diffusion models for complete molecule generation, 2023
[5] Hoogeboom et al. Equivariant diffusion for molecule generation in 3D, 2022
I thank the authors for their response, they were able to address my concerns and questions. I have increased my score.
This work establishes a new method for physics-guided Machine Learning, where embedding physical principles into loss functions significantly improves performance in molecular generation and spin prediction. Its core contribution is shifting loss design from statistical error minimization to physics-driven optimization, ensuring the model contains physical constrains such as symmetry during predictions. The new model optimizes prediction errors, efficiency and reliability.
优缺点分析
Strengths:
This paper introduces energy loss functions, a novel framework that integrates thermal equilibrium properties of physical systems into Machine Learning loss design. Unlike traditional losses such as MSE, which optimize statistical prediction errors, this work leverages the Boltzmann distribution and reverse KL divergence to derive physics-based loss functions. This ensures model predictions adhere to physical constraints such as symmetry and energy minimization.
Originality: First systematic integration of physics into loss functions, beyond Machine Learning model architectural stage. Quality: A comprehensive theory with formal proofs in Appendix, extensive experiments (molecule generation and spin prediction) and code release commitment. Clarity: Showing comparison between MSE and energy loss functions with clear physics figures such as Fig. 1. Comprehensive mathematical derivations in Appendix.
Weaknesses:
- The selection of energy coefficients lacks discussion. The choice of Morse and L-J potential seems arbitrary, adding a few sentences to prove the completeness of the choice may help. In addition, it may help if the authors give a few examples of energy coefficients in different physical systems, such as protein and crystalline systems.
- Lack of discussions of how to build the sparse version of the energy loss. The main text mentions the sparse model reduce computation by O(N) operations, but without clear details.
- Temperature remains unexplored. Temperature is treated as a hyperparameter rather than a physical variable. The authors may need to add a few sentences to discuss what will change in energy loss building under different temperature.
- Incomplete comparison with equivariant architectures. May need to add studies to demonstrate the complementary of benefits of energy loss functions.
问题
- Energy losses assume thermal equilibrium. How can the framework adapt to far-from-equilibrium systems (e.g., chemical reaction pathways)? Should dynamic energy landscapes be incorporated?
- The main focus is loss-level innovation. Could coupling with equivariant architectures enhance performance? For example: losses enforcing energy constraints + architectures handling symmetries.
- Will energy coefficients, the choice of potential energy equations and noise change under different temperature?
- How to modulate the energy loss functions when it comes to the physical systems that don’t obey Boltzmann distribution?
局限性
- Atomic system coefficients (exponential decay) rely on experience without much theoretical guidance. A complex system (e.g., proteins) may need much more complicated energy approximations at prohibitive computational cost.
- Theoretical guarantees for score estimation valid only at low noise, high noise requires unaddressed corrections.
- Sparse rigid graphs reduce complexity but lack reported speedups, Fig. 3c only qualitatively claims O(N) scaling.
最终评判理由
The authors have provided a thorough and well-reasoned rebuttal addressing all points raised in the review. They have clearly justified their choice of potentials, clarified the general applicability of their method via Taylor approximation, and convincingly demonstrated the practical benefits of the sparse energy loss, particularly in large-scale settings, through detailed wall-clock timing analysis. The comparison between equivariant architectures and the energy loss highlights a key contribution, which shows that energy-based regularization can yield substantial gains even without architectural symmetry, at minimal computational overhead. The discussion on temperature scaling, the complementarity of invariance and energy minimization, and the extension to state-of-the-art models further strengthen the paper’s technical depth and relevance.
格式问题
No formatting issues.
We thank the reviewer for their valuable feedback. We are glad that they have appreciated the originality of our work, the comprehensive theory and the extensive experiments on different domains.
Some of the raised questions are answered in the Appendix of the paper. We will make use of the added camera-ready page (if accepted) to move some material to the main body.
We address their concerns and questions in details below:
Selection of the energy coefficients
The choice of focusing on the Morse and Lennard-Jones potentials is justified by the fact that they are quite general and are the most widely used pairwise potentials. Lennard-Jones is the archetypal example for simple but realistic inter-molecular interactions across physical systems while the Morse potential is a model for bonding in a diatomic molecule. There is a very large number of potentials with explicit parametric forms that are used by practitioners, so we could not go into details of all of them. We will however include references in the paper to the relevant literature (such as [1,2]) to guide the reader towards alternatives.
An important feature of our method is that it provides a clear method to find the coefficients given any choice of potential, by second-order Taylor approximation. The Morse and LJ potentials offer important examples (see derivations in Appendix A.3 of the paper). We will add a few sentences to clarify the general procedure in the manuscript following your suggestion.
Sparse loss function
The sparse version of the loss function as well as the algorithm to implement it is discussed in Appendix A.5. It is based on random regular graphs (which can optionally be symmetrized to improve the invariance of the loss as we discuss). We will make use of the additional camera-ready page to move some explanations to the main text.
We have also included wall-clock times benchmarking the losses on a single NVIDIA L40s following your suggestion. Our main objective in including the sparse energy loss was to demonstrate our method can efficiently generalize to systems with many particles where loss calculation may contribute significantly to running time (e.g. very large point clouds). This is not the case for molecules, where the neural network (GNN or Transformer) is typically fully connected and thus scales as N^2. In the results below on QM9 with GDM, the most expensive loss calculation is less than 1% of the total backward and forward time.
| Component | Loss Type | Time (ms) (ms) |
|---|---|---|
| Loss computation | MSE | 0.1840 ± 0.0084 |
| ENERGY | 0.5100 ± 0.0161 | |
| ENERGY-SPARSE | 0.5712 ± 0.0242 | |
| Kabsch Align | 1.1361 ± 0.0300 | |
| Forward pass | – | 73.55 ± 16.38 |
| Backward pass | – | 94.47 ± 3.37 |
| Optimizer step | – | 1.43 ± 0.0072 |
To better understand the scale at which this becomes a relevant consideration and the utility of the sparse energy loss, see the following wall clock times from the shape generation setting:
| # Nodes | Energy (ms) | Sparse Energy (ms) | MSE (ms) | Kabsch Align (ms) |
|---|---|---|---|---|
| 30 | 0.240 ± 0.0021 | 0.255 ± 0.0012 | 0.058 ± 0.0004 | 0.807 ± 0.0027 |
| 300 | 0.245 ± 0.0011 | 0.257 ± 0.0024 | 0.059 ± 0.0004 | 0.804 ± 0.0048 |
| 3000 | 20.120 ± 0.0055 | 0.275 ± 0.0018 | 0.0779 ± 0.0008 | 0.944 ± 0.0028 |
| 30000 | - | 0.293 ± 0.0029 | 0.0740 ± 0.0011 | 3.242 ± 0.1452 |
| 300000 | - | 2.652 ± 0.0050 | 0.131 ± 0.0004 | 24.381 ± 0.0272 |
At 30000+ nodes, the energy loss requires too much memory to compute. Importantly, the sparse energy is cheaper than the Kabsch Align by a factor 5-10x. Note that all losses have some constant cost that does not scale with N contributing to the wall-clock time. This explains why for QM9 (avg. 28 atoms) energy loss is marginally faster than sparse energy and why in the scaling table wall-clock times start to increase with N only after a certain point.
Interestingly, using an equivariant network with EDM takes 129.71 ± 0.044 ms for the forward pass and 180.07 ± 0.070 ms for the backward pass. Using the energy loss imparts a 0.3% increase on one backward pass through the model, while using an equivariant architecture imparts a 94% increase, while providing inferior benefits.
A crucial conclusion is that our results show that the energy loss with a non-equivariant architecture results in more improvement than using an equivariant architecture, at negligible computational cost, which we think is a significant finding. We will discuss it further in the main text.
Effect of temperature
The temperature parameter simply scales the loss (see Equation 5). Optimizers like Adam are invariant to loss scaling. For optimizers that are not invariant (e.g. SGD), the temperature is absorbed in the learning rate.
Coupling with equivariant architectures
We have indeed seen that coupling the energy loss with equivariant architectures leads to improved performance. Our results in Appendix C.2.1 of the paper show that the energy loss results in significant improvements even for equivariant architectures. We argue that the benefits of invariance of the loss function are complementary to equivariance of the architecture. Essentially, equivariance guarantees that for a given input output pair, the output associated with a transformed input will be the transformed output. However, it does not guarantee that the output will be the correct one, it ensures generalization to samples related by symmetry. The energy loss helps in learning the correct output, by appropriately weighting the errors and gradients during training. In the case of symmetries, instead of being forced to regress the original configuration, the network can learn to regress any configuration related by symmetry, which equivariance does not allow. This results in a reduction of variance in the gradients as we show in Proposition 4.5.
We obtained new results that confirm the increase in performance on a more modern equivariant architecture (see below). The improvement offered by the energy loss is therefore orthogonal to architectural improvements and offers another avenue for development.
We also highlight that our results show that the energy loss with a non-equivariant architecture results in more improvement than using an equivariant architecture, at a much cheaper cost, which we think is a significant finding. We will discuss it further in the main text.
Extension to non-equilibrium and non-Boltzmann systems
As said in the paper, the energy loss is not as well motivated for systems that are not at thermal equilibrium. Our method still applies to a wide range of problems, but we have been careful in limiting its scope. However, a crucial motivation for our work is that the MSE loss is already an approximation to an energy function, simply a very crude and unphysical one. Compared to it, our approximations are much better motivated. Using more elaborate approximations is an exciting avenue for future research, but introduces a tradeoff between complexity and fidelity. What we have introduced in this work sits at an advantageous point with respect to this tradeoff.
Additional results
Finally, following the suggestion from reviewer uwRJ, we have significantly improved our experimental evaluation by obtaining results on a more recent, state-of-the-art equivariant architecture for molecular generation (JODO [4]). Replacing the loss function used in this model by the energy loss function resulted in an improvement across nearly all 3D metrics. Our method therefore provides complementary gains to state-of-the-art architectures.
We hope this response addresses your concerns. If any further clarification would help you reconsider your overall assessment, we would be happy to provide it.
[1] Stone, The theory of intermolecular forces, 2013
[2] Frenkel and Smit, Understanding molecular simulation, 2001
[3] Huang et al. Learning joint 2D and 3D diffusion models for complete molecule generation, 2023
We thank you again for taking the time to review our work. We tried to address all of your concerns and questions. We also point that we have provided final evaluation results showing state-of-the-art results in the reply to reviewer uwRJ. If anything remains, please let us know during this last part of the discussion period.
After useful and fruitful discussions with all reviewers, we think we have satisfied all concerns
- reviewer uwRJ helped us significantly improve our empirical evaluation by suggesting experiments on the modern JODO architecture and raised questions about wall-clock times for the sparse loss. They have no remaining concerns and recommend acceptance noting our work as a “creative and refreshing rethink of foundational concepts of ML”, "versatile" and leading to “tangible improvements”
- reviewer J6c1 raised points about flow matching and suggestions on improving the writing. They have no remaining concerns and recommend acceptance noting the paper as "well-written" and “clearly motivated” with potentially wide applicability
- reviewer 3SJQ has no remaining concerns and recommends acceptance, noting the work as "clearly written", "simple yet interesting" and “broadly applicable to various machine learning problems”. They raised points about physically motivated networks vs. loss functions and an extension to flow matching
- reviewer f4bE raised questions about equivariant networks vs. invariant loss functions, wall clock times for the sparse loss and the choice of energy coefficients. We have tried to address these concerns and there was no indication during the discussion that any concerns remain. They note our work as containing “comprehensive theory” and “extensive experiments”
We thank all reviewers for their time and effort and note the biggest changes from the discussion period.
Strong new results on state-of-the-art molecule generation
Thanks to a suggestion by uwRJ, we have significantly improved nearly all metrics on a near state-of-the-art molecule generation model JODO, demonstrating our method provides consistent gains at the frontier. This included a broader evaluation suite, validating the method further, and supported the complementary gains of the energy loss to equivariant architectures (as raised by f4bE, 3SJQ).
Extension to flow matching
Thanks to questions from J6c1, 3SJQ we have demonstrated how to extend our framework to flow matching and shown the theoretical results on score estimation transfer to this setting.
Sparse loss and wall-clock times
Thanks to suggestions by f4bE, uwRJ we have provided detailed wall-clock times for all loss functions, suggesting the sparse energy loss to be a feasible alternative to the energy loss on very large graphs, while being cheaper to compute than Kabsch Align.
The paper introduces energy loss functions through the reverse kullback leibler divergence as a loss function for learning on data reflecting physical systems. This idea is quite similar to established methods such as energy-based training in Boltzmann Generators, however, is different in the sense that the approach can be applied even when a ground truth potential is not available, e.g. by assuming some other potential. The approach is interesting, however, some reviewers highlighted the incremental nature of the work technical and conceptual contributions, and while the loss is touted as a general learning paradigm, experiments only cover one learning modality. Nevertheless, the paper is solid and provides an interesting direction which may have impact across a broader range of ML applications.