HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing
We present HollowFlow, a flow-based generative model based on a GNN that makes sample likelihood computations drastically more efficient.
摘要
评审与讨论
This paper addresses the problem of expensive divergence computations for likelihood evaluations in continuous normalizing flows. The solution adapts a previously proposed network architecture which can report its Jacobian trace in a single backward pass. The present authors describe the construction of a similar architecture with equivariant GNNs. The resulting network is less powerful than standard GNNs, but the fast likelihood computation enables fast ESS/s on the well-studied LJ-13 LJ-55 systems.
优缺点分析
Strengths
- The paper treats an important and very unappreciated problem in the Boltzmann sampling literature, which is the difficulty of scaling exact likelihood CNFs to high dimensions despite their superior performance to coupling flows. There are few obvious ways of tackling this challenge, resulting in fewer papers published in this area. The authors' present work is among the more creative recent contributions to Boltzmann sampling.
- Although the basic idea of HollowNet is pre-existing, the authors present a highly nontrivial adaptation to GNNs operating equivariantly via careful construction of the message passing graph. The runtime characteristics of this network are carefully considered and formalized.
- The empirical results indeed show that trading off accuracy for fast likelihoods is sometimes worthwhile, supporting the promise of the research direction.
Weaknesses
- It is not acceptable to manipulate the ESS metrics by excluding outliers, a protocol which has no justification in importance sampling. The authors must correct this, or I will lower the score.
- The performance penalty is larger than one would hope, calling into question whether the method would work for more difficult systems.
- The authors should discuss considerations, if any, in adapting the framework to non-equivariant GNNs or transformers, as these are increasingly the architectures of choice for molecular diffusion or flows.
- The kNN graph construction is non-differentiable. Can the authors comment on how this changes the validity of the divergence computation or the CNF?
问题
See above.
局限性
Yes
最终评判理由
I maintain my positive assessment of the technical aspects of the work from my original review.
格式问题
No
We thank the reviewer for the valuable, detailed feedback and thoughtful questions. Answers to the reviewer's questions and additional comments can be found below.
It is not acceptable to manipulate the ESS metrics by excluding outliers, a protocol which has no justification in importance sampling. The authors must correct this, or I will lower the score.
We agree that 'manipulating' benchmark statistics is not acceptable. We however emphasize that we include these to illustrate the impact of the outliers on the ESS, and how controlling these is a path forward for improving the presented approach
The performance penalty is larger than one would hope, calling into question whether the method would work for more difficult systems.
We agree that the penalty is large, however, we expect this to be linked to the locality assumption which we use throughout the manuscript. To test this, we will include an additional ablation study in the camera ready version where we use a KNN graph for the baseline to disentangle the contributions of locality and HoMP on the performance penalty.
Additionally, for the camera ready version, we will implement multi-headed HoMP similar to the multi-head attention described in appendix A , where several relatively sparse HoMP with different connectivities (e.g. separated length scales) are run in parallel and combined in the end. This makes sure that each network does not get disconnected as fast while not having to do any locality assumption. Furthermore, we are currently trying HollowFlow on Alanine Dipeptide and will include the results in the camera ready version.
The authors should discuss considerations, if any, in adapting the framework to non-equivariant GNNs or transformers, as these are increasingly the architectures of choice for molecular diffusion or flows.
By non-equivariant, we assume the reviewer means, not equivariant with respect to the euclidean groups, e.g. E(3), as explored in our examples. This can be done by simply adopting a non-equivariant MP/transformer architecture as is, but ensuring the MP/attention is done in a manner that is faithful to the HoMP recipe, as outlined in the paper. Relaxing permutation symmetry can be done by avoiding using the same embeddings for the same particles/nodes in the graph.
The kNN graph construction is non-differentiable. Can the authors comment on how this changes the validity of the divergence computation or the CNF?
It is unclear what the reviewer means here. The KNN graph is used to limit the connectivity of the MP graph, and the divergence calculation does not depend on the graph construction once it is set.
I appreciate the authors' response, but they do not sufficiently address my concerns.
-
As originally mentioned, the manipulation of ESS metrics is inappropriate and I do not find the authors' defense convincing. I would request that the authors present new results with the unadjusted ESS metrics. If this is not done, I will unfortunately have to recommend against acceptance. I like the idea very much and respect the technical depth of the results, but transparent and honest reporting of results is paramount.
-
The divergence is defined only for a differentiable vector field. If the vector field is constructed via a GNN over a kNN graph, the neural network output is not continuous wrt to the inputs. Can the authors comment on how this affects the validity of using the divergence to compute model likelihoods?
- We want to emphasize that we do report the original ESS without any outlier removal for all the experiments in table 2 and table 3. We do, however, acknowledge that the comparison between the baseline and HollowFlow is done using the , i.e. the ESS computed with a percentile of outliers removed. For completeness, we provide additional results for LJ13 below, where we recompute the speed-up that HollowFlow provides using the original ESS. The numbers in brackets indicate a 68% confidence interval calculated via bootstrapping. is defined analogously to (eq. 22):
Note that some of the numbers differ from the original paper as we retrained the baseline model using best model selection techniques. As removing outliers improves the baseline model more than it does improve the HollowFlow models, the effective speed-up with outliers is even larger than the speed-up without outliers. We will include these results, as well as the corresponding results for LJ55, in the camera ready version and clarify the context. We will also rerun the calculations with more samples to decrease errors.
| ESS | ESS | ESS | ESS | Effective speed-up without outliers | Effective speed-up with outliers | |
|---|---|---|---|---|---|---|
| k=2 | ||||||
| k=4 | ||||||
| k=6 | ||||||
| k=8 | ||||||
| k=10 | ||||||
| k=12 | ||||||
| baseline |
- We thank the reviewer for clarifying the question. We agree that if the NN graph is updated dynamically, non-differentiability occurs on a set of measure-zero hypersurfaces (the decision boundaries of the NN construction). This can cause theoretical issues, including but not limited to the uniqueness and existence of a global solution to the corresponding ODE. However, empirically, we observe no instability. We expect this to be the case as the vector field remains differentiable almost everywhere, and consequently the ODE and the corresponding continuity equation is well-defined almost everywhere. We acknowledge that this is a somewhat heuristic explanation, but defer formal analysis of these cases, to explain our empirical results, to future work, as the specific graph construction is not a central contribution to this work.
To further validate empirically that the non-differentiability does not cause any problems in practice, we tracked the divergence of a trained fully connected and a trained 6 nearest neighbors (NN) HollowFlow model during sampling using a forward Euler scheme with 20 integration steps. We monitor the maximal and average absolute difference of the divergence of consecutive steps over 1000 initial conditions to measure if there are significant discontinuities in the NN model as opposed to the fully connected one. Results can be found below. While both values are slightly higher for the NN model, there are no significant differences further supporting our claim that the non-differentiability does not cause issues in practice.
max difference average difference fully connected 72.5 5.3 k = 6 74.2 5.8
This paper introduces "HollowFlow," a novel continuous normalizing flow (CNF) architecture designed to drastically reduce the computational cost of likelihood evaluation for generative models, particularly in the context of Boltzmann Generators (BGs). The authors' key contribution is a new message passing scheme called Hollow Message Passing (HoMP) addressing the prohibitive scaling of computing the divergence of the vector field in CNFs, which traditionally requires backward passes.
HoMP is implemented using a non-backtracking graph neural network (NoBGNN) that operates on a line graph representation of the system's connectivity. This construction splits the Jacobian of the neural network vector field into a block-diagonal part and a "block-hollow" part, which allows for the exact computation of the Jacobian's trace (the divergence) with a constant number (dimension of coordinates) of backward passes.
The authors provide a theoretical analysis of the computational complexity, showing that HollowFlow can achieve a speed-up of up to over standard equivariant GNN-based flows. They validate their approach by training E(3)-equivariant HollowFlow models on two Lennard-Jones (LJ) particle systems, LJ13 and LJ55.
优缺点分析
Strengths:
- The proposed HoMP algorithm is a clever adaptation of the "hollow network" concept to the crucial domain of equivariant GNNs. This offers a direct and effective solution to the divergence calculation bottleneck in CNFs.
- The paper is well-supported by theory and offers step-by-step proofs in the appendix.
- The experimental results are impressive, demonstrating a speed-up of nearly 90x for the LJ55 system.
- The paper provides code and all training details for reproducibility.\
- The paper provides an extension to attention architectures offering integration with various SOTA architectures.
Weaknesses:
- A notable weakness is the performance of the baseline model, particularly in terms of the effective sample size (ESS). For LJ13, the baseline achieves an ESS of only 0.001%, and for LJ55, it's 0.167%. These are very low values, suggesting the baseline model is not particularly strong. While the authors acknowledge this, it tempers the impact of the "effective speed-up" metric, as the comparison would be more compelling against a state-of-the-art baseline with a higher ESS.
- The experiments are confined to two relatively small, well-behaved systems (LJ13 and LJ55) that are dominated by short-range interactions. The paper's claims of general applicability to "high-dimensional scientific problems" would be much stronger if tested on more diverse and challenging systems.
- The use of a kNN graph is a necessary compromise to avoid the poor scaling of the line graph construction on a fully connected graph. However, the choice of 'k' is a critical, non-trivial hyperparameter. Figure 3(c) shows that performance is quite sensitive to 'k', and the appendix does not offer a heuristic for selecting it, adding a layer of tuning complexity.
问题
- The appendix clarifies the update rule for the backtracking array B(t), which is of size . Could you comment on the practical computational and memory cost of storing and updating this array in your implementation? Could it become a practical bottleneck for systems with thousands of nodes, and do you see a path to optimizing it?
- Your results show a non-monotonic dependence of model performance on the hyperparameter 'k'. Do you have any intuition or a proposed heuristic for selecting an optimal 'k' for a new system without having to train multiple models? Is there a risk that for a complex landscape, the optimal 'k' would need to be so large that it negates the computational advantage of using a sparse graph?
- Could you elaborate on the training of the baseline model? Given its very low ESS, do you believe it was fully converged? How does its ESS compare to other published flow-based Boltzmann Generators for these LJ systems? A stronger baseline would make the "effective speed-up" claims even more compelling.
- This work is motivated by achieving exact divergence calculation efficiently. How do you see your method comparing to those that use stochastic trace estimators (e.g., Hutchinson's estimator)? While approximate, they are also computationally cheap. In your opinion, what are the key applications or systems where the exactness of the divergence provided by HollowFlow is critical and approximation is insufficient?
局限性
The authors provide an honest discussion of the limitations in the main paper, which I find accurate. Most of my main concerns align with theirs, and are:
- The current kNN-based implementation assumes locality and is not well-suited for systems with critical long-range interactions (e.g., electrostatics). This significantly limits its immediate applicability to a wide range of important molecular systems.
- The HoMP algorithm relies on updating a backtracking array, B(t), to dynamically remove edges from the line graph at each message passing step. This array is of size . This could become a new computational or memory bottleneck for very large systems (e.g., n > 1000), potentially undermining the overall speed-up. The appendix provides the algorithm but does not analyze the practical cost of this step.
- The core idea of HoMP is to restrict information flow to gain computational efficiency. This is an explicit trade-off. It remains an open question whether this restriction will prevent the model from learning the complex energy landscapes of more challenging systems, leading to a sample quality (ESS) so poor that even a massive speed-up cannot compensate for it.
最终评判理由
The authors address most of my concerns. However, I'm still concerned about the scalability of the method.
格式问题
NA
We thank the reviewer for taking the time to give valuable and very detailed feedback. Answers to the reviewer's questions can be found below. Finally, we additionally comment on some limitations that the reviewer pointed out.
Questions
The appendix clarifies the update rule for the backtracking array B(t), which is of size . Could you comment on the practical computational and memory cost of storing and updating this array in your implementation? Could it become a practical bottleneck for systems with thousands of nodes, and do you see a path to optimizing it?
In practice, it is sufficient to only save information about from which nodes of all the nodes of have received information from. If has nodes and has nodes, the size of the array will effectively be . For a KNN graph, so the memory requirements of in our implementation scale as . In practice, only indexing operations, element wise binary logical functions and PyTorch scatter-add operations are applied to . Those do, to the best of our knowledge, not scale more than linearly with the size of and can be (and probably are internally in PyTorch) heavily parallelized. We never experienced any computational bottlenecks during these calculations. If it does become a problem, one possible strategy to circumvent it is to use a graph , that has a fixed girth (length of the shortest cycle) of at least , where is the number of message passing (MP) steps on the line graph. This ensures that the Jacobian stays hollow even without removing connections rendering unnecessary.
Your results show a non-monotonic dependence of model performance on the hyperparameter 'k'. Do you have any intuition or a proposed heuristic for selecting an optimal 'k' for a new system without having to train multiple models? Is there a risk that for a complex landscape, the optimal 'k' would need to be so large that it negates the computational advantage of using a sparse graph?
Naively, one might expect that the model performance (ESS) would monotonically increase with as there are more connections in the graph for larger . However, this is not the case in practice (see e.g. table 3). The fact that the ESS does not increase further with beyond some upper bound can heuristically be explained as follows: By construction, some of the connections in the line graph need to be removed after a certain number of MP steps (See appendix B for more details). The larger , the more connections need to be removed after only one MP step up to the point where (almost) no connections are left after only one MP step. We thus have two counteracting effects when increases: The first MP step gains more connections and thus expressiveness while the second MP step looses connections and expressiveness. Eventually, the latter seems to take over effectively limiting the expressiveness of the model. We will include a brief summary of this explanation in the camera ready version to improve clarity.
As larger 's are not necessarily better, we do not think that for complex landscapes the optimal will be so large that it negates the computational advantage. To (heuristically) choose , one can efficiently numerically work out how many edges of the line graph are left after message passing steps without training a model. There should be a significant amount of edges left after one MP step, and the graph should not be extremely sparse in the beginning. We will include a plot showing the percentage of edges remaining as a function of the number of message passing steps for different 's together with an instruction on how to tune in the appendix of the camera ready version.
Could you elaborate on the training of the baseline model? Given its very low ESS, do you believe it was fully converged? How does its ESS compare to other published flow-based Boltzmann Generators for these LJ systems? A stronger baseline would make the "effective speed-up" claims even more compelling.
The baseline model was trained in a similar way as in the Equivariant Flow Matching paper (Klein et al., 2023). However, there are some significant differences. Beside a different learning rate scheduler (see appendix C4), most importantly, we use a different architecture (PaiNN) instead of their architecture that we will refer to as EGNN. The ESS for e.g. LJ13 that was reported in (Equivariant Flow Mathching, Klein et al., 2023) is almost 58% while our ESS is close to zero if we do not remove a small number of outliers and comparable (49%) if we do remove these outliers. As the two architectures are different (different update rules, different embeddings etc.), there might be non-trivial effects that could explain the difference in ESS, similar to the non-linear scaling with .
While we do believe that our baseline models are fully converged, it might be possible to further increase performance with different training strategies and choice of hyperparameters.
It is important to keep in mind that the ESS is extremely sensitive to outliers of the log importance weight histogram. To illustrate this effect, we included where these outliers are removed in a controlled way. Future work will be necessary to fully understand how different architectures used in HollowFlow and in the non-hollow baseline do affect the ESS and especially how different architectures influence the amount of outliers that significantly lower the ESS.
This work is motivated by achieving exact divergence calculation efficiently. How do you see your method comparing to those that use stochastic trace estimators (e.g., Hutchinson's estimator)? While approximate, they are also computationally cheap. In your opinion, what are the key applications or systems where the exactness of the divergence provided by HollowFlow is critical and approximation is insufficient?
For Boltzmann generator applications, variance in the likelihood estimator leads to biased weights under self-normalized importance sampling. Thus, exact likelihoods and exact divergences are critical and even an unbiased estimator like the Hutchinson estimator is insufficient. Furthermore, considering how sensitive the ESS is to even a small number of outliers, it is important to get the importance weights exactly right.
Limitations:
The current kNN-based implementation assumes locality and is not well-suited for systems with critical long-range interactions (e.g., electrostatics). This significantly limits its immediate applicability to a wide range of important molecular systems.
While we agree with the reviewer's observation that we currently assume locality, we want to emphasize that this can be circumvented by a multi-head strategy, where several relatively sparse HoMP with different connectivities (e.g. separated length scales) are run in parallel and combined in the end (see appendix A for a brief discussion). This makes sure that each network does not get disconnected as fast while not having to do any locality assumption. This introduces an additional scaling with the number of heads while the scaling with the number of particles remains the same for each head. We will include a more detailed discussion of multi-head HollowFlow in the camera ready version.
The core idea of HoMP is to restrict information flow to gain computational efficiency. This is an explicit trade-off. It remains an open question whether this restriction will prevent the model from learning the complex energy landscapes of more challenging systems, leading to a sample quality (ESS) so poor that even a massive speed-up cannot compensate for it.
We are currently working on running HollowFlow on Alanine Dipeptide and plan to include the results in the camera ready version.
I thank the authors for their detailed rebuttal. This clears up most of my concerns and I'd be happy to increase my score.
We thank the reviewer for engaging in the discussion phase and for increasing their score. We remain open to discussion in case any doubts remain.
This paper proposes HollowFlow, a framework that combines an non-backtracking GNNs and some NNs to speedup sampling and likelihood evaluation in previous BG methods. Specifically, HollowFlow's non-backtracking GNNs enforce block-diagonal Jocabian structure, which effectively reduces the number of backward pass. Experiments conducted on LJ13 and LJ55 further validate the effectiveness of HollowFlow.
优缺点分析
Strengths:
-
HollowFlow innovatively introduces HoMP to effectively reduces the number of backward passes required for the likelihood evaluations. KNN graphs are also incorporated to reduce the graph complexity.
-
The paper develops theoretical scaling laws that aligns well with experimental results.
Weaknesses:
- Although the paper states that:
the performance of our baseline is not in line with state-of-the-art in terms of ESS, however, since our HollowFlows are built directly upon this baseline, we anticipate that any improvement in the baseline will be reflected in the HollowFlow models as well.
Nevertheless, the lack of direct comparison with state-of-the-art methods reduces the persuasiveness of the results regarding HollowFlow's performance. I suggest that the authors supplement their experiments with additional benchmarks against leading approaches.
-
While HollowFlow achieves remarkable speed-ups compared to the baseline, the ESS and ESS_rem is noticably lower. Furthermore, increasing the value of does not seem to improve HollowFlow's ESS and ESS_rem beyond a certain upper bound, and this upper bound remains significantly lower than that of the baseline. What is the rationale of this phenomenon? and How to tune HollowFlow in practice?
-
The introduction of KNN assumes the locality of the graph, as stated in the limitation section. How does HollowFlow perform with a fully connected graph? Additionally, I recommend including an ablation study to disentangle the contributions of HoMP, KNN, etc.
-
Experiments are conducted solely on LJ systems. Thus, the generalizability of the proposed model seems not sufficiently validated.
问题
See above
局限性
See above
最终评判理由
most concerns have been solved. i raise the score.
格式问题
NA
We thank the reviewer for the valuable and very detailed feedback. Answers to the reviewer's questions and comments can be found below.
-
Thanks for your suggestion, a direct comparison against a state-of-the-art method might indeed be interesting. However, we want to emphasize that our goal is not to beat the state of the art in terms of ESS but to demonstrate a significant improvement in terms of ESS per compute during sampling for a given architecture. In our experiments, we decided to use the PaiNN architecture as a base to implement HoMP even though state-of-the-art results have been achieved with other architectures like the architecture, that we will refer to as EGNN in the following, used in the Equivariant Flow Matching paper (Klein et al., 2023). HoMP can also be implemented with EGNN but in contrast to PaiNN it requires some arbitrary symmetry breaking on the line graph as all terms in the message passing will cancel out otherwise. While a base line using EGNN achieves an ESS that is close to the state-of-the-art, it seems to lower performance of HollowFlow, presumably due to the aforementioned symmetry breaking. Thus, we stuck with PaiNN for our baseline and HollowFlow, even though the PaiNN baseline is not in line with state-of-the-art methods.
-
HollowFlow uses a restricted architecture that we expect to have a limited expressiveness resulting in a lower ESS. However, as pointed out above, our goal is to improve the ESS per compute during sampling which we still achieve due to enormous computational savings. The fact that the ESS does not increase further with beyond some upper bound can heuristically be explained as follows: By construction, some of the connections in the line graph need to be removed after a certain number of message passing (MP) steps (See appendix B for more details). The larger , the more connections need to be removed after only one MP step up to the point where (almost) no connections are left after only one MP step. We thus have two counteracting effects when increases: The first MP step gains more connections and thus expressiveness while the second MP step looses connections and expressiveness. Eventually, the latter seems to take over effectively limiting the expressiveness of the model. We will include a brief summary of this explanation in the camera ready version to improve clarity.
As stated in table 3, we found empirically that HoMP with relatively small 's ( for LJ13) seems to work about as well as fully connected HoMP (). To (heuristically) tune in practice, one can efficiently numerically work out how many edges of the line graph are left after message passing steps without training a model. There should be a significant amount of edges left after one MP step and the graph should not be extremely sparse in the beginning. We will include a plot showing the percentage of edges remaining as a function of the number of message passing steps for different ’s in the appendix of the camera ready version together with an instruction on how to tune .
In practice, one could also use a multi-head strategy where several relatively sparse HoMP with different connectivities (e.g. separated length scales) are run in parallel and combined in the end (see appendix A for a brief discussion). This makes sure that each network does not get disconnected as fast while not having to do any locality assumption. This introduces an additional scaling with the number of heads while the scaling with the number of particles remains the same for each head. We will include a more detailed discussion of multi-head HollowFlow in the camera ready version.
-
While we indeed assume locality of the graph, it is possible to circumvent this assumption by the multi-head approach described above. In Table 3 we show how HollowFlow performs with a fully connected graph on LJ13 (see row labelled ). This can already be seen as an ablation to dissentangel the contribution of the locality assumption. We will add a similar study for LJ55. Running HollowFlow without HoMP is not possible as HoMP is inevitable to achieve the hollow structure of the Jacobian unless we only do one message passing step. We will include an additional ablation study in the camera ready version where we run the baseline with a KNN graph.
-
It would indeed be desirable to test the method on other systems. We are currently working on running HollowFlow on Alanine Dipeptide and will include the results in the camera ready version.
thx for the rebuttal. most of my concerns have been solved. I raise my score to 4, good luck
We thank the reviewer for their engagement and feedback on our paper. If any outstanding questions remain we are happy to engage with them until the end of the rebuttal period.
The authors propose a flow-based generative model based on a GNN that makes sample likelihood computations more efficient, using a novel non-backtracking graph neural network (NoBGNN) enforcing a block-diagonal Jacobian structure. They showed speedups on two Lennard-Jones particle systems. Reviewers found the problem to be important and the experimental results significant. Most questions/concerns were resolved during discussion, after which all reviewers leaned towards acceptance. One major point of discussion was the reporting of ESS metrics including outliers (KaKL), which I remind the authors to incorporate into the paper.