FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching
We introduce FragFM, a novel hierarchical framework employing fragment‐level discrete flow matching for efficient molecular graph generation, along with a new molecular generative benchmark focused on natural products.
摘要
评审与讨论
The paper proposes FragFM, a fragment-based generative model for 2D molecule generation. The framework combines a fine level autoencoder to define a fragment-level latent space, and a coarse level flow matching model to generate the fragment-level graphs in the latent space. Specifically, the paper proposes a novel sampling strategy called stochastic fragment bag to encourage better controllability on guidance. Experiments on Moses and GuacaMol show clear advantages of FragFM on FCD and Scaf over the baselines. The authors further collect a new benchmark to evaluate functional-driven metrics on natural products, which also demonstrates the superiority of FragFM.
优缺点分析
Strengths:
- It is reasonable to decouple fragment-level generation and atom-level reconstruction into two stages.
- It is pretty novel to use bag of fragments as dynamic vocabularies during generation, the concept of which is exciting.
- The experiments are solid, with extensive in silico benchmarks, including conventional ones and newly proposed ones.
Weaknesses:
- An ablation on the size of the bag (N) should be implemented since there is a gap between training and inference. The authors also confirm the gap in the paper that only when N is large enough will it be unbiased estimation of the true distribution, so I think it is necessary to further explore how this gap will affect the generation performance.
- It is better to explain how the autoencoder is constructed. Currently many details are missing in the main texts. For example, what is the full size of the fragment library? Does each inter-fragment connection only include one bond, or multiple bonds are also allowed? And what about the bond types (single, double, triple, etc)?
问题
- Is the number of fragments in the bag (N) a fixed value across the entire training and the testing processes? Or is it variable? If it is fixed, what will happen if we change N during inference time? Will it be out-of-distribution?
- Why is the curve of QED=0.8 deviates from curves of other target values in Figure 3? Are there any reasons?
局限性
Yes
最终评判理由
Before rebuttal I am acutally kind of confused by some unclear presentation of the methods and the experiments, but still I gave the score according to some intuitive assumptions. The authors provided adequate responses to solve these confusions and I think my score remains fair.
格式问题
No major formatting issues are found.
[Common response to all reviewers]
We thank reviewers for their feedback for their helpful suggestions on clarity and presentation. We will revise the last part of the introduction to include the core contributions of the paper. For clarity, we reemphasize our core contributions below:
- The first general fragment-level flow matching. We embed chemically meaningful substructures into the flow-matching framework for molecular‐graph generation, enabling efficient sampling of complex molecules and stronger property control.
- Similar or superior performance compared to the previous SOTA models in general molecular generative benchmarks
- Efficient sampling at very few steps, maintaining high generation quality with 10 steps (≈95% validity, FCD < 1.0 on MOSES), outperforming atom-level baselines even at 500 steps (Fig. 2, Tab. 8).
- Better conditional generation: With classifier guidance and fragment-bag reweighting, FragFM shows better conditioning ability across every simple and complex properties (Figs. 3-4, App. E.5), while preserving validity under strong conditioning.
- Resolving the key challenges in a fragment-based generative model. To make the approach practical we must (i) avoid the prohibitive cost of enumerating a very large fragment library, (ii) generalize to fragments that never appear in training, and (iii) reconstruct atom-level graphs faithfully.
- Methodological novelty. We tackle these challenges with (i) a stochastic bag strategy that provides an unbiased, low-cost surrogate for the full library, (ii) a GNN-based fragment encoder that extrapolates to unseen fragments, and (iii) a coarse-to-fine auto-encoder that converts fragment graphs back to atom-level structures with high fidelity.
IMPORTANT NOTE: Because of character limits, we list only the citations that were added during the review. New references are numbered consecutively after the final citation of the main manuscript, while any source already cited in the paper retains its original reference number.
We sincerely appreciate the reviewer’s time and careful evaluation of our work. The reviewer’s thoughtful questions and concrete suggestions on bag-size ablations and autoencoder details were invaluable and led to substantial improvements in the manuscript.
[W1,Q1] Detailed analysis regarding the fragment bag size
Before we introduce ablation study, we briefly summarize the theoretical background of importance sampling and InfoNCE framework. During training, negative candidates are drawn from the unconditional fragment frequency , whereas the positive fragment follows the conditional density . Let
where is a testing function. is a self-normalized importance-sampling (SNIS) estimator of .
Standard results (see Chapter 9 of [84]) give
In our setting, we note that refers to the size of the sampling fragment bag, or . Hence inference performance improves monotonically as the candidate-set size grows, converging to the exact value in the limit (See “Effect of the inference-time bag size” below).
InfoNCE uses the same bag construction with N_\\mathrm{train} in training phase. A larger N_\\mathrm{train} tightens the mutual-information lower bound, but the optimal network simply estimates the density ratio , which is independent of [44]. In practice, N_\\mathrm{train} controls gradient variance.
In the previous experiments, we set N_\\mathrm{train}=N=384 , guided by preliminary sweeps and theory as a training and inference-time cost–accuracy trade-off (see Reviewer A3wR Q2). Thanks to the insightful review, we additionally ran extensive experiments varying training- and inference-time bag-sizes to systematically analyze the effects of both of them; the results are displayed below. We will add the results and theoretical analysis in the manuscript.
For ablation about the bag sizes, we run three experiments:
- changing with fixed N_\\mathrm{train} to study inference-time bag size effects
- changing N_\\mathrm{train} with fixed to study training-time bag size effects
- changing both N_\\mathrm{train}=N to study overall effects of fragment bag sizes (please refer to Reviewer A3wR Q2)
Effect of the inference-time bag size. Back to the reviewer’s original question, we re-ran the MOSES benchmark with the same trained model () while varying the inference bag size to verify that training-inference bag sizes can differ safely. As predicted, all metrics stabilize once N \\ge N_\\mathrm{train} and degrade gracefully for smaller bags. These results well aligns to the theoretical analysis.
(To isolate the effect of bag size, these generations exclude the detailed-balance term. is during generation, is during training.)
| N/N_\\mathrm{train} | Validity | Uniqueness | Novelty | Filters | FCD | SNN | Scaf |
|---|---|---|---|---|---|---|---|
| 0.125 | 93.7 | 99.6 | 97.6 | 95.0 | 2.54 | 0.47 | 14.6 |
| 0.25 | 96.6 | 99.9 | 96.1 | 96.6 | 1.52 | 0.50 | 13.0 |
| 0.5 | 98.6 | 100.0 | 93.3 | 97.9 | 0.86 | 0.52 | 13.2 |
| 1 | 99.5 | 100.0 | 90.6 | 98.5 | 0.58 | 0.54 | 11.8 |
| 2 | 99.6 | 100.0 | 88.6 | 98.6 | 0.60 | 0.55 | 11.0 |
| 4 | 99.6 | 100.0 | 88.8 | 98.8 | 0.58 | 0.54 | 11.7 |
| 8 | 99.7 | 100.0 | 88.9 | 98.6 | 0.58 | 0.54 | 11.4 |
Effect of the training-time bag size. To isolate training-time effects, we additionally trained independent models with different N_\\mathrm{train} on the MOSES benchmark with sufficiently large . During inference, we fixed for all model. The results shows stable across N_\\mathrm{train}: validity >99.5%, FCD = 0.58-0.65, and Filters ~ 98.5%, suggesting that all trained models with larger than 16 can estimate the density ratio robust on the .
(To isolate the effect of bag size, these generations exclude the detailed-balance term. is during training.)
| N_\\mathrm{train} | Validity | Uniqueness | Novelty | Filters | FCD | SNN | Scaf |
|---|---|---|---|---|---|---|---|
| 16 | 99.6 | 100.0 | 93.2 | 98.2 | 0.61 | 0.53 | 16.2 |
| 32 | 99.6 | 100.0 | 92.7 | 98.3 | 0.61 | 0.53 | 12.1 |
| 64 | 99.6 | 99.9 | 91.5 | 98.3 | 0.65 | 0.54 | 14.5 |
| 128 | 99.6 | 100.0 | 91.4 | 98.4 | 0.65 | 0.54 | 12.1 |
| 256 | 99.7 | 100.0 | 89.9 | 98.6 | 0.62 | 0.54 | 12.5 |
| 384 | 99.6 | 100.0 | 88.8 | 98.8 | 0.58 | 0.54 | 11.7 |
| 512 | 99.6 | 100.0 | 90.7 | 98.3 | 0.61 | 0.54 | 11.0 |
[84] Owen, Art B. “Monte Carlo theory, methods and examples.” (2013): 19-22.
[W2] More information regarding coarse-to-fine autoencoder and fragment library
For the implementation details about coarse-to-fine autoencoder, full size of the fragment library, and detailed fragmentization strategy please refer to our reply for Reviewer A3wR’s [W1,Q1]. As many reviewers pointed out limited explanation for coarse-to-fine autoencoder in the main text, we will revise the manuscript to include more information about the coarse-to-fine autoencoder.
[Q2] Why is the curve of QED=0.8 deviates from curves of other target values in Figure 3?
The deviation stems from the dataset prior. On MOSES, the QED distribution is peaked around 0.8 (Figure 9c). Targeting therefore requires a much smaller shift from the training data, so classifier guidance attains lower MAE at lower FCD when targeted to such value. These trends are similarily shown in logP and number of rings conditioning (Figure 12 and 13).
Thank the authors for the responses, which help me better understand the paper! I think I have no other questions, and hope the authors can address the concerns of other reviewers.
Dear reviewer weUv,
Thank you for the thoughtful and encouraging review. We appreciate your recognition of the two-stage design, the stochastic fragment bag idea, and extensive experiments, including the new NPGen benchmark. Your feedback helped us sharpen the presentation further.
We’re happy to hear that our rebuttal effectively addressed your concerns, and we sincerely appreciate your support for our work.
If you have any additional questions or suggestions, please don’t hesitate to let us know.
Respectfully, FragFM authors.
Proposes a two-stage molecular graph generative model named FragFM. The proposed model first generates a fragment-level graph using discrete flow matching and then reconstructs it into an atom-level graph. In addition to standard benchmark datasets MOSES and Guacamol, this work introduces a new dataset made of chemical compounds found in natural products.
优缺点分析
Strengths
- The motivation of the paper is clear.
Weaknesses
- Reading this paper is not easy. Finding implementation details of this method is very cumbersome. What is the size of the fragment library? What is the size of fragment bag? How does the bag size affect the quality of generated molecules?
- Recent relevant work on fragment-based molecule generation models such as SAFE-GPT [1] and GenMol [3] has not been mentioned. Sc2Mol [2] is a 2-stage generative model that first creates a scaffold and then enriches the scaffold with atom and bond types. I believe JTVAE should be used as a baseline as it performs better than JTVAE.
问题
- Make the paper more readable. Move some of the equations to the supplementary section and include information that will be useful to i reproducing the results (e.g., details of the encoder, size of the fragment bag).
- It will be useful if an ablation study with different fragment bag sizes can be included into the main text.
- Include a section on existing fragment-based molecule generation methods and contrast with them. Use relevant methods as baselines in the experiments.
局限性
yes
最终评判理由
Since the authors have addressed all my concerns to a satisfactory level, I would like to increase my review score. Authors have improved the readability of the paper by adding details of the models and training procedures. They have performed additional experiments to compare the performance of relevant related work such as SAFE=GPT and Sc2Mol.
格式问题
None
We would like to thank the reviewer for their thoughtful and constructive review with their perspective to significantly strengthen the clarity and impact of our work.
We include a common response to all reviewers at the beginning of our response to Reviewer weUv, re-emphasizing our core contributions.
[W1,Q1] Reading this paper is not easy
We sincerely thank the reviewer for their constructive feedback on improving the clarity and reproducibility of our work. We agree that some implementation details were missing from the main text. To address this, we have made substantial revisions, which we will include in the camera-ready version.
We outline the specific changes below, corresponding directly to the reviewer's points.
1. Implementation Details and Hyperparameters
We apologize for omitting the fragment bag size and fragment library sizes. We will explicitly state all key hyperparameters.
-
Fragment Bag Size (): We will clarify in the text that the fragment bag size for sampling is set to 384.
-
Fragment Library Size (): We will include the following table about fragment library size. Note that full utilization of the extensive training / test fragment libraries is not explored yet in the previous fragment-based approaches used only frequently occuring 100 to 200 fragments.
MOSES GuacaMol NPGen Total 44,083 223,186 133,823 Training 27,760 145,971 84,731 Valid 2,248 7,473 3,709 Test 2,541 22,508 11,712 -
Fragmentation and Bond Types: We will state clearly that our fragmentation strategy is based on BRICS rules [38] and, importantly, that our fragment library and inter-fragment connections exclusively use single bonds. This design choice simplifies the generative process.
2. Methodological Clarity: Coarse-to-Fine Autoencoder
We agree with the reviewer that the original description of our autoencoder was insufficient. We have rewritten this section to be more self-contained and clear. The revised text below now provides a detailed information of:
- The model's training strategy.
- The specific fragmentation rules and resulting library.
- The decoding process, explaining how the latent vector, fragment-level graph and the Blossom algorithm are integrated to reconstruct the final atom-level graph.
Revised version of 3.2 coarse-to-fine autoencoder
While a fragment-level graph () offers a higher-level abstraction of molecular structures, it also introduces ambiguity in reconstructing atomic connections. Specifically, a single fragment-level connectivity () can map to multiple distinct, valid atom-level configurations. To achieve accurate end-to-end molecular generation, it is therefore crucial to preserve atom-level connectivity () when forming the fragment-level representation. Drawing on hierarchical generative frameworks [40,41,42], we employ a coarse-to-fine autoencoder. The encoder first decomposes an atom-level graph () into a set of predefined fragments using BRICS rules, where we only considered single bond-based fragmentation rule. It then compresses the precise inter-fragment atomic connectivity information, which are lost during fragmentation, into a single continuous latent vector, . Consequently, the decoder's task is to reconstruct the original molecule by predicting the atom-level edges between fragments, guided by both and .
We train our autoencoder using a combined loss function designed for high-fidelity reconstruction and a well-behaved latent space. The primary objective is a reconstruction loss that ensures the predicted atomic links between fragments faithfully restore the original molecular structure. We supplement this with a minimal Kullback-Leibler (KL) divergence penalty on the latent variable. Finally, to translate the decoder's output into a chemically valid molecular graph, we employ the Blossom algorithm [43]. In detail, the decoder predicts a matrix of probabilities for all potential atomic connections between the "junction atoms" of adjacent fragments. We interpret this as a maximum weighted matching problem, where the edge weights are the log-probabilities predicted by our model. The Blossom algorithm then efficiently determines the optimal set of one-to-one atomic connections that maximizes the joint probability, guaranteeing a valid and accurate reconstruction of the final molecule.
These comprehensive updates will be integrated into the final manuscript. We are confident that these revisions directly address the reviewers’ concerns and make our work significantly more accessible and understandable to the NeurIPS readers.
[W2,Q3] Mentioning more related works and new baselines
We thank the reviewer for their constructive feedback and for highlighting these relevant works. A key distinction is that the suggested papers focus on fragment-based language models, whereas our work proposes a fragment-based graph model. This fundamental difference in data modality (sequential strings vs. explicit graphs) guided our initial set of baslines.
In order to conduct a fair additional benchmarking analysis in terms of training data, we re-trained the models on the MOSES benchmark. Here, GenMol was excluded from the comparison because they did not provide the training code. We trained Sc2Mol for 25 epochs with the provided default settings and trained SAFE-GPT-20M with 20 million parameters, following the detailed hyperparameter settings in Section 4.1.5 of the original paper [83]. Additionally, the validity correction module of Sc2Mol, which explicitly fixes the validity of generated SMILES, is not publicly available. Therefore, we only report the results of Sc2Mol without validity correction. After generating 25,000 molecules for each model with three different seeds, we applied the same post-processing strategy to the molecules.
The results indicate that these methods have limitations in capturing the target data distribution. Except for novelty, for which we provide additional insight in Q1 of the EQJM reviewer's comments, FragFM outperformed all other fragment-based language models.
| Validity | Uniqueness | Novelty | Filters | FCD | SNN | Scaf | |
|---|---|---|---|---|---|---|---|
| Sc2Mol (w/o VC) | 59.3 | 100.0 | 98.6 | 92.1 | 6.81 | 0.44 | 9.5 |
| SAFE-GPT-20M | 98.1 | 100.0 | 90.9 | 98.2 | 0.71 | 0.54 | 9.8 |
| FragFM | 99.8 | 100.0 | 87.1 | 99.1 | 0.58 | 0.56 | 10.9 |
Given these results, we believe our method offers a distinct and effective alternative in the landscape of fragment-based generation.
Furthermore, we will add a discussion of these methods to our related work section.
Related Work
Recently, several fragment-based molecular generation models that use sequence-based representations have also been developed. For instance, SAFE-GPT [83] uses a transformer architecture on the SAFE (Sequential Attachment-based Fragment Embedding) representation, which linearizes molecular graphs into sequences of fragments. Similarly, GenMol [84] builds upon the SAFE representation but employs a discrete diffusion model for generation. Another approach, Sc2Mol [85], utilizes a two-stage process where a variational autoencoder first generates a carbon scaffold, which is then decorated with specific atoms and bonds by a transformer model. While these methods have shown success in generating valid molecules, they are conceptually different from our approach, which operates directly on graph representations rather than linear sequences.
[83] Noutahi, Emmanuel, et al. "Gotta be SAFE: a new framework for molecular design." Digital Discovery 3.4 (2024): 796-804.
[84] Lee, Seul, et al. "Genmol: A drug discovery generalist with discrete diffusion." arXiv preprint arXiv:2501.06158 (2025).
[85] Liao, Zhirui, et al. "Sc2Mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer." Bioinformatics 39.1 (2023): btac814.
[Q2] Fragment bag size-performance evaluation
Because the fragment-bag size governs our Monte Carlo approximation of the full-fragment library, increasing the fragment-bag size would lower the variance (improving generation fidelity) at the expense of compute. We therefore ran a ablation study on MOSES with , and will include it in the manuscript. To isolate the effect of bag size, these generations exclude the detailed-balance term. As increases, the metrics improve—higher validity/Filters/SNN and lower FCD—and converges, as expected. We also provide theoretical analysis along with additional results on varying trian / test bag size on Reviewer weUv’s [W1,Q1].
| for training and inference | Validity | Uniqueness | Novelty | Filters | FCD | SNN | Scaf |
|---|---|---|---|---|---|---|---|
| 16 | 96.9 | 99.7 | 95.8 | 97.1 | 1.27 | 0.51 | 12.9 |
| 32 | 97.9 | 99.8 | 94.5 | 97.9 | 0.81 | 0.53 | 12.1 |
| 64 | 99.1 | 99.9 | 92.6 | 98.3 | 0.66 | 0.53 | 13.1 |
| 128 | 99.3 | 99.9 | 92.7 | 98.1 | 0.65 | 0.53 | 12.5 |
| 256 | 99.6 | 100.0 | 90.4 | 98.5 | 0.61 | 0.54 | 13.0 |
| 384 | 99.5 | 100.0 | 90.6 | 98.5 | 0.58 | 0.54 | 11.8 |
| 512 | 99.5 | 100.0 | 91.1 | 98.4 | 0.61 | 0.54 | 12.1 |
I thank the authors for taking the time to write a comprehensive response addressing all the concerns. I appreciate the extra effort authors have put into comparing with related models. Given that the paper would be updated according to the responses, I would like to increase my review score.
Question: In the table comparing performance of related work the model Sc2Mol (w/o VC) has generated molecules with a novelty score of 98.6 which is higher than SAFE-GPT and FragFM. However, novelty 90.9 of SAFE-GPT is made bold. Is there any reason for Sc2Mol to have such a high novelty score? What is the percentage of novel and valid moelcules?
Dear Reviewer A3wR,
We thank the reviewer for recognizing our new experiments. Your feedback helped us broaden the related-work discussion and make the paper more persuasive. We also appreciate the initial comments on fragment-bag size, an important aspect of our method, thereby provided additional analysis and greatly improved the manuscript’s quality.
We apologize to mis-represent the result. As we cannot fix the initial rebuttal now, we fix the benchmark table and revised it below.
| Validity (%) | Uniqueness (%) | Novelty (%) | Filters | FCD | SNN | Scaf | |
|---|---|---|---|---|---|---|---|
| Sc2Mol (w/o VC) | 59.3 | 100.0 | 98.6 | 92.1 | 6.81 | 0.44 | 9.5 |
| SAFE-GPT-20M | 98.1 | 100.0 | 90.9 | 98.2 | 0.71 | 0.54 | 9.8 |
| FragFM | 99.8 | 100.0 | 87.1 | 99.1 | 0.58 | 0.56 | 10.9 |
We attribute the high novelty of Sc2Mol arises from the two-step approach of generating initial scaffold based on VAE and refining it to atom-level resolution. In addition, we carefully note that novelty is often anti-correlated with distributional fidelity, as we mentioned in Reviewer EQJM [Q1].
In the MOSES benchmark [18] (Official benchmarking implementation repository, Appendix C.1), novelty is computed as the percentage of novel molecules among the valid and unique ones. So the percentage of overall “novel and valid” molecules among the whole generated molecules can be computed by multiplying the validity, uniqueness and novelty in the table, resulting in 58.7%. (as uniqueness is 100%)
Respectfully, FragFM authors.
Thank you for the correction and the explanation on novelty.
This paper introduces a novel framework for molecular graph generation. It uses discrete flow matching to generate fragment-based graphs instead of at an atomic level, and a coarse-to-fine autoencoder reconstructs the atomic graph from the fragment graph. A Natural Product Generation benchmark is also introduced to evaluate the generative model's performance on generating natural products.
优缺点分析
Strengths
- The fragment-based approach allows the model to generate molecules based on existing subparts, reducing the possibility of the model generating implausible or non-synthesizable molecules.
- While many existing benchmarks for drug-like molecules exist, datasets of natural molecules are less explored. The NPGen addresses the unmet needs and opens new venues for evaluating the bio-compatibility of molecular generative models.
Weaknesses
- Even though FragFM outperforms baselines in both Standard Molecular Generation Benchmarks and Natural Product Molecule Generation Benchmarks, it seems the difference is still marginal for many metrics.
问题
- It seems FragFM lags in novelty in Table 1. Could the author provide more insights into this result?
- Could the author provide more details on how the Kullback-Leibler (KL) divergence of distributions of NP-likeness scores is calculated?
- Could the author provide more explanation for why the author chose to randomly partition NPGen into training, validation, and test subsets under the assumption of i.i.d. Sampling in Appendix D.1, instead of using partitioning methods like similarity splits?
局限性
Yes.
最终评判理由
The author has addressed all my questions. The quality of the work meets the standard for a NeurIPS acceptance. However, I still believe the performance boost is a little bit marginal, and further evaluation could be completed. Thus, I will retain my score as it is.
格式问题
NA
We are very grateful to the reviewer for the careful and detailed feedback. We appreciate the time invested, and comments have been instrumental in refining our manuscript.
We include a common response to all reviewers at the beginning of our response to Reviewer weUv, re-emphasizing our core contributions.
[W1] FragFM outperforms the baselines, but the gains are marginal
We thank the reviewer for pointing this out. We agree that some metrics on MOSES/GuacaMol appear close. Though, we carefully want to recap that on MOSES (Tab. 1) and GuacaMol (Tab. 4), FragFM attains FCD of 0.58 on MOSES and an overall FCD score of 85.8 on GuacaMol, substantially outperforming baselines (FCD 1.00-1.95; GuacaMol 0.9-73.8). To our knowledge, it is the first flow- and diffusion- based model to surpass autoregressive models on MOSES, while also achieving state-of-the-art Filters and SNN with nearly perfect validity. However, as reviewer mentioned, the performance gap is not significant, reflecting the saturation of these molecular generative benchmarks, where many methods already perform at near-optimal levels.
This saturation of performance is why we introduced the more challenging molecular generative benchmark, NPGen, which comprises diverse natural product molecules with high complexity. In this benchmark, the performance gap between FragFM and other models is far more significant.
For the evaluation of functional distributions (NPScore and NPClassifier pathway/superclass/class), FragFM yields remarkably lower divergence compared to the previous methods:
- JT-VAE: NPScore 14.54-, Pathway 5.38-, Superclass 8.70-, Class 7.18-fold better results
- DiGress: NPScore 5.23-, Pathway 1.17-, Superclass 2.27-, Class 2.89-fold better results.
In case of FCD, FragFM achieved 3.04- and 1.53-fold better results compared to the JT-VAE and DiGress, respectively. Aside from the benchmarks, we emphasize that FragFM-generated molecules avoid the chemically implausible motifs or avoid sampling overly simple molecules often observed in baseline models (App. F.3, Fig. 20-24).
To strengthen the evaluation, we include ZINC250K, another commonly used molecular generation benchmark. NSPDK evaluates the similarity of molecules by comparing the shortest paths in their graph structures. With 500 steps, FragFM achieved the state-of-the-art performance on all metrics. Especially, NSPDK and FCD show 5.00- and 2.31-fold better results.
| Method | Val. w/o corr. | NSPDK | FCD | Step |
|---|---|---|---|---|
| Training Set | - | 0.0001 | 0.062 | – |
| GraphAF | 67.92 | 0.0432 | 16.128 | – |
| GraphDF | 89.72 | 0.1737 | 33.899 | – |
| GDSS | 97.12 | 0.0192 | 14.032 | 1000 |
| GSDM | 92.57 | 0.0168 | 12.435 | 1000 |
| DiGress | 94.98 | 0.0021 | 3.482 | 500 |
| GGFlow | 99.63 | 0.0010 | 1.455 | 500 |
| FragFM | 99.81 | 0.0002 | 0.630 | 500 |
[Q1] Insights into FragFM's lower novelty in benchmark
We appreciate the reviewer’s point: while FragFM attains high validity and strong distributional metrics (e.g. FCD/Filters/SNN), its novelty is lower than some baselines. In small-molecule benchmarks, novelty is often anti-correlated with distributional fidelity [81, 82]. FragFM also shows similar trend, with high validity and low FCD, with lower novelty.
Furthermore, we would like to claim that higher novelty does not necessarily imply chemically meaningful structures. As illustrated in the first paragraph of App. F.2, Fig. 20-24, baselines that report higher novelty sometimes produce chemically implausible motifs.
However, we found that it is possible to modulate the sampling algorithm to generate more unseen molecules, fully utilizing the flexibility of stochastic bag strategy. We add experiments that modulate sampling by introducing temperature factor to the Boltzmann probabilities. Specifically, two temperature factors are used:
- : reweighting fragment’s logit within the in-bag transition kernel (Eq. 5)
- : reweighting fragments when constructing fragment bag in each euler step (Eq. 34)
A higher temperature indicates that the sampling probabilities are more uniform. As shown in the following table, higher temperature factors reliably increases the novelty while allowing the trade-off with FCD and validity.
| Valid | Unique | Novel | Filters | FCD | SNN | Scaf | ||
|---|---|---|---|---|---|---|---|---|
| 1.0 | 1.0 | 99.8 | 100.0 | 87.1 | 99.1 | 0.58 | 0.56 | 10.9 |
| 1.0 | 1.5 | 99.2 | 100.0 | 94.2 | 98.3 | 0.90 | 0.52 | 13.5 |
| 1.5 | 1.0 | 99.7 | 100.0 | 88.6 | 98.8 | 0.88 | 0.54 | 11.0 |
| 1.5 | 1.5 | 99.2 | 100.0 | 94.5 | 98.3 | 0.91 | 0.51 | 13.1 |
[81] Mahmood, Omar, et al. "Masked graph modeling for molecule generation." Nature communications 12.1 (2021): 3156.
[82] Geng, Zijie, et al. "De novo molecular generation via connection-aware motif mining." arXiv preprint arXiv:2302.01129 (2023).
[Q2] KL divergence calculation on NP-likeness scores
The Kullback-Leibler (KL) divergence for the continuous NP-likeness scores is calculated using a standard non-parametric method following previous benchmarks (MOSES, GuacaMol).
The procedure is as follows:
- Score Calculation: First, we compute the NP-likeness score for every molecule in the reference test set and our generated set, resulting in two one-dimensional arrays of continuous values.
- Distribution Estimation with KDE: Since the scores are continuous, we cannot use a simple histogram-based KL divergence. Instead, we model the underlying probability distribution of each set of scores using Gaussian Kernel Density Estimation (KDE). This technique creates a smooth, non-parametric probability density function (PDF) for both the reference distribution and the generated distribution.
- Numerical Approximation: To compute the KL divergence integral numerically, we first define a common evaluation range that spans the minimum and maximum scores observed across both sets. We then evaluate both PDFs at 1,000 equidistant points along this range to obtain two discrete probability vectors. To ensure numerical stability, a small value (1e-10) is added to the probabilities.
- KL Divergence Calculation: Finally, we compute the KL divergence between these two discrete probability vectors using the
entropyfunction fromscipy.stats.
[Q3] Dataset split on NPGen benchmark
We thank the reviewer for this important question about data splitting strategy of NPGen benchmark. We chose a random split rather than a similarity-based one because it is essential for NPGen to serve as a benchmark for distribution learning. In other words, the primary purpose of the NPGen benchmark is to evaluate how well a model can learn the complex, underlying distribution of natural products. A fundamental requirement for this is that the training and test sets are i.i.d. (independent and identically distributed). Our evaluation metrics, such as the KL divergence of pathway, superclass, and class distributions, measure how faithfully a model's generated output matches the training data's distribution.
Similarity-based split strategies (e.g., scaffold splits) can violate the i.i.d. assumption. This is particularly problematic when curating natural-product (NP) datasets and evaluating functionality-driven distributions, as molecules’ functionality often related with its structure. This creates training and test sets with fundamentally different distributions, making the benchmark unsuitable for evaluating distribution learning.
To quantify this, we analyzed the distribution of biosynthetic pathways after performing both a random split and a scaffold split on NPGen, which is predicted by NPClassifier [56]. The results clearly show that a scaffold split introduces a severe distributional bias:
| Pathway | Random Split | Scaffold Split | Relative Deviation | |||
|---|---|---|---|---|---|---|
| Train | Test | Train | Test | Random (%) | Scaffold (%) | |
| Alkaloids | 0.219 | 0.219 | 0.212 | 0.250 | +0.2 | -17.9 |
| Amino acids and Peptides | 0.032 | 0.032 | 0.034 | 0.026 | 0.0 | +24.0 |
| Carbohydrates | 0.012 | 0.011 | 0.010 | 0.018 | +3.8 | -77.6 |
| Fatty acids | 0.111 | 0.110 | 0.131 | 0.019 | +1.1 | +85.2 |
| Polyketides | 0.078 | 0.077 | 0.076 | 0.088 | +1.4 | -15.4 |
| Shikimates and Phenylpropanoids | 0.167 | 0.167 | 0.163 | 0.184 | -0.2 | -12.6 |
| Terpenoids | 0.303 | 0.304 | 0.295 | 0.332 | -0.4 | -12.4 |
| Unclassified | 0.078 | 0.080 | 0.078 | 0.082 | -2.2 | -6.1 |
As the table shows, the random split maintains near-perfect consistency between the train and test sets (e.g., < 4% deviation for all pathways). In contrast, the scaffold split creates massive shifts. For example, Fatty acids are over-represented in the training set and severely under-represented in the test set, leading to an 85.2% relative deviation.
Employing a scaffold-split dataset could lead to a misleading assessment of a distribution learning model's performance. A high KL divergence score would be an expected artifact of the biased split itself, rather than a true reflection of the model's generative capabilities. For this reason, we used a random split to establish NPGen as a fair and suitable benchmark for measuring a model's ability to learn the distribution of natural products.
Dear Reviewer EQJM,
We wanted to gently remind that the end of author-reviewer discussion period is approaching.
We appreciate the reviewer’s feedback and their recognition of FragFM as a novel framework for molecular graph generation as well as highlighting NPGen as addressing an unmet need and opening new avenues to evaluate the bio-compatibility of molecular generative models.
We have conducted additional experiments and revised the manuscript to address your concerns:
- [W1] we re-emphasized the original results to clearly show a significant performance improvement over existing atom-based baselines in NPGen, with our performance being at least twice as good as the previous state-of-the-art (SOTA) model for most distribution-learning metrics. Furthermore, we note that for the standard molecular generative benchmarks, which is saturated, FragFM clearly approached its performance toward the optimal limit of each metric in FCD, Filters, and SNN.
- [Q1] we provided a deeper explanation of novelty and included temperature-adjusted sampling experiments to steer generation toward novel molecules.
- [Q2]–[Q3] we expanded explanations and analyses for the NPGen benchmark.
Your feedback was very helpful: it prompted us to (1) emphasize FragFM’s larger margins over baselines, (2) add a steering strategy for sampling more novel molecules, and (3) improve the readers' understanding of the NPGen benchmarks. We will integrate these clarifications into the manuscript text.
We also emphasize that we organized our core contributions in Reviewer weUv [Common response to all reviewers].
We hope the added experiments and updates address your original concerns. If you feel these revisions strengthen the paper, we’d be grateful if you could reflect that in your rating; and if anything remains unclear, we’re happy to follow up.
Respectfully, FragFM authors.
Thank you for the rebuttal. The author has addressed all my questions.
We sincerely appreciate your confirmation that our rebuttal has addressed your concerns. If you believe that the discussion has led to meaningful improvements, we would be grateful if you could take this into account when you make your final evaluation. Thank you again for your valuable feedback and time.
This work introduces a hierarchical framework for molecule generation that operates at both the atom and fragment levels. To achieve this, the authors developed two key innovations: a fragment bag strategy to efficiently explore vast fragment libraries and a coarse-to-fine autoencoder that seamlessly maps between atom-level and fragment-level molecular graphs. Furthermore, to better emulate natural chemical compounds for drug discovery applications, a new benchmark dataset, NPGen, was curated, comprising approximately 659,000 compounds. Experimental results demonstrate that the proposed model, FragFM, surpasses other state-of-the-art methods across most metrics on both standard datasets and the newly introduced NPGen benchmark.
优缺点分析
Strengths: This work chooses multiple datasets to validate its superiority on molecular generation, which is persuasive. This work investigates the generation efficiency and conducts experiments for conditional generation, providing various views to evaluate the proposed method. This work develops a new benchmark, NPGen, contributing to the communities of machine learning and drug development. Weaknesses: The manuscript's clarity is impeded by a structural flaw: its core contributions are not adequately introduced in the main text. The paper emphasizes two novel approaches—a coarse-to-fine autoencoder and a fragment bag strategy—yet fails to provide them with sufficient exposition. Notably, the autoencoder is described in a cursory 12-line passage, which is insufficient for a central component of the framework. Furthermore, the fragment bag strategy lacks due prominence, as it is not even granted a dedicated section title. This lack of detail and structural emphasis hinders the reader's ability to fully assess the novelty and significance of the proposed techniques. A significant gap exists between the paper's proposed contributions and its experimental validation. While the manuscript presents multiple experiments, it fails to provide empirical evidence for the efficacy of its core components. For instance, the fragment bag strategy is introduced specifically to bypass the limitations of fixed fragment libraries. However, the authors omit any experiments in the main text, demonstrating this strategy's ability to generate molecules containing rare or novel fragments. This omission makes it difficult to verify one of the central claims of the work.
问题
The authors claim that this work achieves lossless reconstruction of the atom-level graph in the conclusion. However, the validity of FragFM fails to achieve 100% as JT-VAE does on both MOSES and NPGen benchmarks. Is there any evidence showing that the FragFM could achieve lossless reconstruction? On Line 124, the authors claim that the single continuous latent vector z encodes the committed connectivity details. How to prove it? Is there any other information that the latent vector should contain? Besides, from my opinion, the latent vector encodes the connectivity implicitly, resulting in less than 100% generation validity. The Junction-Tree VAE encodes the connectivity explicitly and predicts which two atoms should be connected by an edge, achieving 100% validity. FragFM introduces the fragment bag strategy to explore the huge fragment library. Do the authors ever consider the long-tailed issues posed by the rare fragments? For the choice of loss on Line 156, why do you choose the Info-NCE loss? On Line 172, I think that modeling each transition of nodes and edges independently is mainly for simplicity. However, different nodes may lead to different ways of connecting (edges). How do you think about this issue?
局限性
Yes
最终评判理由
I appreciate the authors' rebuttal, which has addressed most of my concerns. At this stage, I intend to maintain my original score and would be interested in seeing the feedback from the other reviewers.
格式问题
N/A
We sincerely thank the reviewer for a thorough, thoughtful review regarding the clearity of the manuscript and experimental designs.
We include a common response to all reviewers at the beginning of our response to Reviewer weUv, re-emphasizing our core contributions.
[W1] Manuscript's clarity is impeded by a structural flaw
For the implementation details about hyperparameters and coarse-to-fine autoencoder, please refer to our reply for Reviewer, A3wR’s [W1,Q1]. As many reviewers pointed out limited explanation for coarse-to-fine autoencoder in the main text, we will revise the manuscript to include more information about the coarse-to-fine autoencoder.
Furthermore, we agree that our manuscript should place greater emphasis on the stochastic fragment bag strategy, which is a core contribution to our model's efficiency and performance. To address this issue, we will revise the relevant subsections or paragraphs in Section 3 to give this strategy the prominence it deserves. In particular, as sections 3.3 and 3.4 mainly introduce the fragment bag strategy for training and generation, we will emphasize the stochastic fragment bag strategy as follows(we have highlighted the revisions with in bold):
- In line 102-105: Starting from fragment graph notation in section 3.1, we elaborate on ... fragment-level graph based on stochastic fragment bag strategy in sections 3.3 and 3.4.
- In line 149: Stochastic Fragment Bag Strategy: Parameterization and Info-NCE Loss
- In line 167: 3.4 Generation Process with Stochastic Fragment Bag Strategy
[W2,Q2] A significant gap exists between the paper's proposed contributions and its experimental validation
Reviewer’s comments on weaknesses and questions
The manuscript fails to provide … of the central claims of the work.
FragFM introduces the fragment bag strategy … rare fragments?
Thank you for this thoughtful comment. Our main goal in FragFM is to show that moving from atom-wise diffusion to a fragment-level flow framework brings three practical advantages:
- Scalability to complex molecules
- Much faster sampling while preserving validity
- Stronger and more flexible property control
The experiments included in the paper were chosen to demonstrate these three points side-by-side against atom-wise baselines.
Though, we fully agree that readers will also want direct evidence that the fragment-bag strategy copes with challenges arising in a fragment-based framework, such as the long-tailed distribution of fragment types and the need to generalize to novel fragments that never appear during training. We have therefore run an additional set of analyses and will incorporate these results into the revision.
On rare/long-tail fragments. Handling the long tail (seldom showing fragments) is essential, especially when the number of fragments are large (e.g. NPGen has 133K fragments). In theory, our stochastic bag is an unbiased estimator of the full-fragment vocabulary transition kernel (Eq. 7, our reply on Reviewer weUv’s [W1,Q1]), so rare fragments are not excluded by design. More specifically, to verify that rare fragments appear in the generated molecules at the same rates as in the training data, we conducted a rare-fragment recovery analysis. By defining rare as fragments that appear in the data, we compare their occurance shares between the training and 100k generated molecules (reported per- bin). (i.e., total occurrences of k-frequency types divided by total fragment occurrences). The measured ratios closely resembles the training distribution, indicating that our bag strategy can effectively generates the molecules with rare fragments.
| MOSES | GuacaMol | NPGen | ||||
|---|---|---|---|---|---|---|
| Training Set (%) | Generated (%) | Training Set (%) | Generated (%) | Training Set (%) | Generated (%) | |
| 1 | 0.225 | 0.182 | 1.522 | 1.660 | 1.508 | 1.768 |
| 2 | 0.086 | 0.071 | 0.535 | 0.565 | 0.873 | 0.902 |
| 3 | 0.063 | 0.053 | 0.361 | 0.367 | 0.551 | 0.551 |
| 4 | 0.056 | 0.048 | 0.286 | 0.278 | 0.460 | 0.452 |
| 5 | 0.046 | 0.036 | 0.236 | 0.238 | 0.330 | 0.326 |
Novel fragments. Generation with test set fragment bags is reported in the Tables 6 and 7, directly evaluating generalization to fragments unseen during training. For clarity, we also enumerate the counts: in MOSES, the test split contains 2,588 unseen and 7,210 seen fragment types (26% unseen); in GuacaMol, 22,962 unseen and 40,646 seen fragment types (36% unseen). Under these settings, FragFM maintains state-of-the-art validity and distributional metrics, demonstrating robust generalization to novel (unseen) fragments.
[Q1] Validation of autoencoder for its lossless reconstruction ability
Reviewer’s comments on questions
The authors claim that this work achieves lossless reconstruction … FragFM could achieve lossless reconstruction?
On Line 124, the authors claim that the single continuous latent vector encodes the committed connectivity details ... validity.
We sincerely thank the reviewer for the careful reading and the question. For clarity, we distinguish reconstruction accuracy from validity. Accuracy means that, given a fragment graph and its latent , the decoder recovers the exact same molecule (e.g., from fragments A*, *BC*, D* (for ABCD, where * indicates the possible junction)). Validity only requires chemical feasibility—so DBCA would also count.
Our coarse-to-fine autoencoder uses Blossom matching algorithm (App. 3.2) to reconstruct atom-atom connectivity from fragment-level graph and . The gap to 100% generation validity (relative to JT-VAE) mostly arises from invalid combination of fragments (for example, A* , *BC*), which cannot be reconstructed to any atom-level graph. We further conducted analysis about the role of and validities of JT-VAE and FragFM.
On the role of the latent in autoencoder accuracy. To show that latent encodes atom-level connectivity information well, we conduct an ablation study on for the reconstruction accuracy. Our autoencoder achieves almost perfect reconstruction of all test sets, whereas decoding with random does not. As the size of the molecules in the NPGen benchmark larger, the discrepancy in accuracy rises, empirically showing that the latent encodes almost-full information of the atom level connectivity information.
| Accuracy on Test Set | MOSES | GuacaMol | NPGen |
|---|---|---|---|
| Random | 55.2% | 46.7% | 34.4% |
| Latent vector | 99.9% | 99.4% | 97.4% |
Since reconstruction accuracies do not achieved 100.0% in all dataset, we will tone-down the phrase “lossless reconstruction” into “nearly lossless reconstruction”.
On JT-VAE’s 100% validity compared to FragFM. JT-VAE enforces validity by design: during decoding it masks fragment choices that would violate chemical feasibility and prunes infeasible assemblies, restricting the search to valid molecules. In contrast, diffusion/flow-based models update many nodes and edges simultaneously, making hard-encoding of rules difficult; consequently, achieving literal 100% validity is inherently harder (e.g., DiGress ~85–86%, DeFoG ~92–93%; Tab. 1). Our fragment-level approach reduces this complexity and attains >99% validity on MOSES in a purely data-driven manner (no rule-based masks) while delivering stronger distributional metrics, e.g., Filters 99.1, FCD 0.58, SNN 0.56 than JT-VAE (Filters 95.0, FCD 1.00, SNN 0.54).
[Q3] Choice of Info-NCE loss
We adopt Info-NCE because it is the only objective we have found that can model without bias. In our method, the density ratio
f_\\theta(x_{t}; X_t,t)=\\frac{p_{1|t}(x_{t}|X_t)}{p_{1}(x_t)},
is modeled directly (Eq. 5) and is independent of the bag size [44]. Consequently, although the model is trained with a fixed fragment-bag size , it can be applied with variable bag size during inference.
Computing a full soft-max over the entire fragment library is infeasible, whereas Info-NCE restricts computation to the fragments in the fragment bag while remaining an unbiased estimator of the full objective for moderate .
[Q4] Node, Edge diffusion
Factoring node- and edge-state updates is a standard practice in discrete diffusion/flow models[3,28,32] and has proven empirically sufficient: despite the independence assumption, our method still outperforms atom-wise baselines on MOSES and NPGen (Tab. 1,2).
Coupling every node flip with all incident edge flips would force a non-linear trajectory in Eq. 2, leading to intractable Euler updates and invalidating the closed-form rates in Eq. 3. Ensuring graph consistency under such coupling would also require additional constraints that are infeasible. While joint node-edge transitions are an interesting direction for the future, to the best of our knowledge no tractable implementation has been demonstrated so far. We therefore retain the independent formulation, which is both computationally feasible and empirically validated.
Our fragment-based approach is inherently better suited to mitigate this dependency issue than atom-level models. In atom-level diffusion, independent node and edge updates frequently violate local valency rules, leading to chemically invalid states, as shown in benchmarks (Tab. 1,2,7). FragFM mitigates this problem by design: since each node is already a chemically coherent fragment, the most critical local dependencies are implicitly preserved within the node itself. This structural constraint is why our model achieves over 99% validity in a purely data-driven manner, directly demonstrating that the information loss from the independence assumption is minimal.
Dear reviewer NJHQ,
We wanted to gently remind you that our discussion period is end soon.
We appreciate the reviewer’s recognition of our experiments across multiple benchmarks, including the conditioning and sampling-efficiency studies. We also thank the reviewer for acknowledging NPGen as a useful benchmark that contributes to both the machine-learning and drug-discovery communities.
Thank you again for the detailed, specific review. We added explanations and analyses that directly address your concerns:
- [W1] clearer explanation of the method
- [W2,Q2] rare/novel fragment coverage (recovery analysis and test-bag generalization)
- [Q1] coarse-to-fine autoencoder reconstruction accuracy
- [Q3] Explanation about the choice of Info-NCE loss
- [Q4] Reasoning about independent node and edge diffusion
Adding on our initial response to [Q3], we also had included additional theory and experiments (see Reviewer weUv [W1,Q1] and Reviewer A3wR [Q2]), with a detailed study of fragment-bag size. We will add these clarifications into the main text for readability.
We also emphasize that we organized our core contributions in Reviewer weUv [Common response to all reviewers].
We acknowledge that the initial submission could have better highlighted each component individually. Since we have now added this material, we hope the revisions sufficiently address your concerns and kindly ask you to reconsider revising the evaluation of the paper.
Respectfully, FragFM authors.
The paper proposes a hierarchical framework that enables the fragment-level molecule generation through discrete flow matching. The framework proposed is reasonable and well-motivated. The reviewers also find the new proposed benchmark is valuable for the community. The main shared concerns before the rebuttal centered around the clarity of the paper and insufficient experimental evaluation. The rebuttal provided many additional results and detailed clarifications which addressed most of the concerns. However, even reviewers with a favorable opinion of the paper expressed only borderline positive views. During reviewer discussion, reviewer A3wR states that despite the clarification provided in rebuttal, the final revision needs to go through another round of review. Reviewer EQJM also pointed out the marginal improvement, suggesting more further evaluations.
The AC appreciates the authors’ effort in presenting a reasonable hierarchical framework and a new benchmark, but also shares concerns regarding its clarity. To facilitate the clear communication of ideas to the ML community, the paper needs a substantial improvement in its presentation clarity. The paper could also be strengthened to include a more thorough evaluation and/or analysis to better highlight the potential of the model. The results and descriptions from the rebuttal could be a good starting point. It would be beneficial for the revision to go through another round of thorough review before it can be published. As such, the AC recommends rejecting the paper in its current form. The AC has discussed the case with the SAC, and that the decision has been confirmed by the SAC.