PaperHub
4.4
/10
Rejected5 位审稿人
最低3最高6标准差1.2
6
3
5
3
5
2.8
置信度
正确性2.4
贡献度1.8
表达2.0
ICLR 2025

Exploring the Recall of Language Models: Case Study on Molecules

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We measure the recall of language models trained on molecular datasets

摘要

关键词
recalllanguage modelsmolecular language modelssampling methods for language models

评审与讨论

审稿意见
6

This paper identifies the challenges in evaluating the recall of generative models and introduces a recall benchmark in the domain of molecular generation. It also proposes sampling strategies and loss formulations to enhance recall.

优点

This paper is well-written and easy to understand, addressing a problem that has not been extensively explored before. Additionally, the paper addresses crucial research directions, such as measuring recall without generation and methods to enhance recall, presenting intriguing experimental results.

缺点

Scalability of Research

The study in this paper is limited to a specific domain, namely molecular generation, and there needs to be a discussion on how this research can be extended to other domains. For example, a crucial aspect of measuring recall, as highlighted in the paper, is identifying the equivalence class of the model’s generated results. As mentioned in lines 60-62, there is a technique for identifying equivalence classes for SELFIES strings. How could this issue be addressed in other domains you mentioned in the introduction, such as “vulnerable code generation”?

Completeness in Method

In my opinion, the sections proposing the sampling strategy and loss to improve the model’s recall are crucial for establishing the novelty of your paper. However, these aspects are not fully developed and lack sufficient explanation. For instance, in the case of the recall-oriented loss function, the approach of changing the aggregation to min or max seems quite extreme to me, with significant potential for refinement. Additionally, the proposed method only showed effectiveness for a very small and underperforming model with 800K parameters. Therefore, improvements in this area are essential. Additionally, the motivation for using beam search in recall-oriented generation and the intuition behind why increasing the beam size leads to improved recall need to be more thoroughly explained.

Evaluation

Most experiments in this paper are validated using a single model and dataset, making it difficult to consider the proposed benchmark method and the approaches to improve recall as thoroughly validated. I believe there should be verification to ensure that the trends in the experimental results hold consistently across at least several models. Additionally, there are confusing aspects regarding the details of the experiments, which should be described and justified more comprehensively (see the questions section for more details).

问题

  • In my understanding, the process you described in lines 236-237 is aimed at generating the set of every correct generation, S\mathbb S , for evaluation purposes. Is this correct? Additionally, how can you ensure that the generated results accurately represent every correct generation?

  • As shown in Table 2, recall shows a correlation with the complexity of molecules, whereas precision does not. Is there a specific reason for this? I’m curious about which aspects of the recall metric lead to this outcome.

  • What is the input to the model when performing generation with an LLM for recall/precision evaluation?

  • What exactly is the purpose of the validation set mentioned in line 220, and is there a specific reason for using only 10,000 instances?

  • How does the cost (time complexity, memory, etc.) change with the beam size in 4.3?

评论

Questions:

  • In my understanding, the process you described in lines 236-237 is aimed at generating the set of every correct generation, ”S”, for evaluation purposes. Is this correct? Additionally, how can you ensure that the generated results accurately represent every correct generation?

    Answer:

    Yes, that is correct. Note that in practice, there is no proven method to exhaustively generate all possible valid permutations of a given SELFIES representation. Instead, we approximate this by shuffling the atom positions of each molecule up to 1 million times and retaining the unique, valid string representations obtained from these permutations to get every possible representation of molecules which satisfy the criteria of the molecular subsets.

    With respect to completeness of the molecular subsets (irrespective of string representation), leveraging GDB-13 as our full molecular set from which the subsets of interest are derived, we have a guarantee that these subsets are exhaustive enumerations of molecules which specify certain criteria. Some of these criteria are inherited from GDB-13 (≤13 heavy atoms, valid valences, etc..) and the remaining criteria correspond to the constraints by which we define the subsets.

  • As shown in Table 2, recall shows a correlation with the complexity of molecules, whereas precision does not. Is there a specific reason for this? I’m curious about which aspects of the recall metric lead to this outcome.

    Answer:

    In Table 2, the recall of model generations for molecular subsets from highest to lowest is “sas”, “asp”, “d>p”, “d=p”. Thus, “sas” is the easiest set to model and “d=p” is the hardest. We see the same order of performance for precision, in Table 3. Thus, both metrics capture the same ordering of difficulty for modeling the respective molecular sets.

  • What is the input to the model when performing generation with an LLM for recall / precision evaluation?

    Answer:

    We do not provide any specific input to the model; the generation is entirely unconditional, initiated only with a start token. Each model is fine-tuned on a specific subset, and after training, we expect it to generate samples predominantly from the learned distribution of that subset.

  • What exactly is the purpose of the validation set mentioned in line 220, and is there a specific reason for using only 10,000 instances?

    Answer:

    The validation set, mentioned in Line 220, is used to evaluate the model's performance during training, helping us monitor overfitting and adjust hyperparameters as needed. The size of 10,000 instances was chosen as it represents 1% of the 1M training set, providing a reasonable balance between computational efficiency and statistical power. More importantly, we use this small validation set and our proposed method to predict recall on the entire set of molecules which satisfy the given criteria, the results of which are displayed in Figure 3.

  • How does the cost (time complexity, memory, etc.) change with the beam size in 4.3?

    Answer:

    The time complexity of beam search scales with the beam size B, as the model computes the probabilities for each token in the vocabulary V for each beam. At each decoding step, this results in B×V probability calculations . Sorting B×V candidates will take log(B×V) time. C is the cost of the operation of getting 1 probability. Given a sequence length L, the overall time complexity is O(L(B×V[C + log(B × V)]).

    When the beam size is increased, for instance from B=1M to 10M, the time complexity increases B×log(B), meaning the decoding time will approximately scale by a factor of 10×log(10).

    In terms of memory complexity, the space required grows linearly with beam size, i.e., O(B×L), as each beam must store its partial sequence and associated score. For sequences of length L=10 and beyond, we save the beam candidates to disk to avoid excessive memory usage, thereby reducing the in-memory footprint.

We hope we addressed all of the concerns you raised to your satisfaction. If that is the case, we would ask to adjust the review score accordingly. We are open for more questions and feedback. Thanks again.

评论

I appreciate the author for providing detailed responses that have addressed most of my concerns and questions.

However, some concerns still remain.

First, if the scope of this paper is focused on benchmarking, I believe more extensive experiments should have been conducted (at least across various models).

Second, based on the experimental results of this paper on the loss function, it is somewhat risky to suggest ‘capacity-recall trade-offs in objective design.’ At the very least, the performance of each loss objective with respect to model parameter size should have been measured more thoroughly (with finer granularity) to establish such a trend.

Third, while the implications drawn from the experimental results—such as ‘additional considerations are necessary to develop loss functions that significantly improve recall of model generations’—are noted, I think that they are too weak to enhance the value of this paper.

Therefore, I will maintain my score.

评论

All experiments in our paper were done using OPT models. This week we have trained Llama 3.2 1B on the same pretraining set (canonical SELFIES), fine-tuned on canonical versions of the four datasets, generated 1M molecules from each of the models, and computed their precision and recall.

Here are the results:

MetricOPT 1.2B PrecisionLlama 3.2 1B PrecisionOPT 1.2B RecallLlama 3.2 1B Recall
S_asp75.6476.048.618.65
S_sas80.5580.9611.2511.3
S_d>p68.3168.596.956.98
S_d=p14.0415.181.721.86

Essentially, Llama 3.2 is slightly and consistently better across all metrics. Other than that there are no differences in behavior. We have fine-tuned on randomized versions as well, and the outcome is exactly the same: all scores are a bit better, but no difference in the relative rankings.

We will add this to the manuscript.

The idea of having more model sizes between OPT-800K and OPT-125M to have finer granularity is a good one. There are no "standard" sizes of OPT in between, but we will create new ones. Thanks for this.

评论

Here are the results of predicting the precision and recall of fine-tuned Llama models (pretrained and fine-tuned on canonical SELFIES)

ModelMetricPrecisionPredicted PrecisionDifferenceRecallPredicted RecallDifference
OPT-1.2BS_asp75.7%74.0%1.7%8.61%8.43%0.18%
S_sas80.6%79.9%0.6%11.25%11.16%0.08%
S_d>p68.3%66.5%1.7%6.95%6.78%0.17%
S_d=p14.1%13.5%0.6%1.73%1.65%0.07%
LLAMA-3.2S_asp76.0%74.2%1.8%8.65%8.46%0.19%
S_sas81.0%80.0%1.0%11.30%11.18%0.13%
S_d>p68.6%67.2%1.4%6.98%6.85%0.13%
S_d=p15.2%14.3%0.8%1.86%1.75%0.10%

As mentioned before, Llama models have slightly better precision and recall. The predicted precision and recall metrics for Llama models are also slightly higher than the predictions for OPT, which implies that the predictor can be reliably used to compare two models.

评论

Evaluation

Most experiments in this paper are validated using a single model and dataset, making it difficult to consider the proposed benchmark method and the approaches to improve recall as thoroughly validated. I believe there should be verification to ensure that the trends in the experimental results hold consistently across at least several models. Additionally, there are confusing aspects regarding the details of the experiments, which should be described and justified more comprehensively (see the questions section for more details).

Answer:

We chose the OPT model because it is a general GPT-2-like architecture and a predecessor of the LLaMA models. Furthermore, we extensively trained it from scratch, which required significant computational resources and time, making it challenging to replicate the same experiments with other models.

While we believe newer architectures will improve both precision and recall, we do not expect significantly different behavior, e.g. across sampling strategies. To verify this belief, we are currently training a Llama 3.1 1B model. We hope the results will be in before the end of the discussion period.

Regarding datasets, to the best of our knowledge, there are no other comprehensive datasets apart from GDB-13 that satisfy the conditions outlined in our paper. While we have this limitation to start from GDB-13, we ensured to have diverse datasets for training and evaluating the recall. We believe we achieved a quite wide diversity, as the recall of the tested approaches varies between 12% and 58% (Table 2). This result highlights the impact of the “complexity” of the “language” being modeled by the LM. On the other hand, Tables 6 and 7 show that the general findings (e.g. the role of pretraining or string representations) translate well across the datasets.

评论

Completeness in Method

In my opinion, the sections proposing the sampling strategy and loss to improve the model’s recall are crucial for establishing the novelty of your paper. However, these aspects are not fully developed and lack sufficient explanation. For instance, in the case of the recall-oriented loss function, the approach of changing the aggregation to min or max seems quite extreme to me, with significant potential for refinement. Additionally, the proposed method only showed effectiveness for a very small and underperforming model with 800K parameters. Therefore, improvements in this area are essential. Additionally, the motivation for using beam search in recall-oriented generation and the intuition behind why increasing the beam size leads to improved recall need to be more thoroughly explained.

Answer:

We appreciate the feedback regarding recall-oriented modeling and generation strategies described in our work. It is true that these elements of the research would benefit from additional exploration, development and explanation; the latter issue we would be sure to correct in a revision. However, we respectfully disagree that they are insufficiently developed and do not provide novelty. We address the concerns corresponding to those elements below.

Please also note that the scope of this paper is to present a benchmark to facilitate research on recall of LMs. We thank you for confirming that this question is underexplored in literature. Designing significantly novel methods that maximize the recall is beyond the scope of this work. We tried to cover all “low-hanging” methods known to the community to set up the scene with reasonably strong baselines.

Regarding the loss function

The min/max aggregation approach may appear extreme, but it was deliberately chosen to test boundary conditions of the recall-precision trade-off space. While you are correct that the improvement was only observed in the 800K parameter model, this finding is actually quite significant for several reasons. It demonstrates that recall optimization strategies may need to be parameter-count dependent, and suggests there may be fundamental capacity-recall trade-offs in objective design. Additionally, the negative results in larger models are themselves informative, indicating that additional considerations are necessary to develop loss functions that are significantly improve recall of model generations. Furthermore, we articulate in Section 4.4 that the designing recall-oriented loss functions belongs to future work, and by providing simple approaches towards this end, establish important initial baselines and observations upon which subsequent research can develop greater understanding and more performant methods. In a revision, we could include implementations of other loss functions, and additional analyses on the relationship between recall, training objective and model scale if it would help the strength of our work.

Regarding the sampling strategy

There are two reasons why the generations go below the “ideal” curve (the blue dashed line on Figure 2): (a) imperfect precision of generations, (b) duplications.

Regular autoregressive generation has both problems. Precision is ~constant at 75% (Table 4), and there are many duplications. “Upper bound (i.i.d)” shows the case when the precision is ideal, but the duplications are there.

Beam search solves the duplication issue, which means that beam search with larger beam size will inevitably produce more distinct molecules, so the recall cannot decrease. Unfortunately, the precision gets gradually worse as one increases the beam size. The reason is that beam search naturally ranks the molecules by their perplexity, and the top ones have higher precision.

Note that surprisingly, the two different issues for these two methods (beam vs. upper bound) produce very similar recall. We will add a paragraph with these clarifications in the manuscript.

评论

Thank you for the detailed review of the paper.

Weaknesses:

Scalability of Research

The study in this paper is limited to a specific domain, namely molecular generation, and there needs to be a discussion on how this research can be extended to other domains. For example, a crucial aspect of measuring recall, as highlighted in the paper, is identifying the equivalence class of the model’s generated results. As mentioned in lines 60-62, there is a technique for identifying equivalence classes for SELFIES strings. How could this issue be addressed in other domains you mentioned in the introduction, such as “vulnerable code generation”?

Answer:

Despite the fact that proposed extensions of this work to other domains is beyond the immediate scope of this paper, we agree that accurately conveying the potential of this kind of analysis to other fields is essential. Initially, as presented in the molecular domain, the set constraints and equivalence classes will need to be simple to define and validate. To start, let’s simplify the “vulnerable code generation” setting, to a “vulnerable function generation” setting. Additionally, since functions with given behaviour can span infinite amount of text, let’s further constrain the program space to those composed of less than or equal to 1000 characters in the python programming language, and no side effects.

Let’s say that the domain specific behaviour is that the introduction of the function in a codebase makes it open to a prespecified vulnerability. In this case an equivalence class of functions could be all of those with equivalent IO behaviour (i.e. for every possible function input, same output). In this case, an equivalence class of functions could be large and include many different programming constructs, but would have virtually identical behavior.

Another alternative would be to define functions which make a codebase vulnerable to any kind of vulnerability, and create function equivalence classes based on the category of vulnerability introduced. Some examples would be: weak random numbers, race conditions, buffer overflow, error swallowing, etc… In this case the equivalence class is larger, but IO behaviour doesn’t have to be verified.

In both cases, an increased recall demonstrates a stronger threat model, in that it can generate a more complete set of threats to a given codebase. A more concrete realization of such problem settings would require a focused effort by researchers working within the cybersecurity domain, but would provide insights into performance of models in generating malicious code.

审稿意见
3

This paper introduces a new benchmark of molecules for evaluating generative language models with a focus on recall. It aims to investigate the model's ability on tasks requiring distinct output generation, like detecting all vulnerabilities in code. Using organic molecule dataset, the study shows that model recall can be anticipated via perplexity on a validation set. Moreover, the authors use beam search decoding to reduce duplicates and a recall-aware loss function to improve performance, providing insights into molecular representation and model pretraining effects.

优点

This paper presents a meaningful investigation into the recall of model generation, with a well-articulated and compelling motivation.

缺点

  1. From section 3.1 onward, this paper becomes quite difficult to follow, largely due to the use of specialized terminology from fields like chemistry without providing sufficient foundational overviews or introductory explanations. This approach makes it challenging for readers to fully grasp the content and nuances of the work. For instance, important details and statistics regarding the dataset collected by the authors are not included, and terms like SELFIES are mentioned without any straightforward elaboration to help readers understand what SELFIES actually represents. This lack of accessible explanations hinders the reader’s ability to form a clear understanding of the paper’s specifics. I recommend that the authors incorporate diagrams or more detailed descriptions of key terminology to enhance clarity.

2.In section 4.2, a new method for estimating recall is proposed. First, the statement "Given that evaluating recall provides a meaningful and interpretable measure of an approach’s ability to model data, estimating recall without needing to perform generations would be useful" lacks a convincing motivation for why recall estimation without actual generation is necessary. There is no clear justification for the need to use an alternative method to evaluate recall. Furthermore, using probability to estimate recall does not align with the standard definition of recall, which traditionally measures the proportion of correctly generated instances rather than a probabilistic expectation. Thus, it is both imprecise and misleading to label this metric as recall. For instance, in earlier sections (Table 2), the authors appear to use a conventional method for calculating recall; however, after introducing this new approach, they apply it in Table 4 but use the same metric name. This inconsistency undermines reliability and creates confusion regarding the validity of the reported recall values.

3.In section 4.3, I don’t see a substantial difference between your proposed recall-oriented generation and the standard beam search.

  1. Mean aggregation is equivalent to the regular loss function" lack clarity—specifically, it is not defined what the “regular loss function” refers to. Furthermore, the section does not directly present the actual loss function or provide a detailed explanation. Instead, it relies solely on textual descriptions, which makes it difficult to understand the specifics of the proposed loss. Including the explicit mathematical form of the loss function along with a step-by-step explanation would significantly improve clarity and accessibility.

5.In addition to the presentation issues mentioned above, the paper lacks a coherent structure throughout both the methods and experiments sections. The presentation feels fragmented, and critical details regarding the experimental setup, such as baseline configurations, are insufficiently described. To improve clarity, a major revision is needed to reorganize the paper, providing a more cohesive structure and a thorough explanation of the experimental settings.

问题

Please refer to the weakness part

评论

Thank you for the review. Please find the responses to the weaknesses below.

1. On readability due to chemical terminology

Thank you for bringing to our attention the potential difficulties in the interpretation of our work caused by excessive use of domain-specific language. In our paper, we included citations to the works which formulated the SMILES and SELFIES molecular representations. However, we recognize that this doesn’t necessarily provide sufficient continuity for a reader with limited exposure to chemistry.

SMILES and SELFIES are both string representations, which are linearized representations of 2D molecular graphs. We attach an image below which gives a visualization of the SMILES and SELFIES strings along with how the substrings map to nodes and edges on the molecular graph. Notably, SELFIES were designed after SMILES with the express goal of creating a similar representation where any sequence of tokens from the SELFIES vocabulary corresponds to a valid molecule. SMILES do not have this property and thus some SMILES strings do not correspond to a valid molecular graph. We are going to add a paragraph on this in the revised manuscript.

With respect to the datasets we investigated in the study, we included statistics on the length of the SELFIES representations, number of distinct randomized SELFIES per molecule for each subset, as well as a brief explanation of the criteria which define each subset. The larger dataset GDB-13 from which we derive these subsets is described in some detail in lines 160-164, and refer readers to the original publication for additional details. A large variety of statistics about GDB-13 is available in that publication, some of which are attached below. We will be happy to follow your suggestions about which of these (or other) statistics would most aid in the improving the clarity of the problem setting.

image.png

image.png

image.png

image.png

2. On estimating recall

There are several reasons why it may be useful to be able to accurately estimate recall a model would achieve on a given subset without performing generations. The primary reason is that our method greatly reduces the computational cost of getting this value, which allows the process to be used for model selection. In practice, one can tune i.i.d generation hyperparameters like temperature or Top-K in nucleus sampling to maximize recall without actually generating large number of molecules for each hyperparameter. Figure 3 hints that the estimated values can be compared across other hyperparameters as well. We are going to motivate this more explicitly in the revised manuscript.

Additionally, in cases where the entire closed set of desired generations is not known, this method enables the estimation of recall on that set using a much smaller subset.

We do not believe that the work is misleading in how it refers treats the recall metric. The paper does not claim that recall is a new metric, nor does it state that the predicted recall is recall itself. For example, Table 4 shows the actual recall calculated after performing generations. We report the predicted recall only in Figure 3 and the axis labels reflect our explicit distinction between true recall and the recall predicted by our method.

3. On the novelty of beam search

Our recall adapted generation strategy is not substantially different from standard beam search, and we do not claim that is a technical contribution of this work. Rather, the formulation of the recall metric provides new intuition and motivation for comparatively extreme beam search generation hyperparameters, where the beam size is equal to the generation size. This configuration demonstrably increases recall, and performs better than other generation methods to this end.

We agree that the abstract of the paper could be interpreted as hinting towards significant novelty around beam search. We will adjust the abstract accordingly.

审稿意见
5

This paper introduces a benchmark for evaluating models based on recall rather than just accuracy. The authors tackle two challenges: the lack of complete correct output sets and the presence of multiple similar outputs. Using small organic molecules from the GDB-13 database, they fine-tune models and develop a method to predict recall based on perplexity. They also propose a novel beam search decoding method to maximize recall by avoiding duplicates, alongside a recall-aware loss function. This approach aims to enhance the ability of GLMs to generate all correct outputs, with potential applications in various fields, including security.

优点

  • This paper explores the evaluation of recall rates for small language models, which is a meaningful endeavor.
  • The paper investigates various methods to enhance the recall rates of models and has achieved some positive results.

缺点

  • The contributions of this paper are limited. On one hand, in improving recall through sampling methods and loss functions, the authors merely attempt different strategies, which can sometimes harm precision, and no solutions are provided. On the other hand, the improvements through fine-tuning appear to offer no significant contribution, as it is generally expected that fine-tuning would enhance performance on a specific task.
  • The model is too singular, as the experiments in this paper only include the OPT-1.3B model. Therefore, the evaluation results and methods for enhancing recall may not generalize well.

问题

See weaknesses.

评论

Thanks for the review!

Weaknesses:

The contributions of this paper are limited. On one hand, in improving recall through sampling methods and loss functions, the authors merely attempt different strategies, which can sometimes harm precision, and no solutions are provided.

On the other hand, the improvements through fine-tuning appear to offer no significant contribution, as it is generally expected that fine-tuning would enhance performance on a specific task.

Answer:

We appreciate the concern about the scope and extent of contributions of the present work. This criticism does not account for the primary focus and contribution of this paper, which is to enable an exact recall metric, as well as a strong method to predict this recall without expending compute for model inference for all of the required samples. This is a critical contribution as prior works attempting a recall evaluation rely on approximate metrics like KNN or necessitate the retrieval of specific text from corpora. Our method enables model selection for any i.i.d. sampling method, provides foundational insights into recall in language modelling. The recall-adapted beam generation strategy emphasizes how intuition surrounding this newly developed metric can guide modelling decisions.

The work also connects this evaluation with a specialized domain in which the recall problem is meaningful, formalizes the necessary domain-specific constructs to calculate recall, and provides motivation for potential applications in other domains like security. It’s of note that in molecular generation tasks, precision is not of great interest since repeated proposals of the same molecule are typically redundant. The analysis of loss function aggregation methods revealed an unexpected relationship between model capacity and recall optimization strategies.

Finally, we do not perform experiments comparing models with and without finetuning on the subsets of interest. Rather, we separate model training into pretraining and finetuning stages in order to analyze the impacts of data representation on recall and precision, namely canonicalization and randomization of the SMILES and SELFIES representations. We also include an experiment which compares a model which undergo finetuning but not pretraining, this is a different experiment which demonstrates that the increased representational power gained during the pretraining stage uniformly improves models’ ability to generate molecules within subsets of interest, despite being exposed to a far greater number of molecules outside of these subsets during pretraining.

Please also note that the scope of this paper is to present a benchmark to facilitate research on recall of LMs. Thanks for confirming the importance of this task in the Strengths section of your review. Designing significantly novel methods that maximize the recall is beyond the scope of this work. We tried to cover all “low-hanging” methods known to the community to set up the scene with reasonably strong baselines.

The model is too singular, as the experiments in this paper only include the OPT-1.3B model. Therefore, the evaluation results and methods for enhancing recall may not generalize well.

Answer:

While we primarily focused on OPT-1.3B, our experiments actually span multiple model scales, including OPT-125M and OPT-800K variants (Section 4.4). These experiments revealed important scaling behaviors - for instance, our recall-oriented loss function showed different effects across model sizes.

We chose the OPT architecture because it is a general GPT-2-like architecture and a predecessor of the LLaMA models, making it a reasonable proxy for a variety of autoregressive decoder-only architectures. Additionally, we extensively trained multiple sized models from scratch on different molecular representations, which required substantial computational resources. Given these constraints, we decided to focus our efforts on this single class of models to ensure thorough evaluation and analysis.

While we believe newer architectures will improve both precision and recall, we do not expect significantly different behavior, e.g. across sampling strategies. To verify this belief, we are currently training a Llama 3.1 1B model. We hope the results will be in before the end of the discussion period, and regardless of the outcome we will include them in the manuscript.

We thank you again for your review. We hope these clarifications will enable a positive re-evaluation of our work.

评论

All experiments in our paper were done using OPT models. This week we have trained Llama 3.2 1B on the same pretraining set (canonical SELFIES), fine-tuned on canonical versions of the four datasets, generated 1M molecules from each of the models, and computed their precision and recall.

Here are the results:

MetricOPT 1.2B PrecisionLlama 3.2 1B PrecisionOPT 1.2B RecallLlama 3.2 1B Recall
S_asp75.6476.048.618.65
S_sas80.5580.9611.2511.3
S_d>p68.3168.596.956.98
S_d=p14.0415.181.721.86

Essentially, Llama 3.2 is slightly and consistently better across all metrics. Other than that there are no differences in behavior. We have fine-tuned on randomized versions as well, and the outcome is exactly the same: all scores are a bit better, but no difference in the relative rankings.

We will add this to the manuscript. Our initial intuition is confirmed.

审稿意见
3

This paper presents a benchmark for modelling molecules, based on GDB-13 (an exhaustive set of molecules with at most 13 heavy atoms that satisfy certain conditions). The authors pretrained LMs to generate the molecule sequences, and aim to bring up recall via 1) better sampling in generation and 2) better training data design. In addition to that, the authors proposed ways to predict the recall value with a small-scale experiment and a set of empirical studies on how should one best represent the molecules in LM inputs.

优点

  1. Maximizing recall is indeed valuable for a lot of applications, as the authors discussed in the paper, this paper is of empirical importance.
  2. The formulation of the problem is novel, the molecular generation domain provides an excellent testbed due to well-defined equivalence classes and complete reference sets.
  3. The experiments are done with rigor. I like the comprehensive analysis of factors affecting recall (pretraining, molecular representations, etc.)
  4. The dataset and benchmark would make a good contribution to the community.

缺点

My main concern with this paper is around its technical contributions:

  1. The author proposed using random sampling with temperature and beam search (with a large beam size) to improve recall coverage. These two methods are well-known methods in language models' (LM) generation, and I was expecting a novel generation approach such as generating with penalizing the likelihood of already generated sequences.
  2. The method that predicts recall has a lot of similarities with perplexity measure in language modelling, would the authors clarify how is the proposed metric different from the perplexity-based measures?
  3. Removing duplicates and selecting data in each batch are sensible approaches, but they don't appear to be anything novel.

I have some minor questions listed in the below section.

问题

  1. In figure 2, the authors stated that "The plot indicates that the recall is close to saturation at 10 million generations, implying that this model will not cover 90% of the molecules even with 50 million generations." To me, the coverage function is naturally sub-linear, as you repeatedly take samples from a fixed distribution, the likelihood of getting a new unseen sample gradually goes down, so I am not sure if this (the sublinear trend) is a problem. And if it is, does the authors' proposed approach improves the trend to be somewhat linear? I think that will be an exciting result to see.

  2. SMILES v.s. SELFIES. I am not expert on the molecule modelling topic, but from Table 7, it seems SMILES works better than SELFIES when the data is in Canonical form, so why choose SELFIES as the main representation form?

  3. Writings: [Line 76], (Remove "Finally"?) Finally, LLMs have recently demonstrated strong performance on these tasks [Line 310] I am not sure this expression = "an average probability", looks like a sum of probabilities.

评论

1. In figure 2, the authors stated that "The plot indicates that the recall is close to saturation at 10 million generations, implying that this model will not cover 90% of the molecules even with 50 million generations." To me, the coverage function is naturally sub-linear, as you repeatedly take samples from a fixed distribution, the likelihood of getting a new unseen sample gradually goes down, so I am not sure if this (the sublinear trend) is a problem. And if it is, does the authors' proposed approach improves the trend to be somewhat linear? I think that will be an exciting result to see.

Answer:

As you correctly state, i.i.d methods are expected to demonstrate sublinear performance in these settings. In general, it would be better to get to higher recall with fewer generations, so “beating” the sublinear trend is a good goal.

Unfortunately, beam search does not really beat that. Here is why. There are two reasons why the generations go below the “ideal” curve (the blue dashed line on Figure 2): (a) imperfect precision of generations, (b) duplications.

Regular autoregressive generation has both problems. Precision is ~constant at 75% (Table 4), and there are many duplications. “Upper bound (i.i.d)” shows the case when the precision is ideal, but the duplications are there. Beam search solves the duplication issue, but the precision gets gradually worse as one increases the beam size. The reason is that beam search naturally ranks the molecules by their perplexity, and the top ones have higher precision. Surprisingly, the two different issues for these two methods (beam vs. upper bound) produce very similar recall.

We have plotted the values of Table 4 to visually show that the trend is sublinear for the beam search as well. We are adding more points to this chart to make it smoother before we put it in the paper (unfortunately it takes too long). We will add a paragraph with these clarifications.

Thanks for bringing this up!

  1. SMILES v.s. SELFIES. I am not expert on the molecule modelling topic, but from Table 7, it seems SMILES works better than SELFIES when the data is in Canonical form, so why choose SELFIES as the main representation form?

Answer:

We used SELFIES as it has less issues with generating valid molecules. At some point we found the paper “Invalid SMILES are beneficial rather than detrimental to chemical language models” and decided to compare SMILES as well. The results were mixed: SMILES was better with canonical fine-tuning, and SELFIES was better with randomized fine-tuning.

We didn’t rerun all our experiments with SMILES as we did not have a goal to squeeze the best possible scores. The goal of this subsection is to show the effect of representations. The lesson learned is that future work should not neglect this aspect of the training when maximizing recall in modalities that have multiple representations.

  1. Writings: [Line 76], (Remove "Finally"?) Finally, LLMs have recently demonstrated strong performance on these tasks [Line 310] I am not sure this expression = "an average probability", looks like a sum of probabilities.

Answer: Thank you for identifying these grammatical errors, we would be happy to make revisions to the manuscript based on this feedback.

We hope we addressed all of the concerns you raised to your satisfaction. If that is the case, we would ask to adjust the review score accordingly. We are open for more questions and feedback. Thanks again.

评论

Thank you for the detailed response and interesting insights! Please find our responses below.

Weaknesses:

My main concern with this paper is around its technical contributions:

The author proposed using random sampling with temperature and beam search (with a large beam size) to improve recall coverage. These two methods are well-known methods in language models' (LM) generation, and I was expecting a novel generation approach such as generating with penalizing the likelihood of already generated sequences.

Answer:

We acknowledge that the proposed methods, random sampling with temperature and beam search, are well-established in language model generation. Our primary goal with temperature sampling was to conduct ablation studies, demonstrating that while higher temperatures lead to higher entropy and more diverse generations, in the context of molecular generation, this also resulted in more molecules outside the desired subset (Figure 4).

Regarding beam search, we do not claim to introduce an entirely novel decoding scheme. Instead, we adapt it by setting the beam size equal to the generation size—an essential adjustment specifically designed to maximize recall in the context of molecule generation. This modification enables beam search to thoroughly explore a broader set of high-recall candidates (which, initially, we could not determine would belong to our desired subset) and ultimately achieve significantly higher recall compared to other popular decoding methods.

In general, the scope of this paper is to present a benchmark to facilitate research on recall of LMs. Thanks for highlighting this in the Strengths section of your review. Designing significantly novel methods that maximize the recall is beyond the scope of this work. We tried to cover all “low-hanging” methods known to the community to set up the scene with reasonably strong baselines.

The method that predicts recall has a lot of similarities with perplexity measure in language modelling, would the authors clarify how is the proposed metric different from the perplexity-based measures?

Answer:

The recall and especially precision predictors have mathematical similarities with standard perplexity metrics, from which we took inspiration. The critical difference for the method which predicts recall from perplexity is in the facility of its interpretation and connection to generative performance. Perplexity measured on a hold out set may be able to predict the perplexity on another larger corpus with similar distribution, but it does not provide an interpretable or comparable value regarding the downstream performance of the model. Unlike our proposed method, perplexity measures do not take model generations into account. In practice, the simplicity of our method enables the concomitant calculation of perplexity and predicted recall metrics, which would prove informative for teams working at the intersection of NLP and applications in domains for which recall is meaningful.

Removing duplicates and selecting data in each batch are sensible approaches, but they don't appear to be anything novel.

Answer:

We acknowledge that the loss objectives presented in the work are not novel on a broad scale. The intention of experiments which implemented these modifications, was to provide a more comprehensive explanation of . We would be happy to correct the statement in the manuscript to clarify that the novelty of the work is within the analysis of a new problem setting with previously inaccessible motivations for these modelling decisions, rather than presenting novel methods in a broad sense.

审稿意见
5

This paper introduces a benchmark for evaluating the recall of language models in the domain of small organic molecules. Specifically, based on the famous dataset GDB-13, the authors prepare a new dataset with four subsets, e.g., a new subset contains molecules that share a certain percentage of substructures with aspirin. Based on the constructed dataset, the molecule generation capability of language models (LMs) in terms of recall before and after fine-tuning has been evaluated. A new method for predicting the recall of LMs has also been designed. The average probability of a desired molecule to be generated and the ground truth recall values are used to build a regression model for the recall prediction. The evaluation demonstrated the correlation is more than 0.99. Finally, a recall-oriented molecule generation method and a loss function have been introduced to boost the recall of LMs.

优点

  1. An interesting and important problem in analyzing the recall of language models.
  2. Multiple solutions with promising results have been proposed in the same work
  3. The paper is well-written

缺点

  1. Even though the motivation is clear and good, the studied objective does not fit the motivation well, is the recall metric more important in the molecule generation domain?
  2. Many design choices are unclear, e.g., why use Beam search in section 3.4 not others?
  3. Many problems, e.g., capability estimation and new loss design, have been studied, but each of them lacks a comparison with baselines.

Overall, this paper studies an important problem and proposes promising solutions for recall estimation and LMs enhancement. However, there are some concerns that need to be addressed.

Firstly, even though the main point, evaluating whether a model can generate all correct outputs is important for safety-critical problems, it is unclear whether this is the case for the studied objective molecule generation. It is better to give clear motivation for the importance of evaluating recall for this task.

For the subset construction, in Table 1, it is unclear how the threshold is determined, e.g., 0.4 for Sasp and 0.2 ≤ sim(m, d) ≤ 0.2165. Please clarify it.

In Section 4.1, Table 2 and Table 3 suggest different solutions as the best, which one we should accept in practice. It is better to add more discussion here.

In Section 4.2, considering the recall estimation, there are many works that have been proposed to evaluate deep learning models in an unsupervised manner [1, 2, 3], it is necessary to at least discuss the difference between the proposed method and these works.

In Section 4.3, it is unclear why Beam search is used here since there are many other options (search methods).

In Section 4.4, first, it is better to add baselines without using the designed loss function in Table 5. Besides, the recall values decreased after comparing the results in Table 5 and Table 4. It is unclear which factors lead to this degradation.

[1] Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. [2] Estimating Model Performance Under Covariate Shift Without Labels. [3] Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift

问题

Please check my comments above.

评论

We thank the reviewer for the deep review and the questions.

Concerns:

Firstly, even though the main point, evaluating whether a model can generate all correct outputs is important for safety-critical problems, it is unclear whether this is the case for the studied objective molecule generation. It is better to give clear motivation for the importance of evaluating recall for this task.

Answer: The primary focus of the paper is to demonstrate this problem formulation in a domain for which it is useful. We suggest applications in other domains to provide further motivation for our line of research. With respect to molecules, we address the importance of recall evaluations in molecular generation in lines 044-048 of the paper:

In scientific discovery, generating new molecules or materials with given characteristics is a cornerstone problem. For example, in drug discovery, most of the correctly generated molecules may prove useless in subsequent phases of drug development (e.g., in toxicity analysis), so generating a diverse and complete set of initial molecules is useful. Another related problem is the exhaustive generation of all conformations (3D positions) for a given molecule.

To expound upon this, the ability of a model to cover the full set of molecules which satisfy certain criteria is desired for a number of reasons. Firstly, it tests whether a model can generate molecules which often have high reward, and capture the total diversity of the subset in question. It provides a direct signal for systematic biases and failure modes of the generative model, identifying if it misses certain subclasses or chemical subspaces within the chosen set. Current benchmarks in molecular generation rely on arbitrary thresholds for property values to evaluate molecular generation pipelines because the complete set of desired molecules is not specified. By reformulating the problem, we enable an evaluation method which is both interpretable by domain experts (”With model A, we can recover M% of the molecules that bind to Y and have Z property”) and fully captures the complexity of the task.

For the subset construction, in Table 1, it is unclear how the threshold is determined, e.g., 0.4 for Sasp and 0.2 ≤ sim(m, d) ≤ 0.2165. Please clarify it.

Answer:

The thresholds were chosen to ensure that the resulting subsets have comparable sizes. We aimed to construct a training set of 1 million molecules for each subset, ensuring that all models were trained on equal amounts of data. Specifically, this design ensures that the upper bounds for recall calculations are based on subsets of similar size, providing a fairer basis for comparison.

In Section 4.1, Table 2 and Table 3 suggest different solutions as the best, which one we should accept in practice. It is better to add more discussion here.

Answer:

Thank you for raising concern about seemingly conflicting findings. We attempt to address this in lines 512-526 of our work. To add to this and continue the reasoning of the response to the first concern, maximizing recall would typically be of greater interest in practical applications compared to precision. In this case, randomized pretraining with randomized fine-tuning would be the best configuration based on our findings. This is because during molecular generation, given a fixed compute budget, the generation of an increased diversity of candidates for subsequent development is more important than generating desired molecules more often, since duplicate generations are redundant.

评论

In Section 4.2, considering the recall estimation, there are many works that have been proposed to evaluate deep learning models in an unsupervised manner [1, 2, 3], it is necessary to at least discuss the difference between the proposed method and these works.

[1] Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. [2] Estimating Model Performance Under Covariate Shift Without Labels. [3] Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift

Answer: The mentioned papers are not about measuring the recall and do not seem to be relevant to this work. Here is a detailed description and comparison with one of the papers mentioned.

Unsupervised Evaluation of Code LLMs with Round-Trip Correctness. - This paper adapts the idea of back-translation, where a description is generated from a code snippet, and then the code is regenerated from the description. The initial code and the regenerated code are then compared using similarity metrics such as exact match, CodeBLEU, and unit tests. - The key difference between this work and ours is that their method does not measure recall, nor does it attempt to predict recall as a metric upfront. - Furthermore, their approach focuses on conditional generation, whereas our method is designed for unconditional generation.

In Section 4.3, it is unclear why Beam search is used here since there are many other options (search methods).

Answer:

It is true that a breadth of search methods exist for large model generation. However, for the purpose of this study, which is primarily focused on a novel evaluation setting, the purpose of employing beam search was to showcase the effectiveness of a commonly used non-i.i.d. generation method for the setting that we propose.


While other search methods could also be explored, our primary goal in this section was to showcase the effectiveness of Beam search method when we choose to keep all generated outputs which are all unique.

In Section 4.4, first, it is better to add baselines without using the designed loss function in Table 5. Besides, the recall values decreased after comparing the results in Table 5 and Table 4. It is unclear which factors lead to this degradation.

Answer:

We would like to clarify that the baseline is indeed included in Table 5, specifically the "Aggregation with Mean Loss." Additionally, we demonstrate in the same table that using the proposed Minimum Loss function allows for achieving higher precision and recall compared to the baseline, particularly when applied to a smaller model.

The results in Table 4 and Table 5 correspond to evaluations of different experimental setups. In Table 4, the model is trained on the default setting, which uses a training set comprising 1 million unique molecules (SELFIES), as described in lines 229–231. In contrast, Table 5 reports results from an experiment described in lines 421–426, where the training set is augmented by generating 8 SELFIES representations for each of the 1 million molecules. This augmentation introduces variability that impacts the recall values.
AC 元评审

The paper explores the problem of evaluating language models with a focus on recall as opposed to accuracy and introduces a new benchmark for molecules. The methodology primarily involves random sampling with temperature and beam search using large beam width for decoding using a recall-aware loss function. Using a dataset of organic molecules, the paper shows that recall can be predicted using perplexity on a validation set.

The reviewer assessments were mixed on this paper. All reviewers appreciated the research question, formulation, and benchmarking. The negative reviewers' complained about the lack of technical novelty and/or more comprehensive experiments. The authors' response to some of the other questions were mostly satisfactory even though couple of reviewers didn't respond to rebuttal.

In my own reading and assessment of the paper, it certainty has some strengths but needs improvement for acceptance. The paper can potentially take two routes to strengthen it.

  • Increase the technical novelty and add additional experiments on more molecule datasets (if the focus is on molecules as a case study).
  • Increase the experiments by adding more use-cases (as alluded in the paper) beyond molecules to drive home the general message for benchmarking and importance of this research.

Therefore, I'm recommending to reject this paper and strongly encourage the authors' to improve the paper based on the feedback from reviewers' for resubmission.

审稿人讨论附加意见

The negative reviewers' complained about the lack of technical novelty and/or more comprehensive experiments. The authors' response to some of the other questions were mostly satisfactory even though couple of reviewers didn't respond to rebuttal.

Based on my own reading the paper needs improvement in methodology and/or more comprehensive experiments.

最终决定

Reject