5.5

/10

Poster4 位审稿人

最低4最高7标准差1.1

3.5

置信度

COLM 2025

Benchmarking Retrieval-Augmented Generation for Chemistry

Xianrui Zhong,Bowen Jin,Siru Ouyang,Yanzhen Shen,Qiao Jin,Yin Fang,Zhiyong Lu,Jiawei Han

OpenReview PDF

提交: 2025-03-20更新: 2025-08-26

TL;DR

We construct a comprehensive Retrieval-Augmented Generation benchmark for chemistry.

摘要

关键词

Retrieval-Augmented GenerationRAGBenchmarkAI for ScienceLLMLarge Language ModelChemistry

评审与讨论

审稿意见

评分: 7置信度: 42025-04-19

The authors take 4 datasets and 6 corporate in the chemistry domain and see how RAG with various retrievers works. There are ablations and analyses.

接收理由

CHEMRAG-BENCH and CHEMRAG-TOOLKIT are clearly useful contributions on which to evaluate and improve models.

拒绝理由

I don't have major reason to reject. Some questions below.

给作者的问题

Why use SPECTER? SPECTER2 is better and has adapters for various tasks. Also, SPECTER was made to encode titles + abstracts specifically. Many of your datasets are not in that format, so it wouldn't make sense to use SPECTER for those datasets.
For the original 4 datasets (MMLU-Chem etc) -> what was the retrieval intention of the authors? Did these datasets come with corpora? It's unclear how your corpora relate to the original datasets, their construction, etc.
There's no mention of code release or how to use the benchmark/toolkit. Given the framing, I assume you'll release the code. It would be very nice to have examples of how to run it. Is the benchmark easy to run with another LLM? How easy is it to add another retriever? What about another corpus?

评论- Response to Reviewer YkjK

2025-06-03

Thank you very much for your constructive comments and support! Besides the general response, we specifically answer your questions below and the manuscript will be carefully revised accordingly.

Q1: Why use SPECTER? SPECTER2 is better and has adapters for various tasks.

A1: We selected four representative retrieval models to illustrate the performance across different retriever types: BM25 as a sparse retriever; Contriever and e5 as general-domain dense retrievers; and SPECTER as a scientific-domain dense retriever. To show that our conclusion is consistent for both SPECTER and SPECTER2, we will add the results of SPECTER2 in the revised draft.

Q2: What was the retrieval intention of the authors?

A2: Chemistry is a highly specialized and dynamic discipline. Thus, LLMs trained on general corpora often fail to generate grounded and accurate responses for chemistry, instead, they may produce hallucinated or outdated content. Retrieval presents a natural solution to these limitations, allowing models to retrieve and incorporate trusted chemical knowledge during inference. Therefore, it is a powerful way to mitigate hallucination.

Q3: There's no mention of code release or how to use the benchmark/toolkit.

A3: We will release the code soon. Please find our code at https://anonymous.4open.science/r/ChemRAG_anonymous-EB23/. We deleted some model/cache paths for anonymous purposes. It is very easy to run with different LLMs. For open-source LLMs, you can simply download a model from huggingface and provide the location of the model. For proprietary models, you can simply provide the model name. For adding a new retriever, you can simply run the index code that we provide. As for a new corpus, we have provided the code to index the corpus with retrievers.

2025-06-03

Thanks for your response.

I was not clear about my second question:

"For the original 4 datasets (MMLU-Chem etc) -> what was the retrieval intention of the authors? Did these datasets come with corpora? It's unclear how your corpora relate to the original datasets, their construction, etc."

I mean what was the intention of the authors of the 4 original datasets? Were these datasets released with corpora to retrieve from?

2025-06-05

Thank you for your question!

The intention of the 4 original datasets is to systematically evaluate how LLMs and RAG systems perform in chemistry-related tasks. We select tasks from both academic and research settings.

For academic setting, we choose MMLU which is a commonly used dataset for evaluating LLMs performance. MMLU only has multiple-choice questions, in order to evaluate the performance more comprehensively, we select SciBench as another dataset in the academic setting. SciBench is a widely used dataset and consists of open-ended questions in chemistry.

As for the research setting, we collaborate with researchers in chemistry and biochemistry, who expect LLMs to work in their domain-specific tasks in molecule description generation, molecule generation, property and reaction predictions. Based on the expectation of domain experts, we select ChemBench4K and Mol-Instructions, both of which cover a wide range of tasks desired by domain experts. The main difference between ChemBench4K and Mol-Instructions is that ChemBench4K is in a multiple-choice setting while Mol-Instructions is in an open-ended setting. We adopt the two settings because we want to make the evaluation as comprehensive as possible.

Finally, for the last question, these datasets were not released with any corpus. The corpora is purely constructed by us, which is one of the main contributions of this work.

2025-06-05

"The corpora is purely constructed by us, which is one of the main contributions of this work." - Can you help me understand how you have analyzed how appropriate or useful the corpora are for each dataset? To put this question another way: if the datasets were not released with any corpus, what is the motivation to attach a corpus to them post-hoc? Does it make them easier than originally? Harder? Is your corpus the right corpus?

2025-06-07

Thanks for the questions! Here are the answers to your questions.

Q4: What is the motivation to attach a corpus to them post-hoc?

A4: Chemistry is a highly specialized and dynamic discipline. Thus, LLMs trained on general corpora often fail to generate grounded and accurate responses for chemistry, instead, they may produce hallucinated or outdated content. Retrieval from a corpus presents a natural solution to these limitations, allowing models to retrieve and incorporate trusted chemical knowledge during inference. Therefore, it is a powerful way to mitigate hallucination.

Q5: Does it make them easier than originally? Harder?

A5: From our extensive experiments, we believe that most of the time it makes the question easier, as suggested by the consistent improvements shown in Table 3.

Q6: Is your corpus the right corpus?

A6: Most of the data sources used to construct the corpus (PubChem, PubMed, USPTO, and Semantic Scholar) are the ones widely used by the chemistry research community. We have conducted corpus study with GPT-3.5-turbo shown in Table 4, which suggests that each data source has its own strength. Therefore, we believe our corpus is the right corpus.

Thanks again for your questions and support! Please feel free to reach out if you have more questions and comments.

2025-06-07

Appreciate all the answers!

审稿意见

评分: 4置信度: 32025-05-13

The paper explores RAG for domain-specific search, more specifically chemistry-related tasks. Tapping into previously released benchmark datasets the paper brings together a benchmark (ChemRAG-Bench) that combines four existing datasets. In addition to that the authors present a toolkit (ChemRAG-Toolkit) which allows the exploration of RAG performance on the benchmarks by modifying different experimental settings such as the retrieval algorithm and underlying language model. Experimental results based on deploying the toolkit are being reported as well.

Topically this a very good fit for COLM. The paper also provides interesting insights. However, there are a number of concerns that limit the overall impact of the contribution as described below.

接收理由

The paper provides a strong motivation and a domain-specific scenario that helps get a better understanding of how large language models affect different downstream use cases.
The authors make a practical contribution that may be picked up as a reference point by others in the same research field.
The bibliography is comprehensive referencing many papers from key research outlets. There is however a bit of a skew towards papers from the non-peer-reviewed literature.
Good use is made of appendices. To further support reproducibility however it would have been beneficial to also include an anonymized GitHub repository.

拒绝理由

I see a number of shortcomings affecting the overall value of the contribution:

The work appears a bit incremental as the “comprehensive benchmark” collection the authors present seems to be a compilation of benchmarks published previously (as a side note here: please address any possible copyright issues that might arise when re-distributing such resources). I might be mistaken here but that is how I interpreted the paper.
While each dataset looks like a valuable resource it remains unclear how representative the chosen data actually is for a specific practical setting.
Another shortcoming I see is the experimental work which does not make use of suitable baselines. More specifically, the results in Table 3 would be a lot stronger if state-of-the-art baselines were to be deployed to demonstrate what a RAG approach offers over alternatives. Such baselines could be implementations of what has been reported in the literature as top-performing approaches for each of the datasets, possibly also a fine-tuned BERT system or something similar as long as non-contamination is being guaranteed (see also comment further down).
The discussion of related work does not address the specific challenges (and the possible solutions put forward) that generative approaches such as RAG pose. The evaluation paradigm RAGAs is one such approach [1].
No statistical significance testing has been applied to the experimental results (despite the use of terms such as “outperform”). As an example, the results in the last column of Table 4 look more or less on par and this suggests that none of the chosen settings might be any better or worse than any of the others.
Given the datasets have been published previously (one of them several years ago) it is reasonable to assume that LLMs will have seen them during training. This means that they may well be contaminated and the results obtained from testing on what has been used as training data will be potentially flawed.
There is no limitations section which should critically assess how generalisable the contribution is.

Minor points:

Make all captions self-contained (e.g. what metrics are being used in Table 3?)
The bibliographical entries need to be polished (e.g. incorrect capitalisation, missing details etc.).

[1] Es et al. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. EACL 2024.

给作者的问题

“We collect data from six sources” —> How exactly is this being done? Are these all results from each of these sources or just a sample (if so, how was this sampled)?
I am puzzled by the statement that “current retrievers only consider semantic similarity”. I think it is more common to simply use bag-of-words approaches (e.g. BM25) as a first step in RAG rather than actually utilizing semantics. This statement therefore needs to be backed up by suitable references.

评论- Response to Reviewer cNEq [3/3]

2025-06-03

Q4: The discussion of related work does not address the specific challenges

A4: Thanks for your suggestions! We have included the suggested section and the mentioned paper in our revised draft.

Q5: No statistical significance testing has been applied to the experimental results

A5: For benchmark paper, it is very rare to include the statistical significance test. To make sure our result is reproducible, we set the temperature to 0 for all models except Deepseek and o1. We run the experiments three times and the mean is reported. There are 1,932 questions in total, and we believe this is substantial that the improvement from using RAG does not rely on some individual data points.

Q6: Given the datasets have been published previously (one of them several years ago) it is reasonable to assume that LLMs will have seen them during training.

A6: We think this is a common issue for all LLMs. Even LLMs have seen the test data during training, they may not be able to leverage it well (e.g. baseline in Table 3), and using RAG may be helpful for providing related knowledge for the task (also shown in Table 3).

Q7: There is no “limitations” section which should critically assess how generalisable the contribution is.

A7: Thanks for your suggestion! The limitation section is not required by COLM. We will add a limitation section in the appendix in our revised draft.

Reference

[1] Hendrycks, D., et al. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[2] Papers with Code: MMLU Leaderboard

[3] Zhang, D., et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 [cs.AI] (2024). https://doi.org/10.48550/arXiv.2402.06852.

[4] Wang, X., et al. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. Proceedings of the 41st International Conference on Machine Learning, PMLR 235:50622-50649, 2024.

[5] Fang, Y., et al. Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models. Proceedings of the International Conference on Learning Representations (ICLR), 2024.

2025-06-07

I appreciate the effort that the authors have put into answering my concerns. While I can see some of these concerns partially answered my overall assessment remains the same.

For future submissions I would highly recommend to include a GitHub link so that some of the arguments can be followed and reproduced (e.g. around representativeness as well as details on corpus construction which does not come across in the paper). I also think that data contamination cannot just be ignored as a model that has been trained on test data gives invalid results.

2025-06-09

Thank you for your response! Please find our code and retrieval cache at https://anonymous.4open.science/r/ChemRAG_anonymous-EB23/. We deleted some model/cache paths for anonymous purposes. The constructed corpus is so large that we cannot find an easy way to upload it anonymously.

2025-06-10

Dear reviewer cNEq,

It would be much appreciated if you could explain more on how we could address the data contamination issue. We think we have addressed all your concerns except the data contamination issue, could you enlighten us what we could do to improve the current rating?

Sincerely, Paper 625 Authors

评论- Additional Response

2025-06-11

Q8: “We collect data from six sources” —> How exactly is this being done?

A8:

PubChem: We downloaded the data from PubChem bulk download. For each chemical compound in PubChem, we collect its names, properties, and description and transform them into JSON format.
Semantic Scholar: We collect 1,849,956 full-text chemistry papers from Semantic Scholar with its API, then we chunk them into chunks of 512 tokens with a 50-token overlap.
USPTO: After collecting the information from USPTO, we transform reactants, reagents, products, and yields information into JSON format.
Open-source textbooks: We parse the 2 PDF chemistry textbooks from openstax.org to texts with Mathpix (https://mathpix.com/). Then we split the parsed textbook into chunks of 512 tokens with a 50-token overlap.
Wikipedia and PubMed are collected in [1]

Q9-1: I am puzzled by the statement that “current retrievers only consider semantic similarity”.

A9-1: We forget to include keyword matching here, which has been added to the revised manuscript. We are trying to express that current retrievers may fail when the query is complex. We have conducted a more detailed failure analysis. The failures are mostly:

Prioritize too much on molecule matching

For instance, the query “Which ingredients are commonly selected for creating Cc1oc(-c2ccccc2)nc1COc1ccc2cc(CC3SC(=O)NC3=O)cnc2c1?” is asking about reagents information for generating the mentioned compound as a product. When there is no such information, retrievers usually give high scores to irrelevant documents that contain the same SMILES. This may introduce noise to the retrieved documents and mislead LLMs. A better retrieval system may identify this situation, then search for the synonyms, and search for similar compounds if synonyms still fail.

Often fail when the document only mention one name

A molecule may have many names, including SMILES, IUPAC, and English names. This makes retrieving the right document more difficult as the question may only contain English names, but there may only be SMILES in the relevant documents. For example,

Query: What is the molecular weight of aspirin?

Document1: The molecular weight of CH3COOC6H4COOH is 180.16 g/mol.

Document2: Aspirin can cause developmental toxicity.

In this example, CH3COOC6H4COOH is the formula of aspirin, but current retrievers don't give Document1 high scores because it doesn’t know CH3COOC6H4COOH is aspirin. By training with synonyms, the texts with CH3COOC6H4COOH and aspirin will be closer in the embedding space. Alternatively, when using an LLM to generate queries, the LLM may first find the formula of aspirin, and then use the formula to search for the molecular weight.

Q9-2: It is more common to simply use bag-of-words approaches (e.g. BM25) as a first step:

A9-2: BM25 is the first tested retriever and is also reported in our paper.

Data Contamination Concern: We would like to emphasize that we are comparing the same LLM in settings with and without RAG. Even if there were data contamination, it would still be a fair comparison. Data contamination issue is common and out of scope of this paper. The oldest question dataset used in this paper is MMLU, but it is still widely used in newly proposed LLMs, including Qwen3 [2], o1 [3], and DeepSeek-v3 [4]. HumanEval and GSM8K, benchmarks proposed in 2021, are also used in newly proposed LLMs, including Llama3 [5], and DeepSeek-v3 [4].

Please feel free to reach out if you have more questions and/or concerns.

Reference

[1] Xiong, G., Jin, Q., Lu, Z., Zhang, A. (2024). Benchmarking Retrieval-Augmented Generation for Medicine. arXiv preprint arXiv:2402.13178.

[2] Yang, A., et al. (2025). Qwen3 Technical Report. arXiv preprint arXiv:2505.09388.

[3] OpenAI. (2024). OpenAI o1 System Card. arXiv preprint arXiv:2412.16720.

[4] DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.

[5] Grattafiori, A., et al. (2024). The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.

评论- Response to Reviewer cNEq [2/3]

2025-06-03

Q3: Another shortcoming I see is the experimental work which does not make use of suitable baselines A3: We have already considered SOTA models. For MMLU [1], GPT-4o is the strongest model based on [2]. For ChemBench [3], in their original paper, the strongest models are ChemLLM and GPT-4. For SciBench [4], in their original paper, GPT-4 performs the best. For Mol-Instruct [5], the best models are MolT5 and ChemT5, and we have included the results below.

Comparison with MolT5 and Text-Chem T5

LLM	Method	Molecule Design	Forward Reaction	Description Gen.	Reagent Prediction	Retrosynthesis	Avg
MolT5	Baseline	36.38	22.49	14.10	18.15	24.16	23.06
Text-ChemT5	Baseline	33.45	85.71	2.87	18.79	60.10	40.18
LLaMA3.1-8B	Baseline	25.52	34.51	4.49	0.91	43.19	21.72
LLaMA3.1-8B	Ours	43.20	50.14	20.51	32.68	51.86	39.68
LLaMA3.1-70B	Baseline	30.89	42.47	8.04	16.17	35.27	26.57
LLaMA3.1-70B	Ours	48.74	58.24	15.95	42.14	69.06	46.83

Chain-of-thought Experiment

We have also included the chain-of-thought (CoT) experiments demonstrated below.

We notice that CoT does better than baseline and solely using RAG in multiple-choice settings (MMLU and ChemBench4K), but performs worse than baseline and solely using RAG in open QA (SciBench and Mol-Instruct.).

LLM	Method	MMLU	SciBench	ChemBench4K	Mol-Instruct	Avg
LLaMA3.1-8B	Baseline	42.9	3.3	27.25	23.99	24.36
LLaMA3.1-8B	CoT	61.38	1.92	32.75	6.67	25.68
LLaMA3.1-8B	Ours	52.15	3.56	25.88	41.05	30.66
LLaMA3.1-70B	Baseline	62.38	5.99	24.25	28.33	30.24
LLaMA3.1-70B	CoT	67.66	4.36	32.13	25.97	32.53
LLaMA3.1-70B	Ours	61.05	13.63	26.25	49.67	37.65

Comparing with other methods may not be necessary. Because we focus on building a benchmark that makes using and researching RAG for chemistry more convenient and focus on understanding the performance of current RAG systems, rather than claiming current RAG systems can beat many other systems. Please refer to our general response to all reviewers for more information.

评论- Response to Reviewer cNEq [1/3]

2025-06-03

Thank you very much for your constructive comments! Besides the general response, we specifically answer your questions below and the manuscript will be carefully revised accordingly.

Q1: The work appears a bit incremental as the “comprehensive benchmark” collection

A1: Although RAG demonstrated its power in the general domain, its effectiveness in the chemistry domain is not well explored and this is because of lack of benchmark data and chemistry knowledge source. This paper is not aimed at developing a new retriever, but to propose a benchmark on RAG for chemistry and systematically study RAG in the challenging Chemistry domain by constructing corpus, incorporating existing retrieval methods, and selecting representative chemistry-related tasks. It is non-trivial to construct the corpus, incorporate retrieval algorithms, select the representative tasks, and implement the RAG pipeline. We solve these challenges by working with researchers in chemistry and biochemistry. They expect LLMs to work in their domain-specific tasks in molecule description generation, molecule generation, property and reaction predictions. We carefully collect data from sources commonly used by domain experts:

PubChem is the largest database of freely accessible chemical information in the world. For each chemical compound in PubChem, we collect its names, properties, and description and transform them into JSON format.
Semantic Scholar composes a large amount of full-text scientific papers and includes cutting-edge chemistry research information. It also contains chemical reactivity and bioactivity information. We collect 1,849,956 full-text chemistry papers from Semantic Scholar, then we chunk them into chunks of 512 tokens with a 50-token overlap.
USPTO contains a large amount of information about reactions: reactants, reagents, products, and yields. We transform these attributes into JSON format.
Open-source textbooks: We parse the PDF textbooks to texts with Mathpix (https://mathpix.com/). Then we split the parsed textbook into chunks of 512 tokens with a 50-token overlap. The curated corpus is large-scale and comprehensive, containing millions of molecules and 1,849,956 chemistry papers, covering chemical names, properties, reactivity, bioactivity, and reactions. We will add the corpus curation section in the revised draft.

The tasks in our constructed benchmark are carefully selected to represent a wide range of chemistry tasks by collaborating with the aforementioned domain experts. The proposed benchmark makes using RAG for chemistry and research in this domain much more convenient. Please refer to our general message for more information.

As for the copyright issue, all the data we collect from is open-source, and we do our best to make sure it is allowed to re-distribute.

Our Contribution:

To the best of our knowledge, this is the first work of benchmarking RAG for chemistry. It is also the first work that covers a wide range of aspects and systematically studies the effect of current RAG systems on chemistry tasks. Both the dataset and the benchmark will serve as valuable references and contributions to both the RAG and chemistry communities.

Q2: While each dataset looks like a valuable resource it remains unclear how representative the chosen data actually is for a specific practical setting.

A2: The domain experts of researchers in chemistry and biochemistry guide us through selecting the corpora and task datasets. For the corpora, we have PubChem which contains names, properties, and description for each chemical compound; Semantic Scholar which contains full-text chemistry literature and is valuable for analyzing the usage, property, and chemical activity; USPTO which contains critical information for reactions; and PubMed which contains important information for biochemistry. Our domain experts confirm these data sources are quite representative and are the data sources they use in their work.

For the task datasets, our domain experts confirm that the datasets represent all the settings that a LLM may be used for, since they mainly leverage LLMs for understanding the molecules, getting information of molecules, predict reactions, and generate molecules.

审稿意见

评分: 5置信度: 42025-05-22

This paper presents benchmarking LLMs with RAG for chemistry tasks. It covers six task categories, six heterogeneous corpora, five off-the-shelf retrievers, and eight LLMs. Through systematic experiments, the authors demonstrate an average 17.4% relative improvement over direct inference, analyze the impact of corpus selection, retriever choice, and retrieval depth.

接收理由

RAG for scientific domains is an important and understudied direction.
It is a well-organized empirical study on RAG-enabled LLMs on diverse chemistry tasks.

拒绝理由

No new retrieval algorithms or corpora are proposed--both the “benchmark” and “toolkit” are essentially assemblies of existing resources.
Shallow findings and analysis. Specifically, the key observations (e.g., corpus choice impacts task performance; retrieval depth influences results) are expected for any RAG system. The paper does not provide deeper analysis such as why certain retrievers outperform others, or how often retrievers actually surface truly relevant chemistry information, or what the common failure modes are.
It is not chemistry-specific enough, other than the used datasets. It does not explore the unique characteristics and challenges of chemistry-specific tasks, and does not develop any domain-specific methodologies to address the challenges.
The comparative evaluation is not very comprehensive. It does not compare with task SoTA models, as well as existing chemistry agents (such as ChemAgent, ChemCrow, or ChemToolAgent), many of which are similar to RAG in the way that they use tools to obtain domain information and has improved the performance over base LLMs. Without this, it's hard to know how much the RAG system can help. In addition, the experiments omit test-time techniques such as chain-of-thought prompting, which could significantly affect performance.

给作者的问题

Question:

You describe your benchmark as “expert-curated.” Could you clarify who the experts were and what their role was in curating the data?

Suggestion:

Consider including the names of the LLMs and retrievers used directly in the table and figure captions to improve clarity and readability.

评论- Response to Reviewer JvHc [3/3]

2025-06-03

Q5: You describe your benchmark as “expert-curated.” Could you clarify who the experts were and what their role was in curating the data?

A5: The experts are researchers in chemistry and biochemistry. They mainly involve in 1) selecting data when curating the corpora, including the data source, specific information, data processing; and 2) choosing the chemistry-specific tasks. Please refer to our general response for more information.

Q6: Consider including the names of the LLMs and retrievers used directly in the table and figure captions to improve clarity and readability.

A6: Thanks! We have updated our draft accordingly.

Reference

[1] Friel, R., et al. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv:2407.11005 [cs.CL] (2025). https://doi.org/10.48550/arXiv.2407.11005.

[2] Xiong, G., et al. Benchmarking Retrieval-Augmented Generation for Medicine. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251, Bangkok, Thailand. Association for Computational Linguistics.

[3] Hendrycks, D., et al. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[4] Papers with Code: MMLU Leaderboard

[5] Zhang, D., et al. ChemLLM: A Chemical Large Language Model. arXiv:2402.06852 [cs.AI] (2024). https://doi.org/10.48550/arXiv.2402.06852.

[6] Wang, X., et al. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. Proceedings of the 41st International Conference on Machine Learning, PMLR 235:50622-50649, 2024.

[7] Fang, Y., et al. Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models. Proceedings of the International Conference on Learning Representations (ICLR), 2024.

2025-06-08

I thank the authors for their detailed response. While it partially addresses some of my concerns, my overall assessment remains unchanged.

Specifically, I acknowledge the non-trivial effort involved in assembling the corpus and conducting the experiments, and I appreciate the inclusion of new results with CoT prompting. However, several important issues persist.

I was not necessarily expecting new RAG methods in this single paper, but I believe more thorough experimentation (such as incorporating SoTA reasoning models and combining CoT with RAG) and deeper analysis (beyond basic performance metrics) are needed. These additions would better demonstrate the potential and necessity of chemistry-focused RAG and provide more meaningful insights for future work.
Furthermore, comparisons with existing tool-using agents such as ChemToolAgent or ChemCrow would be valuable. Such comparisons could help clarify which technical approaches are more effective in different scenarios. If RAG outperforms these methods in specific cases, it would strengthen the case for pursuing chemistry RAG as a promising direction.

2025-06-10

Thank you for your response! We have conducted a more detailed retrieval failure analysis, and the failure are mostly:

Prioritize too much on molecule matching

Often fail when the document only mention one name

Query: What is the molecular weight of aspirin?

Document1: The molecular weight of CH3COOC6H4COOH is 180.16 g/mol.

Document2: Aspirin can cause developmental toxicity.

2025-06-10

Thanks for the new analysis. These failure modes make sense, and reveal the directions for improvement.

I appreciate the authors' effort in addressing my concerns and improving the manuscript, and thus have improved my rating. The authors may consider adding more experimentation and quantitative analysis in the next version.

2025-06-11

Thank you so much for your response and support!

Given the limited time and resources, we selected ChemCrow to run the experiments since it has significantly more citations and its GitHub repository has received substantially more stars. Please find the results below.

We found the outputs of ChemCrow on Mol-Instructions questions difficult to parse, and due to time constraints, we were unable to resolve this issue. As a result, we chose not to report the score for Mol-Instructions for now, but we would add the result to the revised version.

We also observed several drawbacks in the agent system:

It is more expensive to run, as it relies on various (potentially paid) APIs.
It is slower, since an agent needs to plan and call tools dynamically.
At least for ChemCrow, it sometimes gets stuck in an endless loop, repeatedly calling the same tool for the same information.

We will include ChemToolAgent experiments and ChemCrow's Mol-Instructions results in the revised version.

Experimental Results

Model	Method	MMLU	SciBench	ChemBench4K	Avg.
ChemCrow (GPT-4o)	Baseline	73.26	23.53	56.88	51.22
GPT-4o	Baseline	74.59	4.97	59.50	46.35
GPT-4o	Ours	73.92	8.59	67.25	49.92

评论- Response to Reviewer JvHc [2/3]

2025-06-03

Q4: The comparative evaluation is not very comprehensive.

A4: Our evaluation settings are the widely used ones in the community [1,2]. We have already considered SOTA models. For MMLU [3], GPT-4o is the strongest model based on [4]. For ChemBench [5], in their original paper, the strongest models are ChemLLM and GPT-4. For SciBench [6], in their original paper, GPT-4 performs the best. For Mol-Instruct [7], the best models are MolT5 and Text-ChemT5, and we have included the results below.

LLM	Method	Molecule Design	Forward Reaction	Description Gen.	Reagent Prediction	Retrosynthesis	Avg
MolT5	Baseline	36.38	22.49	14.10	18.15	24.16	23.06
Text-ChemT5	Baseline	33.45	85.71	2.87	18.79	60.10	40.18
LLaMA3.1-8B	Baseline	25.52	34.51	4.49	0.91	43.19	21.72
LLaMA3.1-8B	Ours	43.20	50.14	20.51	32.68	51.86	39.68
LLaMA3.1-70B	Baseline	30.89	42.47	8.04	16.17	35.27	26.57
LLaMA3.1-70B	Ours	48.74	58.24	15.95	42.14	69.06	46.83

Thank you for mentioning using chain-of-thought prompting, we have included the experiments demonstrated below. We notice that CoT does better than baseline and solely using RAG in multiple-choice settings (MMLU and ChemBench4K), but performs worse than baseline and solely using RAG in open QA (SciBench and Mol-Instruct.).

LLM	Method	MMLU	SciBench	ChemBench4K	Mol-Instruct	Avg
LLaMA3.1-8B	Baseline	42.9	3.3	27.25	23.99	24.36
LLaMA3.1-8B	CoT	61.38	1.92	32.75	6.67	25.68
LLaMA3.1-8B	Ours	52.15	3.56	25.88	41.05	30.66
LLaMA3.1-70B	Baseline	62.38	5.99	24.25	28.33	30.24
LLaMA3.1-70B	CoT	67.66	4.36	32.13	25.97	32.53
LLaMA3.1-70B	Ours	61.05	13.63	26.25	49.67	37.65

Comparing with chemistry agents may not be necessary. First, we focus on building a benchmark that makes using and researching RAG for chemistry more convenient and focus on understanding the performance of current RAG systems, rather than claiming current RAG systems can beat many other systems. Second, these chemistry agents are in different settings, for instance, ChemAgent requires a few ground truths for reasoning while we don’t provide any demonstration in our setting.

评论- Response to Reviewer JvHc [1/3]

2025-06-03

Thank you very much for your constructive comments! Besides the general response, we specifically answer your questions below and the manuscript will be carefully revised accordingly.

Q1: No new retrieval algorithms or corpora are proposed

PubChem is the largest database of freely accessible chemical information in the world. For each chemical compound in PubChem, we collect its names, properties, and description and transform them into JSON format.
Semantic Scholar composes a large amount of full-text scientific papers and includes cutting-edge chemistry research information. It also contains chemical reactivity and bioactivity information. We collect 1,849,956 full-text chemistry papers from Semantic Scholar, then we chunk them into chunks of 512 tokens with a 50-token overlap.
USPTO contains a large amount of information about reactions: reactants, reagents, products, and yields. We transform these attributes into JSON format.
Open-source textbooks: We parse the PDF textbooks to texts with Mathpix (https://mathpix.com/). Then we split the parsed textbook into chunks of 512 tokens with a 50-token overlap. The curated corpus is large-scale and comprehensive, containing millions of molecules and 1,849,956 chemistry papers, covering chemical names, properties, reactivity, bioactivity, and reactions. We will add the corpus curation section in the revised draft.

Our Contribution:

Q2: Shallow findings and analysis.

A2: Thanks for the suggestions! We will update more detailed analysis in the revised draft.

Q3: It is not chemistry-specific enough, other than the used datasets.

A3: The corpus constructed by us is created for chemistry and it integrates diverse sources of chemical knowledge, capturing a broad spectrum of information on chemical compounds, reactions, mechanistic insights, and scientific discourse. The retriever and the method of LLMs utilizing the retrieved knowledge are unique in a chemistry setting and should be presented in a separate paper. The goal of our paper is to provide all the components for users and researchers: chemistry-focused corpora (created by us), baseline retrievers, and chemistry tasks. In the future, researchers do not need to construct another database, run baseline retrievers, or find chemistry-related tasks, but simply utilize our proposed benchmark.

审稿意见

评分: 6置信度: 32025-05-25

This paper collect a benchmark by selecting useful data from existing benchmarks, build a RAG toolkit using an ensemble of existing retriever & reranker, and conduct lots of experiments on analyzing the effect of corpus to retrieve, retriever selection, and task difference.

接收理由

Lots of experiments are conducted, and very reasonable analysis on RAG's effect on many meaningful aspects.
RAG's effect indeed need systematic research: this work focus on an interesting and novel research question.

拒绝理由

The contribution seems a bit limited: the constructed benchmark seems is collected from existing benchmarks, and the developed toolkit is an ensemble of existing methods.
It might lack of some insights on how to build a better retriever from the analysis from this paper. Suggestion: maybe build one to verfiy the findings, then it would probably be a (much) more complete submission.

评论- Response to Reviewer vCXR

2025-06-02

Thank you very much for your constructive comments! Besides the general response, we specifically answer your questions below and the manuscript will be carefully revised accordingly.

Q1: The contribution seems a bit limited

A1: Our work is beyond combining previous work. The corpora is curated by us. Curating the corpus for ChemRAG benchmarking is non-trivial. We solve this issue by discussing with researchers in chemistry and biochemistry. They expect LLMs to work in their domain-specific tasks in molecule description generation, molecule generation, property and reaction predictions. We carefully collect data from sources commonly used by domain experts:

PubChem is the largest database of freely accessible chemical information in the world. For each chemical compound in PubChem, we collect its names, properties, and description and transform them into JSON format.
Semantic Scholar composes a large amount of full-text scientific papers and includes cutting-edge chemistry research information. It also contains chemical reactivity and bioactivity information. We collect 1,849,956 full-text chemistry papers from Semantic Scholar, then we chunk them into chunks of 512 tokens with a 50-token overlap.
USPTO contains a large amount of information about reactions: reactants, reagents, products, and yields. We transform these attributes into JSON format.
Open-source textbooks: We parse the PDF textbooks to texts with Mathpix (https://mathpix.com/). Then we split the parsed textbook into chunks of 512 tokens with a 50-token overlap.

The curated corpus is large-scale and comprehensive, containing millions of molecules and 1,849,956 chemistry papers, covering chemical names, properties, reactivity, bioactivity, and reactions.

The tasks in our constructed benchmark are carefully selected to represent a wide range of chemistry tasks by collaborating with the aforementioned domain experts.

The proposed benchmark makes using RAG for chemistry and research in this domain much more convenient. Please refer to our general message for more information.

Q2: It might lack some insights on how to build a better retriever from the analysis from this paper.

A2: Current sparse and dense retrieval methods still fall short in the chemistry domain. We could first construct a synonym dictionary and use the synonyms of a molecule to retrieve positive documents which are used to train a retriever. In this way, texts about the same molecule but with different names are likely to be retrieved which resulting in a better retriever. We could also use an LLM to do the retrieval by using its reasoning capability to write search queries. The focus of this paper is to benchmark RAG systems for chemistry. In addition, building a better retriever itself is a lot of contribution and may be a new paper, which is why we refrain from constructing a new retriever in this paper.

Here is an example of the analysis from this paper:

Query: What is the molecular weight of aspirin?

Document1: The molecular weight of CH3COOC6H4COOH is 180.16 g/mol.

Document2: Aspirin can cause developmental toxicity.

2025-06-09

Thank you for the reponse. With the discussion, I think this paper can make a good contribution to the field, and have adjusted my score to reflect it.

2025-06-08

Dear Reviewer vCXR,

As we approach the end of the discussion period, could you please check the authors' responses and see if they have addressed your concerns? Thank you very much for your efforts.

Best, AC

评论- General Response to All Reviewers

2025-06-02

We appreciate all reviewers for their constructive comments! This paper is to build a benchmark on retrieval-augmented generation (RAG) systems for chemistry and systematically study the effect of the RAG systems on chemistry tasks. The proposed “Toolkit” is for benchmark studies rather than proposing a new solution or a new method.

“Why RAG in Chemistry?”

Chemistry is a highly specialized and dynamic discipline. Thus LLMs trained on general corpora often fail to generate grounded and accurate responses for chemistry, instead they may produce hallucinated or outdated contents. RAG presents a natural solution to these limitations, allowing models to retrieve and incorporate trusted chemical knowledge during inference. Despite the growing interest in applying RAG to the chemistry domain, there remains a lack of standardized benchmarks and well-curated domain-specific resources to support rigorous evaluation and design of RAG systems. By collaborating with researchers in chemistry and biochemistry, we collect resources from PubChem, USPTO, PubMed, Semantic Scholar, chemistry textbooks, and Wikipedia, covering a wide range of information on chemical compounds, reactions, and experimental procedures.

“Our Contribution”:

“Our work is beyond combining the previous work”:

Large-scale and Comprehensive Corpora Construction Curating the corpus for ChemRAG benchmarking is non-trivial. We solve this issue by discussing with researchers in chemistry and biochemistry. They expect LLMs to work in their domain-specific tasks in molecule description generation, molecule generation, property and reaction predictions. We carefully collect data from sources commonly used by domain experts:

PubChem is the largest database of freely accessible chemical information in the world. For each chemical compound in PubChem, we collect its names, properties, and description and transform them into JSON format.
Semantic Scholar composes a large amount of full-text scientific papers and includes cutting-edge chemistry research information. It also contains chemical reactivity and bioactivity information. We collect 1,849,956 full-text chemistry papers from Semantic Scholar, then chunk them into chunks of 512 tokens with a 50-token overlap.
USPTO contains a large amount of information about reactions: reactants, reagents, products, and yields. We transform these attributes into JSON format.
Open-source textbooks: We parse the PDF textbooks to texts with Mathpix (https://mathpix.com/), and then split the parsed textbook into chunks of 512 tokens with a 50-token overlap.

The curated corpus is large-scale and comprehensive, containing millions of molecules and 1,849,956 chemistry papers, covering chemical names, properties, reactivity, bioactivity, and reactions.

Representative Tasks The tasks in our constructed benchmark are carefully selected to represent a wide range of chemistry tasks by collaborating with aforementioned domain experts. Chemists mainly leverage LLMs for understanding molecules, getting information of molecules [1], predicting properties and reactions [1, 2], and generating molecules [1, 3]. The tasks in our paper cover many scenarios that chemists may encounter when using LLMs.
The proposed benchmark makes using RAG for chemistry and researching in this domain much more convenient. For general users, they can use it for chemistry-related tasks, including searching for compound information, predicting reactions, and generating molecules. For researchers, the benchmark makes it easier to evaluate their proposed methods, and serves as a solid baseline for future follow-up works in this research line.

Our code, question datasets, and retrieval results can be found at: https://anonymous.4open.science/r/ChemRAG_anonymous-EB23/.

[1] Zheng, Y., Koh, H.Y., Ju, J. et al. Large language models for scientific discovery in molecular property prediction. Nat Mach Intell 7, 437–447 (2025). https://doi.org/10.1038/s42256-025-00994-z

[2] Yu, B., Baker, F.N., Chen, Z. et al. LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset. Conference on Language Modeling (COLM). 2024.

[3] Ishida, S., Sato, T., Honma, T. et al. Large language models open new way of AI-assisted molecule design for chemists. J Cheminform 17, 36 (2025). https://doi.org/10.1186/s13321-025-00984-8

最终决定Accept

2025-07-08

Summary

This paper presents ChemRAG-Bench, a new benchmark designed to systematically assess RAG for chemistry tasks. It also provides ChemRAG-Toolkit, a modular framework that supports multiple retrievers and LLMs for controlled experimentation. The benchmark integrates various chemistry-specific corpora and evaluates across a variety of tasks. The paper argues that this is the first large-scale benchmark dedicated to RAG in chemistry.

Strengths:

Reviewers in general agree that this is the first comprehensive benchmarking effort for RAG applied to chemistry, a highly specialized and underexplored domain for LLMs. General-purpose LLMs struggle in chemistry without retrieval, and this work helps fill the gap.
ChemRAG-Bench and ChemRAG-Toolkit can be useful to the community (Reviewer YkjK). Lots of experiments are conducted on meaningful aspects, they are well organized, and analysis is reasonable (Reviewer JvHc, vCXR).

Weaknesses:

One weakness mentioned by most reviewers (Reviewer vCXR, JvHc, cNEq) is that the paper mostly combines existing datasets, models, and retrievers. No new retrieval methods are proposed, and the toolkit primarily assembles existing components. Relatedly, the paper does not discuss in depth about chemistry-specific modeling or retrieval challenges, or propose new retrieval or generation methods adapted to the unique challenges of chemistry domain.
Reviewer cNEq mentioned about data contamination and copyright issues (very important). The authors should add some discussions on limitations and ethics to clarify them. For example, it might be OK to acknowledge there is a risk of data contamination, but given the low performance in general, the issue might not be as serious as one would imagine.

During the author response, authors added new experiments (e.g., chain-of-thought, ChemCrow agent comparisons), released code, improved analysis, and clarified the corpora as well as the engagement with domain experts, which addressed many concerns.

I think the creation of a chemistry-specific RAG benchmark and toolkit fills an important gap and is likely to benefit both the AI-for-chemistry and RAG research communities. Therefore, I would recommend to accept the paper, provided that the authors add the new results and more discussions about current methods’ failure modes and possible future work as well as the missed references mentioned during the response period (such as ChemCrow and ChemToolAgent).