ChatQA: Surpassing GPT-4 on Conversational QA and RAG
We introduce ChatQA, a suite of models that outperform GPT-4 on RAG and conversational QA.
摘要
评审与讨论
This paper proposes a family of fine-tuned models (ChatQA) that surpass GPT-4 on conversational QA and RAG. It introduces a two-stage instruction fine-tuning method to enhance the model’s capability of using the retrieved context for generation. In addition, it shows that fine-tuning a single-turn retriever (Dragon) on human-annotated data can closely match LLM-based query rewrites, thereby eliminating computational and API costs. Lastly, this paper collects a ChatRAG benchmark for evaluating RAG, table-related QA and arithmetic calculations. Experiments on the ChatRAG Bench show that the proposed ChatQA models outperform or closely match the strong generalist model GPT-4.
优点
- Overall, the writing of the paper is good. The experiments and ablation studies are sound and comprehensive, demonstrating the effectiveness of the two-stage instruction fine-tuning method and the data curation recipe.
- The open-sourced training, the data curation recipes, and the use of the public foundation models (Llama2 and Llama3) could be valuable and beneficial for the community in chasing proprietary models.
缺点
- Training an open foundation model with curated instruction data is not new. The paper could be improved if it demonstrates why the selected mixture of training data is effective, and training on them could match GPT-4.
- While the ablation studies show the effectiveness of the collected data in training ChatQA models, I believe more fine-grained data selection and analysis could potentially further improve the performance [1][2].
[1] How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources, 2023
[2] A Survey on Data Selection for Language Models, 2024
问题
- L163, in the stage-2 context-enhanced instruction tuning, it seems that you again use all of the SFT datasets from stage-1. Will it lead to overfitting? If you want to maintain the instruction-following ability, why not merge all data from stage-1 and stage-2 and then do multi-task tuning?
局限性
N/A
Thank you for your constructive comments and feedback. We will address your questions below.
“1. Training an open foundation model with curated instruction data is not new. The paper could be improved if it demonstrates why the selected mixture of training data is effective, and training on them could match GPT-4.”
- Thank you for the suggestion. ChatQA models show compelling results because our collected instruction tuning datasets are designed for conversational QA and RAG.
- We collected a conversational QA dataset where the document context is provided for each sample, enabling our models to learn how to locate relevant information and generate accurate answers. Additionally, the data samples are in a conversational format, allowing our models to answer follow-up questions. Furthermore, we include several single-turn QA tasks with document contexts to enhance information-seeking capabilities. To enable our model's tabular understanding and reasoning abilities, we incorporate TAT-QA, which provides a rich source of tables and QA samples involving mathematical calculations and reasoning.
- In Table 3, we present comprehensive ablation studies demonstrating the effectiveness of these single-turn QA datasets and the conversational QA dataset, and the comparison between the synthetic and human-annotated conversational QA dataset.
- We will elaborate further and provide more quantitative results on why the selected datasets are effective in the final version of this paper.
“2. While the ablation studies show the effectiveness of the collected data in training ChatQA models, I believe more fine-grained data selection and analysis could potentially further improve the performance [1][2].
[1] How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources, 2023
[2] A Survey on Data Selection for Language Models, 2024”
- Thanks for your great suggestion. We will cite and discuss the papers you mentioned in related work. We will look into more detailed data selection and analysis in our future work.
“Questions: 1. L163, in the stage-2 context-enhanced instruction tuning, it seems that you again use all of the SFT datasets from stage-1. Will it lead to overfitting? If you want to maintain the instruction-following ability, why not merge all data from stage-1 and stage-2 and then do multi-task tuning?”
- This is a good question. This two-stage approach is akin to curriculum learning, which first equips the model with basic instruction-following capabilities and then enhances its conversational QA and RAG capabilities. In stage-2, we blend all of the SFT datasets from stage-1 again to ensure the model does not forget its instruction-following capabilities.
- In Table 3, we conducted this ablation study by training on all datasets merged from stage-1 and stage-2. The result is denoted as "w/o stage-1" because it is essentially the same as performing only stage-2 training, which uses all the datasets. We found that this resulted in an overall drop in performance, demonstrating the effectiveness of the proposed two-stage curriculum training.
Thank you once again for your review. We hope our response satisfactorily addresses all your questions.
Thanks authors for the detailed response. It has addressed most of my concerns. I have updated the scores accordingly.
Dear Reviewer,
Many thanks once again for your insightful feedback. We appreciate your acknowledgment of our efforts in your comments.
This paper explore RAG in conversational QA senarios. It proposes a two-stage instruction tuning method for conversational QA in a RAG manner, accompanied by a comprehensive benchmark. The training of LM involves two stages: (i) SFT on dialogue and QA datasets and (ii) context-enhanced instruction tuning which includes SFT on both synthetic and existing datasets. Similar strategies are applied to tune the retriever for multi-turn QA scenarios. The experiments demonstrate that the proposed ChatQA model surpasses various existing large language models in performance.
优点
This paper effectively underscores the significance of RAG in chat-based QA scenarios, a critical area that has been overlooked in previous research. It offers a detailed and concise overview of the current research landscape, scenario positioning, and existing challenges, providing valuable insights into these areas. Additionally, the proposed benchmark presented in the paper could be valuable for advancing research in this domain.
缺点
-
There is a lack of technical novelty. The paper focuses on fine-tuning a language model using an existing dialogue dataset and self-created data for this scenario. While the approach is clearly outlined, it would be helpful if the authors could further elaborate on the innovative aspects of their methodology to highlight its novelty.
-
The paper lacks clear definitions and formalizations of the tasks that need to be addressed, which obscures the distinctions between its concepts and existing works. It took me some time to realize that the "instruction tuning method" pertains solely to the construction of training data and is conducted without retrieval. This leads to questions about the method's relevance to the subsequent RAG framework. The connection between the proposed method, which is tailored for LM in chat-based QA, and the RAG remains unclear and needs better delineation.
-
Lack of essential ablation studies. Numerous choices regarding training data selection and the construction of synthetic data are only briefly explained. These choices, while seemingly reasonable, necessitate thorough ablation and comparison with standard processes to validate their effectiveness. Additionally, although several strategies for tuning the retriever for this scenario are proposed, there is a lack of analysis and ablation to justify their usefulness.
问题
-
Is there any difference between the concepts of "multi-turn QA" and "conversational QA" in this work? These terms appear to be used interchangeably.
-
Could you provide justification for the proposed retriever tuning strategies?
-
Does the conversational QA scenario differ from traditional QA primarily because it includes additional context? If so, it would be intriguing to assess the impact on QA accuracy when this context is removed. Furthermore, comparing the importance of this context against the content retrieved in the RAG may provide insightful observations.
局限性
NA
Follow-up response to question 3.
Does the conversational QA scenario differ from traditional QA primarily because it includes additional context? If so, it would be intriguing to assess the impact on QA accuracy when this context is removed. Furthermore, comparing the importance of this context against the content retrieved in the RAG may provide insightful observations.
- Many thanks for raising this intriguing question. Conversational QA differs from traditional QA as it includes dialog history, where a particular question (e.g., a follow-up question) may reference information from previous dialogue. In our work, all dialogue history and current questions are fed into LLMs, and the retriever also takes them as input instead of solely relying on current questions. We will further clarify this in the final draft. Removing the dialogue history could lead to many unanswerable cases, especially for follow-up questions. Following your suggestions, we assessed the impact on QA accuracy when the conversational context (previous turns of dialogue) is removed and found a significant drop in accuracy.
| ChatRAG | |
|---|---|
| ChatQA-1.5-70B | 58.25 |
| - remove dialog history | 46.88 |
| Llama-3-Instruct-70B | 52.52 |
| - remove dialog history | 42.84 |
- Both conversational QA (multi-turn) and traditional QA (single-turn, e.g., Natural Question, TriviaQA, and HotpotQA) typically rely on grounding documents related to the questions (referred to as context or documents in the literature). Our ChatQA performs very well on traditional single-turn QA tasks (see Table 5) because these can be viewed as a special case of conversational QA when the number of turns is just one. Removing grounding documents or disabling RAG requires the model to answer questions from its parameters' knowledge, which can lead to more hallucinations. We also observed a significant drop in QA accuracy when the document context is disabled.
| ChatRAG | |
|---|---|
| ChatQA-1.5-70B | 58.25 |
| - remove all document context | 31.61 |
| Llama-3-Instruct-70B | 52.52 |
| - remove all document context | 28.14 |
We really appreciate your comments and suggestions and will incorporate them into our final draft. We hope our response addresses your concerns. Please let us know if you have any further questions.
Thank you for your detailed comments and feedback. We will address your concerns and questions below.
“There is a lack of technical novelty…it would be helpful if the authors could further elaborate on the innovative aspects of their methodology to highlight its novelty”
-
The innovative aspects of the methodology include: i) We propose a two-stage instruction tuning method, with one stage focusing on general instruction-following capabilities and another focusing on document-grounded and contextualized QA. The effectiveness of this method is validated in ablation study (see Table 3). ii) We introduce a method to obtain a retriever for conversational QA that can utilize the entire conversation history for accurate retrieval from the documents. It performs as well as the SOTA ChatGPT-based query rewriting model but significantly reduces deployment costs.
-
In addition to the innovative aspects of methodology, we think that technical novelty can also come from experimental results, and this work certainly presents a good amount of novel results / findings. For example, i) ChatQA-1.5-70B model is built on a weaker foundation model (Llama3-70B) compared to GPT-4-Turbo. However, it can significantly surpass GPT-4-Turbo (Avg. 58.25 vs. 54.03) with carefully designed instruction tuning recipe, while not utilizing any synthetic data from ChatGPT models. ii) We demonstrate that the proposed multi-turn query retriever can be as good as the ChatGPT based query rewriting model in conversational RAG setting. iii) We are one of the first open-source efforts to show compelling results for unanswerable cases (outperforming GPT-3.5, but slightly worse than GPT-4) in a zero-shot QA setting. We believe that novel results and findings are as important as novel ideas for LLM research.
-
Last but not least, we open-source the ChatQA models, retriever model, training data, and ChatRAG Bench. We believe this contribution holds significant value for the research community.
“The paper lacks clear definitions and formalizations of the tasks that need to be addressed, which obscures the distinctions between its concepts and existing works. It took me some time to realize that the "instruction tuning method" pertains solely to the construction of training data and is conducted without retrieval. This leads to questions about the method's relevance to the subsequent RAG framework. The connection between the proposed method, which is tailored for LM in chat-based QA, and the RAG remains unclear and needs better delineation.”
-
Thanks for raising the question. The goal of the ChatQA work is to address context-rich conversational QA, where the context can involve long documents requiring RAG. In the final evaluation, we use five long-document datasets that need retrieval and five short-document datasets where the documents can fit into LLM prompts. Detailed information can be found in Section 5.2. We will clarify the tasks further in our final manuscript.
-
For the construction of instruction-tuning data, we indeed tried to include training data with top-5 retrieved chunks from long documents. We need to put the results in Appendix A.1 due to the space limit. We find that this can slightly improve the RAG results on long document datasets but it degrades the performance on short documents datasets. Note that, the context-enhanced instruction tuning (stage-2) empowers the ChatQA model to effectively integrate useful information from the “context” for response generation. This context can either be user provided short document, or retrieved top-k contexts from provided long documents.
“Lack of essential ablation studies. Numerous choices regarding training data selection and the construction of synthetic data are only briefly explained. These choices, while seemingly reasonable, necessitate through ablation and comparison with standard processes to validate their effectiveness.”
- Thanks for your reminder. In Table 3, we have conducted ablation studies on the key components of our proposed training strategies. For example, the effectiveness of adding single-turn QA datasets, the multi-turn QA dataset, stage-1 and stage-2 training, and the comparison between the synthetic and human-annotated multi-turn QA dataset.
- In Appendix H.2, we perform ablation studies on selecting the number of unanswerable samples in the instruction tuning dataset (Table 11).
- In Appendix A, we provide more ablation studies on utilizing top-k retrieved chunks for instruction-tuning (Table 6). We will highlight the list of ablation studies in the main text for the final version of the paper.
“Lack of analysis and ablation to justify the usefulness of the proposed retriever tuning. Could you provide justification for the proposed retriever tuning strategies?”
- Existing retrievers, such as Dragon and E5, are designed for single-turn queries and struggle to generalize well to questions within conversations that reference previous dialogue, such as follow-up questions. To address this limitation, our retriever tuning strategy involves further fine-tuning the single-turn retriever using pairs of conversational queries and corresponding contexts (passages) from documents, utilizing the same contrastive loss. This approach enhances the retriever's ability to handle conversational queries more effectively.
- In addition, we provide ablation studies and analyses on different strategies for tuning the retriever and query rewriting method in Appendix E.2 (Table 9), which further illustrates the effectiveness of our proposed retriever tuning strategy.
"Any difference between the concepts of "multi-turn QA" and "conversational QA" in this work?"
- Multi-turn QA and conversational QA are interchangeable terms.
We will respond to your intriguing final question in the follow-up comment.
Dear Reviewer,
Thank you again for your detailed comments and constructive suggestions. We will incorporate all of them into the final version of the paper. We hope our response can help address your concerns. Please let us know if you have any additional questions. We would be happy to discuss them further with you.
Dear Authors,
Thank you for addressing my concerns. I have made the necessary changes to my score based on your response.
Thank you once again for providing such a helpful review and for acknowledging our efforts in your response.
This paper introduces ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question-answering (QA) tasks. The key contributions of the paper include:
- A two-stage instruction tuning method that improves RAG performance.
- A dense retriever optimized for conversational QA that performs comparably to state-of-the-art query rewriting models while reducing deployment costs.
- CHATRAG BENCH, a comprehensive benchmark comprising 10 conversational QA datasets for evaluating RAG, table-related QA, arithmetic calculations, and unanswerable questions.
- ChatQA-1.0-70B, built on Llama2, which slightly outperforms GPT-4-0613 and GPT-4-Turbo on CHATRAG BENCH without relying on synthetic data from OpenAI GPT models.
- Llama3-ChatQA-1.5-70B, which surpasses GPT-4-Turbo-2024-04-09 by a good margin.
- Good performance on single-turn QA and RAG benchmarks, with Llama3-ChatQA-1.5-70B outperforming existing frontier RAG models like Command R+.
优点
The paper demonstrates several notable strengths across several dimensions:
- The two-stage instruction tuning method is a novel approach to enhancing RAG performance, combining supervised fine-tuning (SFT) with context-enhanced instruction tuning.
- The development of a dense retriever optimized for conversational QA without relying on query rewriting is an innovative solution to a common challenge in multi-turn QA systems.
- The creation of CHATRAG BENCH as a comprehensive evaluation suite for conversational QA and RAG tasks is a good contribution to the field.
- The empirical results are robust, with the ChatQA models outperforming strong baselines, including GPT-4, across multiple datasets and tasks.
- The ablation studies provide valuable insights into the contributions of different components of the proposed method.
- The study on handling "unanswerable" scenarios addresses an important challenge in QA systems, contributing to the development of more robust models.
缺点
- The study primarily focuses on Llama2 and Llama3 of sizes 7B/8B, 70B as base models. Including a wider range of base models could provide insights into the generalizability of the proposed methods. Also, a detailed analysis of how performance scales with model size could offer insights into the trade-offs between model size and performance.
- The paper doesn't address potential ethical implications or biases that might be present in the developed models or the CHATRAG BENCH. A brief discussion on this would enhance the paper's comprehensiveness.
- While the paper provides extensive quantitative results, it doesn't include a detailed qualitative analysis of the types of errors made by the models. This could offer insights into areas for future improvement.
问题
- How does the performance of ChatQA models scale with model size? Is there a point of diminishing returns? This analysis could offer insights into the trade-offs between model size and performance.
- Are there specific limitations of the current approach that you think are most important to address apart from expanding to code-related tasks or math reasoning tasks?
- What do you see as the most promising avenues for future work based on your findings?
- Do you have any explanation or intuition as to why incorporating more unanswerable samples beyond 1.5K leads to lower accuracy scores in most of the tasks in Table 11?
- You mention that fine-tuning a single-turn query retriever performs comparably to state-of-the-art query rewriting. Can you elaborate on the trade-offs between these two approaches beyond computational cost?
局限性
They discuss the limitations at the beginning of the Appendix (Line#908-914).
Thank you so much for your detailed comments and feedback. We will address your questions below.
“The study primarily focuses on Llama2 and Llama3 of sizes 7B/8B, 70B as base models. Including a wider range of base models could provide insights into the generalizability of the proposed methods. Also, a detailed analysis of how performance scales with model size could offer insights into the trade-offs between model size and performance.” and “Question 1. How does the performance of ChatQA models scale with model size? Is there a point of diminishing returns? This analysis could offer insights into the trade-offs between model size and performance.”
- Many thanks for your comment. We actually have results on a wider range of base models, including in-house pretrained GPT-{8B, 22B} (pretrained on 3.5T tokens), Llama2-{7B, 13B, 70B}, and Llama3-{8B, 70B}. The full results of all these models are in Appendix K (Table 14) due to the page limit.
- Thanks for providing this analysis suggestion. In Table-14 (Appendix K), we have studied the model size of 7B, 13B, 22B, 70B for ChatQA-1.0. Note that the 22B model comes from our in-house GPT-22B, and the rest models are from Llama2. We put the ChatRAG average scores of these ChatQA models as follows:
| Models | ChatRAG |
|---|---|
| ChatQA-1.0-7B | 46.96 |
| ChatQA-1.0-13B | 50.27 |
| ChatQA-1.0-22B | 53.01 |
| ChatQA-1.0-70B | 53.89 |
- We find that the performance has a large boost from ChatQA-1.0-7B to 13B, and from ChatQA-1.0-13B to 22B. However, the improvement becomes marginal from ChatQA-1.0-22B to 70B. Therefore, a rough estimate suggests that a 22B model might strike a good balance between the model size and performance. But, we believe that more thorough studies are needed to analyze this trade-off in detail.
“The paper doesn't address potential ethical implications or biases that might be present in the developed models or the CHATRAG BENCH. A brief discussion on this would enhance the paper's comprehensiveness.”
- Thanks for this suggestion. We will include this discussion in the paper.
“While the paper provides extensive quantitative results, it doesn't include a detailed qualitative analysis of the types of errors made by the models. This could offer insights into areas for future improvement.”
- We put case studies in Appendix I due to a page limit. We study some errors made by ChatQA-1.0-13B, ChatQA-1.0-70B, GPT-3.5-Turbo, and GPT-4. We find that ChatQA models are robust in text-based information seeking scenarios, while it will sometimes make mistakes for tabular reasoning questions. We also find that the larger ChatQA model (e.g., ChatQA-1.0-70B) is able to correct the typo in the question or context, while the smaller one (e.g., ChatQA-1.0-13B) might fail to do so.
“Are there specific limitations of the current approach that you think are most important to address apart from expanding to code-related tasks or math reasoning tasks?”
- Thank you for raising this question. We believe there are several limitations to the current approach and many opportunities to extend its capabilities:
- ChatQA is tested with RAG tasks involving single-step retrieval. It would be interesting to explore and extend its RAG capabilities to handle multiple retrieval steps, requiring joint reasoning to provide accurate answers.
- The current version of ChatQA-1.5 only supports an 8K context window. Extending this context window would be beneficial for long-context summarization tasks.
“What do you see as the most promising avenues for future work based on your findings?”
- We observe very strong results from relatively small ChatQA models. For example, Llama3-ChatQA-1.5-8B achieves results comparable to GPT-4-Turbo. A promising direction would be to explore the effectiveness of even smaller models (e.g., 1B) through model distillation. Smaller models are ideal for deployment on mobile devices, and conversational QA and RAG are important use cases for users.
“Do you have any explanation or intuition as to why incorporating more unanswerable samples beyond 1.5K leads to lower accuracy scores in most of the tasks in Table 11?”
- We conjecture that it might be attributed to the data quality of unanswerable samples. For HumanAnnotatedData, we asked annotators to identify all parts of the context locations that are relevant to the user’s question. And we construct unanswerable samples by deleting the text from the corresponding locations in the context. There is a possibility that the relevant context for a few questions is not entirely removed, leading to incorrect unanswerable training samples. Hence, a better data filtering strategy can be applied to improve the data quality and potentially improve the overall accuracy.
“You mention that fine-tuning a single-turn query retriever performs comparably to state-of-the-art query rewriting. Can you elaborate on the trade-offs between these two approaches beyond computational cost?”
- Query rewriting requires making API calls to powerful LLMs to rewrite the query, which incurs additional costs each time a conversational query comes. Alternatively, we can build a strong instruction following model or query rewriting model ourselves, but in additional computational resources, large amounts of datasets need to be collected for training.
- Fine-tuning a single-turn query retriever, however, requires careful data collection on conversational query and context pairs. We showcase that collecting around 7k conversational dataset (around 5 conversational query and context pairs for each conversation) is good enough to build a powerful multi-turn query retriever that works comparable to the GPT-3.5-Turbo query rewriting approach.
“They discuss the limitations at the beginning of the Appendix (Line#908-914).”
- Thanks for pointing this out. We will put them right after conclusion to make them easier to find.
The paper introduces ChatQA, a suite of models designed to excel in retrieval-augmented generation (RAG) and conversational question answering (QA). The authors propose a two-stage instruction tuning methodology to bolster generative capabilities and a dense retriever optimized for conversational QA to improve retrieval effectiveness. Notably, the ChatQA models, even when built on foundation models perceived as weaker than GPT-4, demonstrate superior performance on RAG and conversational QA tasks, surpassing GPT-4 in certain benchmarks. The paper further contributes the ChatRAG Bench, a comprehensive benchmark for evaluating RAG and conversational QA models, and releases model weights, training data, the ChatRAG Bench itself, and the retriever to the community.
Soundness:
The technical claims, experimental methodology, and research design in this paper are well-supported and sound. The central claims are convincingly backed by extensive experimental results and comparisons to established baselines, including GPT-4. The ablation studies further validate the efficacy of the proposed two-stage fine-tuning approach and the significance of the curated datasets. The paper demonstrates a meticulous and rigorous approach, ensuring the reliability and reproducibility of its findings.
Presentation:
The paper exhibits a clear and well-organized presentation style. The writing is lucid, and the technical concepts are effectively conveyed. The authors adequately contextualize their work within the landscape of prior research, highlighting the novel contributions of their approach. Overall, the paper is well-written and easy to follow, making it accessible to a broad audience.
Contribution:
The paper makes a significant contribution to the field of conversational QA and RAG. The proposed ChatQA models push the boundaries of performance, even outperforming GPT-4 in certain benchmarks. The open-sourcing of model weights, data, and the ChatRAG Bench fosters transparency and collaboration, promoting further advancements in the research community. The paper's findings challenge prevailing assumptions about the necessity of relying on synthetic data from OpenAI GPT models, opening avenues for innovative training strategies.
优点
- The paper introduces a novel two-stage instruction tuning methodology that significantly enhances the context-aware and RAG-based QA capabilities of LLMs. It also proposes an effective dense retriever optimized for conversational QA.
- The research is meticulously conducted, with rigorous experiments and comprehensive evaluations. The ablation studies provide valuable insights into the contributions of various components of the proposed approach.
- The paper is well-written and clearly structured, making it easy to follow the authors' line of reasoning. The technical concepts are presented in an accessible manner.
- The paper's findings are impactful, demonstrating that state-of-the-art performance in conversational QA and RAG can be achieved without reliance on synthetic data from OpenAI models. The open-sourcing of resources further amplifies the significance of this work.
缺点
The paper primarily focuses on evaluating the "unanswerable" scenario using a small set of samples. A more extensive evaluation involving diverse "unanswerable" scenarios would enhance the robustness of the findings.
问题
How does the proposed two-stage instruction tuning method compare to other state-of-the-art instruction tuning approaches in terms of efficiency?
局限性
The paper does not identify any potential negative societal impact, which may be worth exploring in future work, particularly given the potential for misuse of powerful conversational QA models.
Many thanks for your detailed comments and feedback. We will address your questions below.
“The paper primarily focuses on evaluating the "unanswerable" scenario using a small set of samples. A more extensive evaluation involving diverse "unanswerable" scenarios would enhance the robustness of the findings.”
- Thank you for your suggestion. We will conduct further studies on “unanswerable” scenarios in future work. We also believe that the open LLM research community needs to invest more in this area. The leading proprietary LLMs also struggle with balancing hallucination and incorrect refusal answers. For example,
- “Previous Claude models often made unnecessary refusals that suggested a lack of contextual understanding.” from https://www.anthropic.com/news/claude-3-family
“How does the proposed two-stage instruction tuning method compare to other state-of-the-art instruction tuning approaches in terms of efficiency?”
- We believe our ChatQA training requires much smaller compute resources compared to state-of-the-art instruction tuning approaches (e.g., Llama3-instruct) that use an enormous SFT dataset and even need iteratively online SFT training.
- For all ChatQA models, we perform one epoch training on a relatively small SFT dataset (128K samples) in our stage-1 training. We use a global batch size of 128 and fine-tune the base model with 1000 iterations. In our stage-2 training, we blend conversational QA, single-turn QA, and the SFT dataset in stage-1 with certain ratios and further fine-tune the model with 3000 iterations using a global batch size of 64. Therefore, in total, our model consumes 128,000 + 64 * 3000 = 320,000 samples, which are much fewer than Llama-3 model alignment training, which takes more than 10M samples.
“The paper does not identify any potential negative societal impact, which may be worth exploring in future work, particularly given the potential for misuse of powerful conversational QA models.”
- Thanks for pointing this out. We have discussed the potential negative impacts in page 22 (right before the Appendix), and the potential misuse of ChatQA models has been included. But, we will put them right after conclusion to make them easier to find.
We really appreciate your review and hope our response addresses all your questions.
The paper introduces ChatQA, a set of models designed to excel in retrieval-augmented generation (RAG) and conversational question answering (QA), demonstrating a good performance that, in some cases, surpasses GPT-4. The authors present a two-stage instruction tuning method and a dense retriever specifically optimized for conversational QA, alongside a benchmark called ChatRAG Bench. The study is complemented by the release of model weights, training data, and the benchmark to the community, which could potentially foster further research in this domain.
The technical rigor of the paper is generally solid, with well-supported claims and extensive experimentation that effectively validates the proposed methods. The ablation studies reinforce the methodology's efficacy, particularly the two-stage fine-tuning approach. However, some reviewers have raised concerns about the lack of technical novelty, pointing out that the approach relies heavily on existing techniques and does not significantly advance beyond current methods.
The presentation of the paper is clear and well-structured, making the complex concepts accessible to a broad audience. Despite this, there are concerns regarding the lack of in-depth analysis of the types of errors made by the models and the absence of discussions around ethical implications or biases. These omissions limit the paper's comprehensiveness and suggest areas for improvement, particularly in providing more detailed qualitative insights and addressing potential societal impacts.