Dear Reviewer eQsN,

Thank you for your time to review our work! We will answer your questions as follows:

The augmented math questions may include contradicting sentences. The augmented math questions may become unsolvable or yield answers that differ from the original ones. Although human evaluations on 200 samples suggest that “94.5% of questions meet acceptable quality,” this accuracy may still be inadequate, particularly given that the labels in the GSM8K test set might contain errors. An alternative could be to release these 200 samples as a verified subset of the E-GSM dataset. Reporting CoLeG-E and CoLeG-R results on the 200 samples, both with and without verification, would also be helpful.

Thank you for your question. The human evaluation criteria are detailed in Appendix A.2. Specifically, any question that includes contradictory sentences or yields a different answer from the original problem is classified as "poor" quality. As explained in Lines 173–176, we employ two heuristics to filter out "bad" extended questions. The specifics of these heuristics can be found in Appendix A.3, while the filtering process is detailed in Appendix A.4. The core idea behind our approach is to use entailment and solvability as metrics to filter out a substantial portion of questions, ensuring that all "bad" questions identified during our human evaluation are eliminated. This screening process explains why the number of questions presented in Table 1 diminishes with each successive round.

In Table 2, the higher results w/D (compared to w/) may because the size of D is larger than .

Thank you for pointing this out. We expand to the same size of D by further RFT [1]. The results for Llama-2-7B is given as follows:

Method	CoLeG-E	CoLeG-R
	20.22	66.64	58.45	49.62	42.96	40.94	38.95
expanded	20.34	66.28	58.99	50.06	43.35	41.25	39.10
	28.09	80.97	59.44	57.57	50.92	49.44	48.13

We can see that there is not much improvement and the claim remains. We hypothesize that this is because the set of unique questions remains unchanged; simply applying RFT yields similar solutions, resulting in minimal improvement for SFT. Moreover, the addition of short questions does not substantially enhance performance for E-GSM.

How is E-GSM different from GSM-IC?

Thank you for your question! E-GSM is different with GSM-IC in the following way:

E-GSM is more challenging (not in terms of difficulty level of problems) than GSM-IC. GSM-IC uses template-based method to insert one irrelevant sentence to GSM8K problems, which initially reduced the performance of earlier LLMs like text-davinci-003. However, as LLMs become more sophisticated, GSM-IC no longer poses a significant challenge. For example, the current version of GPT-3.5-turbo achieves 88.35% accuracy in GSM-IC with 0-CoT (as shown in Table 3). In contrast, our E-GSM extends the context of GSM8K problems to create longer scenarios, which are inherently more challenging. Specifically, the accuracy of GPT-3.5-turbo on the fourth round of E-GSM is only 64.42% with 0-CoT.
Different research focus. GSM-IC explores the impact of introducing a single irrelevant sentence on the mathematical reasoning capabilities of LLMs. In contrast, our research with E-GSM is intended to examine the inconsistency of LLMs when solving extended math problems of the same difficulty level, as motivated by our discussion in Section 2.1.

We hope our response will address your concerns. If you have any further questions, feel free to discuss with us!

Sincerely,

Authors

[1] Yuan, Z., Yuan, H., Li, C., Dong, G., Lu, K., Tan, C., ... & Zhou, J. (2023). Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.

.