We thank the reviewer for a positive review. We were very happy to read that: “this is frontier research, and I was excited to read it”, and that the review recognises evaluating factual question answering “was especially useful and cleanly conveyed the points made”. In the below, we address your raised weaknesses and questions.

Weakness 1: “As mentioned above, the experiments were very narrowly scoped.”
We respond to the first weakness in the general response to all authors above. To add to specific comments by your review here, we agree that it would be great to see these experiments reproduced on other model families. However, Llama specifically is not possible because the pretraining data is not published. On the short reasoning hops not resulting in fractional-valued answers; the reason we did this is two-fold; it is less likely that answers to reasoning steps are in our sample of 5M documents if they contain fractional values, and in many cases expecting an LLM to output fractional values is less reasonable if it does not have access to tools.

Weakness 2 - part 1: “The description of EK-FAC was brief and not as clearly described as the later experiments and results”
This is understandable, and it’s useful for us to know that more motivation for using EK-FAC is required. To address this, we added the following line in the main paper: “In the same experiments, we motivate the use of EK-FAC estimations of the Hessian, by showing it significantly improves over a method using only first-order information.” (referring to Appendix A.1, see red-coded revisions L210-211). Given the limited space we have in the revision, and because this is background material, we decided to further address your point in the appendix. To summarise here; EK-FAC estimation of the Hessian is a much better estimate of the counterfactual question that we are interested in (“how do the trained model parameters change if a datapoint is included in the pretraining set and the model is retrained”) than methods using only first-order gradient information. This is especially true in a regime where many gradient steps are taken such as for LLMs, because second order information becomes even more important. Beyond the motivation of using EK-FAC over first-order methods, we expanded section A.2 of the appendix with two subsections that should address this point, and referred to it in the main paper (see L235, colour-coded red). In A.2.1, we ran additional experiments to motivate each approximation we do. To estimate the Hessian from Equation 1 with EK-FAC tractably for LLM-scale models we use a block-diagonal approximation of the Hessian. We estimate the effect this has on influence scores compared to a full implementation by calculating the correlations in an experiment on Wikitext with GPT-2. We find the scores correlate highly with the full implementation scores (Pearson’s R of 0.96). In the second section we added (A.2.2), we further compare our EK-FAC implementation to a publicly available implementation of EK-FAC influence functions (that correlates with our implementation with 0.996 Pearson’s R), and we share the detailed results of this experiment in the supplement. This provides a reference implementation that can further help with understanding the EK-FAC estimations.

Weakness 2 - part 2: “the discussion section at the end of the paper (sec 5) was very dense and a bit confusing.”
This is very valuable feedback, thank you. We have restructured the discussion to add detail and improve clarity. Please refer to the second and third paragraph in the discussion in the uploaded revision (L486-512, colour-coded red). To summarise the changes; we separated the two alternative hypotheses, rewrote them to be clearer, and reframed the second half of the paragraph starting on L490 originally (now on L496) in terms of limitations.

Weakness 3: “it could be made more clear that these studies are distinct from linguistic reasoning”
We agree with the point about the field conflating many different forms of “reasoning”, without being too clear about what reasoning is. This is in part why we chose very simple mathematical reasoning, with clear-defined steps that build on each other. We tried to be clear about this, by making a point about saying we look at simple mathematical reasoning tasks in the abstract, and specifying the types in the introduction (right before the summary of the findings). To emphasise again at the end of the paper that this does not mean that our findings would generalise to other forms of reasoning, we added the following line in the discussion: “Finally, in this work we look at mathematical reasoning, which is very different from other types of reasoning, especially if they are inductive. Future would should verify whether similar results hold for more types of reasoning” (colour-coded orange, L525-528).