We thank the reviewer for helpful suggestions and references. Below we address the specific points raised by the reviewer.

Novelty:
- As we specify in the Contributions subsection, the main contributions of this paper includes:

Creating a large dataset
Training multiple language models on it with different settings
Evaluating these language models on held-out benchmarks manually to demonstrate that the dataset helps autoformalization
Conducting ablation experiments to show the multilinguality in formal languages benefits autoformalization.

We want to emphasise that this is a dataset paper, not a method one. The paper by Fu et. al proposes “model specialization”, which distils from a large model to a small model. Ours is about utilising the power of model distillation and back-translation to construct the most useful dataset for autoformalization, and validating various scientific points with it, as mentioned in the contributions.
The proposed method is generic, but only evaluated on the narrow domain of autoformalization.

Autoformalization is a new domain that poses unique challenges like the lack of a sizeable dataset, the integration of neural and symbolic methods, and the difficulty of model evaluation. We think the domain is important enough to warrant its own treatment: Numerous papers solely on the specialised domain of autoformalization have been published, with lots in top conferences within both the machine learning and the theorem proving communities [1-8].
Back-translation and sequence distillation are well-known techniques in the field of NLP and we do not make claims about inventing them. This is a dataset paper and we utilise these two techniques to create the dataset. The advantage of using back-translation for autoformalization is that informalising is much easier than formalising and hence we can get a high-quality dataset with it. Only the very first part of the process and experiments described in our paper is similar to Fu et. al and our novelty is elsewhere.
References:
- We note that autoformalization is quite different from both program synthesis and informal reasoning, with the aim of translating from a natural language to a formal one instead of creating formal programs from scratch. These papers are related to the method of creating the dataset, rather than the conclusions/contributions (which are about the dataset and the multilinguality) of the paper. We thank the reviewer again for the references. We have updated the paper to include these references and made a comparison of our paper to them.
Data contamination:
- This is a valid concern and we appreciate it. Three important pieces of information indicate that there is unlikely any data contamination with our paper:
We did not evaluate GPT-4. We only evaluated llama models.
The pre-trained model is not contaminated: We dedicate a subsection to the issue of data contamination with the pre-trained llama model in Section 6. Since it cannot even be instructed to write Isabelle or Lean, we think it unlikely that the pre-trained model is contaminated.
The fine-tuning datasets are not contaminated: The fine-tuning datasets come from Isabelle AFP + Lean Mathlib4. The test benchmarks come from miniF2F + ProofNet. These datasets are strictly disjoint. We use GPT-4 to create English descriptions of AFP and Mathlib4, which are extremely unlikely to include anything from the formal test benchmarks. Since we have an unlikely contaminated pre-trained model, and an unlikely contaminated fine-tuning dataset, we conclude that the test evaluations are not affected by data contamination, and our conclusions stand.
Spelling:
- We use British English spelling throughout, except for the word “autoformalization”, which has become a term.