Dear Reviewers,

We would like to express our sincere gratitude for your thoughtful and constructive comments. In this general response, we address the concerns that were commonly raised by multiple reviewers.

Justification of our synthetic dataset: We believe that our D-level PCFG is a sufficient and appropriate setting for achieving our research objectives. First, it is well established that PCFGs can express a wide variety of languages (see, e.g., [1]). Additionally, previous studies have also utilized data generated using PCFG to investigate the pretraining of LLMs [2, 3, 4, 5, 6].

Furthermore, our experimental setting is designed to associate the metadata to various stuff observed in natural language. While one might intuitively think that, since we use synthetic data generated by a PCFG, the metadata only reflects grammatical differences, this is not actually the case.

For example, the metadata in our synthetic dataset can be considered analogous to the topic of a sentence or the URL from which a sentence originates. In natural language, changes in topic or source URL often lead to shifts in word frequency distributions. Similarly, in our setting, different metadata values correspond to different PCFG trees, resulting in changes in the frequency of terminal symbols.

We also posit that the metadata in our synthetic dataset can correspond to aspects such as the tone (e.g., explanatory vs. conversational) or genre (e.g., news, biography, dictionary/Wikipedia, or fiction) of natural language. In natural language, such changes often lead to grammatical shifts—e.g., more formal and complex structures in encyclopedic genres vs. simpler and more casual constructions in spoken or informal texts. Our PCFG-based setting replicates this: different metadata values lead to different underlying grammars, which are consistent within the data associated with a given metadata value, just as in natural language.

Thus, the metadata in our D-level PCFG setting captures a variety of metadata types conceivable in natural language. We therefore believe our dataset meaningfully reflects certain aspects of natural language metadata and is a valid framework for studying the effect of metadata conditioning. While it is synthetic, it is not an unrealistic abstraction.

Finally, we would like to emphasize that natural language is inherently complex, and capturing all of its properties is beyond the scope of our study. We intentionally used synthetic data with controllable metadata in order to isolate and investigate the effect of metadata in pre-training. Generalizing findings from controlled experiments to natural language is, in itself, an ambitious challenge and not the central aim of our work.

Lack of description on the experimental settings: We sincerely agree with the reviewers who pointed out that our paper lacks sufficient detail regarding the experimental settings. Below, we provide additional information, which we will incorporate into the revised version of the paper.

The number of data samples used for pre-training is 500,000, and the training is conducted over a single epoch. We note that the difference in results between the models with and without metadata is not due to overfitting.
For batch size, optimizer, and learning rate, we followed the setup in [2]. Specifically, we used a batch size of 96, AdamW optimizer with , weight decay of 0.1, and a learning rate of 0.0003.
Details about the model size and D-level PCFG settings are provided in lines 128–137 of the paper.

We have also addressed reviewer-specific concerns individually in their respective responses. We would be happy to address any further questions or clarifications you may have.

Sincerely,

Authors

[1] G. K. Pullum and G. Gazdar. Natural languages and context-free languages. Linguistics and Philosophy 4 (1982).

[2] D. Hupkes, V. Dankers, M. Mul, E. Bruni. Compositionality decomposed: how do neural networks generalise? Journal of Artificial Intelligence Research, 67, 757-795 (2020).

[3] S. Bhattamishra, K. Ahuja, N. Goyal. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (2020).

[4] Z. Allen-Zhu and Y. Li. Physics of language models: Part 1, learning hierarchical language structures. arXiv preprint arXiv:2305.13673 (2023).

[5] H. Zhao, A. Panigrahi, R. Ge, S. Arora. Do Transformers Parse while Predicting the Masked Word? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023).

[6] J. Jumelet, W. Zuidema. Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution. In Findings of the Association for Computational Linguistics: EMNLP 2023 (2023).