Thanks for your insightful response!

We will continue to answer your questions below.

A1: Thank you for your valuable feedback! We will carefully consider your suggestions and make targeted improvements in the subsequent version of our manuscript.

A2: We sincerely apologize for our previous misunderstanding of your question. Please allow us to provide a clarified explanation:

This issue is somewhat similar to the first question. In fact, we presented the results of mixed domain pre-training in the first experiment of our paper (Figure 1(a)). We observed that under this setting, various sequence modeling architectures achieved similar language modeling test performance when reaching comparable pre-training levels. This led us to conclude that under the mixed domain pre-training setting with in-distribution test, it might be difficult to distinguish between the merits of different architectures. Consequently, we proceeded to explore the use of out-of-distribution test settings, which we subsequently adopted for all following experiments.

When implementing the out-of-distribution test settings, we considered two approaches: 1) maintaining the mixed domain pre-training setting while seeking new, unused domain data for test; or 2) adopting a limited domain pre-training setting where certain domains are reserved exclusively for test. As you rightly pointed out, the first approach better aligns with real-world scenarios. However, its main challenges lie in collecting additional new domain data, and these new domains might not be widely familiar, potentially appearing artificially selected. The second approach, while differing from mainstream pre-training methods, is more practical to implement and ensures that out-of-distribution domains are well-known. Moreover, since this setting was designed specifically for architecture analysis, it doesn't conflict with or replace large-scale mixed domain pre-training for practical applications. Based on these considerations, we chose the limited domain pre-training setting to achieve our out-of-distribution test objectives. However, as you rightly pointed out, this approach comes at the cost of losing the opportunity to directly validate the architecture's OOD generalization capability under mixed domain pre-training settings. That said, we believe out-of-distribution generalization is a universal challenge—mixed domain pre-training can hardly cover all possible domains, making it somewhat similar to limited domain scenarios in this regard. We hope to further explore and validate this aspect in future work.

We understand your additional concern regarding our lack of demonstration of the proposed architecture's performance under mixed domain pre-training. This primarily relates to our early decision to adopt the limited domain pre-training setting after the first experiment. However, during the preliminary stages of this work, we did conduct experiments with the Top-1 Element Selection architecture under mixed domain pre-training. We present these results below:

(1) When Model parameters≈110M: Language modeling test performance under mixed domain pre-training

As we cannot insert a figure here, we present some data points where the pre-training levels can be aligned in a tabular format for your reference (analogous to observing the differences in vertical coordinate values while fixing the horizontal coordinate values in Figure 1(a)) :

Model	Pre-Training Loss	Loss on Mixed Domain Test Set (ID)
Transformer++	2.83619	2.30881
Top-1 Element	2.83130	2.30267

Transformer++	2.79629	2.26666
Top-1 Element	2.79794	2.26627

Transformer++	2.77541	2.24591
Top-1 Element	2.77463	2.24285

Transformer++	2.75492	2.22381
Top-1 Element	2.75587	2.22106

(2) When Model parameters≈1.3B: Language modeling test performance under mixed domain pre-training

Similarly, we present these results in tabular form:

Model	Pre-Training Loss	Loss on Mixed Domain Test Set (ID)
Transformer++	2.39902	1.88991
Top-1 Element	2.39882	1.89060

Transformer++	2.35962	1.85101
Top-1 Element	2.36057	1.85379

Transformer++	2.33047	1.82144
Top-1 Element	2.33449	1.82770

Transformer++	2.30021	1.79239
Top-1 Element	2.30367	1.79699

Since we observed that the behavior of the Top-1 Element Selection architecture in the mixed domain pre-training setting remained consistent with our earlier observations (Figure 1(a)) without showing degradation, we conducted only unified limited domain pre-training experiments for the Top-1 Chunk Selection architecture, which shares similar mechanisms. We will incorporate these results into subsequent versions of the paper to further strengthen its persuasiveness.