We thank the reviewer for the detailed comments. Below we note:

About missing details:

The paper misses dataset details (L195, L355, L427). Are NIAH samples created manually or by prompting large language models? What datasets are used to begin with in Sec 2-4? What additional evaluation tests did you create in Sec 4.1-4.3?

The paper misses experimental details, such as prompts used, links to existing assets used, etc.

We appreciate the comment on missing details. Please see our [Update on Experimental Details], where we have included comprehensive descriptions of the datasets, methods for creating NIAH samples, and evaluation tests. We would also be happy to provide further follow-up explanations. These updates will also be incorporated into the revised version of our paper. Our full reproducible code and data is open-sourced, and we will reveal the link after the anonymity period.

About :

L156: By , do you mean real numbers? If so, perhaps use instead and clarify that refers to unnormalized attention scores.

Yes, refers to real numbers, and is indeed more precise. We have updated the notation and clarified its meaning in the revised paper.

Question on Retrieval Head detection:

L194: Does it happen that the model generates a word that is not the next token in the needle that should be generated? If this happens, do you skip the example? Or consider that as a case when all attention heads do not perform a copy-paste operation, even if an attention head actually pays the most attention to the token that should be generated next?

Our retrieval head detection is specifically targeting the copy-paste behavior. So yes, if the model generates a tokens that is not in the needle, we do not count it, no matter whether the attention head is attending to the target token or not.
That being said, since the needle sentence contain multiple tokens, to identify retrieval heads the model does not necessarily need to generate all of the tokens in the needle. Say if a needle "best place to visit in SF ..." contain 20 tokens, as long as the model copy-paste a fair portion of the tokens, say 14 out of 20, this level of copy-paste is enough for us the identify how strong/ frequently an attention head is doing retrieval.

L203: What do you mean by “stabilize” and “converge”? Please either provide definitions or plots to illustrate.

Since heads detected with strong retrieval scores in one context may not exhibit strong retrieval scores in another context, so we conduct retrieval head detection over multiple (question, answer pairs) and multiple long-contexts (haystacks).
As we increase the number of trials for detecting retrieval heads, “stabilize” and "converge” refers to the ranking of attention heads (based on retrieval scores) becoming consistent across repeated trials of retrieval head detection.
We have added definitions and included supporting plots in Section 7.2, Figure 13, in the updated Appendix.

Writings

It’d facilitate reading to clarify that " is a sentence that is irrelevant to " in L146, instead of, for example, a short phrase or a single word. Can add a reference to Figure 2 so that readers see an example.

Figure 3: Seems that in fact less that 1% of attention heads are activated more than 50% times. The 5% in the caption could probably be changed to 1%.

Figure 7: Could be nice to include the dataset name in the caption.

Thanks for the suggestions and we have modified the paper in in our new revision.