Thank you for your review and valuable feedback. We have uploaded an updated PDF and written a general response with some clarifications that can be useful to all reviewers.

We address the specific weakness and questions next:

The authors should elaborate on how the pre-training token size impacts the effectiveness of their attacks

We do not ablate the size of the pre-training dataset in our work. All models are trained on the exact same dataset containing 100 billion tokens. This dataset is large enough that our experiments approximate industry-scale pre-training runs with reasonable precision, but less than a full training run because we simply don’t have enough resources. If we had infinite compute, we would have liked to answer the question of how training on a greater number of tokens impacts the effectiveness of the attack, but unfortunately this is not feasible.

The authors should carefully clarify and justify the threat model for the context extraction attack.

Many deployed LLM systems (e.g., ChatGPT, Claude, etc.) prepend undisclosed system prompts to user queries. The point of the context extraction attack is to allow a malicious user to steal this undisclosed context.

explores the threat of context extraction attacks in detail.

Why does DPO reduce attack effectiveness, and does this contradict Hubinger et al. (sleeper agents)?

The fact that post-training DPO (and SFT) on clean data reduces attack effectiveness is not surprising in itself: jailbreaking and denial-of-service (producing unsafe/gibberish text) are behaviors that perfect alignment should get rid of. This is exactly why our study focuses on if poisoning at pre-training time can persist through alignment. Our findings do not contradict with Hubinger et al. on whether alignment removes backdoors: Section 4 and 5 in their paper show that SFT and RL fine-tuning does “train away” backdoors to some degree. We have clarified this in the paper.

Do the attacks make it through potential filters and potential defenses?

This is a valid concern. The paragraph "Can data poisoning be filtered out?" in the Discussion covers this in detail. Models such as OLMo use rule/classifier-based checks to filter out non-natural low-quality text, and all attacks other than the denial-of-service attack should bypass the filters. The trigger-free belief manipulation attack produces documents that just resemble web documents, and it seems fundamentally difficult, if not impossible, to prevent. Defense against web-scale data poisoning is a problem out-of-scope for this paper and we leave for future work.