Thanks for your detailed engagement with our paper & consideration of our rebuttal.

For each patch, we perform 2 inference runs on GPU in parallel (i.e., on prompts and ) and then patch a hidden state from one run over to the other (i.e., ) before completing the inference run on input to see how the score changes as a result of the patch. For layerwise analysis, this is, 32 paired inference runs over the dataset for a 32-layer model like Olmo 7B. In the naive implementation, patching each attention head adds a number-of-attention-heads multiplicative factor, so for Olmo 7B, you would now run the inference procedure over the dataset 32 heads * 32 layers number of times. This can be sped up slightly by caching the hidden states up to the layer where you are performing the intervention. There is also some nascent work on speeding patching up by using gradient-based approximation techniques (https://www.neelnanda.io/mechanistic-interpretability/attribution-patching), but we stuck to the purist approach here to avoid approximating anything.