Thank you for your thoughtful engagement. We’d like to address the final concern about generalization beyond modular addition (W1) and clarify our claim regarding divide-and-conquer algorithms (Q1).

On Generalization (W1)

We acknowledge that our methodology is focused on modular addition, and explicitly address this with Conjecture 4.9, to encourage future extensions, but we do not view this as a weakness because:

Our paper unifies five prior works (Nanda et al., Chughtai et al., Zhong et al., Gromov, Morwani et al.) on modular addition, resolving previously conflicting interpretations and other open questions
Additional tasks would harm the clarity of this contribution. Since the goal of this work is to reconcile and unify prior interpretations on modular addition, expanding to new tasks would distract readers, especially because prior work was scoped to modular addition
Even if our methodology doesn’t transfer to other tasks, our work provides a complete theory of how deep networks implement modular addition, something no prior work has achieved
We’re first to present a theoretical bound for deep networks on this task. Prior work addressed 1 layer MLPs

We recognize generalization is a natural next step, which is why we framed Conjecture 4.9 as a key forward-looking result. But within its scope, this work offers a definitive resolution to longstanding contradictions in the literature, especially those that challenged the universality hypothesis, and we believe this stands firmly on its own merit.

On Divide-and-Conquer Algorithms (Q1)

We gladly clarify a key result: "transformers can learn implementations of divide-and-conquer algorithms."

Theorem 4.4 proves that neurons activate only on approximate cosets, which correspond to paths on a graph (a data structure).
Each such path covers roughly half the vertices. Theorem 4.7 proves that only such paths (i.e., frequencies) are needed to solve . This is a significant improvement over prior theoretical work that addressed only 1-layer MLPs (not depth or transformers) and proved O(n) frequencies (Gromov, Morwani et al.).
Since each neuron activates only on approximate cosets and approximate cosets suffice, all networks have effectively implemented a version of Abstract Algorithm 4.6, which performs modular arithmetic via divide-and-conquer.

Our work thus unifies prior interpretations and is first to rigorously prove an example where deep networks (including transformers) implement divide-and-conquer algorithms through their weights, representing a substantial theoretical advancement.

Simplified Divide and Conquer.

Consider

There are two coset types:

mod 2:
- :
- :
mod 3:
- :
- :
- :

Let neuron-1 have frequency 3 and activate when the answer . It learns:

This neuron activates maximally when both , guaranteeing .

Let neuron-2 have frequency 2 and activate when . It learns:

This function activates maximally when , ensuring .

Summing the outputs of these two neurons gives the following logits:

c = 0 → 4  
c = 1 → 0  
c = 2 → 2  
c = 3 → 2  
c = 4 → 2  
c = 5 → 0

The correctly selects . Suppose neuron-3 activates for (e.g., by firing when and ). Then if neuron-1 and neuron-3 activate simultaneously, the would select due to the higher logit. This logic continues as you add neurons corresponding to every coset.

This demonstrates a divide-and-conquer strategy: each neuron rules out a large fraction of incorrect outputs, analogously to how binary search eliminates half the search space at each step. This example is with mod 6 and exact cosets. Approximate cosets generalize exact cosets by capturing cases where neurons learn frequencies that are not divisors of the modulus. Theorem 4.7 shows approximate cosets are sufficient for networks to attain strong margins, matching the cosets required by the classical Chinese Remainder Theorem.

As grows, the number of coset types is logarithmic, e.g.

→ 2 coset types
→ 10 coset types

We believe our work resolves a foundational problem and hope our rebuttal clarified its scope and significance. We find it deeply satisfying that networks converge to a generalization of the classical divide-and-conquer Chinese Remainder Theorem and hope the clarity of this result came through.