We sincerely thank the reviewer for the thoughtful and encouraging feedback. We are glad that the reviewer recognizes the novelty, significance, and robustness of our theoretical framework and its practical relevance. Below, we address each of the raised points.

W1 & Q3. Extension to recurrent architectures and models with joint space-time filters
We fully agree that extending the analysis to recurrent architectures or models with joint space-time filters is a valuable direction for future work. In this paper, we focused on delivering a rigorous and targeted analysis of spatiotemporal over-squashing within a widely-used class of factorized convolutional STGNNs.
Recurrent STGNNs following a T&S paradigm exhibit fundamentally different information propagation dynamics. In these models, the spatial receptive field typically expands with the sequence length, differently from the TTS counterpart, where the entire temporal sequence is encoded into a latent state later used for message passing. A thorough theoretical analysis of recurrent STGNNs would require substantially broader analysis, which we view as out of scope for a conference-length submission.
Conversely, joint space-time filters are more closely aligned with our setting. The factorized convolutional models we study can be viewed as a special case of joint space-time architectures with a constrained parameterization. Therefore, our theoretical results apply directly to a subset of joint models where cross-dimensional edges are not considered. However, generalizing our sensitivity bounds to arbitrary joint space-time filters – which by design follow a T&S scheme only – would require analyzing a larger architectural space. We consider this a promising future direction and have added a note on this in the discussion.
W2 & Q1. Reduced sensitivity towards earlier time steps
We agree that the finding appears counterintuitive at first glance. Our analysis shows that this effect emerges at depth, rather than in shallow causal convolutional networks. Specifically, Proposition 4.2 shows that as more convolutional layers are stacked, the influence of temporally recent inputs diminishes relative to more distant ones. This behavior stems from the structure of causal convolutions, which incrementally incorporate more information into a fixed-length context vector. When viewed through the matrix-multiplication perspective (i.e., using the Toeplitz matrix representation), it becomes evident that causal convolutions propagate information along powers of a directed path graph. Over multiple layers, earlier time steps accumulate influence through an increasing number of propagation paths, while more recent inputs have fewer paths for propagating their initial information. Importantly, because causal convolutions are forward-only, each time step can preserve its information in the associated context vector through self-loops only, with a major impact on the last time step in the sequence. When a task requires preserving local or recent information, this primacy bias can negatively impact performance even in trained neural networks, as demonstrated in the CopyLast experiment (Fig. 3). We have expanded the relevant discussion to clarify this mechanism and its practical implications for tasks that rely heavily on recent inputs.
W3 & Q2. Limitations of mitigation strategies
Thank you for highlighting the limitations of the proposed mitigation techniques.

Dilated convolutions (): The parameters (dilation base) and (reset modulo) are primarily chosen to control the temporal receptive field at each layer. To avoid the sink effect (seen in Fig. 2(b.2)), we require , i.e., no resets within the model depth. However, when the dilation rate grows beyond the window size at layer , the convolution degenerates into a fully-connected layer shared across time. In these cases, resets become necessary for temporal propagation, but reintroduce the risk of over-squashing. Smaller allows more flexibility in setting before hitting this tradeoff.
Row-normalization (): While effective in retaining local information for the last time step – useful for encoder-decoder architectures – the behavior for earlier steps is still similar to the vanilla model. As shown in Fig. 2 (c.2), however, paths toward the final time step dominate compared to any other paths in the temporal graph. Thus, for tasks requiring readout at intermediate time steps (e.g., imputation), the benefits of may be limited. We have added a clarifying note in the discussion of mitigation strategies.

We appreciate the suggestion to conduct a systematic sensitivity analysis of the mitigation strategies. While we agree that this could yield interesting insights, we believe it would fall outside the intended scope and focus of this work. As the reviewer kindly acknowledged, our paper is primarily theoretical, and our experiments were designed to validate core theoretical claims rather than to fully optimize or benchmark each mitigation strategy. We have added a brief mention of this limitation and encourage future work to build upon these mitigation techniques with a more exhaustive empirical lens.