Thank you for dedicating your time and effort to provide feedback on our work. Below, we answer the questions that have been raised.

Weakness 1: Justifications for the novel design choice (but seems heuristic) compared to the regular AR model.

Thanks for the suggestion. Below we provide justification for each item.

Predefined assignments and predicting three points in Block-wise Masked FINOLA

This design is meticulously crafted to achieve several key objectives: (a) Full coverage: ensuring that all masked positions receive attention, (b) Balance: distributing the usage of unmasked positions equally, and (c) Efficiency: utilizing each unmasked position to predict three masked positions, especially under the condition where 75% of positions are masked. Additionally, this design can be implemented in a block-wise manner to further enhance efficiency, a crucial consideration for the extended training durations involved in self-supervised learning.

Why first order?

The selection of first-order autoregression is driven by its inherent simplicity and local property, where the derivatives along the and axes are exclusively determined by the current position. This strategic choice not only simplifies the computational process but also aims to unveil valuable mathematical insights that could be shared across different images, provided the approach proves effective."

Weakness 2: Lack of interpretation of the derived PDE.

The derived partial differential equations (PDEs) in Eq. 4 provide a theoretical extension of FINOLA from a discrete grid to continuous coordinates, establishing a conceptual bridge between the discrete representations of neural networks and the continuous nature of human vision. Although our digital images and feature maps are inherently discrete, the human vision system perceives continuous signals.

This theoretical extension allows us to hypothesize about the mathematical properties of continuous feature representations in the human vision system. We posit that the proposed PDEs in Eq. 4, derived from their discrete counterparts in Eq. 1, capture the patterns inherent in continuous features within human vision.

This is related to Q2 below.

Q1: How do the coefficient matrices and directly know the spatial relationship? They do not take any of the spatial information.

The coefficient matrices and encode the relationship between a feature vector and and its spatial derivatives, namely and . This relationship is mathematically expressed as:

,

,

where .

It's essential to note that matrices and exhibit invariance across different images and positions within the feature map once they are learned from the data.

1.1 Coefficient matrices and capture channel-wise relationship.

Indeed, the difference of the channel vector between consecutive positions, such as , represents a linear combination of normalized channel values at the current position .

1.2: The pattern and correlation encoded in the channel are related to the positional information.

The correlation does NOT pertain to the positional information; instead, it is associated with the derivatives (or rate of change) along the and axes, often referred to as "spatial derivatives".

1.3: the coefficient matrices and (indirectly) capture the relationship between each position. Maybe it is trivial for some readers but it is not for me.

As discussed earlier, the matrices and explicitly capture the relationship between the feature at a given position and its derivatives along the and axes. Consequently, this encoding implicitly extends to capturing the relationship between consecutive positions, such as from to .