Q1: Technical contribution

Appreciate your response. We would like to provide a more detailed clarification on technical contribution of our Hi-MAR, especially compared to existing works like VAR:

The use of conditional tokens to alleviate training-inference discrepancy across scales is novel: We identify and alleviate the training-inference discrepancy across scales, which is a common yet underexplored issue in multi-scale prediction models (e.g., VAR, FlowAR, Muse). As shown in Table 4 of our paper, simply using low-resolution visual tokens as pivots to guide denser token prediction leads to marginal FID improvement (2.31 to 2.28), due to the inconsistency of pivot tokens between training and inference. To mitigate this, we propose using low-resolution conditional tokens generated by Hi-MAR Transformer to guide denser token prediction. This strategy ensures consistency between training and inference. As shown in Table 4, replacing the visual token pivots (as used in VAR) with our conditional tokens yields a notable FID improvement (2.28 to 2.07), which validates our proposal. A detailed explanation is provided in Q4 below.
The proposal of scale-aware Transformer block is novel. We introduce a scale-aware Transformer block that provides tailored scale guidance for each phase. This design is novel, effective, and not introduced in VAR. It is also worth noting that our hierarchical modeling can be easily applied to most VAEs without the need to train a multi-scale autoencoder, which needs to be trained in VAR.

We therefore kindly invite Reviewer kLHw to reconsider assessment on our Hi-MAR's essential technical contributions, depending on the above discussions.

Q2: Effectiveness-efficiency trade-off

As suggested, we show a detailed comparison on effectiveness and efficiency across different methods in this new table (see the link). Starting from a two-stage VAR, we progressively apply modifications: 1) Adopt masked autoregression for each stage (row 2 in this table); 2) Add a diffusion head for each scale (row 3 in this table). Note that we change the dimension and depth of VAR so that the parameter number of modified VAR is similar to Hi-MAR-B. As shown in this table, while these modifications improve performance, they still lag behind Hi-MAR in both accuracy and speed. Notably, even with these modifications, the best FID 2.30 of these variants only approaches that of Hi-MAR pivoting on ground-truth visual tokens (FID 2.28), whereas Hi-MAR further improves to 2.07 by introducing conditional tokens, highlighting the importance of addressing training-inference discrepancy.

Q3: Figure 4 should include VAR

Thanks. As suggested, we include VAR for comparison in the revised Figure 4 (see this link). We will add this in revision.

Q4: Training-inference discrepancy

We clarify the discrepancy issue and how Hi-MAR resolves it. Let us consider a simplified two-scale setting. In VAR:

Training: The model learns to predict large-scale tokens conditioned on ground-truth small-scale tokens , i.e., .
Inference: The model first predicts small-scale tokens , which may contain errors, and then uses them to predict , i.e., .

This mismatch between (GT) in training and (noisy) in inference introduces a training-inference discrepancy, leading to error accumulation and degraded generation quality.

In Hi-MAR:

Training: In the first phase, a proportion of small-scale visual tokens are masked and the remaining unmasked ones are fed into Hi-MAR Transformer. The Hi-MAR Transformer outputs conditional tokens , which are further fed into the diffusion head for predicting the masked tokens as MAR. In the second phase, similar masking procedure is also applied to the denser visual tokens . Instead of using ground-truth , Hi-MAR Transformer takes the small-scale conditional tokens from the first phase along with the unmasked visual tokens as input to generate denser conditional tokens . Finally, the Diffusion Transformer head conditioned on is adopted to predict the denser masked tokens .
Inference: We follow the same procedure (i.e., first predict small-scale conditional tokens , and then predict denser conditional tokens based on ), ensuring the consistency of pivot tokens between training and inference.

This design ensures that both training and inference in Phase 2 rely on predicted conditional tokens rather than ground-truth tokens. As shown in Table 4, this leads to a notable FID improvement (2.28 to 2.07), validating our proposal.