Q1: The claim of "the first AR for Any-to-Image generation"
A1: Thank you for the valuable comment. We agree that there have been prior autoregressive models that support multiple input modalities, including text and image. However, our goal is to develop a unified autoregressive visual generation model that explicitly accommodates three types of conditions, text, spatial, and visual contex, in a single framework. While models like Unified-IO and Unified-IO 2 indeed support text and spatial inputs, they do not explicitly unify visual context as a generative condition. Lumina-mGPT is trained with diverse inputs, but to the best of our knowledge, it only reports performance on text-to-image generation, without demonstrating generalization to the broader range of input conditions explored in our work. We will revise our claim in the paper to better reflect this nuance.

Q2: The claim of "achieving state-of-the-art on benchmarks"
A2: Thank you for your comment. We will revise the paper to more accurately characterize our results in comparison to prior work, below we would like to provide some important context regarding the comparisons mentioned: 1) diffusion models like [5] typically rely on additional text encoders that are typically not counted in the model size. For example, [5] uses a strong 2B-parameter text encoder (Gemma-2B), which is non-negligible compared to the diffusion models themselves. In contrast, our model uses a single causal transformer without external text encoders, keeping the architecture clean and parameter-efficient. 2) AR model [4] and hybrid model [6] are trained on much more data than our model. Specifically, our model is trained on approximately 56M text-to-image pairs and 13M examples for image editing and control generation. In contrast, [4] reports using around 140M samples, while [6] uses roughly 2B examples. Despite this significant parameter or data gap, our model achieves competitive performance across a range of benchmarks, especially given our unified support for diverse conditional generation tasks.

Q3: Novelty of this work
Thank you for the thoughtful comments. We would like to clarify the novelty and motivation of our work: we do not seek to alter the autoregressive modeling manner, but to build a unified AR framework that accommodates diverse conditional inputs. To alleviate the information leakage issue from condition tokens to content tokens in causal modeling, we introduce Disentangled Causal Attention (DCA), a training-time regularization scheme that carefully preserves the causal nature of AR generation while enabling the model to learn condition-aware generation without overfitting or shortcutting. Though lightweight by design, DCA is proved to be effective in improving the instruction following capability of our model, as evidenced by our ablation studies.

Q4: The advantage of using AR for image generation
Thanks for pointing this out. Our choice of an autoregressive (AR) model is motivated by its ability to flexibly accommodate diverse input conditions within a single causal transformer, in addition to the advantage you noted. There is currently no clear consensus in the community regarding which family of generative models, AR, diffusion, or VAR, is fundamentally superior to visual generation. Each has its own trade-offs in terms of quality, extensibility, and inference speed. We believe AR models remain an important and under-explored direction, especially in the context of multi-conditional visual generation, and our work aims to push this frontier forward.

Q5: Qwen initialization
Thank you for the question. We fine-tune a pretrained Qwen2.5 model for image generation tasks. Specifically, we adopt Qwen2.5 both as the text tokenizer and as the initialization for the transformer decoder. During training, we compute the autoregressive loss on the entire sequence, including both the text tokens (input prompts) and the visual tokens (generated image latents).

	GenEval	MMMU
Qwen2.5-0.5B	-	47.5
Pretrain w/o init	0.29	-
Pretrain w/ init	0.31	12.7

Performance with and without Qwen initialization: we conducted experiments using both Qwen-initialized and randomly initialized models and found that both setups can achieve similar performance on text-to-image (T2I) benchmarks after convergence (0.29 vs 0.31 on GenEval after 512 resolution pretraining). However, Qwen initialization leads to more stable training in the early stages, with lower loss.
Natural language understanding performance after T2I fine-tuning: after fine-tuning Qwen2.5 on T2I data, we observe a significant drop in language understanding performance, with MMLU accuracy decreasing from 47.5 to 12.7. This degradation is expected as we do not include any text-only data during fine-tuning. Although we compute loss on text tokens, the learning signal is relatively simple and repetitive, and lacks the diversity and complexity of language modeling tasks. We believe that the text understanding capability can be better preserved by mixing in text-only data for joint training, and we leave a systematic study of its impact on both language and T2I performance for future work.

Q6: Inference speed
A6: Thank you for the question. We evaluate the inference speed of our model alongside other models of comparable scale. Since Janus only supports 384×384 resolution, we only report its speed based on 24×24 tokens. As shown below, benefiting from the usage of KV cache, the inference speed of our model clearly outperforms Show-o. While our generation speed is still slower than SANA, deploying with vLLM could significantly narrow the gap.

Method	# of Tokens	Speed (Sec/Image)
Janus (1.5B)	24×24	11.92
Show-o (1.3B)	32×32	269.10
SANA (1.6B)	32×32	1.77
Ours (1.5B)	32×32	43.55
Ours + vLLM (1.5B)	32×32	5.40