Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation
摘要
评审与讨论
This paper proposes Speculative Jacobi-Denoising Decoding to enable the control of refinement in SJD. A novel and sophisticated fine-tuning strategy by next-clean-token prediction is designed to predict the clean token at the next position with noised tokens as input. Then a part of perfix tokens would be selected after verification and the remaining tokens would be refined. The experiments show that the proposed method largely accelerates the inference speed and maintains the generation quality.
优缺点分析
Strength:
- The proposed SJD2 is novel and shows impressive acceleration and maintains the generation quality.
- The proposed fine-tuning strategy enables low-cost fine-tuning for the consecutive acceleration method.
Weakness:
- The method's description is dense and requires prior familiarity with Jacobi decoding, speculative decoding, and diffusion models. This may hinder readability on readers who are not familiar with some of these topics.
- While the denoising process is justified through analogy to diffusion models, the behavior of denoising in discrete token space is not fully analyzed. More investigations on the claim that diffusion models improve the refinement, like showing how the output tokens converge, can be helpful.
问题
- Is the proposed SJD2 able to perform batch inference?
- What's the specific choice of and in Equation (3)?
局限性
Yes
最终评判理由
The author's clever rebuttal addressed my main concerns, and I tend to remain positive rating.
格式问题
No
Response to Reviewer 3CPQ
Thank you for your thoughtful and constructive feedback on our paper. We are particularly grateful for your recognition of our novelty and high performance.
1. Enhanced prelimiary on speculative decoding and Jacobi decoding
Thank you for your suggestion. To improve clarity and readability, we improve preliminary section and add more knowledge details related to our core design. Specifically, we refine the descriptions of speculative decoding and Jacobi decoding: (A) speculative decoding: Speculative decoding employs a small model to accelerate the sequential generation of large autoregressive model. This model is trained on the same domain as the large model and is small enough for faster generation. In each step of inference, the small model first generates a sequence with its own inference paradigm. The large model then verifies this sequence in a single forward pass, selecting a prefix to serve as part of the final output. This verification ensures that each sampled token adheres to the probability distribution parameterized by the large model. Since multiple tokens can be taken as the final output by only one forward pass of the large model, the acceleration can be achieved; (B) Jacobi decoding: Jacobi decoding deems the auto-regressive inference as a process of solving the fixed point of a nonlinear equation in a triangular system. This decoding algorithm iteratively performs multi-token decoding and can be executed without fine-tuning or auxiliary modules. The specific process of Jacobi decoding is as follows: First, given the previously pre-filled or decoded tokens, we randomly initialize a sequence of candidate tokens. Then, in each iteration, we execute one forward pass of the auto-regressive model for all the candidate tokens with a causal mask. the predicted probabilities then generate the tokens typically via greedy sampling, and these sampled tokens are taken as the inputs of the next iteration. This process can be formulated as: , where denotes the token index and denotes the iteration index. The Jacobi decoding process continues iterating until the convergence is reached, as determined by a deterministic criterion where these tokens remain unchanged between consecutive iterations.
2. Investigations on the refinement
Thank you for your suggestion. Since we cannot include links in this rebuttal, we use a pair of tables to show the change of token categories at each sampling step and token position, so each table includes an example of the token trajectory. In SJD2 (shown by each column of Table A), over 25 sampling steps, the predicted tokens start identical (e.g., token value 167282). As the noise level decreases, the token categories become diverse, but some token categories begin to repeat, which means the trajectory becomes stabilize. In contrast, for SJD (the columns in Table B), the token categories change irregularly, appearing to oscillate and remain unstable. We will provide more detailed visualizations using figures in the camera-ready version.
Table A: The discrete token trajectory in SJD2 with Emu3 as baseline:
| Sampling Steps \ Token index | Token 1 | Token 2 | Token 3 | Token 4 | Token 5 |
|---|---|---|---|---|---|
| Step 1 | 167282 | 167282 | 167282 | 167282 | 167282 |
| Step 2 | 167282 | 167282 | 167282 | 167282 | 167282 |
| Step 3 | 167282 | 167282 | 167282 | 167282 | 167282 |
| Step 4 | 167282 | 167282 | 167282 | 167282 | 167282 |
| Step 5 | 167728 | 167728 | 167728 | 167728 | 167728 |
| Step 6 | 167728 | 159459 | 159459 | 167728 | 159459 |
| Step 7 | 167282 | 167282 | 167282 | 167282 | 167282 |
| Step 8 | 167282 | 167282 | 163124 | 167282 | 152950 |
| Step 9 | 154915 | 153542 | 163124 | 156927 | 160877 |
| Step 10 | 155499 | 158046 | 158230 | 153542 | 152491 |
| Step 11 | 154578 | 157671 | 152153 | 153840 | 153399 |
| Step 12 | 154794 | 155092 | 152611 | 154282 | 153088 |
| Step 13 | 153692 | 153700 | 158037 | 156095 | 154951 |
| Step 14 | 152784 | 160040 | 156856 | 152551 | 155736 |
| Step 15 | 152388 | 153700 | 155986 | 152395 | 151927 |
| Step 16 | 158046 | 160040 | 165298 | 159796 | 151980 |
| Step 17 | 152388 | 152357 | 153737 | 155415 | 153182 |
| Step 18 | 158046 | 160040 | 155986 | 152006 | 153182 |
| Step 19 | 152388 | 152357 | 155986 | 152006 | 152268 |
| Step 20 | 154500 | 155680 | 153737 | 155493 | 152268 |
| Step 21 | 152388 | 152357 | 155986 | 154177 | 152268 |
| Step 22 | 158046 | 155680 | 155986 | 152006 | 152268 |
| Step 23 | 152388 | 152357 | 155986 | 152006 | 152268 |
| Step 24 | 152388 | 155680 | 153216 | 152006 | 152519 |
| Step 25 | 152388 | 155680 | 153951 | 157603 | 152268 |
Table B: The discrete token trajectory in SJD with Emu3 as baseline:
| Sampling Steps \ Token index | Token 1 | Token 2 | Token 3 | Token 4 | Token 5 |
|---|---|---|---|---|---|
| Step 1 | 178678 | 154122 | 180405 | 156118 | 153692 |
| Step 2 | 165560 | 164685 | 162805 | 167164 | 155025 |
| Step 3 | 166380 | 154949 | 164287 | 166096 | 157596 |
| Step 4 | 156660 | 159530 | 152575 | 152503 | 166994 |
| Step 5 | 158407 | 151855 | 153356 | 157268 | 165140 |
| Step 6 | 160136 | 152594 | 160547 | 153105 | 152290 |
| Step 7 | 162159 | 156850 | 156404 | 159634 | 155860 |
| Step 8 | 165406 | 154488 | 158930 | 159965 | 153749 |
| Step 9 | 164959 | 151927 | 160406 | 165482 | 154677 |
| Step 10 | 154998 | 155867 | 166882 | 159530 | 166918 |
| Step 11 | 155948 | 159306 | 153403 | 156096 | 162431 |
| Step 12 | 160407 | 152980 | 155008 | 157592 | 152012 |
| Step 13 | 160393 | 156175 | 153776 | 156225 | 165458 |
| Step 14 | 155559 | 157817 | 164655 | 166620 | 159092 |
| Step 15 | 161548 | 153057 | 164198 | 155867 | 167001 |
| Step 16 | 152190 | 155639 | 164741 | 165587 | 151964 |
| Step 17 | 155733 | 163798 | 153648 | 163956 | 160777 |
| Step 18 | 152029 | 156397 | 154102 | 153091 | 155513 |
| Step 19 | 165985 | 152276 | 154563 | 159234 | 153776 |
| Step 20 | 161888 | 154110 | 167323 | 155040 | 154248 |
| Step 21 | 160647 | 152453 | 157631 | 153105 | 157318 |
| Step 22 | 161724 | 154949 | 159608 | 154419 | 155366 |
| Step 23 | 155774 | 152065 | 164738 | 157863 | 157972 |
| Step 24 | 152548 | 153464 | 155459 | 154419 | 156705 |
| Step 25 | 164009 | 158589 | 160031 | 153100 | 155766 |
3. About the possibility of batch inference
In principle, our SJD2 can inherently support batch inference as a variant of speculative decoding [A]. However, our current implementation has not yet been optimized for batch processing due to engineering challenges such as synchronization overhead. We are actively exploring engineering optimizations, including dynamic batching and asynchronous verification, to enable efficient batch inference in future versions.
[A] Qian H, Gonugondla S K, Ha S, et al. BASS: Batched attention-optimized speculative sampling[J]. arXiv preprint arXiv:2404.15778, 2024.
4. About details of Equation 3
As shown in Appendix A, we set and for Equation 3.
Thank you for your detailed and thoughtful rebuttal. I appreciate the additional empirical study you provided, which effectively addresses my initial concern about the behavior of denoising in discrete token space.
This paper proposed Speculative Jacobi-Denoising Decoding (SJD2) to accelerate autoregressive text-to-image model. By incorporating diffusion denoising with speculative decoding for AR model, the paper reaches 4x speed up on common AR image generation model.
优缺点分析
Strengths
- The background of SJD2 is well introduced in the first section.
- SJD2 gets good acceleration with almost no loss on image quality.
- The method works on both text-to-image and text-to-video models.
Weaknesses
- Based as pervious work SJD, this work is only a minimal piece of contribution, besides, as stated in related work, the integration of diffusion has been studied in many visual generation works.
- The method section is not well organized, needs more details, like the finetuning objective, the timestep injection and so on.
- The difference between SJD2 and AR+diffusion methods are not introduced in related works.
- The method comparison is not effective, only SJD are compared.
问题
- The index of embedding in Eq.(2) is confused, please explain why the superscript skips from 0 to , and what is and in subscript.
- Does the original AR model utilize normalized embedding as input?
- Please explain the training objective and process, I only see a cross-entropy loss for AR model, how did this work for the diffusion? If there is a multi-step denoising process, would this involve a high-order gradient?
- What is the finetuning effort of SJD2, like the dataset size and training time.
- Please compare with more parallel decoding methods, like Jacobi Decoding, Lookahead Decoding, CLLMs, and ZipAR.
- How effective is the Refinement with Denoising? Is the difference between SJD and SJD2 this refinement? If no, there is a lack of relevant ablation experiments.
局限性
See Weaknesses and Questions
最终评判理由
My final rating is 4 Borderline accept. The authors proposed a decoding method to accelerate the AR image generation model, which shows benefits. My concerns are answered, the paper should be revised to make it clearer for audience.
格式问题
NA
Response to Reviewer eMe4
Thank you for your thoughtful and constructive feedback on our paper. We appreciate your recognition of our introduction and the significant acceleration.
1. About our contribution
We feel it necessary to clarify the contribution of our work: our SJD2 introduces a new advancement by integrating denoising process into autoregressive (AR) models to stabilize the Jacobi iteration in SJD. We achieve this through fine-tuning AR models to be compatible with noisy embeddings (thus the denoising can be integrated in Jacobi iteration, shown in Eq. 3, lines 191-202), while distinctively preserving core AR mechanisms (discrete tokenization, next-token prediction with cross-entropy loss, and Jacobi iteration) via the proposed techniques like noise perturbation on normalized embeddings. Unlike existing AR+diffusion models that rely on diffusion losses and auxiliary decoders for image generation, SJD2 maintains the standard AR components for image generation (i.e. predicting the probability of discrete tokens, and training with cross-entropy loss) while delivering significant acceleration gains (shown in Table 1).
2. About details of our finetuning objective and timestep injection
Thank you for your suggestion. We will improve the clarity of the method section in the camera-ready version and we detail our finetuning objective and timestep injection here: (A) timestep injection: Since we avoid introducing additional adapters, we utilize a flexible operator, i.e., the attention mechanism, for our token-wise timestep injection. Specifically, we take the sinusoidal encodings of timesteps as a sequence of special token embeddings and append them to the sequence of input token embeddings. Then, the sequence, which comprises clean input token embeddings, noisy input token embeddings, and timestep encodings, is fed into the transformer blocks. Within the attention modules of these blocks, we use the attention mask to force each noisy token embedding to attend to the corresponding timestep encoding which indicates its noise level. To ensure the distribution of the Fourier encodings of timesteps aligns with that of the token embeddings, we apply a normalization-then-de-normalization process to these encodings (normalization with statistics of the sinusoidal encodings and denormalization with the statistics of the token embeddings), similar to Eq 4. (B) finetuning objective: SJD2 fine-tunes a pre-trained autoregressive text-to-image model to predict the next clean token from noisy token embeddings. During fine-tuning, input token embeddings are perturbed with Gaussian noise and then fed into transformer blocks followed by a prediction head which outputs logits representing the categorical probability distribution of the next clean token. The cross-entropy loss is computed between these logits and the ground truth token categories, optimizing the model to denoise inputs while maintaining autoregressive prediction. This objective enables SJD2 to handle noisy inputs during inference.
3. Difference between SJD2 and AR+diffusion methods
In our related work section (Integration of autoregression and continuous diffusion), we have discussed models like Diffusion-forcing and Transfusion. However, with the recent publication of numerous new AR+Diffusion works, we now provide an updated version of the related work part: Recent diffusion forcing models (e.g. Self-forcing [A], MAGI-1 [B], and Causvid [C]) introduce the autoregressive sampling into the temporal dimension of the continuous video diffusion models. They employ causal masks to support history-conditioned video generation. The multimodal large language models (MLLMs) (e.g. BLIP-3o [D], Emu2 [E], and Seed-X [F]) generate images via pre-trained diffusion models (e.g., SDXL) which take the output features of autoregressive models as conditions. Moreover, the unified models like Bagel [G] and Janus-flow [H] directly integrate diffusion process into the autoregressive backbones. These models train the autoregressive backbones together with lightweight decoders through diffusion loss (e.g., noise prediction or velocity prediction) for image generation, and the backbone can perform both autoregressive decoding and continous diffusion sampling. In contrast, SJD2 uniquely integrates denoising into pre-trained autoregressive models without modifying their core: preserving discrete tokenizers, next-token prediction with cross-entropy loss, and the established inference mechanisms. Our innovation lies in fine-tuning with noise-perturbed normalized embeddings, enabling ODE-based denoising in autoregressive models. We do not relay on diffusion loss and additional decoders.
[A] Huang X, Li Z, He G, et al. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion[J]. arXiv 2025.
[B] Teng H, Jia H, Sun L, et al. MAGI-1: Autoregressive Video Generation at Scale[J]. arXiv 2025.
[C] Yin T, Zhang Q, Zhang R, et al. From slow bidirectional to fast autoregressive video diffusion models[C]. CVPR 2025.
[D] Chen J, Xu Z, Pan X, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset[J]. arXiv 2025.
[E] Sun Q, Cui Y, Zhang X, et al. Generative multimodal models are in-context learners[C]. CVPR 2024.
[F] Ge Y, Zhao S, Zhu J, et al. Seed-x: Multimodal models with unified multi-granularity comprehension and generation[J]. arXiv 2024.
[G] Deng C, Zhu D, Li K, et al. Emerging properties in unified multimodal pretraining[J]. arXiv 2025.
[H] Ma Y, Liu X, Chen X, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation[C]. CVPR 2025.
4. Comparison to modern parallel decoding methods
We compare our method with the recent state-of-the-art and classic speculative/parallel decoding methods, including Lantern (ICLR 2025) [A], ZipAR (ICML 2025) [B], Eagle [C] and Jacobi Decoding [D], on COCO2017 validation set with Lumina-mGPT as baseline. As shown in the table below, our approach achieves superior acceleration while maintaining comparable visual quality.
| Configuration | Acceleration Latency | Acceleration Step | CLIP-Score |
|---|---|---|---|
| AR | |||
| Jacobi Decoding | |||
| SJD | |||
| EAGLE | |||
| LANTERN | |||
| ZipAR | |||
| Ours |
[A] Jang D, Park S, Yang J Y, et al. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding[C]. ICLR 2025.
[B] He Y, Chen F, He Y, et al. ZipAR: Parallel Autoregressive Image Generation through Spatial Locality[C]. Forty-second International Conference on Machine Learning (ICML 2025).
[C] Li Y, Wei F, Zhang C, et al. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty[C]. Forty-first International Conference on Machine Learning (ICML 2024).
[D] Song Y, Meng C, Liao R, et al. Accelerating feedforward computation via parallel nonlinear equation solving[C]. PMLR 2021.
5. About the superscript and subscript in equation 2
The superscript in Equation (2) indicates the noise level applied to the embedding , with the skip from to reflecting a deliberate design in the application of noise perturbation. As demonstrated in Figure 3, some regions of the embeddings remain unperturbed by noise (denoted by superscript 0), while their adjacent regions are perturbed to a minimal noise level (denoted by superscript ). The subscript identifies the position of a specific token embedding, and is an integer in the range , used to specify a certain position where the noise level shifts from to .
6. About inputs in original autoregressive models
The pre-trained autoregressive models used as baselines are incompatible with normalized embedding inputs. Consequently, after performing denoising on normalized embedding inputs, we employ a denormalization step (Equation 4) to restore these noisy normalized embeddings to the original embedding space.
7. About the training process
Unlike standard diffusion models that rely on diffusion loss (i.e. ), the training objective of SJD2 is based solely on cross-entropy loss. This is because SJD2 is not a standard diffusion model: it only leverages the denoising concept from diffusion models while not using the diffusion loss. Specifically, during training, we perturb token embeddings with noise and the cross-entropy loss is applied on the noisy embeddings. During inference, the ODE denoising process is fused with the Jacobi decoding which is also a multi-step decoding algorithm (line 191-202 with equation 3). There is no high-order gradient computations in our fine-tuning.
8. About the fine-tuning efforts
To build the training dataset for fine-tuning, we collect about 80,000 synthesized images from huggingface and recaption them with Qwen-VL. We perform fine-tuning on 8 NVIDIA GPUs (80GB memory each) with a global batch size of 64, while leveraging DeepSpeed ZeRO-3 or FSDP with gradient checkpointing to save GPU memory at the cost of increased training time. All model parameters are used for fine-tuning. The fine-tuning requires 6 epochs, which costs about A100 hours for Lumina-mGPT and H100 hours for Emu3.
9. About the refinement with denoising
This refinement with denoising is the key distinction between SJD2 and SJD. As shown in Table 1, SJD2 achieves greater acceleration than SJD.
Thanks for the rebuttal, my questions are partially answered. The author should make the proposed method (AR & Diffusion) clearer in paper, both training & inference procedure, text description & figures (like Fig.3).
To my view, the proposed method is to predict next token until the noisy embedding is acceptable. If so, I'd like to know how many denoising step would it take during the generation of each token?
We appreciate your feedback and are pleased that the previous concerns have been addressed. We will incorporate the improvements to enhance the clarity of our methodology in the final version of the paper. Here, we will address the new concerns:
About the illustration of training and inference procedure
Thank you for your suggestion. We will improve clarity of writing in our method section. Here, we explain the specific training and inference procedures of SJD2 using Figure 2 and Figure 3, with the following text describing the figures:
Figure 2: Overview of our decoding process. First, given a sequence of noisy normalized token embeddings (illustrated as blue-bordered patches in the first row) and prefilling/already-accepted tokens (depicted as green circles in the first row), the noise levels of the token embeddings are set to increase non-strictly monotonically. Next, these token embeddings undergo one iteration of Jacobi-denoising, where they are fed into the neural network along with timestep encodings for a single parallel forward pass using a causal attention mask. The network predicts conditional probability and performs token sampling for the next clean token at each position. Sampled tokens from noisy embedding inputs are denoted as the one-position-offset -predictions, marked by the down-right blue solid arrows, while those from prefilling or accepted token inputs are denoted as autoregressive (AR) predictions (i.e. the standard next-token prediction), marked by green solid arrows. Subsequently, a prefix of sampled tokens is accepted based on the probabilistic criterion outlined in Equation 1 (e.g., in the second row, the first two sampled tokens are accepted and marked with green borders). Then, a denoising step (Equation 3) is performed for the remaining tokens, involving a linear combination of the token embedding from the previous iteration at the same spatial position (indicated by vertical blue dashed lines) and the embedding of the predicted clean tokens from the one-position offset (the down-right blue solid arrows). This iterative process repeats until all required tokens are accepted, serving as the final outputs.
Figure 3: This figure illustrates our training strategy and model process. Starting with normalized input embeddings rather than raw input indices (i.e., token categories), the process is depicted as follows: (a) During training, the normalized embeddings (initially clean) are first transformed into noisy embeddings, as shown in the right-side image. In this image, noise levels increase non-monotonically across patches (i.e., token positions) in a raster scan order (left to right, top to bottom). When reaching a randomly determined position, the noise level stops increasing, and the noise level of the next position resets to zero, forming segments with non-monotonically increasing noise levels in the token sequence. Next, these noisy embeddings are appended with timestep encodings, which indicate the noise level at each position. Together, the embeddings and encodings are fed into transformer blocks and a prediction head to produce logits for each position. The cross-entropy loss is then applied to each position, using the clean token indices as labels, with one-position offset (shown by the pink dotted frame) for next-clean-token prediction. (b) During inference, the input normalized embeddings (already noisy) are not further perturbed. These embeddings are also appended with timestep encodings and processed through the transformer blocks and prediction head to generate logits. As described in Equation 2, these logits are used for token sampling, and these sampled clean tokens are transformed into normalized token embeddings, consistent with the input normalized embeddings shown in this figure.
About the denoising process in the procedure of token prediction
Yes. Given a sequence of token embeddings, SJD2 performs the next-token prediction for each embedding until its corresponding token category is acceptable. The number of denoising steps in SJD2 is set to 25.
This paper introduces Speculative Jacobi-Denoising Decoding (SJD2), a method for enabling parallel token generation in autoregressive models. The authors propose a next-clean-token-prediction task that allows a model to predict multiple future tokens with only light fine-tuning. Experiments on MSCOCO and GenEval show that this approach can achieve a speedup of around 4x without a significant drop in performance.
优缺点分析
Strengths
-
The paper is well-written, and the ideas and methods are described clearly. The main figure illustrates the core idea very well.
-
The evaluation on MSCOCO shows that the method can speed up models like Lumina-mGPT and Emu3 by about 4x without harming performance too much.
-
The qualitative results look good and effectively support the main experimental claims.
Weaknesses
-
Limited Scope of Model Compatibility: It is unclear if the proposed SJD2 method works with a wider range of autoregressive models. For example, compatibility with other advanced autoregressive approaches like VAR or Janus Pro is not discussed.
-
Insufficient Baseline Comparisons: The main experiments lack comparisons with other modern speculative or parallel decoding methods. The only baseline compared against is the original SJD, which makes it hard to judge SJD2's performance relative to the state of the art.
-
Missing Comparison with Diffusion Models: The paper does not include a comparison of the efficiency/performance trade-off against state-of-the-art diffusion models.
-
Weak Performance on GenEval: The overall performance on the GenEval benchmark does not appear to be competitive with many existing methods.
问题
-
Why was there no comparison with the Emu3 model on the GenEval benchmark, even though it was used in other experiments?
-
Are there any comparisons of computational cost in terms of FLOPs? This would provide a more detailed efficiency analysis beyond latency.
-
How does the proposed SJD2 approach compare to other parallel decoding methods from recent literature? A broader comparison is needed to position this work.
-
Is SJD2 compatible with autoregressive models that already use a form of parallel decoding, such as MAR models?
-
Could you provide an analysis or comparison of the training cost required for the proposed SJD2 approach?
局限性
Yes
最终评判理由
The author's rebuttal resolves most of my concerns and I'm willing to increase the score. Authors are encouraged to update the manuscript to reflect the comparison with more baselines.
格式问题
N/A
Response to Reviewer Q75X
Thank you for your thoughtful and constructive feedback on our paper. We particularly appreciate your recognition of the clear methodological presentation and our performance of acceleration. Our primary contribution lies in enabling efficient parallelization of existing autoregressive architectures through lightweight fine-tuning, rather than pursuing state-of-the-art performance.
1. Results on more autoregressive models
Thank you for your suggestion. As our SJD2 is developed for the paradigm of next-token prediction, we choose Janus-pro-1B for experiments. The paradigm of VAR is next-scale prediction which could require different ways of acceleration and thus is beyond the scope of our paper. Specifically, we employ our pairwise text-image data to tune Janus-pro-1B to be compatiable with noisy embedding inputs and use our SJD2 for decoding. The results are in the following tables. From the results, we observe that SJD2 still can accelerate Janus-pro without sacrificing on generated image quality, as evidenced by the following Geneval metrics.
| Method | Latency | Steps |
|---|---|---|
| Janus-Pro-1B | 9.1s | 576 |
| Janus-Pro-1B + SJD2 | 2.5s | 144 |
| Method | Colors | Position | Counting | Two | Color Attri | Single | Overall |
|---|---|---|---|---|---|---|---|
| Janus-Pro-1B (AR) | 0.82 | 0.39 | 0.39 | 0.55 | 0.42 | 0.94 | 0.59 |
| Janus-Pro-1B (SJD2) | 0.83 | 0.45 | 0.37 | 0.59 | 0.38 | 0.96 | 0.60 |
2. Comparison to modern parallel decoding methods
We compare our method with the recent state-of-the-art and classic speculative/parallel decoding methods, including Lantern (ICLR 2025) [A], ZipAR (ICML 2025) [B], Eagle [C] and Jacobi Decoding [D], on COCO2017 validation set with Lumina-mGPT as baseline. As shown in the table below, our approach achieves superior acceleration while maintaining comparable visual quality.
| Configuration | Acceleration Latency | Acceleration Step | CLIP-Score |
|---|---|---|---|
| AR | |||
| Jacobi Decoding | |||
| SJD | |||
| EAGLE | |||
| LANTERN | |||
| ZipAR | |||
| Our SJD2 |
[A] Jang D, Park S, Yang J Y, et al. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding[C]. ICLR 2025. [B] He Y, Chen F, He Y, et al. ZipAR: Parallel Autoregressive Image Generation through Spatial Locality[C]. Forty-second International Conference on Machine Learning (ICML 2025). [C] Li Y, Wei F, Zhang C, et al. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty[C]. Forty-first International Conference on Machine Learning (ICML 2024). [D] Song Y, Meng C, Liao R, et al. Accelerating feedforward computation via parallel nonlinear equation solving[C]. PMLR 2021.
3. About the lack of comparison to diffusion models
Actually, our work focuses on improving the inference efficiency of standard autoregressive (AR) models for image generation. Thus, we choose these baselines: Lumina-mGPT, Emu3, and Janus-Pro that is supplemented in this rebuttal. The AR models have advantages in the unified modeling of multimodal data (predicting discrete tokens for both the linguistic and visual tasks, different from the diffusion models) and thus it is worth studying [A, B, C]. While many AR models currently underperform state-of-the-art diffusion models in image quality and face acceleration challenges, our SJD2 narrows the speed gap. In the following table, we evaluate inference latency for several commonly-used diffusion models (smaller than 3B) and Janus-Pro at the same resolution (). We set the number of sampling steps for diffusion models as 50. Results demonstrate that our SJD2 reduces latency of Janus-Pro-1B from 9.1s to 2.5s, narrowing the gap between Janus-pro and the advanced diffusion models like SD3. Moreover, this result means Janus-Pro-1B with SJD2 already outperforms SDXL in speed (2.5s vs. 4.3s):
| Method | Latency | Steps |
|---|---|---|
| Janus-Pro-1B | 9.1s | 576 |
| SD1.5 | 1.7s | 50 |
| SDXL | 4.3s | 50 |
| SD3-Medium | 1.7s | 50 |
| Janus-Pro-1B + SJD2 | 2.5s | 144 |
[A] Chen X, Wu Z, Liu X, et al. Janus-pro: Unified multimodal understanding and generation with data and model scaling[J]. arXiv preprint arXiv:2501.17811, 2025.
[B] Wang X, Zhang X, Luo Z, et al. Emu3: Next-token prediction is all you need[J]. arXiv preprint arXiv:2409.18869, 2024.
[C] Team C. Chameleon: Mixed-modal early-fusion foundation models[J]. arXiv preprint arXiv:2405.09818, 2024.
4. Regarding Performance on GenEval
This work focuses on accelerating existing pre-trained autoregressive models while preserving visual quality, not on achieving state-of-the-art image quality. As shown in Table 2, our method successfully maintains visual fidelity on GenEval. While current autoregressive baselines like Lumina-mGPT generally underperform compared to advanced diffusion-based generation models, our proposed method has the potential to be applied to stronger autoregressive base models in the future.
5. Performance of Emu3 on Geneval benchmark
In the following table, we compare our SJD2 to autoregressive decoding (AR) and SJD on the GenEval benchmark with Emu3 as the baseline. Our method achieves an overall score of 0.49, which is close to the 0.52 of AR and the 0.48 of SJD, demonstrating preserved visual quality.
| Method | Colors | Position | Counting | Two | Color Attri | Single | Overall |
|---|---|---|---|---|---|---|---|
| AR | 0.78 | 0.15 | 0.33 | 0.69 | 0.16 | 0.98 | 0.52 |
| SJD | 0.79 | 0.12 | 0.28 | 0.61 | 0.13 | 0.97 | 0.48 |
| SJD2 | 0.73 | 0.14 | 0.28 | 0.61 | 0.24 | 0.96 | 0.49 |
6. Flops in inference
We present the average Flops per output token in the table below. The results reveal that autoregressive decoding requires fewer Flops than SJD2. Although more Flops are used for decoding, the practical latency becomes lower. Actually, this Flops overload stems from the paradigm of all the speculative decoding methods. Their drafting-and-verification mechanism, which inherently introduces computational overhead: the number of accepted tokens per sampling step is substantially lower than the number of input draft tokens. Previous studies have also confirmed this observation. For example, as demonstrated in Tables 4 and 5 of [A], Medusa (a speculative decoding method) consumes 403 GFlops per token, whereas vanilla autoregressive decoding requires only 19 GFlops.
| Method | GFlops | Latency |
|---|---|---|
| Lumina-mGPT (AR) | 18.72 | 88.55s |
| Lumina-mGPT (SJD2) | 219.60 | 33.64s |
| Emu3 (AR) | 18.15 | 375.29s |
| Emu3 (SJD2) | 465.92 | 147.65s |
[A] Lin C H, Tuli S, Smith J, et al. SLiM: Speculative decoding with hypothesis reduction[C]//Findings of the Association for Computational Linguistics: NAACL 2024. 2024: 1005-1017.
7. About compatibility of parallel-decoding models
Thank you for your question. Actually, speculative decoding was originally designed to accelerate standard autoregressive models that decode tokens sequentially (one token per forward pass). Its lossless acceleration relies on optimizing GPU parallelism utilization. However, the models employing parallel decoding techniques (e.g., MaskGit or the next-set-prediction of MAR) inherently utilizes GPU parallelism well. Consequently, speculative decoding methods like SJD2 would require redesign and new theoretical foundations to assist these frameworks, which can be explored for future works. Regarding the diffusion head of MAR which operates on continuous latents, the speculative decoding should be redesigned to compatiable with Gaussian inputs, and some concurrent efforsts have been conducted and still in exploration [A].
[A] Wang Z, Zhang R, Ding K, et al. Continuous speculative decoding for autoregressive image generation[J]. arXiv preprint arXiv:2411.11925, 2024.
8. Comparison of the training cost of SJD2
Regarding the fine-tuning cost, the baseline, Lumina-mGPT, requires 10 million image-text pairs for pre-training. In contrast, our fine-tuning uses only 80,000 images over 6 epochs, totaling approximately 0.5 million image samples used during training. This indicates that at most 5% of the original computational resources are used for fine-tuning our method. We only use A100 hours for fine-tuning, highlighting the efficiency.
Thanks the author for the detailed feedback. SJD2 seems to have good performance on token-by-token autoregressive vision models. However, I'm still having concerns about it not generalizable to more autoregressive approaches, Also the numbers for Emu3 on GenEval seems to be lower than the ones reported in the original paper.
Thank you for your feedback. We will address the new concerns here:
About the number of autoregressive base models
In this paper and our prior rebuttal, we have conducted experiments on three base models: Lumina-mGPT, Emu3, and Janus-Pro. Following the suggestions in the prior reviews, we included Janus-Pro to demonstrate the effectiveness of SJD2 on the advanced autoregressive models. Notably, existing papers on accelerating autoregressive text-to-image generation typically use 2–3 base models [A, B] to validate generalizability, aligning with the number of autoregressive base models for SJD2.
[A] He Y, Chen F, He Y, et al. ZipAR: Parallel Autoregressive Image Generation through Spatial Locality[C]. Forty-second International Conference on Machine Learning.
[B] Jang D, Park S, Yang J Y, et al. LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding[C]. The Thirteenth International Conference on Learning Representations.
About the GenEval performance on Emu3
First, the baseline performance of Emu3 with original autoregressive decoding (AR), as shown in the first row of the table of GenEval-Emu3 (reproduced here from our prior rebuttal), aligns with the original Emu3 paper (Page 21, Table 7, second-to-last line, Emu3-DPO, overall score: 0.52) [A]. Second, our SJD2 method achieves a GenEval overall score of 0.49, closely approaching the performance of Emu3 with AR decoding (0.52). Furthermore, the speculative decoding method SJD [B] scores 0.48, demonstrating that our approach is competitive with existing acceleration methods.
| Method | Colors | Position | Counting | Two | Color Attri | Single | Overall |
|---|---|---|---|---|---|---|---|
| AR | 0.78 | 0.15 | 0.33 | 0.69 | 0.16 | 0.98 | 0.52 |
| SJD | 0.79 | 0.12 | 0.28 | 0.61 | 0.13 | 0.97 | 0.48 |
| SJD2 | 0.73 | 0.14 | 0.28 | 0.61 | 0.24 | 0.96 | 0.49 |
[A] Wang X, Zhang X, Luo Z, et al. Emu3: Next-token prediction is all you need[J]. arXiv preprint arXiv:2409.18869, 2024.
[B] Teng Y, Shi H, Liu X, et al. Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding[C]. The Thirteenth International Conference on Learning Representations.
Speculative Jacobi-Denoising Decoding is the next step in proposing novel improvements to Speculative Jacobi Decoding for parallel token generation in autoregressive models. These changes result in faster convergence, reducing computational overhead, as well as equivalent or higher quality on several metrics demonstrated on image generation tasks.
优缺点分析
Strengths: Detailed ablations and metric studies, as well as a clear description of the approach. I think this is a good, informative paper which is useful for people to get introduced to Jacobi decoding methods, as well as see the line of research leading up to this new method. The experimental section is focused on images, but clear in the results and goals. Descriptions of the inference time procedure are high quality, and presented with great figures and examples.
Weaknesses: Given the widespread applicability of Jacobi decoding and descendants, it would be really interesting to see this approach applied to other generative applications, like text or audio token generation. It may be out of scope for this review cycle, but really worth considering for broad appeal of the method.
It is not clear to me if the full model is trained during the finetuning procedure, or only a small part. There are some other complaints below, but a bit more indepth treatment of the tuning stage, and what parameters are involved would be useful especially for followup work.
There is a strong conceptual convergence between the approach detailed here, and the paper "Rolling Diffusion Models", which adds a rolling time-window approach into continuous diffusion. It it is probably worth referencing or briefly discussing in the background material.
There was some mention of compute costs, but it would be useful to detail how long the finetuning takes in terms of approximate wall-clock time, as well as further details on the finetune hyperparameters in general.
There is a statement "For fine-tuning, we divide the input sequence into randomly sized segments and add noise to each segment.", it would be useful to have details about the procedure to select the randomly sized adjacent segments, and how the noise level is chosen and applied over training. There is "We follow the advanced flow matching setting with αt = 1 − t and σt = t for our denoising process", but more details would be better.
Similarly, there is a statement "we take the normalized Fourier encodings of timesteps" - is this different than a standard positional encoding? If so, it would be good to add more detail, or cite a relevant paper defining this. Otherwise, change the writing slightly to tie in to the more standard naming with "Fourier positional encodings" or "Transformer sinusoidal position encodings", or something like this.
问题
Do the authors have any user preference A/B tests to complement the GenEval result? It would be great to correlate these quantitative metrics to human A/B.
In eq(1), if the 1 in the lefthand side is decreased, what happens to the resulting quality? Do the authors have any study, reference, or ablation on this - I think it might also improve the convergence speed of Speculative Jacobi Decoding, but maybe at the expense of quality?
Do the authors have any high-order statistics, like skewness, kurtosis, and so on about the distribution of embeddings before and after normalization? It is slightly surprising to me that simple mean standard deviation normalization is enough to Gaussianize sufficiently to prevent the failure modes discussed in the paper. It would be good to have some information, if people want to apply this method but have different embedding distributions up front.
Is there any potential of a "training free" version of this method, if not what are the key roadblocks (technical, computational, theoretical) to such a version?
Will the authors release code to help reproduction in open source tools?
局限性
Yes
最终评判理由
I like this paper and think it has large applicability with broad interest for a variety of autoregressive modeling approaches. Given the focused scope of the work and the clarity of discourse, I think it should be a "strong accept" for any conference. I really like the clear explanation and practicality of the method.
格式问题
None
Response to Reviewer ucHV
Thank you for your thoughtful and constructive feedback on our paper. We particularly appreciate your recognition of our detailed ablation studies, clear methodological explanations, and high-quality experimental demonstrations.
1. About potential application on other domains like text/audio
Thank you for your thoughtful suggestion about extending Jacobi decoding to text and audio generation. While such expansion falls beyond the scope of current review cycle, we plan future research to pursue this direction, including curating diverse text/audio datasets and exploring integrations with autoregressive LLMs like Llama and Qwen by using SJD2.
2. About fine-tuning details and the training/inference cost
All model parameters are used for fine-tuning. To build the dataset, we collect about 80,000 synthesized images from huggingface and recaption them with Qwen-VL. We perform fine-tuning on 8 NVIDIA GPUs (80GB memory each) with a global batch size of 64, a learning rate of 2e-5, and the AdamW optimizer (=0.9, =0.95), while leveraging DeepSpeed ZeRO-3 or FSDP with gradient checkpointing to save GPU memory at the cost of increased training time. The fine-tuning requires 6 epochs, which costs about A100 hours for Lumina-mGPT and H100 hours for Emu3. The inference cost is in the following table:
| Method | GPU memory | Latency |
|---|---|---|
| Lumina-mGPT (AR) | 17G | 88.55s |
| Lumina-mGPT (SJD2) | 20G | 33.64s |
| Emu3 (AR) | 20G | 375.29s |
| Emu3 (SJD2) | 23G | 147.65s |
3. About discussion on Rolling Diffusion Models
Thank you for your suggestion. SJD2 accelerates autoregressive text-to-image generation by integrating denoising with Jacobi iterations, using a fixed sliding window for parallel token prediction in discrete token space. It fine-tunes pre-trained models to predict conditional probability given the input of noisy token embeddings, and uses the cross-entropy loss for supervision. During inference, the token embedding is first denoised and then seamlessly enters the standard Speculative Jacobi iterations draft and verify its corresponding discrete token and probability. In contrast, Rolling Diffusion Model applies a continuous diffusion process with a rolling time-window for time-series data, trained with diffusion loss to predict noise in continuous space. While both use window-based denoising for efficiency, SJD2 focuses on the discrete tokens in autoregressive models, whereas Rolling Diffusion Model focuses on adapting continuous diffusion for temporal sequences. We will add the discussion in the camera-ready version.
4. About segment size and noise level selection
The process of sequence segment dividing and noise level selection is as follows: First, we sample a monotonically increasing timestep sequence from the range of , where is randomly determined in range (the typical number of steps in diffusion sampling). The elements in this timestep sequence can adhere to the Karras timestep schedule [A]. Then, for an input token sequence of length (), we uniformly partition it into segments ( is randomly chosen from range ), and fill each segment uniformly with the timestep sequence across all positions. These timesteps are used to calculate the parameters and for noise perturbation on token embeddings.
[A] Karras T, Aittala M, Aila T, et al. Elucidating the design space of diffusion-based generative models[J]. Advances in neural information processing systems, 2022, 35: 26565-26577.
5. About the writting of positional encoding
Thank you for your suggesion. Our position encoding is the same as the sinusoidal position encodings used in standard Transformers. We will add this explanation in the camera-ready version.
6. About the user preference A/B tests
We conduct user preference on the images generated by autoregressive decoding and SJD2. We use Lumina-mGPT as baseline and select 16 pairs of images with 10 users for study. The results are in the following table. We observe that the preference of the two decoding methods is close and the images generated by autoregressive decoding is slightly more preferred.
| Method | Preference |
|---|---|
| AR | 54.7% |
| SJD2 | 45.3% |
7. The influence of decreasing the 1 in the lefthand side in Equation 1
Thank you for your question. First, for simplicity, let's consider replacing the constant "1" on the left-hand side of Equation 1 with a variable . Actually, decreasing this variable would reduce the acceptance rate, as a lower makes the acceptance condition harder to satisfy. We also conduct corresponding experiments and the results aligns with our hypothesis. Thus, not only conforms to the theory of speculative sampling, but also guarantees a relatively large acceptance rate.
| Method | Steps |
|---|---|
| Emu3 + SJD () | 3537 |
| Emu3 + SJD () | 4969 |
8. About relationship between normalization and the failure mode
We do not utilize high-order statistics like skewness or kurtosis for normalization and denormalization. Actually, the failure mode depicted in Figure 5 arises when the noise-perturbation-based fine-tuning (and the denoising sampling) applied directly to token embeddings without normalization and denormalization. As we discussed in lines 213-220, normalization and denormalization primarily serve to scale input values appropriately for noise perturbation and transformer blocks, respectively. The noise perturbation formula , where , can produce extreme values, i.e., may reach magnitudes of 6 or -6, because is sampled from a standard Gaussian distribution. However, we observe that the token embedding statistics are consistently small across many models (e.g., mean = 0.0179, std = 0.2709). It is challenging for models to adapt to large-value inputs with few fine-tuning epochs. Consequently, we implement noise perturbation on normalized embedding values and subsequently rescale them for model input.
9. The possibility of a training-free version of our method
Currently, it is challenging to achieve a training-free variant of our method. The difficulty stems from the fact that autoregressive models are not exposed to Gaussian-perturbed tokens during their standard training process. These models are typically optimized for next-token prediction using discrete tokens generated by tokenizer encoder (e.g., VQGAN with CNN encoders), where the inputs are not intentionally corrupted with Gaussian noise throughout pre-training. Consequently, the pre-trained autoregressive models may not encounter noisy token representations, making it difficult to generalize to such inputs without explicit training adaptation.
10. About the open-source
We will release our code upon paper acceptance.
Thank you for the detailed response. This answered my questions in detail, and really enhances the work in my opinion.
On 6 - if adding the A/B data to the paper or appendix, it would be worth specifying the confidence interval as well, and also looking potentially at per-image-pair agreement across raters. But this already gives a reliable signal that the approach here is if nothing else, quite close in preference and quality to AR, as the preference was not by any means a blowout (and we expect variance between examples just due to the nature of generative models already).
On the rate - thank you for the correction, this makes sense!
My question on normalization partially stems from related methods in "discrete diffusion" on top of pretrained and from-scratch embeddings (such as SSD-LM or Dirichlet Flow Matching), where even with normalization there can also be reprojection methods, or other things to try and craft the right inputs and outputs. Scaling is also critical and necessary, and the explanation here makes sense, for 8 and 9.
Thank you for your valuable feedback. We are glad that the rebuttal has addressed your questions and we appreciate your suggestions. We will include the improvements in our final paper.
This paper introduces Speculative Jacobi-Denoising Decoding (SJD2), a novel acceleration framework for autoregressive text-to-image generation. The core scientific claim is that by integrating denoising into Jacobi iterations and introducing a next-clean-token prediction task, SJD2 enables parallel token generation, achieving up to 4x inference speedup while largely preserving visual quality. The paper demonstrates this on multiple recent generative models (Lumina-mGPT, Emu3, Janus-Pro) with both quantitative and qualitative results, including user preference studies. The strengths of the work lie in its clear motivation, methodological novelty, careful experimental validation, and practical acceleration gains. The method is also relatively lightweight, requiring only modest fine-tuning compared to full pretraining.
The weaknesses identified by reviewers include incrementality over prior speculative Jacobi decoding, limited baseline comparisons (initially only against SJD), and questions about generalizability to other AR paradigms (e.g., VAR, MAR). During rebuttal, the authors responded constructively: they added experiments on Janus-Pro, compared against modern speculative/parallel methods (Lantern, ZipAR, Eagle), provided detailed fine-tuning and FLOP/latency analysis, clarified noise segmentation and timestep injection strategies, and conducted user preference A/B testing. The rebuttal substantially mitigated the main criticisms, with reviewers acknowledging improved clarity and breadth of evaluation. Considering the strong technical contribution, convincing empirical validation, and convincing rebuttal, the AC recommends acceptance. The paper makes a solid and timely contribution to the acceleration of autoregressive generative models.