Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy
improve both generation quality and efficiency of autoregressive image generation by leveraging entropy of token-wise probability distributions
摘要
评审与讨论
The authors propose to measure entropy and use it to set temperature dynamically to improve speed and quality of autoregressive image generation model. They did some initial experiment with CLIP score and FID vs temperature for different entropies and observed that
- In high-entropy regions, lower randomness helps improve text-image consistency.
- In regions with extreme low entropy, increasing sampling randomness consistently improves visual quality, while having negligible impact on text alignment.
Based on these observations, they propose to set temperature dynamically according to entropy to improve the sampling process. Further, the authors also demonstrate the effectiveness of the method with mask-prediction models and scale-wise models. Finally, they also use entropy to improve speculative decoding and therefore accelerate the inference speed. Experiments show promising results and validated the method.
优缺点分析
Strengths:
- The idea is novel.
- The paper is well written and is easy to follow
- In addition to the main results, authors have done various ablation studies to validate the proposed method
Weakness:
- Need to provide more details about the initial finding for Figure 3(c). Ideally use different model and dataset to make sure the conclusion is solid.
- "Mask-prediction models" section needs some improvement. I did not get why "This design further encourages randomness in low-entropy regions while enhancing accuracy in high-entropy regions"
- The 15% speed up doesn't seem to be a large improvement for acceleration.
问题
- Could you explain more why "regions with simple content (e.g., solid colors) typically exhibit lower entropy while more complex foreground (e.g., objects, structures, and textures) areas have higher entropy" the entropy here sounds similar to frequency, is there any connection?
- For figure 3(c), could you add more details of how the experiment is performed? Which dataset is used and which model is used to generate the distribution?
局限性
yes
最终评判理由
My concerns are addressed.
格式问题
N/A
We sincerely appreciate your time and effort in reviewing our paper. Below are our point-by-point responses:
1. Details & More experiments about Figure 3 (c): Details: Figure 3(c) is based on the LlamaGen-stage1 model and computed on the full COCO2017-val dataset at a resolution of 256. The CFG scale is set to 7.5, with no top-k or top-p sampling applied. Except for the temperature, all other settings follow the official GitHub repository’s parameters.
For temperature perturbation, as described in the main text, our temperature perturbation strategy involves computing the entropy of each token during generation. Tokens are categorized into five predefined entropy intervals ([0,2), [2,4), [4,6), [6,8), [8,∞)), and for each interval, we apply a specific temperature to adjust the token distribution. The overall image quality is then assessed on the dataset.
More experiments: Additionally, we have added the evaluation metrics for STAR model on COCO2017 (due to limited time, we will provide evaluation on more models and more datasets in the next version), as shown in the table below:
| Entropy Range | Temperature | FID | CLIP |
|---|---|---|---|
| Baseline | all 1.0 | 35.05 | 25.43 |
| Ours | dynamic | 32.75 | 25.56 |
| [0, 2) | 2.5 | 33.36 | 25.43 |
| 2.0 | 33.98 | 25.40 | |
| 1.5 | 35.01 | 25.42 | |
| 1.2 | 35.05 | 25.43 | |
| 0.8 | 35.16 | 25.44 | |
| 0.5 | 35.19 | 25.41 | |
| [2, 4) | 2.0 | 33.06 | 25.40 |
| 1.5 | 33.69 | 25.45 | |
| 1.2 | 34.16 | 25.46 | |
| 0.8 | 35.54 | 25.42 | |
| 0.5 | 35.99 | 25.44 | |
| [4, 6) | 2.0 | 37.63 | 25.14 |
| 1.5 | 36.15 | 25.29 | |
| 1.2 | 35.25 | 25.37 | |
| 0.8 | 34.87 | 25.48 | |
| 0.5 | 36.15 | 25.48 | |
| [6, 8) | 1.5 | 35.76 | 25.30 |
| 1.2 | 35.29 | 25.36 | |
| 0.8 | 34.76 | 25.49 | |
| 0.5 | 35.38 | 25.59 | |
| [8, inf) | 1.5 | 35.19 | 25.42 |
| 1.2 | 35.18 | 25.41 | |
| 0.8 | 35.19 | 25.42 | |
| 0.5 | 35.20 | 25.42 |
2. Why masking order helps: In mask-based generation models, each step selects which random tokens to keep based on a predefined rule—typically by adding noise to the confidence scores and keeping the top-k tokens (see Equations 4 and 5 in mainpaper). This can be viewed as a confidence-based top-k sampling strategy, where higher-confidence tokens at a given timestep are more likely to be selected, and the noise level controls the randomness.
From our observations, due to the mask scheduler’s design, low-entropy tokens (which tend to have higher confidence) are selected earlier in generation, while many high-entropy tokens are accepted together in later stages (as shown in Figure 4). This can hurt image quality. By adjusting the noise level based on entropy—using lower noise in high-entropy regions and more randomness in low-entropy ones—we encourage more cautious token selection where needed and allow more exploration in simpler areas, which leads to better image quality.
3. Speedup Ratio: Our proposed acceleration strategy reduces inference cost by an additional ~15% on top of existing acceleration method [1], achieving around 2.3× speedup compared to the original model. Although the current performance gain is modest, this approach highlights the potential of exploiting image non-uniformity for faster generation. We are actively investigating ways to further improve its effectiveness.
Reference:
[1] Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Thanks for the rebuttal. I would like to keep my rating.
This research tackles the problem of inefficient and suboptimal sampling methods in autoregressive image generation models. The key insight is that image tokens exhibit lower and more unevenly distributed information density across different spatial areas compared to text tokens. To solve this issue, the authors introduce an entropy-based sampling strategy with two main components: first, a dynamic temperature adjustment system that varies the level of sampling randomness according to the spatial entropy patterns in token distributions; and second, an entropy-based acceptance criterion for speculative decoding that aims to speed up inference while preserving generation quality. The researchers test their approach on various autoregressive image generation frameworks, including next-token prediction, masked token generation, and scale-based models. Results demonstrate enhanced performance in both image quality metrics (such as FID, CLIP-Score, DPG, and HPS) and computational efficiency during inference.
优缺点分析
Strengths
- Clear Motivation: Empirical analysis (Figure 2a) demonstrates spatial non-uniformity in image token distributions, justifying entropy-aware methods
- Effective Method: Dynamic temperature mechanism well-linked to spatial entropy distributions, showing model-agnostic improvements across multiple AR generation models
- Comprehensive Evaluation: Thorough experiments with multiple metrics (FID, CLIP-Score, etc.) and complete ablation studies demonstrating method efficacy
- Practical Value: Entropy-aware speculative decoding reduces inference steps and latency by ~15% without quality loss
- Clear Visualizations: Effective illustrations of method improvements and comparisons with existing techniques
Weaknesses
- Limited Novelty: Significant overlap with prior entropy-based decoding work in text (e.g., EDT), with insufficient differentiation of novel aspects
- Generalizability Concerns: Experiments mainly focused on COCO2017 and few AR models; limited discussion of generalization to different datasets and non-natural image tasks
- Missing Failure Analysis: Insufficient discussion of scenarios where the method might degrade performance, especially in high-entropy edge cases
- Weak Theoretical Foundation: Lacks theoretical justification for entropy-to-temperature mapping, relying primarily on empirical settings
- Statistical Significance: Missing confidence intervals and significance testing for reported improvements
- Implementation Details: Some key implementation aspects under-explained, potentially hindering reproducibility
问题
-
Clarification on Entropy Calculation: In Section 3.2, is the entropy always calculated per-token across the entire vocabulary regardless of mask/scale, or is it pooled or smoothed spatially in any way for efficiency or noise reduction? If so, could the authors elaborate on implementation trade-offs and robustness?
-
Extension Beyond AR Models: The method is said to be broadly applicable to “AR frameworks based on discrete token prediction,” but could the authors comment (with any preliminary evidence or arguments) on expected utility in hybrid AR-diffusion or continuous generative models, where tokenization or spatial entropy may be ill-defined or noisy?
-
Failure Case Analysis: Could the authors provide additional visual or quantitative analysis of cases where the entropy-aware method causes degeneration or perceptual artifacts, especially in ambiguous or extremely high-entropy settings?
-
Statistical Robustness: Are the improvements (especially in FID and DPG) significant over multiple runs or seeds? Please report confidence intervals or standard deviations if available. Would such variability affect conclusions?
局限性
Yes
最终评判理由
My concerns have been addressed. I will keep my current score.
格式问题
N/A
Thank you for your valuable feedback! Below are our responses to your concerns:
1. Novelty: Thank you for your concern regarding novelty. While entropy has been extensively studied in autoregressive text generation (e.g., EDT [1]), we observe that there are key differences between autoregressive image generation and text generation, which make it inappropriate to directly apply these methods to image generation. In fact, EDT proposes reducing the entropy in low-entropy tokens to sharpen the distribution and increasing randomness in high-entropy regions to improve diversity. In contrast, our work focuses on the sampling strategy for autoregressive image generation and proposes increasing randomness in low-entropy regions, which is opposite to EDT’s approach and brings better image generation quality.
2. Generalizability: Due to resource limitations, it is challenging for us to systematically evaluate a wide range of scenarios. Besides the COCO dataset, we also report results on DPG and HPS—where DPG assesses the model’s ability to follow complex text prompts, and HPS evaluates the aesthetic quality of generated images. These metrics are commonly used to evaluate multiple aspects of image generation models. We believe they provide meaningful insights, and we plan to conduct more comprehensive evaluations in future work.
3. Failure analysis: Due to policy restrictions, we are unable to provide visualizations. In some models, such as Lumina-mGPT, high-entropy regions may actually reflect good diversity, and lowering temperature there can reduce visual quality. Encouraging diversity in low-entropy regions may also occasionally introduce artifacts. We will include further analysis and visual cases in future versions. For a brief discussion of limitations, please refer to Section E.3 in the supplementary material.
4.Theoretical foundation: Thank you for pointing this out. Indeed, the current entropy-based temperature adjustment is empirically designed. We believe this originates from the fact that continuous image information is discretized into token indices, and the information distributions of images and text differ significantly. Since generation quality is highly sensitive to these distributions and the learned distribution is often imperfect, we apply further adjustment based on token-level entropy. As we currently lack a suitable theoretical framework to rigorously analyze this phenomenon, we have instead provided empirical evidence in the main text and hope to establish a stronger theoretical foundation in future work.
5. Statistical significance: Thank you for pointing this out. We found that generation-related metrics are indeed sensitive to random seeds. To further analyze this, we randomly selected 10 seeds from the range , ran the generation model 10 times conditioned on these seeds, and computed the mean and standard deviation of the results, as shown in the table below, the rand seeds have little impact on the performance improvement brought by our method, further supports the positive impact of the proposed method on performance.
LlamaGen:
| Metric | Baseline (mean) | Ours (mean) | Baseline (std) | Ours (std) |
|---|---|---|---|---|
| FID | 21.79 | 20.24 | 0.161 | 0.086 |
| CLIP | 25.95 | 25.95 | 0.022 | 0.036 |
| DPG | 43.74 | 48.87 | 0.348 | 0.326 |
| HPS | 21.23 | 21.40 | 0.031 | 0.042 |
Meissonic:
| Metric | Baseline (mean) | Ours (mean) | Baseline (std) | Ours (std) |
|---|---|---|---|---|
| FID | 53.31 | 48.18 | 0.326 | 0.285 |
| CLIP | 25.30 | 25.62 | 0.015 | 0.025 |
STAR:
| Metric | Baseline (mean) | Ours (mean) | Baseline (std) | Ours (std) |
|---|---|---|---|---|
| FID | 35.48 | 33.12 | 0.291 | 0.494 |
| CLIP | 25.47 | 25.63 | 0.024 | 0.027 |
Note that due to time constraints, we were unable to complete this analysis for all models on all metrics, but we will include the full results in a future version.
6. Clarification on entropy calculation & Implementation details: The entropy described in Section 3.2 is computed over all spatial tokens—specifically, we calculate the entropy of the probability distribution used when generating each token, without applying any denoising or spatial smoothing strategies. And our work will be open source later.
7. Hybrid AR and continuous models: This is a interesting question. The concept of entropy is specific to discrete token-based generation, where each token has an associated probability. This issue does not naturally arise in continuous generation frameworks.
Of course, we believe there can be some alternative ideas in continuous visual generation. For example, the sampling process in discrete models is somewhat analogous to the noise injection process in continuous methods. In multi-step SDE-based models, it may be possible to apply different noise levels to different regions of the image based on content complexity, which could potentially improve generation quality. However, since entropy is not defined in continuous space, other indicators like variance may be needed to guide such adjustments.
For hybrid discrete-continuous models, it depends on how the two parts are combined. Our method is designed to balance diversity and stability in discrete generation. If the discrete part is responsible for generating fine details (as in models like HART [2]), our approach may still be helpful. Conversely, if the discrete AR component is only used to generate high-level (like BLIP3-o [3]) or abstract information, our method may not be effective in that case. The effect of noise in the latent space on entropy remains unclear and would require further investigation.
Reference
[1] EDT: Improving Large Language Models’ Generation by Entropy-based Dynamic Temperature Sampling
[2] HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
[3] BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Thank you for your reply. I would like to hold my current positive score on this paper.
Thank you very much for your reply. The issues and suggestions you pointed out have been of great help to us, allowing us to further improve our work and inspiring potential research directions for the future. Once again, thank you!
The paper proposes a new approach to improve the inference of autoregressive (AR) image generation models by leveraging entropy-based sampling strategies. The authors demonstrate that image tokens exhibit low information density and non-uniform spatial distribution. To address this, they introduce an entropy-informed decoding strategy including (a) a entropy-based sampling temperature for better quality of generated images, and (b) an entropy-aware acceptance rules in speculative sampling for better acceleration. The prpopsed method is validated on COCO2017val benchmark and various frameworks.
优缺点分析
Strength
-
The proposed method demonstrates consistent performance and improved inference speed across multiple generation models, including standard AR, mask-based, and VAR models.
-
The paper provides thorough quantitative and qualitative analyses that illustrate the entropy characteristics of visual generation models that use discrete tokens and demonstrate the effectiveness of the proposed method.
-
This paper is well-written and easy to follow.
Weakness
-
The paper lacks a theoretical analysis of whether the proposed speculative sampling strategy (Section 3.4) preserves the original token distribution or quantifies the error bound of potential quality loss.
-
This paper lacks the evaluation on the COCO2014val benchmark which contains 30k images to provide precise FID values.
-
The paper does not investigate whether the proposed entropy-aware temperature could benefit NLP models.
问题
Please refer to weakness.
局限性
yes
最终评判理由
Thank you for your rebuttal. This rebuttal has solved most of my problems. Experiments have proven the proposed method works well across multiple baselines. Considering it lacks a theoretical proof, I'll hold a Borderline accept score.
格式问题
N/A
Thank you for your recognition and correction. Here are my answers to your concerns:
1. Token distribution of speculative strategy: Thanks for pointing out. Our proposed speculative decoding modification does not fully preserve the original model’s token distribution. In fact, we found that appropriately adjusting the token probability distribution of the autoregressive generation model—such as with the entropy-aware temperature method described in our approach—helps produce higher-quality images. Specifically, the scale term in Equation 8 in mainpaper encourages the acceptance of low-entropy tokens by increasing randomness in their distribution, while the random term sharpens the sampling distribution for high-entropy tokens by suppressing excessive randomness.
2. COCO2014val dataset evaluation: Thank you for your comment. Here, we provide FID and CLIPScore results on the COCO2014val dataset, shown in the table below:
| Model | Method | FID | CLIP |
|---|---|---|---|
| Llamagen | Baseline | 13.37 | 0.2561 |
| Llamagen | Ours | 11.59 | 0.2560 |
| Meissonic | Baseline | 44.54 | 0.2567 |
| Meissonic | Ours+Mask | 38.95 | 0.2590 |
| STAR | Baseline | 24.86 | 0.2581 |
| STAR | Ours+Scale | 22.09 | 0.2598 |
| Lumina-mGPT* | Baseline | 20.78 | 0.2641 |
| Lumina-mGPT* | Ours | 18.84 | 0.2660 |
Note that * means result from a subset with around 14,000 images.
Due to time constraints, the Lumina-mGPT model has not completed the standard 30k-image inference; we report its metrics on a subset (~14,000 images) of images and commit to including the full COCO-2014 results in a future version.
Additionally, since the Meissonic model’s generated images differ significantly in COCO style, we provide its performance on the MJHQ-30k dataset: our proposed method reduces the FID on MJHQ from 20.50 to 17.96 and improves CLIP-score from 27.33 to 27.61.
3. Entropy and NLP models: Thank you for your insightful comment. While entropy has been extensively studied in the context of NLP, especially in works such as EDT [1] and SED [2], our study focuses on the entropy distribution of tokens and the corresponding sampling strategies in autoregressive visual generation—a perspective that, to the best of our knowledge, has not been thoroughly explored in prior work. For instance, EDT recommends reducing the entropy of low-entropy tokens to sharpen distributions and increasing randomness in high-entropy regions to promote diversity, while SED emphasizes that high-entropy tokens are more likely to lead to confusion and require careful handling. Moreover, several recent GRPO strategies also incorporate entropy-aware mechanisms. We believe that our visual generation setting poses distinct challenges that motivate a different design choice, as discussed in our work.
Reference:
[1] EDT: Improving Large Language Models’ Generation by Entropy-based Dynamic Temperature Sampling
[2] SED: Self-Evaluation Decoding Enhances Large Language Models for Better Generation
Thank you for your rebuttal. This rebuttal has solved most of my problems. Experiments have proven the proposed method works well across multiple baselines. Considering it lacks a theoretical proof, I'll hold a Borderline accept score.
Thank you very much for your follow-up review. We are glad that our rebuttal has addressed most of your concerns and that the experimental results have been recognized. We also appreciate your constructive feedback on the lack of theoretical analysis, which will guide our future work.
The authors hypothesize that given the lower info density of image tokens and being spatially non uniform, entropy is a good indicator for balancing diversity and accuracy in the generations. Based on this premise, they propose a new decoding approach which, simply put, tries to increase diversity in easier regions of the image, while prioritising accuracy in the detailed regions. The authors present detailed evaluations and ablations to justify the various engineering choices they make in their approach
优缺点分析
Strengths:
This is a well written paper that methodically identifies the distinction between text and image token distributions, and tries to address them through the lens of entropy. The method is clearly explained and intuitively makes sense. The paper also includes experiments on a variety of methods involving next token generation, masked generation and scale wise generation. The paper also makes a lot of engineering choices, which are well addressed in the ablations. The latency benefits as well as improved metrics through a purely new decoding strategy is a positive step for autoregressive visual generation.
Weaknesses, Questions and Suggestions:
- Eqn 6b, is slightly confusing, i believe the LHS should have a \tilde perhaps like Eqn 3, which is then carried forward appropriately?
- I appreciate the visualization of entropy maps after the proposed methods has been applied in the supplementary, can you also add histograms from similar regions (background and foreground) after your method -- as in Fig. 3a. Do you see a more uniformly distributed histogram now given that for low density regions, diversity has been prioritised?
- I understand that easier regions have better diversity using this method, but in the visuals, it also seems that the dense information regions (foreground) also is better in some sense, increasing the overall image quality, even though the accuracy factor has more emphasis here? Can the authors shed some light onto what specifically happens for the foreground part, I would have expected those areas to be almost similar as the base method.
- For completeness, can the authors compare their improvements with the current SOTA visual generation methods as well, just as a reference to understand where their improvements in AR generation stands as compared to diffusion/flow methods that currently lead the field.
问题
Please see the strengths and weaknesses section. I would be happy to raise my score if my concerns are properly addressed.
局限性
yes
最终评判理由
Based on the provided rebuttal, as well as the contents in the original paper, the authors present a strong entropy driven framework that nicely balances diversity and quality. Given that this idea can be potentially applied in a variety of visual generation settings, I support this paper's acceptance, and will raise my overall score to 5 (Accept).
格式问题
NA
Thanks for your affirmation. I respond to your concerns as follows:
1. Typo in equation: Thank you for your pointing out. We agree that Equation 6 (b) may cause confusion, we correct it here as:
2. Proposed method on histogram: After applying the proposed entropy-aware sampling method, we observed a more concentrated entropy variance—that is, the distribution of token entropy across the dataset becomes more focused, with fewer extremely high or low-entropy tokens and more tokens falling in the middle range.
While we cannot provide visualizations due to policy constraints, similar examples are shown in Figure 6 of the supplementary material. Compared to the baseline, which tends to generate images with flatter backgrounds and less detailed foregrounds (with entropy concentrated in certain regions), our sampling strategy leads to a more balanced spatial distribution of entropy, resulting in more visually diverse and structured outputs. We will add more related analysis in future versions to further support our observations.
3. What happens in foreground: Our method not only increases randomness in low-entropy regions to promote diversity, but also reduces it in high-entropy regions to ensure stability. According to Equation 2 in the main text:
when entropy is large, the temperature approaches , which is typically less than 1. This lowers randomness and encourages more confident sampling in complex (foreground) areas. Conversely, when entropy is small, , usually greater than 1, introducing more randomness to help generate diverse content in simpler regions. Thus, for these simpler regions, the model’s stochastic exploration is encouraged to enhance the richness of the generated image content; while for more uncertain regions (those with higher entropy), the model adopts a more conservative sampling strategy to ensure stable generation. Together, this approach improves the consistency between image and text while maintaining generation diversity.
4. Comparison with SOTA diffusion methods: Thanks for pointing out, here we provide quantitative results of several diffusion and flow models, including SD v2.1, PixArt-Alpha, SDXL, and SD v3 on FID, CLIP-score (coco validation 2017), DPG and HPS, for comparison, as shown in table below:
| Method | FID | CLIP | DPG | HPSv2.1 |
|---|---|---|---|---|
| SDv2.1 | 22.87 | 26.31 | 68.09 | 26.38 |
| PixArt-alpha | 33.23 | 25.70 | 71.52 | 30.04 |
| SDXL (base-1.0) | 23.20 | 26.46 | 74.21 | 28.54 |
| SDv3-medium | 29.82 | 26.24 | 85.85 | 30.22 |
| LlamaGen (Baseline) | 21.94 | 25.95 | 43.51 | 21.24 |
| LlamaGen (Ours) | 20.36 | 25.96 | 48.63 | 21.39 |
| Lumina-mGPT (Baseline) | 29.15 | 26.04 | 79.68 | 28.92 |
| Lumina-mGPT (Ours) | 27.44 | 26.25 | 79.77 | 28.87 |
| Meissonic (Baseline) | 53.61 | 25.27 | 63.83 | 29.33 |
| Meissonic (+Prob.) | 48.37 | 25.49 | 66.19 | 29.94 |
| Meissonic (+Masking) | 48.43 | 25.54 | 67.08 | 30.04 |
| STAR (Baseline) | 35.05 | 25.43 | 70.25 | 28.79 |
| STAR (+Prob.) | 32.75 | 25.56 | 70.83 | 28.93 |
| STAR (+Scale-wise) | 32.37 | 25.61 | 70.86 | 29.06 |
I thank the authors for their responses, and based on this additional info, I am happy to inc my score to 5. Thanks for the good work!
We sincerely thank the reviewer for the positive feedback and the recognition of our work. We are glad that our additional information and clarifications have addressed your concerns, and we truly appreciate your willingness to increase the score.
This paper presents an entropy-informed decoding strategy to enhance autoregressive (AR) image generation, addressing lower information density and non-uniform spatial distribution of image tokens. The approach introduces dynamic temperature control and entropy-aware speculative decoding, improving quality and speed. Reviewers praised its clear motivation, model-agnostic effectiveness across AR frameworks, comprehensive experiments, and practical latency reductions (~15% inference cost). Concerns included limited theoretical foundation for entropy-temperature mapping, generalizability beyond tested datasets/models, missing failure analysis, and modest speedup magnitude. However, the authors addressed key questions in their rebuttal, clarifying implementation details and validating robustness. Despite minor limitations, the work makes valuable contributions to AR image generation by leveraging entropy to balance diversity and accuracy, with strong empirical support. Its practical improvements and broad applicability justify recognition. The AC concurs this is a worthwhile advancement and recommends acceptance.