4.8

/10

withdrawn4 位审稿人

最低3最高8标准差2.0

4.0

置信度

正确性2.3

贡献度2.3

表达2.8

ICLR 2025

Improving Autoregressive Image Generation by Mitigating Gradient Bias in Softmax

Ziyao Guo,Mike Zheng Shou,Kaipeng Zhang

OpenReview PDF

提交: 2024-09-18更新: 2024-11-25

摘要

关键词

Image GenerationAutoregressive Generative ModelSoftmax

评审与讨论

审稿意见

评分: 3置信度: 42024-10-31

This paper presents Gradient Suppressed Softmax for auto-regressive image generation. The authors show that multiple valid tokens could be possible during generation and the traditional softmax compresses the probability of non-target tokens. The authors then designed two types of new PMF for softmax operation, which shows slightly better FID for image generation.

优点

The studied problem is interesting. While auto-regressive image generation is popular, the sampling during inference is indeed an issue.
Comprehensive ablation study of the proposed operations.

缺点

The proposed method is more like a trick.
The improvement is really marginal in terms of FID. On larger models, IS get decreased with the proposed method.
Precision is missing from Table 2.

问题

See above

审稿意见

评分: 5置信度: 42024-11-03

This paper analyzes the negative impact of Softmax on current autoregressive generation tasks from the perspective of gradient bias in probabilistic activation functions. It points out that Softmax's excessive punishment of high-probability non-target classes affects the diversity of autoregressive generation tasks. At the same time, a new Gradient Suppressed Softmax (GS-Softmax) activation function is proposed, which reduces the gradient contribution to high-probability non-target classes. Ultimately, experiments demonstrate that this activation function enhances the diversity of generated content and optimization convergence.

优点

This paper offers a new perspective, namely the gradient optimization of Softmax, to discuss the issue of impaired diversity in autoregressive generated content.
This paper proposes three criteria to guide the creation of probabilistic activation functions that conform to autoregressive characteristics, and introduces the corresponding Gradient Suppressed Softmax (GS-Softmax) function.
The writing style of this paper is clear, making it easy to understand.

缺点

This paper lacks analysis and discussion on an important issue. In the training of traditional large language models, the richness and diversity of training data can often mitigate the problem of over-punishment caused by Softmax. Does the improvement proposed in this paper work only when data is limited? This needs to be explored and experimentally verified.
The improvements in experimental results are too small. In all the experiments comparing Softmax and GS-Softmax, the enhancements in FID values are really minimal (for example, from 10.50 to 10.47), which seems to indicate that the impact of GS-Softmax on model performance is not critical or significant.
There is a lack of necessary experiments to support the motivation. Theoretically, the improvement of GS-Softmax on text should be more significant than the FID on images, which also supports the "rationality" maintained in the motivation. This experiment is important, yet the paper fails to provide it.

问题

Justification For Recommendation And Suggestions For Rebuttal：

Justification For Recommendation：Reference to Paper Strengths.
Suggestions For Rebuttal:
1. The analysis of experimental results needs to be more detailed.
2. Present more experimental data to support the motivations behind the paper.

Additional Comments For Authors：

To enhance clarity and persuasiveness, the authors should rectify vague descriptions and inaccuracies in the details.

审稿意见

评分: 8置信度: 42024-11-03

This paper has the potential to question one of the most commonly used functions in deep learning. It claims to introduce a variant of softmax that shows improved generation diversity and convergence, particularly for the autoregressive image models. The proposed gradient-suppressed softmax learns potential candidates for pixels rather than obvious ones during training. It supports the claim by demonstrating improved diversity (tends to have better diversity key metrics like sFID, FID) and convergence (assuming autoregressive models will need training >300 epochs or so).

优点

The problem is well understood and clearly articulated
Limitations are considered and argued
Supported by detailed experiments

缺点

Is this function scalable for higher-context windows? (or as shown fixed for the 256x256, 384x384 dimensions)
I'd like to know how this activation function is generalizable to the language autoregressive models. How next token diversity will vary with a context window.
Informative to see the loss curves to verify the convergence trend for models for 300 epochs (perplexity is a score)
In addition to the "Eskimo dog" class it would have been interesting to see evaluations for other examples also

问题

Please add GPT model details and citation
Table 2: It is worth noting that the IS for larger models is better with Softmax
Table 3,6: Put IS instead of Inception for consistency in a paper
line#431: 0.8% increase in GPU seconds per iteration (not 0.7%)
Table 5, right: I assume iteration is the epoch and not GPU seconds per training batch
line#425: typo

审稿意见

评分: 3置信度: 42024-11-04

This paper starts from an observation that Softmax module in most models will over-penalize non-target classes with high prediction scores, which may be harmful to autoregressive tasks, since multiple valid predictions exist. To alleviate this, the authors propose Gradient Suppressed Softmax (GS-Softmax), which reduces the gradient contributions of high-probability non-target classes. Experimental results show that this proposed module improves the generation quality image tasks.

优点

The motivation and the proposed method is easy to follow.
The proposed criteria for gradient suppressed softmax is general.

缺点

The main weakness of the paper is the validity of the motivation. The authors claim that the softmax function will suppress possible valid non-target predictions. However, there are two concerns: 1) The authors do not provide any experimental evidence supporting this claim. During training or inference, does the model indeed generate only one high-probability prediction? 2) While the softmax function appears to suppress non-target predictions (e.g., $x_2, x_3, \dots, x_n$ ) based on its formulation, it is important to note that there may be instances in the dataset where other data shares the same conditional context with $x_2$ as the target. In such cases, $x_2$ would not be suppressed by the model.
Additionally, the experimental results presented are insufficient to demonstrate the effectiveness of the proposed method. Improvements across all four evaluation metrics are generally no more than 5%, and the visualization results in Figure 2 do not show any clear superiority.
There is a minor issue with the use of confusing brackets in Equation (5). It is recommended to use \left( and \right) in LaTeX to enhance clarity.

问题

The paradigm of autoregressive generation originates from the NLP field. The motivation and method seems also valid in the NLP field. Would the authors consider conducting experiments on natural language tasks? Such experiments could enhance the generalizability of the proposed method.
Are there any figures illustrating the training loss curve that could support the claims made in Section 5.3 regarding training efficiency?

撤稿通知

2024-11-25

There is still room for improvement in this work, we will continue to make effort to enhance it.