6.7

/10

Poster3 位审稿人

最低6最高7标准差0.5

3.3

置信度

正确性3.0

贡献度3.3

表达3.0

NeurIPS 2024

Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance

Jiwan Hur,Dong-Jae Lee,Gyojin Han,Jaehyun Choi,Yunho Jeon,Junmo Kim

OpenReview PDF

提交: 2024-05-12更新: 2024-11-06

TL;DR

We propose a guided sampling techniques for masked generative models and empirically demonstrate its effectiveness for image generation.

摘要

关键词

Image synthesisdiscrete diffusion modelsmasked generative modelssampling guidanceparameter- efficient fine-tuning

评审与讨论

审稿意见

评分: 7置信度: 32024-07-01

This work proposes a self-guidance method for masked generative models to improve the diversity and quality of class conditional image generation. The main challenge is to design a semantically meaningful smoothing for the discrete VQ token space such that coarse scale information can be extracted. The authors propose an auxiliary task – error token correction to minimize fine-scale details and utilize TOAST for efficient model fine-tuning.

优点

The research problem of introducing self-guidance in discrete space is well-motivated. The background section is well-written, and the discussion of related work is thorough. I find the analogy to guidance of continuous diffusion model very helpful.
The figures are well-made, such as the qualitative visualization is Fig.1.
The quantitative metrics from the experiments also demonstrate the effectiveness of the method. The ablation studies on different hyperparameters are informative.

缺点

From the qualitative visualizations alone (Fig.4), the proposed method does not seem have too different outputs compared with MaskGIT (the middle column). I wonder if the guidance benefit is stronger on other resolutions’ results, like 128 x 128 or 64 x 64.
Maybe some figures could be added to accompany section 3.2 to better illustrate the process of the auxiliary task.

问题

What is $\bf{m}_i$ in equation (7)? Could you provide more intuitions or visualizations of the claims in line 192- 195. Why does this auxiliary task naturally minimizes fine-scale details?
What is the range of the experimented temperature values in Fig. 3? From Fig.3, it seems that the proposed sampling method has very different performance based on the choice of temperature. The authors also mention that they choose different temperatures based on the resolution and sampling steps. I wonder why is the proposed method so sensitive to temperature?
Have the authors tested the model on the 128x128 resolution ImageNet benchmark? What does the FID/IS curve look like?
What happens with strongly guided samples? For instance, for classifier-free guidance in diffusion models, strongly guided samples exhibit saturated colors. I wonder if there is any artifact related to this proposed method when the choice of guidance scale or sampling temperature is high?

局限性

The authors already mention some directions for future work, such as demonstrating the guidance on text-conditioned image generation.

作者回复

2024-08-07

We appreciate your thorough review. We address your concerns and questions below. All visual results can be found in the submitted PDF. Please zoom in on the figure of the submitted PDF for the best view.

$\**Clarified visual comparison.**$

In Fig. 6 of the submitted PDF, we directly compare the sampled images using MaskGIT and the proposed method with the same random seed on ImageNet 256 and 512 resolutions. With the proposed guidance, fine-scale details are enhanced, generating higher-quality images. Please refer to the Global Response for a detailed description.

$\**Impact of guidance in lower resolution.**$

Since the proposed guidance utilizes semantic smoothing, the effectiveness of the guidance can be affected by the resolution of the dataset or input token length. We first note that the proposed guidance is built upon the pre-trained vector quantizer and MGMs. However, we cannot find publicly available vector quantizers and MGMs trained on ImageNet 128 resolution. We tried to train both VQGAN and MaskGIT, which have to be trained sequentially due to the dependency. However, within the limited time frame and computational budget, we are unable to generate reliable results from MaskGIT, generating output with very low visual quality, which shows an inception score of around 20.

Though the VQGAN and MaskGIT pretraining is not converged, we measure the performance improvement by attaching the proposed guidance to the MaskGIT with ImageNet 128 resolution. We measure the IS (larger is better) and FID (lower is better) with a sampling temperature of 2.0. The IS increases from 21.2 to 29.5, and FID decreases from 44.0 to 33.6. Although the hyperparameter search is limited, the FID improvement is significant (23%) compared to the improvement in 256 (50%) and 512 (36%) resolutions. We expect that the performance improvement at lower resolutions will be similar if a well-pretrained quantizer and MaskGIT are utilized and sampling hyperparameters are suitable.

$\**More intuition for auxiliary task and visualization.**$

Thank you for your recommendation, which helps us better illustrate the proposed auxiliary task. We thought it would be beneficial for all reviewers to share deeper insights into the underlying intuition and the visual results, so we provided the review in a global response. Thank you once again for your valuable recommendation!

$\**Choice of the sampling temperatures.**$

In Fig. 5 of the main manuscript, utilized sampling temperatures are listed below.

256: [3, 4, 5.5, 15, 20, 23, 25, 30, 40, 60, 80]
512: [6, 10, 15, 20, 25, 30, 35, 40, 45, 50, 70, 90]

The sampling process of MGMs is known to be sensitive to the choice of various sampling hyperparameters [1]. Sampling temperature plays a crucial role in resolving the multi-modality problem, as explained in section 2. Thus, moderate sampling temperature can increase the sample quality and diversity, simultaneously. After some point, when the temperature increases, diversity increases while quality deteriorates. This is because the large sampling temperature gives strong randomness for re-masking of MGMs ( $p(m_{t−1}|m_t, \\hat{x}_{0,t}$ ) in line 99), potentially masking relatively accurate tokens while leaving unrealistic tokens unmasked. As a result, the sampling of MGMs is sensitive to the sampling hyperparameters. We believe that our extensive ablation experiment can provide valuable insights for the community.

$\**Impact of the strong guidance and high sampling temperature.**$

In Fig. 11 of the submitted PDF, we compare the samples with our default config, with strong guidance (s=5) and high sampling temperature ( $\\tau$ =50). We found that the strong guidance often leads to undesirably highlighted fine-scale details (left column in Fig. 11 b) or saturated colors (right column in Fig. 11 b), similar to the large scale of CFG. High temperatures often lead to the collapse of the overall structure. As an extension of Fig. 5 a of the main manuscript, we further plot the FID and IS curves using larger guidance scales in Fig. 6 d. Using a large guidance scale decreases fidelity and diversity due to the phenomenon we mentioned above. We will add the analysis and figure in the final manuscript for a thorough investigation and to give further insight into the proposed guidance.

$\**Further clarification.**$

The $\\mathbf{m}\_{i}$ in eq. 7 denotes whether the $i$ th input token $z\_t$ is masked or not. $\\mathbf{m}\_i=0$ if the $i$ th input token is masked and $\\mathbf{m}\_i=1$ if not. Since the masked index of $z\_t$ and $x\_t$ are identical, the $\\mathbf{m}\_{i}$ in eq. 7 equals to the $\\mathbf{m}\_i$ in eq. 1. We will clarify this in the final manuscript.

[1] Chang, Huiwen, et al. "Muse: Text-to-image generation via masked generative transformers." arXiv preprint arXiv:2301.00704 (2023).

评论- thank you

2024-08-13

Thank you to the authors for their detailed response and the valuable insights provided on the intuition behind the proposed method. Based on this additional information, I have decided to increase my rating to 7: Accept.

审稿意见

评分: 7置信度: 32024-07-10

This paper explores the use of masked generative models for image generation. While these methods are typically efficient, they sometimes fall short in terms of quality compared to diffusion models. To reduce this gap, this paper proposes a self-guidance algorithm, which is further improved with semantic smoothing. These ideas are shown to be effective for improved image generation using MGMs. The method is compared to a wide variety of baselines on class-conditioned generation, on common metrics, and against a set of baselines.

优点

This paper tackles an important problem in image generation, which is the use of Masked Generative Models for efficient image generations. This family of models have shown to be more efficient than diffusion models, but have so far struggled to catch-up with the quality they can achieve. This paper improves the quality of these models, to make them competitive in quality while preserving their benefits in terms of efficiency.
At the core fo the method lies a self-guidance method, which I believe is a sound and well motivated idea, which can be potentially included in generative models for other applications.
The ideas in the paper are simple in a good way, and they are not particularly tied to a specific application. I believe they could be generalizable to many other problems.
I believe the related work is well analysed, as far as I am familiar with previous work on this topic.
Evaluation is extensive, many different hyperparameters are evaluated with respect to their impact on generation quality.

缺点

The paper is at times hard to follow, lacking clarity in some sections (particularly 5). Images and figures are sometimes small and hard to see.
I do not believe implementation details are enough for reproducibility. Code will not be provided, which will make this paper hard to reproduce and it will reduce its potential impact and room for future comparisons.
Given that a particular benefit of the method of this paper is its computational efficiency in comparison to diffusion-based approaches, I believe a detailed analysis of these should be included.
I believe the paper would benefit from showing results on text-to-image generation.

问题

How would this method generalize to text-to-image methods or image-to-image translation problems?
How tied is this method to VQGAN? Could other encoders be used? If so, what would be their impact?

局限性

The paper is fine in this regard. It could, however, provide more details on how the limitations could be addressed in the future.

作者回复

2024-08-07

Thank you for your review. We address your concerns below. All visual results can be found in the submitted PDF. Please zoom in on the figure in PDF for the best view.

Clarifications of implementation details, figures, and code release.

Thank you for your suggestion to make our research clearer and more reliable. We will clarify and magnify the figures in the final manuscript. We will add more visual results for a clear comparison of the proposed method to the base MaskGIT model, as illustrated in Fig. 6 of the submitted PDF. Please refer to the Global Response for a detailed description.

The proposed method consists of two parts: MaskGIT and TOAST. Since we have not changed the MaskGIT architecture, we briefly explain the implementation details of TOAST below. The TOAST modules consist of three parts: token selection module, channel selection module, and linear feed-forward networks.

(i) The token selection module selects the task or class-relevant tokens by measuring the similarity with the learnable anchor vector $\\xi_c$ . We generate class conditional anchor $\\xi_c$ with simple class conditional MLPs.
(ii) The channel selection is applied with learnable linear transformation matrix P. Then, the output of the token & channel selection module is calculated via $z_i = P \\cdot sim(z_i,\xi_c) \\cdot z_i$ , where $z_i$ denotes the $i$ th input token.
(iii) After the feature selection, the output is processed with $L$ layer MLP layers, where $L$ is equal to the number of Transformer’s layers. The output of the $l$ th layer of MLP blocks is added to the value matrix of the attention block in $(L-l)$ th Transformer layer (top-down attention steering). Following the previous work [1], we add variational loss to regularize the top-down feedback path. A more detailed process and theoretical background can be found in [1].

Most of the implementation can be found in the official TOAST repository. Since our work is also based on the public repository, as mentioned in the main manuscript, we believe that the proposed guidance can be easily implemented and applied in various code bases. We will add the implementation details in the final manuscript for understandability and reproducibility. We also plan to release the source code as soon as we complete the documentation.

Measuring the computational efficiency.

We measure the computational efficiency by calculating the sampling time on A6000, where the batch size is set to 50. In Fig. 9 of the submitted PDF, we plot sampling time with FID/IS values compared to diffusion-based methods and various MGMs. Note that we exclude Token-Critic and DPC since the code is not publicly available. Though each model has slightly different implementation details for the architecture, an order of magnitude smaller NFEs of MGMs (see Table 1 of the main manuscript) lead to a significantly efficient sampling process compared to diffusion-based methods. With the proposed self-guidance, the proposed method surpasses the diffusion-based approaches and MaskGIT in both sampling efficiency and performance.

Generalization to text-to-image method or image-to-image translation.

The proposed method can be generally adopted for various tasks of MGMs. We briefly explain how the proposed method can be extended to various tasks.

Text-to-Image generation (T2I): Because the proposed guidance can be generally defined with text condition, the proposed method can be easily extended to the T2I generation by replacing the class conditional module in TOAST with the text conditional module. Except for the architecture, the training and sampling strategies illustrated in Fig. 2 of the main manuscript can be adopted without modification for text embedding.
Image-to-Image generation (I2I): For Image-to-Image generation, such as style transfer, the proposed method can be adopted to further enhance the sample quality. For instance, MaskSketch [2] utilizes pretrained MaskGIT to generate images from the sketch input. The proposed method fine-tuned on the target dataset can be additionally attached for the enhanced sample quality.

Choice of the quantizer.

Similar to continuous diffusion, the (discrete) latent space brings several strengths, such as semantic richness and computational efficiency. As a result, modern MGMs and various discrete domain generative image models mostly utilize vector-quantize-based encoder-decoders such as VQGAN and VQVAE. In this regime, our work is built upon the MGMs, which operate on VQ space. The latent space of various VQ-based encoders shows similar characteristics, such as rich semantic information, and lacks continuous semantic structure in contrast to pixel space. Furthermore, the proposed guidance and the auxiliary task can be generally defined in any discrete space. As a result, we expect that the proposed guidance will not be tied to the specific quantizer architecture and that the performance improvement will be similar.

[1] Shi, Baifeng, et al. "TOAST: Transfer Learning via Top-Down Attention Steering."

[2] Bashkirova, Dina, et al. "Masksketch: Unpaired structure-guided masked image generation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

审稿意见

评分: 6置信度: 42024-07-12

The paper focuses on enhancing Masked Generative Models (MGMs) for image synthesis. The authors identified several reasons for the underperformance of MGMs: 1) lack of sequential dependencies, 2) multi-modality problem, 3) compounding decoding errors, 4) limitations of low-temperature sampling and 5) ineffective guidance techniques. To address these issues, the authors present novel techniques and methodologies to improve their performance.

The main contributions of the paper are as follows:

Generalized Guidance Formulation.
Self-Guidance Sampling Method.
Parameter-Efficient Fine-Tuning.

优点

Originality The paper presents a highly original approach by extending guidance methods from continuous diffusion models to MGMs. In particular, the use of self-guidance sampling method and an auxiliary task for semantic smoothing in the VQ token space are novel approach that address specific challenges in MGMs.
Quality The paper demonstrates a high level of technical quality. The authors have thoroughly explained why their methods can enhance performance and have supported their claims with extensive experiments and ablation studies on various variables.
Clarity The paper is clearly written and well-structured. The authors provide detailed context relative to prior work, facilitating a clear understanding of the background and contributions of the proposed approach. The detailed explanations of the equations and methodologies, accompanied by ample examples, significantly enhance the overall readability and accessibility of the paper.
Significance The contributions of the paper are significant for the advancement of the MGMs field. By addressing key challenges in MGMs, the proposed methods offer substantial improvements in the quality and diversity of generated samples.

缺点

Particularly due to inadequate visual results, some aspects are difficult to comprehend. Section 3.3 asserts that TOAST is the most suitable model for the study's task, but lacks sufficient explanation on why TOAST is the only suitable model. It would be more convincing if the paper showed some visual results from using other models, which would provide sufficient justification.

Explanations for the choice of hyperparameters are insufficient. While the paper demonstrates trade-off relationships in hyperparameter settings across some datasets in removal experiments, the sampling time steps do not exhibit such trade-offs. It is crucial to provide adequate explanations for why these settings differ from those maintained for the ImageNet 512×512 dataset.

问题

Most hyperparameters in Figure 5 show a trade-off relationship between FID and IS scores. However, increasing the number of sampling time steps improves performance in both metrics. So, could you explain why sampling time steps were set to 12 and 18 in each experiment? Even though other methods are known to use hundreds of sampling time steps, 50 seems quite small in comparison.

局限性

The authors did not fully address the limitations of the paper but did mention the need for attention to AI ethics due to the rapid growth of generative models, touching on the societal impact. To improve the paper, we suggest the following:

Although ImageNet is a comprehensive dataset, it is insufficient to prove generalizability. Therefore, conducting experiments on other datasets used in existing MGMs, such as MSCOCO or Conceptual Captions, will make the paper more complete.
Include more visual results. To clarify the reasons for selecting specific models or hyperparameters used in the paper, it is much more effective to show them through visual results.

作者回复

2024-08-07

Thank you for your comprehensive review and for identifying the ambiguous aspects of our experimental designs. We address your concerns and questions below. All visual results can be found in the submitted PDF. Please zoom in on the figure of the submitted PDF for the best view.

Include more visual results for clarification.

To clearly show the improvement with the proposed guidance compared to MaskGIT, we add more visual results in Fig. 6 of the submitted PDF. Please refer to Global Response for a detailed description. We also visualize the impact of the different sampling hyperparameters in Fig. 11 of the submitted PDF. The high guidance scale may deteriorate the sample quality by overly enhancing fine details, and high sampling temperature often leads the overall structure to collapse.

Using another model for the auxiliary task.

Although we argued that TOAST could be efficiently and naturally adopted to learn auxiliary tasks, other PEFT methods can also be adapted to generate guidance similarly. Before validating, we note that the benefits of using TOAST are twofold:

MGMs exhibit training bias of only predicting the [MASK] tokens while treating other input tokens as ground truth. As a result, MGMs tend not to correct errors in the input tokens. We resolve this by simply masking all tokens of the 2nd stage of TOAST prediction.
Since the output of the 1st stage of TOAST is directly reused to sample $p_\theta(\hat x_{0,t})$ , it brings more efficiency for sampling (see Fig. 2 of the main manuscript).

We experiment with the prompt tuning-based approach with the proposed auxiliary loss. The prompt tuning shows reasonable performance for transfer learning of MaskGIT [1]. The prompt token length is set to 128 following [1], and the batch size is reduced to 128 due to the memory capacity. We first forward the $z_t$ with prompt tokens to obtain ${\mathcal{H}}(z_t)$ in eq.7. Then, we forward the MaskGIT with the ${\\mathcal{H}}(z_t)$ to calculate the ${\\mathcal{L}}_{aux}$ in eq.7. To train the model, we directly input the ${\\mathcal{H}}(z_t)$ after the embedding layer of MaskGIT.

We set the guidance scale to 0.2 since we manually found the result guidance was too strong. The result in the table below and Fig. 8 of the submitted PDF show that the improvement is marginal compared to ours. We suspect that the naïve prompt tuning does not suitably deal with the error correction due to the bias of MGMs (see loss curve in Fig. 8). Furthermore, it requires 3 models forward for each sampling step. With the above discussion and experiments, as well as the discussion in the main manuscript, we verify that the TOAST is a more suitable approach compared to the other PEFT methods in terms of performance and efficiency. We expect that a more meticulously designed model for the proposed guidance can further improve efficiency and performance and leave it for future work.

	NFEs	FID $\\downarrow$	IS $\uparrow$
MaskGIT	18	6.56	203.6
Prompt tuning w/ ${\\mathcal{L}}_{aux}$	18 $\\times$ 3	5.78	209.8
Ours (TOAST)	18 $\\times$ 2	$\3.22$	$\263.9$

Explanations for the choice of hyperparameter (sampling step).

In our preliminary research on the impact of sampling steps ( $T$ ), we mainly investigated $T$ around $\sim$ 18, which is commonly adopted by prior works. We further explore the larger $T$ in Fig. 10 of the submitted PDF using $T$ =24 and 36 with sampling temperature $\\tau$ =35 and 60, respectively. The results indicate that using $T$ , larger than 18, does not ensure a performance increase, as the metrics either saturate or deteriorate despite higher computational costs.

Similar to findings in MaskGIT, where the optimal performance-efficiency trade-off is observed around $T$ =8, our method shows a "sweet spot" around $T$ =18. Up to this point, both quality and diversity increase as sampling timesteps and computational costs increase while outperforming MaskGIT with similar NFEs. This demonstrates that the proposed method shows more scalable and efficient results compared to the MaskGIT. We also found a similar phenomenon for experiments on ImageNet 512x512, and the optimal timestep is around 12~18 steps. In this context, we opt for the sampling timestep of 12 and 18 for the results. We will add the above experimental results and our analysis in the final manuscript to give a thorough insight into choosing the sampling steps.

Generalization to more complex datasets.

Thank you for your valuable comments. Since the suggested datasets are primarily utilized for text-to-image (T2I) generation tasks, pretrained VQGAN and MaskGIT are not publicly available.

Instead, the T2I MGMs, MUSE [2], can be utilized to demonstrate the generalization. Though the main concern of the paper is to improve the generative capabilities of MGMs in class conditional image generation, the proposed guidance and utilized fine-tuning architecture can be extended to T2I generation. Given that the only difference is the replacement of the class condition with a text condition, we can simply substitute the class conditional module in TOAST with a text conditional module while maintaining other fine-tuning and sampling strategies. (Please refer to the response for Lc5P for a detailed TOAST implementation)

However, within the limited time frame, it is challenging for us to find reliable code (since the official MUSE is not publicly available), preprocess the large datasets, and ablate the generalizability of the proposed guidance. Therefore, we leave the generalization performance on T2I as future work.

[1] Sohn, Kihyuk, et al. "Visual prompt tuning for generative transfer learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2] Chang, Huiwen, et al. "Muse: Text-to-image generation via masked generative transformers." arXiv preprint arXiv:2301.00704 (2023).

作者回复

2024-08-07

Dear Reviewers and Area Chairs,

We thank the reviewers for their constructive feedback. We are glad to take various helpful reviewer comments to clarify and complete our work. Reviewers agreed on the originality, motivation, soundness, and significance of the paper. At the same time, reviewers are concerned about the broader applicability of the proposed guidance, the need for enhanced clarity in our visual representations and implementation descriptions, and the effect of a more diverse set of hyperparameters.

Here, we first deal with the most common concerns and questions from reviewers. Throughout all responses, we refer to the originally submitted paper as a $**main manuscript**$ and the submitted PDF file in the rebuttal period as a $**submitted PDF**$ . Due to the page limit, figures and captions in the submitted PDF could be small. Please zoom in on the figures for the best view.

More clarified visual results. (qSSj, Lc5P, cCTZ)

All reviewers suggested providing more visual results to better clarify, demonstrate, and strengthen the effectiveness of the proposed guidance. In response, we have sampled visual results from ImageNet at 512x512 and 256x256 resolutions, which are shown in Fig. 6 of the submitted PDF. We generate paired samples from MaskGIT and the proposed method by fixing all random seeds. Qualitative results show that the proposed guidance enhances fine details, generating samples with high fidelity. Detailed facial attributes or local structures are generated accurately with the proposed guidance, whereas MaskGIT fails to generate plausible images.

More intuition for the auxiliary tasks and visualization of the impact. (cCTZ)

Reviewer cCTZ has raised a concern that further intuition or visualization for the auxiliary task is required to clear up the claim. We are glad to take helpful comments to clarify our work. Though it is not a common concern, we respond here to share more intuition and insights about the proposed guidance with the reviewers. To learn the auxiliary task, we randomly replace a proportion of input tokens as error tokens, which act as semantic outliers of input tokens. To capture the semantic outlier, we argue that the model implicitly learns to smooth the vicinities of the data $z_t$ by leveraging coarse information from the surrounding context while minimizing the fine-scale details. For a simple example of detecting numerical outliers in a 1-dimensional signal, a straightforward way to correct outliers, such as impulse noise, is by measuring the distance to the local mean and correcting by smoothing the signal, such as low-pass filters. Similarly, to correct the unknown error tokens, which are semantic outliers, the model implicitly learns to smooth input $z_t$ to deal with unknown error tokens.

To demonstrate the above process, we visualize the effect of guidance using guidance scales 5 and -5. Given that the output guidance logit is semantically smoothed output, by its definition, the positive guidance means that it guides the sampling process toward enhancing fine details. Contrarily, the negative guidance scale means that it guides the sampling process toward reducing the fine details. More specifically, with the auxiliary task, the guidance logit includes semantically smoothed information. Thus, samples with a negative guidance scale result in reduced semantic information, such as fine details or local patterns. We visualize the effect of the positive and negative guidance scale in Fig. 7 of the submitted PDF. We mask 80% of the input VQ tokens and visualize the one-step prediction outputs (b) without guidance, (c) with a positive guidance scale, and (d) with a negative guidance scale. The visualization shows that the positive guidance enhances the semantic fine details, while the negative guidance reduces the semantically fine details, such as patterns in feather and facial attributes. This demonstrates that the output logit trained with the auxiliary task can effectively capture the semantically smoothed logit, and the result guidance helps to improve the sample quality by enhancing fine details.

Numerical error (typo) in the paper

We apologize for the error in the main manuscript. In Table 2 of the main manuscript, we found that the FID and IS values for our method should be corrected from 3.04 and 240.8 to 3.22 and 263.9, respectively, following the values in the main table (Table 1). We note that it does not impact the discussion and conclusions of our study since it still outperforms other ablation results. We appreciate your understanding!

评论- Post rebuttal discussion

2024-08-11

Dear reviewers and authors,

After the authors rebuttal, my concerns regarding the method and its reproducibility, the visual results and other minor concerns, have been adequately addressed. I am therefore leaning towards increasing my score.

Could the other reviewers provide feedback?

最终决定Accept (poster)

2024-09-25

This paper received unanimous recommendations for acceptance from all three reviewers. In particular, the reviewers praised the novelty of using MGM for more efficient image generation. However, as they noted, the performance on larger-scale datasets or higher-resolution images remains unknown. Therefore, AC recommends acceptance as a poster presentation.