Ultra-Resolution Adaptation with Ease
We present key factors for adapting a pre-trained text-to-image diffusion model to ultra resolution with data and parameter efficiency.
摘要
评审与讨论
This paper introduces URAE, a framework for efficiently adapting text-to-image diffusion models to ultra-high resolutions (e.g., 4K) while minimizing computational costs and data requirements. The approach is based on three key ideas:
- Data Efficiency: Using synthetic images generated by a teacher model improves convergence.
- Parameter Efficiency: Fine-tuning the minor singular components of weight matrices is more effective than traditional low-rank adaptation methods like LoRA.
- Classifier-Free Guidance (CFG) Control: Disabling CFG during adaptation () improves training consistency.
Experiments demonstrate state-of-the-art performance at 2K and 4K resolutions while requiring significantly fewer training samples and iterations.
给作者的问题
This is a good work but I still have several general questions:
-
Theoretical Justification:
Can you formally analyze why tuning minor singular components leads to better adaptation performance in ultra-resolution tasks? -
Hyperparameter Sensitivity:
How sensitive is the approach to the choice of singular component rank ()? Would an adaptive selection method improve results? -
Applicability to Other Architectures:
Have you considered applying this method to other generative models beyond diffusion models, such as GANs or autoregressive transformers? -
Computational Costs:
Given that URAE is designed to be parameter-efficient, have you conducted inference speed and memory consumption benchmarks?
论据与证据
-
Claim: Ultra-resolution adaptation can be efficiently achieved without large-scale data or full model fine-tuning.
- Evidence: Empirical results demonstrate that the method achieves competitive performance at 4K resolution with limited data and training iterations.
-
Claim: Synthetic data can significantly accelerate convergence.
- Evidence: Theoretical support via the bound:
$
E[\|W_T - W^*\|_2^2] \leq E[\|(I - ηM)^T\Delta_0\|_2^2] + \eta^2 (p(1-p)E[\delta^2]+(1-p)\sigma^2)\sum_{i=1}^{N}\frac{(1-(1-\eta\lambda_i)^T)^2}{\lambda_i}+p^2\|W_{ref}-W^*\|_2^2
$
-
Experimental validation shows faster convergence with synthetic data augmentation.
-
Claim: Tuning minor singular components of weights outperforms traditional LoRA-based fine-tuning.
- Evidence: Ablation studies in Table 3 confirm that fine-tuning lower-rank singular values improves performance.
-
Claim: Disabling classifier-free guidance during training is necessary.
- Evidence: Table 2 shows significant degradation in performance when CFG is enabled during adaptation.
方法与评估标准
- Data Generation: Uses synthetic images from pre-trained models (e.g., FLUX-1.1) at lower resolutions for training guidance.
- Parameter-Efficient Fine-tuning: The authors fine-tune the minor singular components of weight matrices instead of major components:
$
W = U\Sigma V, \quad W_{small} = U[:, -r:]\Sigma[-r:, -r:]V[-r:, :].
$
- Classifier-Free Guidance Control: CFG is disabled during training and only applied at inference:
$
\epsilon_{\theta}(z_t, t, \emptyset) + g \cdot (\epsilon_{\theta}(z_t, t, y)-\epsilon_{\theta}(z_t,t,\emptyset))
$
Evaluation Metrics:
- Quantitative: FID, LPIPS, MAN-IQA, QualiCLIP, HPSv2.1, PickScore.
- Qualitative: GPT-4o preference scores.
理论论述
-
Strengths:
- The use of synthetic data to improve convergence is supported by a well-defined mathematical bound (Theorem 2.4).
- The choice to tune minor singular components instead of major ones is motivated by SVD properties.
-
Weaknesses:
- The theoretical motivation for minor singular component tuning is not rigorously justified beyond empirical observations.
- No formal analysis of convergence or stability guarantees of the fine-tuning method.
实验设计与分析
-
Strengths:
- Comprehensive ablation studies validate key design choices.
- Performance comparisons with state-of-the-art methods support claims of efficiency.
-
Weaknesses:
- Experiments are limited to diffusion models, leaving uncertainty about applicability to other generative models (GANs, autoregressive models).
- The computational efficiency gains are not well-quantified in terms of inference latency and resource consumption.
补充材料
Yes, the supplementary material was reviewed.
- Additional ablation studies and qualitative results were useful in supporting the claims.
- However, some details regarding computational efficiency (e.g., training time comparisons, memory usage) were missing.
与现有文献的关系
- The paper builds upon well-established literature in diffusion models and parameter-efficient adaptation, particularly works on LoRA and DreamBooth.
- Explicitly contrasts the proposed approach with previous fine-tuning strategies.
- Missing discussion on how this method compares with latent-space adaptation techniques used in recent diffusion models.
遗漏的重要参考文献
- The paper thoroughly covers diffusion model fine-tuning literature but could benefit from:
- Comparison with other LoRA modifications (e.g., tuning minor components in other applications).
- Discussion on alternative ultra-resolution adaptation techniques, such as patch-based super-resolution models.
其他优缺点
Strengths:
- The approach is practical and efficient, providing clear guidelines for ultra-resolution adaptation.
- The synthetic data augmentation strategy is well-supported theoretically and experimentally.
- Extensive benchmarking against diffusion models provides strong empirical validation.
Weaknesses:
- This paper lack of a formal explanation, for why minor singular components are more effective than major ones. The results are compelling but require a more theoretical justification.
- Computational overhead: While the method is designed to be efficient, there is no detailed profiling of inference efficiency (e.g., time per image, GPU memory consumption).
其他意见或建议
- Provide a deeper theoretical analysis on why minor singular component tuning is optimal for ultra-resolution tasks.
- Try to extend experiments to larger datasets (e.g., ImageNet, LAION-HR) to validate scalability.
- Analyze the impact on inference time to better quantify computational benefits.
We deeply thank Reviewer FjKy for the valuable comments and are glad that the reviewer finds our method practical, efficient, and empirically strong. We would like to address the concerns as below.
- Theoretical motivation for minor singular component tuning.
- We would like to supplement the following theoretical analysis torwards tuning minor singular component, which will be included in our revision.
- Consider the low-rank adapter: The loss is denoted as . Then, the gradients of w.r.t. and are: and respectively.
- In vanilla LoRA, is the original weight matrix , while and and initialized as random values and zeros respectively. Thus, in the initial adaptation stage, due to the joint influence of 's random initialization and noise in data, the gradients of and can be highly random, potentially leading to instability.
- In our approach of tuning minor components, as shown in Eqs. 5 and 6 of the main manuscript, if derived by SVD, , , and initially. Consequently, the gradients of and are influenced by the minor components of the original weight matrix , which tend to be numerically small and more stable compared to standard LoRA. For ultra-resolution adaptation, where major semantics and appearances remain unchanged, tuning minor components helps preserve knowledge in by effectively regulating the gradients of and .
- No formal analysis of convergence or stability guarantees of the fine-tuning method.
- We conduct the following theoretical analysis on the upper bound of the distance between the solution after iterations and the optimal : where with SVD, and is the smallest non-zero eigenvalue of the Hessian matrix.
- The above bound indicates that the training converges to its theoretical optimal solution in a linear rate. We will include the proof in the revision.
- Applicability to other models.
- Thanks. We conduct experiments on Infinity, a text-to-image visual autoregressive model, to adapt it from 1K to 2K scale. The following results confirm the applicability to various models. ||QualiCLIP|MAN-IQA|HPSv2.1| |-|-|-|-| |Infinity-8B|0.5233|0.3226|32.26| |w/ URAE|0.5570|0.3584|32.35|
- Quantified computational efficiency.
- In fact, the parameter efficiency here refers to training efficiency as it only requires to tune a small amount of parameters. As indicated in Sec. 4, this work does not focus on inference efficiency since it is orthogonal to our main contributions. The inference cost is the same as the original FLUX operating at the corresponding resolutions. On H100: ||2K|4K| |-|-|-| |Inference Time (28 Steps/Image)|36.5 Sec.|330.4 Sec.| |GPU Memory|27.5 GB|39.5 GB|
- For the quantified analysis of data efficiency and parameter efficiency, we kindly refer the reviewer to our response to Q1 of Reviewer u5sm.
- Tuning minor components in other applications.
- We note that (Wang et al. 2024a) focus on LLM fine-tuning and also applies minor-component tuning. However, there lacks critical analysis on the applicability of this approach across various scenarios. In contrast, we demonstrate that the method can improve performance when (1) data contains significant noise and (2) the target distribution does not shift too much from the source, e.g., 4K generation. In other cases, when clean data are available, we find that vanilla LoRA can be more effective, as shown in Tab. 2.
- Our response to Q1 provides theoretical insights on this. Due to small singular values, the gradients w.r.t. and , are numerically small, which may lead to insufficient adaptation when training data are accurate.
- Discussion on alternative ultra-resolution adaptation techniques.
- We show comparisons and integrations with some works in Tab. 1 and Fig. 5 and include some discussions in Sec. A.2.
- We kindly refer the reviewer to our response to Q1 of Reviewer wYHL for more discussions.
- Extend experiments to larger datasets (e.g., ImageNet, LAION-HR).
- In fact, as shown in Line 263 (right), our 4K-generation model is already trained with LAION-HR data.
- Sensitivity to the choice of singular component rank ().
-
We conduct the following studies to analyze the sensitivity to the rank : |Rank|1|4|16 (Default)|64|256| |-|-|-|-|-|-| |ImageReward|0.9291|1.0150|1.0923|0.9442|0.9239|
Overall, the performance remains stable when is around 16.
Thank you for the thorough rebuttal and the additional theoretical and experimental clarifications on minor singular component tuning. The convergence analysis, Infinity model experiments, and expanded insights on computational efficiency and parameter usage all help solidify the practicality and applicability of URAE for ultra-resolution diffusion models. These details significantly strengthen the paper's overall contribution.
We would like to sincerely thank Reviewer FjKy for acknowledging our response and for the encouraging positive feedback. Following the suggestions, we will include these results in our revision. We truly appreciate the reviewer's constructive input to our manuscript.
The paper "Ultra-Resolution Adaptation with Ease" presents a novel approach called URAE for adapting text-to-image diffusion models to generate ultra-high-resolution images (e.g., 4K) with limited training data and computational resources. The key contributions include:
- Theoretical and empirical evidence showing that synthetic data from teacher models can significantly enhance training convergence.
- A parameter-efficient fine-tuning strategy that tunes minor components of weight matrices, outperforming widely-used low-rank adapters when synthetic data is unavailable.
- The importance of disabling classifier-free guidance during adaptation for models leveraging guidance distillation.
- Extensive experiments demonstrating that URAE achieves performance comparable to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations for 2K generation, while setting new benchmarks for 4K-resolution generation.
给作者的问题
- How would the performance of URAE scale with additional training data beyond the 3K samples used in the experiments?
- What specific architectural modifications would be needed to combine URAE with recent efficient diffusion backbone designs (linear attention, SSM)?
论据与证据
The claims made in the submission are generally supported by clear and convincing evidence. The authors provide theoretical analysis (Theorem 2.4) to demonstrate the potential benefits of using synthetic data for training convergence. They also conduct extensive experiments to validate the effectiveness of their proposed methods, including ablation studies on key components (data source, parameter tuning strategy, classifier-free guidance). The results show significant improvements over baseline methods and state-of-the-art models in both quantitative metrics and qualitative visual comparisons. The claims about the effectiveness of tuning minor components when synthetic data is unavailable are well-supported by experimental results in the 4K generation task.
方法与评估标准
The proposed methods make sense for the problem of ultra-resolution adaptation. The approach of using synthetic data from teacher models addresses the challenge of limited high-quality training data for ultra-resolution images. The parameter-efficient fine-tuning strategy that focuses on minor components of weight matrices is innovative and appropriate for scenarios where synthetic data is unavailable.
The evaluation criteria, including FID, LPIPS, MAN-IQA, QualiCLIP, HPSv2.1, and PickScore, are standard and relevant for assessing image generation quality. The use of GPT-4o for AI preference studies adds a novel dimension to the evaluation, providing insights into human-like preferences for generated images.
理论论述
The theoretical claims are correct. The authors provide a detailed proof (Theorem B.1) for their main theoretical result regarding the error bound when training with a mixture of real and synthetic data. The proof follows standard optimization analysis for neural networks and correctly accounts for the impact of label noise and model discrepancies. The assumptions made (infinite-width neural networks, linear approximation) are standard in theoretical analyses of neural network training.
实验设计与分析
The experimental designs are sound and valid. The authors conduct experiments on both 2K and 4K resolution tasks, comparing against multiple baseline methods and state-of-the-art models. The ablation studies effectively isolate the impact of different components of their approach. The user study for 4K generation provides additional validation of the practical effectiveness of their method. The experimental setup, including training details and implementation specifics, is well-documented and allows for reproducibility.
补充材料
I reviewed the supplementary material, including the theoretical proof in Appendix B and additional experimental details in Appendix C and D. The theoretical proof is thorough and correctly supports the main claims. The additional experimental results provide further validation of the method's effectiveness across different evaluation dimensions and qualitative examples.
与现有文献的关系
The key contributions of this paper are well-situated within the broader scientific literature on text-to-image generation and diffusion models. The work builds upon recent advances in diffusion models, parameter-efficient fine-tuning, and high-resolution image generation. It addresses the practical challenge of adapting existing models to ultra-resolution settings with limited resources, which is a significant concern in the field.
遗漏的重要参考文献
The paper cites relevant prior work in text-to-image diffusion models, high-resolution generation, and parameter-efficient fine-tuning. However, it could benefit from discussing more recent works on high-resolution generation[1,2,3,4], especially training-free ones.
[1] Jin, Zhiyu, et al. "Training-free diffusion model adaptation for variable-sized text-to-image synthesis." Advances in Neural Information Processing Systems 36 (2023): 70847-70860.
[2]Cao, Boyuan, et al. "Ap-ldm: Attentive and progressive latent diffusion model for training-free high-resolution image generation." arXiv preprint arXiv:2410.06055 (2024).
[3] Qiu, Haonan, et al. "Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion." arXiv preprint arXiv:2412.09626 (2024).
[4] Kim, Younghyun, et al. "Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance." arXiv preprint arXiv:2406.18459 (2024).
其他优缺点
Strengths: • The paper addresses a significant practical problem in the field of text-to-image generation. • The proposed URAE framework is comprehensive, addressing both data and parameter efficiency. • The theoretical analysis provides valuable insights into the effectiveness of synthetic data. • The experimental validation is extensive and rigorous. Weaknesses: • The paper could benefit from more detailed comparisons with very recent works on efficient high-resolution generation. For example, those methods mentioned in the “Essential References Not Discussed” section • The computational efficiency during inference is not specifically optimized, which could be a limitation for real-time applications.
其他意见或建议
The paper is well-written and well-structured, making it accessible to both experts and those new to the field. The visualizations of results are clear and effectively demonstrate the quality improvements achieved by URAE.
We sincerely thank Reviewer wYHL for the positive feedback on the manuscript and are very excited that the reviewer mentions the strengths of addressing a significantly practical problem with a comprehensive framework, insightful theoretical analysis, extensive experiments, and well-written manuscript. The questions are addressed below.
- The paper could benefit from more detailed comparisons with very recent works on efficient high-resolution generation. For example, those methods mentioned in the “Essential References Not Discussed” section.
-
Thanks for bringing these related works to our attention. We show comparisons and integrations with some training-free works in Tab. 1 and Fig. 5. For works mentioned by the reviewer, we would like to supplement the comparison results using consistent COCO validation prompts here: ||FID|HPSv2.1|ImageReward|PickScore| |--|--|--|--|--| |AP-LDM[2]|48.50|30.40|0.6874|22.80| |FreeScale[3]|48.87|31.19|0.7494|22.66| |DiffuseHigh[4]|49.02|30.16|0.6182|22.77| |URAE(Ours)|38.85|31.50|1.0923|23.21|
-
We would like to include the following discussions on these reference to the revision:
- [1] proposes a resolution-adaptive attention scale factor, which has already been adopted in a series of works including FLUX-1.dev and I-Max in Tab. 1.
- [2] proposes an attention-guidance scheme and a progressive upsampling strategy.
- [3] adopts a global-local self-attention mechanism and a tailored self-cascade upscaling strategy with region-aware detail control.
- [4] proposes a DWT-based structural guidance to guide the high-resolution generation with the structural information of the low-resolution images.
These works mentioned by the reviewer tackle the problem of high-resolution image generation from training-free perspectives by designing effective stragies, e.g., progressive generation, to leverage pre-trained diffusion models at their native scales, wheras our method focuses on adapting these models from a training-based perspective so that they can directly operate at a high-resolution scale. Therefore, as mentioned in Sec. A.2 of the appendix, the two lines of research work address the problem from orthogonal directions, i.e., strategy v.s. model, and can be readily integrated together for better performance, as shown in Tab. 1 and Fig. 5.
- The computational efficiency during inference is not specifically optimized, which could be a limitation for real-time applications.
-
Thanks for pointing this out. Although this work does not specifically optimize inference latency, we would like to share our latest observation that, even without any additional training, a trained adapter on FLUX.1-dev can be migrated onto FLUX.1-schnell, which can generate high-quality results with only 4 denoising steps and achieves acceleration compared with FLUX.1-dev (25.8 v.s. 36.5 sec./image). The performance under this setting is shown below:
FID HPSv2.1 ImageReward PickScore FLUX-schnell 42.42 27.97 0.6902 22.07 FLUX-schnell* 42.20 28.17 0.7446 22.38 w/ URAE 38.66 29.63 0.9999 22.74 We will include these results in our revision, which suggest significant potential for acceleration.
- How would the performance of URAE scale with additional training data beyond the 3K samples used in the experiments?
- Thanks for the insightful question. We are actively collecting more data from FLUX1.1 [Pro] Ultra and training new models. As scaling up data collection, preprocessing, and training requires significant resources, the experiments are still ongoing, and we will include the results in our revision.
- What specific architectural modifications would be needed to combine URAE with recent efficient diffusion backbone designs (linear attention, SSM)?
- Thanks for the valuable question. We are continuing working on improving the architectural efficiency of the proposed URAE. Our latest exploration suggests the feasibility of replacing the original full attention with the linearized attention structure introduced in (Liu et al., 2024a). We find that even without further adaptation, the trained adapters in URAE are compatible with these novel attention layers. We present some examples via this anonymous link. The models with linearized attention achieve acceleration at 2K resolution (25.8 v.s. 36.5 sec./image) and acceleration at 4K resolution (124.2 v.s. 330.4 sec./image)
We would like to thank Reviewer wYHL again for the in-depth reviews. We would definitely love to further interact with the reviewer if there are any further questions.
This paper tackles the challenge of efficiently adapting text-to-image diffusion models to ultra-high resolutions (2K and 4K). Traditional approaches demand massive amounts of 4K training data and expensive fine-tuning of the entire model, making them difficult to deploy at scale. In contrast, URAE explores two main dimensions—data efficiency and parameter efficiency—and provides guidelines that yield strong ultra-resolution results with only thousands of samples and minimal GPU resources. By combining synthetic teacher-generated data (when available) and targeted parameter-efficient fine-tuning, URAE achieves state-of-the-art 2K image quality comparable to closed-source models such as FLUX1.1. It also sets new benchmarks in 4K resolution, demonstrating its adaptability under data-scarce conditions.
给作者的问题
- Computational Footprint: Could you provide more details on training cost comparisons, e.g., GPU hours, memory usage, or speedups over full fine-tuning?
- UNet-Based Models: Does URAE also apply neatly to stable diffusion–type backbones, or are modifications needed?
论据与证据
-
Claim: URAE Achieves Ultra-Resolution Adaptation with Minimal Data
- Evidence: The authors fine-tune a base diffusion model (FLUX.1-dev) on just 3K synthetic samples for 2K tasks and achieve close or better results than advanced closed-source models. The theoretical analysis (Theorem 2.4) shows how synthetic data from a high-quality teacher can expedite training convergence.
-
Claim: Parameter-Efficient Fine-Tuning Is More Effective Than Full Model Tuning
- Evidence: Through ablation, they show that focusing on particular “minor” or “major” singular values outperforms commonly used LoRA in certain scenarios, especially for 4K adaptation when synthetic data is unavailable. Empirical benchmarks in Tables 1–3 confirm superior performance over baseline or naive approaches.
-
Claim: Disabling Classifier-Free Guidance (CFG) During Training Improves Stability
- Evidence: The authors discover that for guidance-distilled models like FLUX, setting the CFG scale to 1 (effectively “off”) during fine-tuning leads to better adaptation performance. Results in Table 2 and Figures 3 & 7 illustrate the negative impact of leaving CFG on during adaptation.
-
Claim: Compatibility with Training-Free High-Resolution Pipelines
- Evidence: URAE can be employed in conjunction with existing post-processing or upscale pipelines (e.g., SDEdit, I-Max). Figure 5 shows that URAE effectively upgrades their output from 1024×1024 to 2048×2048, surpassing conventional super-resolution baselines like Real-ESRGAN and SinSR.
方法与评估标准
- Methods:
- URAE advocates fine-tuning on high-quality synthetic data generated by a teacher model.
- At 2K resolution (with synthetic data), focusing on major components (LoRA) works well. At 4K resolution (less reliable data), tuning minor singular values preserves the model’s essential capacities and avoids overfitting to noise.
- For models that rely on guidance distillation, turning off CFG (g=1) eliminates mismatched training objectives.
- Evaluation:
- Datasets & Benchmarks: HPD, DPG, LAION-5B for real data, and teacher-synthesized data from FLUX1.1 [Pro] Ultra for synthetic data.
- Metrics: FID, LPIPS, MAN-IQA, QualiCLIP, user preference metrics (HPSv2.1, PickScore), and GPT-4-based AI preference scores.
- Baselines: Includes stable or reference models like PixArt-Sigma, Sana-1.6B, Real-ESRGAN, SinSR, FLUX-1.dev, etc.
Overall, comprehensive quantitative and qualitative comparisons highlight URAE’s effectiveness.
理论论述
The authors introduce a linearized neural tangent kernel perspective (Theorem 2.4) to show that mixing real and synthetic data can accelerate learning, provided the synthetic data come from a sufficiently good teacher. This analysis solidifies the data-efficiency claim, as it mathematically bounds the distance to the optimal solution under varying real/synthetic data proportions.
实验设计与分析
-
Extensive Benchmarks:
- 2K Results: Detailed in Table 1 and Fig. 4–6, showing URAE outperforms baseline or SOTA methods in image fidelity and preference tests.
- 4K Results: Evaluated in Table 3 and Fig. 8, highlighting that “minor” component tuning without synthetic data can still yield strong high-resolution outputs.
- Ablation Studies: Table 2 and Fig. 7 analyze the effect of (i) synthetic vs. real data, (ii) tuning major vs. minor components, and (iii) CFG on or off.
-
User Studies & AI-Assisted Scoring:
Incorporating GPT-4 preference evaluation (Fig. 4, Table 4) yields additional insights into alignment, aesthetics, and overall image quality.
补充材料
- Appendices offer:
- Detailed theoretical proofs (Appendix B) for Theorem 2.4.
- Additional ablations, hyperparameter details, and user study prompts (Appendix C–D).
- More visual examples of high-resolution outputs, reinforce URAE’s texture fidelity advantages.
The code is mentioned to be released publicly in the future.
与现有文献的关系
- URAE aligns with growing research on large diffusion transformers, e.g., FLUX, PixArt, SANA, focusing primarily on data-efficiency, rather than training entire massive backbones.
- The approach extends beyond standard LoRA, referencing PISSA, FedPara, and other recent minor-component methods.
- Although mainly tested on image generation, the method could integrate well with broader multimodal large language models.
遗漏的重要参考文献
Yang, Zhuoyi, et al. "Inf-dit: Upsampling any-resolution image with memory-efficient diffusion transformer." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
其他优缺点
Strengths:
- Practical Data Efficiency: Demonstrates that 3K–30K images are enough to scale from 2K to 4K resolution, far below prior 4K training demands.
- Detailed Ablation: Comprehensive analysis of synthetic vs. real data usage, plus major/minor SVD component choices.
- Strong Empirical Evidence: Includes GPT-4 preference ranking, user studies, and well-known objective metrics.
Weaknesses:
- Limited Real-World Cost Analysis: While fewer iterations are praised (2K–10K), a clearer breakdown of training time, memory usage, or energy consumption would better illustrate URAE’s resource savings.
- Focus on DiT-Style Models: The method’s adaptability to other architectures (e.g., UNet-based) is suggested but not deeply tested.
- Inference Efficiency: The paper admits it does not optimize for inference latency, which might matter for large-scale industrial use.
其他意见或建议
In Figure 7, it’s not immediately clear how using synthetic data differs from using real data. Could you explain why the visual differences in Figure 7 appear subtle, and what evidence in the paper supports the conclusion that synthetic data ultimately improves training and performance?
We sincerely appreciate Reviewer u5sm for the constructive comments. We are happy that the reviewer finds our data efficiency practical, ablation detailed, and empirical evidence strong. We would like to address the concerns and questions reflected in the review below.
- Limited Real-World Cost Analysis: While fewer iterations are praised (2K–10K), a clearer breakdown of training time, memory usage, or energy consumption would better illustrate URAE’s resource savings.
-
We would like to sincerely thank the reviewer for the constructive suggestions. Following the suggestions, according to publicly available information, we summarize "# of Training Iteration × Batch Size" of various methods, which reflects the total number of seen samples during training, to present a clearer breakdown of the required resources:
PixArt-Sigma-XL Sana-1.6B Ours # of Training Iteration × Batch Size 64K ≥320K 16K Since different methods use varying base models and hardwares for training, we exclude training time as a direct indicator of resource savings. Nevertheless, even for FLUX—the largest open-source diffusion model with 12B parameters—our 4K model can still be trained within a day on an 8×H100 server.
-
For memory usage, we conduct the following studies on training-time GPU memory requirement (MB) with respect to various ranks of the adapters:
Rank 1 4 16 (Default) 64 256 1536 3072 (Full) 2K 35916 35958 36124 36816 39884 52102 77880 4K 62806 62850 63010 63704 66114 80332 OOM We observe that comparing with full-rank adaptation, the low-rank adapters save GPU memory by 50%+.
- Focus on DiT-Style Models: The method’s adaptability to other architectures (e.g., UNet-based) is suggested but not deeply tested.
-
Thanks for the constructive suggestion. Following the suggestion, we conduct an experiments on SD-1.5, to adapt it from 512 to 1024 resolution. The synthetic data used are 10K samples generated by SD3. Results are shown below:
FID HPSv2.1 PickScore SD 1.5 47.55 23.66 20.69 SD 1.5* 45.15 23.72 20.71 SD 1.5 w/ [a] 43.07 24.36 21.32 SD 1.5 w/ Ours 31.06 28.93 21.98 The FID is computed against 5K images in COCO2014val following [b]. SD1.5* denotes using the porpotional attention strategy similar to FLUX-1.dev* in Tab. 1. [a] is a state-of-the-art training-free high-resolution generation baseline based on resolution-aware downsampling and upsampling. The results verify the adaptability of our method to UNet-based diffusion models and its superior high-resolution generation capacity.
- Inference Efficiency: The paper admits it does not optimize for inference latency, which might matter for large-scale industrial use.
-
Thanks for pointing this out. Although this work does not specifically optimize inference latency, we would like to share our latest observation that, without any additional training, a trained adapter on FLUX.1-dev can be migrated onto FLUX.1-schnell, which can generate high-quality results with only 4 denoising steps and achieves acceleration compared with FLUX.1-dev (25.8 v.s. 36.5 sec./image). The performance under this setting is shown below:
FID HPSv2.1 ImageReward PickScore FLUX-schnell 42.42 27.97 0.6902 22.07 FLUX-schnell* 42.20 28.17 0.7446 22.38 w/ URAE 38.66 29.63 0.9999 22.74 We will include these results in our revision, which suggest significant potential for acceleration.
- In Figure 7, it’s not immediately clear how using synthetic data differs from using real data. Could you explain why the visual differences in Figure 7 appear subtle, and what evidence in the paper supports the conclusion that synthetic data ultimately improves training and performance?
- Thanks for the good question. In fact, Fig. 7 in the manuscript empirically verfies the theoretical result in Theorem 2.4 that synthetic data improve performance by diminishing label noises. Comparing results from synthetic and real data, we observe that the latter introduces many unrelated petals, whereas the former exhibits a cleaner layout. Additionally, the synthetic data produce a brighter, more vivid color tone and sharper contours with higher saturation.
- Furthermore, the results in Tab. 2 quantitatively demonstrate the superiority of synthetic data.
We would like to thank Reviewer u5sm again for the valuable feedback. Hope our responses alleviate the reviewer's concerns and we are happy to answer additional questions if there are.
[a] HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models, Zhang et al., ECCV 2024
[b] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models, Li et al., CVPR 2024
Thank you for the comprehensive answers. My concerns are fully addressed.
We are more than glad to know that our responses have fully resolved the raised concerns. We deeply value the reviewer’s insightful comments and constructive suggestions, which will be reflected in our revision and have significantly contributed to refining our manuscript. We are truly grateful for Reviewer u5sm’s time, effort, and thoughtful engagement throughout this process.
This paper explores the adaptation of existing models to ultra-resolution image generation. The authors categorize the challenges into two key aspects: data efficiency and parameter efficiency. Regarding data efficiency, the authors argue that synthetic data can serve as a valuable resource for model convergence in data-scarce scenarios. Regarding parameter efficiency, the proposed approach focuses on tuning minor components when adapting existing models to ultra-resolution. This method offers a promising direction for expanding existing models to ultra-resolution image generation by leveraging a small set of synthetic data for efficient adaptation.
给作者的问题
How can this approach be integrated with existing tuning-free high-dimensional image generation methods?
论据与证据
It is somewhat unclear whether synthetic data generated by teacher models can theoretically promote training convergence significantly, as the authors provide only empirical evidence without a formal theoretical justification. In Section 2.2, synthetic data would be beneficial only if the reference model generating these data is highly accurate. However, it is not guaranteed that this approach avoids mode collapse.
From visual inspection, the generated images appear to exhibit highly similar patterns, suggesting possible mode collapse. For instance: In Figure 8 (URAE Minor-4K), the second image contains many repetitive flower patterns, whereas PixArt-Sigma-XL generates more diverse floral structures. Similarly, in the giraffe example, the URAE-generated image displays repetitive mountain patterns, whereas PixArt-Sigma-XL and Sana-1.6B show greater variation. Such repetitive patterns are also widely noticeable in Figure 1, further supporting this concern.
Additionally, it is unclear how closely FLUX-1.1 [Pro] Ultra resembles real data. The authors appear to assume FLUX-1.1 [Pro] Ultra as real and measure FID scores relative to it in Table 1, yet it is still synthetically generated data.
While the paper argues that existing 2K or 4K resolution benchmarks do not exist, an alternative approach could be to reduce the resolution for quantitative evaluation. For example, adapting a 512-resolution model to generate 1024-resolution images could provide meaningful comparative insights.
方法与评估标准
See the Claims And Evidence part.
理论论述
It is commendable that the paper includes a theoretical proof in Section 2; however, the derivation does not clearly support the results claimed in the paper. Please refer to the "Claims and Evidence" section for further clarification.
实验设计与分析
It appears that MAN-IQA and QualiCLIP may not be reliable metrics for evaluating 4K resolution, as their rankings differ significantly from user study results. For instance, while FLUX-1.dev* ranks second in MAN-IQA, it exhibits noticeable artifacts in Figure 8, raising concerns about the alignment between automated metrics and perceptual quality.
Additionally, could the authors clarify how LPIPS is measured? Specifically, which images are used as the source and which as the target in the comparison?
补充材料
I have reviewed the supplementary material.
与现有文献的关系
The key contribution of this paper is adapting existing models to a data-scarce domain by leveraging a smaller set of synthetic data. This approach could be highly beneficial for various domains where obtaining high-dimensional data is significantly more challenging than acquiring low-dimensional data.
遗漏的重要参考文献
The paper provides a well-discussed review of existing works, effectively situating its contributions within the broader research landscape.
其他优缺点
Strengths
- The paper addresses the important problem of adapting existing models for ultra-resolution image synthesis.
- The writing is well-structured and easy to follow, making the paper accessible to readers.
- Exploring this problem from multiple perspectives (e.g., data, parameters) provides valuable insights and contributes to a broader understanding of the challenges involved.
Weaknesses: Please address the concerns raised in the "Claims and Evidence" and "Experimental Designs or Analyses" sections.
其他意见或建议
Given the existence of numerous tuning-free approaches for expanding models to high-resolution image synthesis, including ultra-resolution adaptation, it is difficult to claim that this paper is the first to tackle adaptation as a primary contribution.
We appreciate Reviewer 5Gfa's thoughtful comments and are glad that the significance and insights of our work are recognized. We would like to address the concerns as below.
- Theoretical analysis on synthetic data and mode collapse.
- Theorem 2.4 illustrates that, by diminishing label noise, accurate synthetic data achieve lower error than real data, which theoretically supports the effectiveness.
- For mode collapse, we theoretically verify that the difference on the diversity of generated samples between the trained and optimal models are tightly bounded, highlighting its robustness against this issue. Specifically, assume the input data . The distance between the variance of generated samples by models after iterations and the optimal one satisfies:
where the settings follow Theorem 2.4 and is the r.h.s. of Eq. 3, concerning with the accuracy of synthetic data. We will include the proof in the revision.
- Visual inspection and mode collapse.
- Mode collapse refers to a lack of diversity in various generative samples, which is, in fact, not equivalent to similar patterns within an image. According to its definition, we do not encounter this issue as validated by the FID against 2K real images in COCO2014val below: |2K|FID|LPIPS|4K|FID|LPIPS| |-|-|-|-|-|-| |PixArt-Sigma-XL|57.02|0.5075|PixArt-Sigma-XL|75.81|0.5066| |Sana-1.6B|54.57|0.5122|Sana-1.6B|73.46|0.5108| |Ours |52.95|0.4669|Ours|70.44|0.4647| |FLUX1.1 [Pro] Ultra|47.12|0.4518|FLUX1.1[Pro]Ultra|-|-|
- Possibly caused by the powerful spatial attention, FLUX itself tends to yield similar pattens, which are also reflected in Fig. 1 and Fig. 14 of I-Max (Du et al., 2024b) and can be inherited by models based on it.
- Empirically, we observe that when objects or patterns are explicitly specified in prompts, the results tend to follow the prompts rather than exhibiting similarity. We validate this through GenEval scores below, which assess precisions of position, instance appearance, etc. ||PixArt-Sigma-XL|Sana-1.6B|Ours| |-|-|-|-| |GenEval Score|0.5422|0.6892|0.6913|
- Sincerely hope our responses can alleviate this concern and we will further clarify it with visualizations in our revision.
- It's unclear how closely FLUX-1.1 [Pro] Ultra resembles real data.
- As FLUX1.1 [Pro] Ultra ranks top on multiple text-to-image leaderboards and our goal is to achieve on-par performance with it, we adopt its generated images as targets in Tab. 1.
- We also supplement results computed against real images in COCO2014val. Please refer to our response to Q2 for details.
- Reduce the resolution for quantitative evaluation.
- Thanks for the suggestion. We evaluate our URAE on SD1.5 and adapt it from 512 to 1024 scale. The training data are generated by SD3. We kindly refer the reviewer to our response to Q2 of Reviewer u5sm for the results, which demonstrate that URAE achieve superior high-resolution generation capacity.
- MAN-IQA and QualiCLIP may not be reliable metrics for evaluating 4K resolution.
- In fact, various metrics have varying preferences and biases, so we include diverse metrics to demonstrate the superiority of our method across various aspects. By downsampling the generated 4K images to the required resolution for evaluation, we supplement more metrics here to reinforce the conclusion: |4K|HPSv2.1|ImageReward|PickScore|GPT-4o Aesthetic|GPT-4o Prompt Alignment|GPT-4o Overall| |-|-|-|-|-|-|-| |PixArt-Sigma-XL|31.02|0.9342|22.76|87.66|87.00|86.28| |Sana-1.6B|32.00|1.0886|22.86|87.71|89.94|86.83| |Ours|32.85|1.1484|23.38|89.65|90.50|87.58|
- GPT-4o scores for human-like evaluation are also included. |Win Rate|Aesthetic|Prompt Alignment|Overall| |-|-|-|-| |v.s. pixart|67.90%|61.60%|59.50%| |v.s. sana|66.60%|50.40%|57.30%|
- How LPIPS is measured?
- In Tab. 1, similar to FID, images generated by FLUX1.1 [Pro] Ultra are used as the target for LPIPS, while the sources are images generated by various methods, following [a]. In our response to Q2, we also supplement the LPIPS results computed against real images.
- Relationship with tuning-free high-dimensional image generation methods.
- The "ultra-resolution adaptation" in the manuscript refers to the training-based adaptation. We will explicitly clarify this in our revision.
- The relationships are discussed in Sec. A.2 of the appendix, where we mention that the two lines of research tackle the problem from two orthogonal perspectives: model and pipeline.
- As shown in Line 308 (left), we apply the trained adapters by our method to these training-free solution in the high-resolution stage. Results can be found in Tab. 1 and Fig. 5.
[a] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models, Li et al., CVPR 2024
Thanks to the authors for the detailed answers. The additional experiments and evaluations address my concerns, and I am happy to raise my scores.
We are truly grateful for the reviewer's thoughtful and constructive feedback, which has been instrumental in improving our work. We are more than encouraged to hear that the reviewer's concerns have been addressed. Thanks again for the reviewer's time and valuable input throughout the review process :)
This paper proposes techniques for adapting pretrained text2image diffusion models to high resolutions. It studies this problem from two aspects, data and parameter efficiency respectively. The reviewers have reached unanimous positive rating of the work after extensive discussions. The AC believes that this is a strong submission and would recommend accept.