PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
4
3
ICML 2025

Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

OpenReviewPDF
提交: 2025-01-12更新: 2025-07-24
TL;DR

In this paper, we target to reduce the size of text encoder in T2I diffusion mode with Skip and Re-use.

摘要

关键词
Generative modelEfficiencyPruning

评审与讨论

审稿意见
3

This paper suggests reducing the parameter in the Text-to-Image (T2I) generative models by pruning the transformer layers in the text encoder for conditioning. They highlight that most of the parameters in T2I models are available in the text encoder. Thus, they propose Skip and re-use layers (Skrr) that prune the layers in the text encoder with beam search and re-use the non-pruned layers as if they were pruned layers. In terms of similarity metric for pruning, authors use MSE as they observe cosine similarity leads to divergence of the feature norm of the unconditional embedding. Quantitative and qualitative experiments the improvement of generation quality compared to other LLM pruning methods, and ablations studies show the significance of re-use and beam search.

给作者的问题

  • The maximum sparsity the paper mentioned is 41.9, and I wonder if the authors have tested with a more highly compressed scenario.

论据与证据

  • They claim that most of the parameters in the latest text-to-image generative models are the text encoder (e.g. T5-XXL encoder in SD3), so they need to focus on reducing the number of parameters of the text encoder to achieve a memory-efficient T2I generative model. I agree with this claim as the size of the text encoder significantly affects the memory and size of the checkpoint in storage.

方法与评估标准

  • Pruning of the proposed method computes the similarity in embedding space of T2I generative model. It makes sense as this space would have the most information T2I model uses.

  • Authors justify that the usage of MSE as similarity by showing the problem of cosine similarity.

  • Authors show the similarity between features in adjacent layers, and it strengthens their argument about the re-use of non-pruned layers.

  • Evaluating the T2I generation on MS-COCO is a common metric as far as I know, and they show various evaluation metrics about T2I performance.

理论论述

  • It seems that their claim makes sense, but I didn't check the theoretical proof thoroughly.

实验设计与分析

  • Most of the experimental designs seem clear and valid.

  • In discussion, the authors discuss the enhancement in FID of the pruned model and relate it to recent guidance methods using perturbed or degraded models. Does this imply that the unconditional score in the pruned model would be a good guidance like the degraded counterpart in autoguidance? Then, what if we compute conditional score from the pruned model and guide it with the original model? Would it worsen the FID?

  • As I understand, the lower metric score is better for both metric_1 and metric_2, since it implies the pruned model works alike the original model. In such case, for every metric, (c) is better than (b) in Figure. 4. However, text in Figure. 4 claims that metric_2 is better than metric_1 and it makes some confusion for me.

补充材料

  • I just skimmed the supplementary material, and did not thoroughly review it.

与现有文献的关系

  • Compressing the large-scale text-to-image generative model is a significant problem for its usage in the real-world scenario. This paper claims the importance of considering text encoder for memory-efficient T2I generative models.

遗漏的重要参考文献

N/A

其他优缺点

One weakness of this paper is that it only reduces the memory and parameters, not actual computing cost such as FLOPS.

其他意见或建议

N/A

作者回复

We sincerely appreciate the reviewer's insightful comments and encouraging feedback regarding our motivation for pruning text encoders, our embedding-space-based similarity approach, the justification for using MSE over cosine similarity, and the clarity and validity of our experimental design and evaluations.


Question 1. Clarification on pruned model's unconditional score effectiveness and impact on FID when guiding original conditional scores.

Response: We believed that it is worth reporting and discussing (at least in the Discussion section) this interesting phenomenon of decreasing FIDs since this was also observed in many different contexts such as fine-tuning for concept erasing [1–3], distillation [4–5], quantization [6], or compression [7–8], despite an intuitive expectation of potential performance degradation. Furthermore, as you suggested, our experimental results, summarized in Table 1, indicate that guiding conditional embeddings from the Skrr-compressed model using unconditional embeddings from the dense model leads to degraded FID performance compared to employing both conditional and unconditional embeddings entirely from the Skrr-compressed model. A deeper exploration of this intriguing phenomenon and its underlying mechanisms would constitute compelling future work.

Table 1. FID and CLIP score with combinations of conditional / unconditional scores from dense / compressed models in PixArt-Σ\Sigma.

ConditionalUnconditionalFID↓CLIP↑
DenseDense22.890.314
Compress w/ SkrrCompress w/ Skrr19.930.312
Compress w/ SkrrDense24.310.307

[1] Lu, et al. Mace: Mass concept erasure in diffusion models. CVPR (2024).
[2] Zhang, et al. Forget-me-not: Learning to forget in text-to-image diffusion models. CVPR (2024).
[3] Lee, et al. Concept pinpoint eraser for text-to-image diffusion models via residual attention gate. ICLR (2025).
[4] Zhao, et al. MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices. In ECCV (2024).
[5] Feng, et al. Relational diffusion distillation for efficient image generation. ACM MM (2025).
[6] Li, et al. SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models. ICLR (2025).
[7] Yuan, et al. Ditfastattn: Attention compression for diffusion transformer models. NeruIPS (2024).
[8] Chen, et al. Δ\Delta-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers. arXiv (2024).


Question 2. Typo error.

Response: Your understanding is correct, and the point you mentioned is indeed a typo. Thank you for pointing this out. The values of Metric1\text{Metric}_1 for (b) and cc in Figure 4 should be swapped.


Question 3. Experiments on extreme sparsity.

Response: Thank you for proposing this insightful experiment. We conducted evaluations under high sparsity levels beyond 50%, and these results are presented in Table 2. While other pruning methods produce images with severely compromised fidelity under such extreme sparsity, Skrr, despite exhibiting some degree of performance degradation, continues to generate images with relatively high fidelity that remain reasonably aligned with the textual descriptions. We will include these detailed experimental results in the supplementary materials of our revised manuscript.
Table 2. Quantitative results on sparsity over 50% with PixArt-Σ\Sigma.

MethodSparsityFID↓CLIP↑DreamSim↑Single↑Two↑Count.↑Colors↑Pos.↑Color attr.↑Overall↑
Dense0.022.890.3141.00.9880.6160.4750.7950.1080.2550.539
LaCo48.6280.30.1940.1700.0030.00.00.00.00.00.001
FinerCut51.3154.70.1910.1760.0780.0080.00.0270.00.00.019
Ours (Skrr)51.420.040.3070.6990.8880.3330.3660.6860.0380.0550.394

Question 4. FLOPs reduction in Skrr.

Response: Skrr not only reduces memory and parameters but also lowers FLOPs compared to the dense model, as shown in Table 2 of the main manuscript. While the Re-use mechanism slightly increase FLOPs relative to methods using pruning alone, the text encoder constitutes a minor fraction of the total T2I pipeline. As a result, Skrr reduces overall FLOPs by 0.04% compared to the dense model, with at most a 0.17% increase over other pruning approaches.

审稿意见
3

This paper propose skrr method, which can effectively reduce the memory consumption of the text encoder in the Text to Image (T2I) model while maintaining the quality of image generation.

给作者的问题

NA

论据与证据

The motivation is reasonable.

方法与评估标准

The paper proposes a two-stage pruning method (Skip and Re use) , which is clear. However, Beam search method may introduce additional computational overhead, and the author did not quantitatively analyze the efficiency and cost of the pruning process itself. The authors use multiple widely recognized evaluation metrics and multiple existing state-of-the-art diffusion models as evaluation benchmarks.

理论论述

It seems correct.

实验设计与分析

When pruning with high sparsity, the FID score (image quality index) actually improves (i.e., the image quality improves). The author briefly discusses this in the Discussion section (from line 433 onwards), suggesting that this may be due to the disturbance of the null condition vector. However, without sufficient and clear quantitative experiments or theoretical support for this phenomenon, the improvement of FID may only be an accidental phenomenon or evaluation bias. More detailed experimental analysis should be conducted on this phenomenon (such as evaluating the stability of FID scores under different random seed or prompt distributions) to confirm whether FID improvement is truly statistically significant.

补充材料

The supplementary material provides detailed experiments setup and additional experiments, like more visual comparisons.

与现有文献的关系

NA

遗漏的重要参考文献

NA

其他优缺点

see above

其他意见或建议

NA

作者回复

We sincerely appreciate the reviewer's insightful comments and encouraging feedback regarding the clarity of our proposed two-stage pruning method (Skip and Re-use), as well as our comprehensive evaluation using multiple widely recognized metrics and state-of-the-art diffusion models.


Question 1. Computational cost of beam search-based algorithm.

Response: Thank you for the valuable suggestion. As you correctly pointed out, beam search introduces additional computational overhead, and we have clearly articulated its time complexity with respect to the number of transformer blocks LL and beam size kk in Table 1. However, since kk is set to a relatively small value (specifically, k=3k=3 in Skrr) compared to LL (with L=48L=48 for T5-XXL), the additional computational overhead remains minimal. Furthermore, it is crucial to emphasize that this overhead occurs only once during the pruning stage when blocks are selected and does not recur during the actual image synthesis phase. Thus, the computational cost from beam search has negligible practical impact on the applicability and efficiency of Skrr. We will include a detailed analysis addressing this aspect explicitly in our revised manuscript.

Table 1. Time complexity of each block-wise pruning method for calibration. For Skrr, k=3k=3.

ShortGPTLaCoFinerCutOurs (Skrr)
O(L)O(L)O(L)O(L)O(L2)O(L^2)O(kL2)O(kL^2)

Question 2. Clarification on statistical significance and stability of FID improvements observed under text encoder pruning.

Response: We emphasize that our evaluation of FID scores was conducted using the MS-COCO dataset with 30,000 prompts, a sufficiently large and widely used benchmark to ensure robustness and statistical reliability. The use of such an extensive dataset substantially reduces the likelihood that the observed improvements in FID scores at high sparsity are due to randomness or evaluation bias. Additionally, it should be noted that the calibration subset derived from CC12M and the MS-COCO dataset employed for FID evaluations are entirely disjoint, further diminishing concerns regarding dataset-induced bias.

We believed that it is worth reporting and discussing (at least in the Discussion section) this interesting phenomenon of decreasing FIDs since this was also observed in many different contexts such as fine-tuning for concept erasing [1–3], distillation [4–5], quantization [6], or compression [7–8], despite an intuitive expectation of potential performance degradation. A deeper exploration of this intriguing phenomenon and its underlying mechanisms would constitute compelling future work.

[1] Lu, et al. Mace: Mass concept erasure in diffusion models. CVPR (2024).
[2] Zhang, et al. Forget-me-not: Learning to forget in text-to-image diffusion models. CVPR (2024).
[3] Lee, et al. Concept pinpoint eraser for text-to-image diffusion models via residual attention gate. ICLR (2025).
[4] Zhao, et al. MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices. In ECCV (2024).
[5] Feng, et al. Relational diffusion distillation for efficient image generation. ACM MM (2025).
[6] Li, et al. SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models. ICLR (2025).
[7] Yuan, et al. Ditfastattn: Attention compression for diffusion transformer models. NeruIPS (2024).
[8] Chen, et al. Δ\Delta-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers. arXiv (2024).

审稿人评论

I have read the author response to my review and it addressed my concerns.

作者评论

We are pleased that our responses addressed the concerns raised and sincerely appreciate the reviewer’s feedback.

审稿意见
4

This paper introduces Skrr, a pruning strategy for text encoders in text-to-image (T2I) diffusion models. Skrr reduces memory usage by selectively skipping and reusing transformer layers, leveraging the redundancy in T2I-specific text encoders. Experimental results on FID, CLIP, DreamSim, and GenEval are reported.

update after rebuttal

Thanks for the new evaluations. The Flux based results may be presented in the main text as well. The field is changing rapidly -- I'm sure that the authors know the native image gen in gpt4o, which seems to be using special autoregressive tokens as conditioning for diffusion. Maybe in future versions some discussions on how this work fits into the 4o style structure will make the paper more interesting.

给作者的问题

  1. Flux is tested, but why number evals are not reported?
  2. Fig 3 seems like a general text encoder study. Does it related to t2i? This is misleading.

论据与证据

On the Claim of Novelty: The paper states, “To the best of our knowledge, this is the first work to tackle the challenge of constructing a lightweight text encoder for T2I tasks.” While technically correct, this phrasing may be misleading. Many text encoder pruning methods can be considered as contributing to lightweight text encoders. The unique aspect of this work lies in how it prunes the text encoder specifically for T2I tasks. However, the specific challenges in optimizing text encoders for T2I models are not clearly articulated, making it difficult to assess the true novelty of the approach.

方法与评估标准

The paper uses GenEval as a benchmark for evaluating text encoder performance. However, GenEval primarily consists of simple prompts that do not fully test the capabilities of powerful text encoders. More challenging benchmarks, such as the DSG score (Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation, ICLR 2024), would provide a better assessment of how well the pruned text encoder captures fine-grained text-image alignment.

理论论述

The paper presents Lemma 3.1 (Error Bound of Two Transformers) and Theorem 3.2 (Tighter Error Bound of Re-use) to support its pruning strategy. However, these theorems involve Lipschitz continuity, which the paper does not explicitly evaluate on real models. Without verifying the Lipschitz properties in practical settings, the theoretical claims may appear somewhat loose, and their applicability to actual T2I models remains uncertain.

实验设计与分析

The primary baseline for evaluation is PixArt (in Tab 1), which is known to be a relatively weak T2I model. Furthermore, the paper tests its method on only this single model. As the proposed pruning strategy is intended to be a general solution for text encoder optimization, it should be validated on stronger T2I models—especially those that demonstrate superior text generation capabilities, such as Flux. Additionally, because PixArt’s performance on GenEval is highly variable, the reported results may lack reliability. Testing on multiple high-performing models would strengthen the claims of generalizability.

补充材料

n/a

与现有文献的关系

not related

遗漏的重要参考文献

n/a

其他优缺点

no other comments

其他意见或建议

no other comments

作者回复

We sincerely appreciate the reviewer's insightful comments and encouraging feedback.


Question 1. Novelty claiming on task.

Response: We acknowledge the reviewer's comment regarding the overstatement of novelty. We appreciate this important point and will carefully revise our manuscript to more accurately reflect the nature of our contribution. Specifically, we will clarify that our primary novelty lies in effectively addressing the particular challenges involved in pruning text encoders within text-to-image (T2I) diffusion models, rather than implying broader methodological novelty.


Question 2. Evaluation on additional benchmark.

Response: GenEval was selected in this work since it is a widely adopted benchmark among prior T2I works, thus can provide direct comparisons with our proposed method. To relieve your concern for additional evaluation, we now evaluate our Skrr on two more benchmarks (Tables 1 and 2 below), T2I-CompBench [1] (generating 4 images per prompt, thus 3,600 images total) and DSG-1K [2] (generating 4 images per prompt, thus 4,240 images total), in BLIP-VQA score and DSG score, respectively. These new results are consistent with our prior evaluation, demonstrating Skrr’s robustness across diverse and rigorous benchmarks.
Table 1. Quantitative results on BLIP-VQA score in T2I-CompBench dataset with PixArt-Σ\Sigma.

MethodSparsityComplex↑Shape↑Texture↑
Dense0.00.51370.45950.5715
LaCo40.50.31270.31550.3370
FinerCut41.70.41620.31460.4155
Ours (Skrr)41.90.44380.37560.4721

Table 2. Quantitative results on DSG score in DSG-1k dataset with PixArt-Σ\Sigma.

MethodSparsityTIFA↑Paragraph↑Relation↑Count↑Real user↑Pose↑Defying↑Text↑Overall↑
Dense0.00.8790.8860.8360.7400.6000.6260.8080.6830.761
LaCo40.50.7630.8120.6300.6020.5180.5940.5950.5820.649
FinerCut41.70.7150.6610.7160.5690.4700.5850.5490.5020.597
Ours (Skrr)41.90.7970.7910.7570.6360.5370.5530.6910.6050.677

[1] Huang, et al. T2I-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. NeurIPS (2023).
[2] Cho, et al. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation. ICLR (2024).


Question 3. About Lipschitz continuity assumption.

Response: The assumption of Lipschitz continuity is widely used in deriving convergence rates and error bounds. Prior work has empirically shown that reasonably trained large transformer models satisfy this condition [3], and theoretical analyses have confirmed its validity in diffusion models [4]. Thus, our use of this assumption is well-justified, supporting the relevance of our theoretical framework to real-world T2I models.
[3] Khromov & Singh. Some Fundamental Aspects about Lipschitz Continuity of Neural Networks. ICLR (2024).
[4] Liang, et al. Unraveling the smoothness properties of diffusion models: A gaussian mixture perspective. arXiv (2024).


Question 4. Evaluation on FLUX.

Response: We provide extensive additional qualitative comparisons for stronger T2I models, such as SD3 and FLUX, in the appendix. Nevertheless, to directly address your concern regarding generalizability, we have also conducted quantitative evaluations on the FLUX.1-schnell model, and these results are presented in Table 3 below. The outcomes clearly demonstrate that our method preserves the performance of dense models in FLUX compares to other baselines, reinforcing our claim regarding the broad applicability and generalizability of our pruning strategy across multiple strong T2I diffusion models.

Table 3. Quantitative results on FLUX.1-schnell.

MethodSparsityFID↓CLIP↑DreamSim↑Single↑Two↑Count.↑Colors↑Pos.↑Color attr.↑Overall↑
Dense0.020.450.3121.00.9940.8790.6030.7380.2800.5000.666
LaCo40.528.670.2920.6310.7560.2750.2750.3720.0250.0530.293
FinerCut41.738.870.2680.5360.5220.1690.0720.2980.0080.0080.179
Ours (Skrr)41.924.280.3000.6980.9250.4390.3000.6170.0580.0530.399

Question 5. Relationship between Figure 3 and T2I models.

Response: Figure 3 presents experimental results specifically obtained from the T5-XXL encoder, which is widely adopted in T2I models. Note that T5-XXL is independently trained from T2I models, so it is reasonable to investigate this component separately while still having a strong relevance to the T2I context. To avoid potential confusion, we will revise the text to explicitly highlight the connection to T2I.

审稿意见
3

This paper introduces Skip and Re-use Layers (Skrr), a method for compressing text encoders in text-to-image (T2I) diffusion models to improve memory efficiency. The large-scale text encoders in T2I diffusion models consume a significant amount of memory despite contributing little to FLOPs. Skrr addresses this by selectively skipping and reusing layers in the text encoder. It uses a T2I diffusion-tailored discrepancy metric and beam search in the Skip phase to identify layers for pruning, and in the re-use phase, it reintroduces adjacent layers to skipped ones to mitigate performance degradation. Experiments show that Skrr outperforms existing blockwise pruning techniques, maintaining image quality and text alignment even at high sparsity levels.

给作者的问题

Based on the above concerns, I suggest i) the authors should conduct more extensive validation of their method, including comparisons on additional datasets; ii) a rigorous analysis should be performed to verify the consistency between the calibration dataset and text-to-image datasets. Most importantly, the current method appears to sacrifice too much performance, which significantly limits its practical applicability. Addressing this performance degradation is critical for real-world deployment. Given these issues, I currently vote for weak reject.

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes. Skip Algorithm and Re-use Algorithm.

实验设计与分析

Yes

补充材料

Yes, B.2, B.3, and B.4.

与现有文献的关系

NA

遗漏的重要参考文献

NA

其他优缺点

Strengths:

  • This paper proposes Skrr, which significantly reduces the memory usage of text encoders in T2I diffusion models.
  • It outperforms existing blockwise pruning techniques such as ShortGPT, LaCo, and FinerCut.
  • The paper provides theoretical support for the Re-use phase. Theorem 3.2 shows that incorporating the Re-use phase can lead to a tighter error bound compared to just skipping layers.
  • Extensive experiments across various metrics show the effectiveness of the proposed method.

Weaknesses:

  • The Geneval score appears to be insufficient as a standalone metric. Are there alternative evaluation benchmarks, such as T2I-CompBench or DPG-Bench, that could provide a more comprehensive assessment of the proposed method?
  • Even when considering the Geneval score, a significant performance drop is observed compared to the dense model (where sparsity is 0), as shown in Table 1. This substantial decline severely limits the practical applicability of skrr in real-world scenarios.
  • Whether memory is a limiting factor in practice remains an open question. During training, it is feasible to extract features offline using the text encoder since backpropagation (BP) is not required. In such cases, optimizing memory usage may not be a critical concern. However, if the proposed algorithm sacrifices significant performance to achieve memory efficiency, its practical utility could be severely constrained, which still needs to be further discussed.
  • The authors state that the calibration dataset consists of 1k text prompts sourced from CC12M. However, it is unclear whether this dataset is representative of the text prompts used in text-to-image generation tasks. There is insufficient validation to ensure that the results derived from this calibration dataset generalize well to text-to-image tasks. For instance, the authors might have overfitted the calibration dataset when selecting re-used or skipped layers, which could lead to suboptimal layer choices for text-to-image generation. This issue requires thorough exploration and justification.

其他意见或建议

See weaknesses.

作者回复

We sincerely appreciate the reviewer's insightful comments and encouraging feedback regarding our method's superior performance over existing blockwise pruning techniques, theoretical support, and extensive experiments across diverse metrics.


Question 1. Evaluation on additional benchmark.

Response: GenEval was selected in this work since it is a widely adopted benchmark among prior T2I works, thus can provide direct comparisons with our proposed method. To relieve your concern for additional evaluation, we now evaluate our Skrr on two more benchmarks (Tables 1 and 2 below), T2I-CompBench [1] (generating 4 images per prompt, thus 3,600 images total) and DSG-1K [2] (generating 4 images per prompt, thus 4,240 images total), in BLIP-VQA score and DSG score, respectively. These new results are consistent with our prior evaluation, demonstrating Skrr’s robustness across diverse and rigorous benchmarks.

Table 1. Quantitative results on BLIP-VQA score in T2I-CompBench dataset with PixArt-Σ\Sigma.

MethodSparsityComplex↑Shape↑Texture↑
Dense0.00.51370.45950.5715
LaCo40.50.31270.31550.3370
FinerCut41.70.41620.31460.4155
Ours (Skrr)41.90.44380.37560.4721

Table 2. Quantitative results on DSG score in DSG-1k dataset with PixArt-Σ\Sigma.

MethodSparsityTIFA↑Paragraph↑Relation↑Count↑Real user↑Pose↑Defying↑Text↑Overall↑
Dense0.00.8790.8860.8360.7400.6000.6260.8080.6830.761
LaCo40.50.7630.8120.6300.6020.5180.5940.5950.5820.649
FinerCut41.70.7150.6610.7160.5690.4700.5850.5490.5020.597
Ours (Skrr)41.90.7970.7910.7570.6360.5370.5530.6910.6050.677

[1] Huang, et al. T2I-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. NeurIPS (2023).
[2] Cho, et al. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation. ICLR (2024).


Question 2. Clarification on Skrr's practical applicability considering performance relative to dense models.

Response: As you may know, true practicality does not only needs excellent performance, but also requires efficient memory usage and we believe that our proposed Skrr can provide a way to controling this trade-off while minimizing performance degradation. As shown in all quantitative results, including Table 1 and Table 2 in question 1, Skrr consistently achieves better performance than all baseline pruning techniques, even at high sparsity. Furthermore, qualitative examples in both the main text (Fig. 5) and appendix (Fig. A10-A16) clearly show that Skrr maintains visual fidelity and semantic alignment with the dense model. These results underscore Skrr’s strong practical applicability despite compression. Additionally, Skrr is entirely training-free, which enhances its practicality and distinguishes it as a meaningful contribution to the field.


Question 3. Practical utility considering memory efficiency versus performance trade-off in text encoder pruning.

Response: We would like to clarify that our method is training-free, thus there is no claim regarding training; instead, our approach is explicitly designed for inference scenarios. In practical situations where it is challenging to anticipate incoming text prompts, offline feature extraction becomes impractical. Consequently, reducing the size of the text encoder is essential for effective model deployment. Reviewer v9A7 also acknowledged this motivation, explicitly stating, "I agree with this claim as the size of the text encoder significantly affects the memory and size of the checkpoint in storage." Concerns regarding performance drops associated with memory-efficient designs have already been addressed above.


Question 4. Clarification on the representativeness and generalizability of the CC12M calibration dataset for text-to-image tasks.

Response: We would like to clarify that the CC12M dataset is widely recognized and extensively utilized for training and evaluating various T2I models, including Stable Diffusion 3. To address your concern about potential overfitting to our calibration dataset, we emphasize two key points: First, all evaluation metrics employed in our experiments—FID, CLIP, DreamSim, GenEval, T2I-CompBench, and DSG score—were computed on datasets entirely distinct and mutually exclusive from the calibration set. Consistent performance across these diverse benchmarks strongly suggests minimal risk of overfitting. Second, since all baseline methods evaluated in our experiments utilized the same calibration set for their respective pruning strategies, the fairness and comparability of our results remain intact.

最终决定

This paper proposes techniques to reduce the memory cost of language encoders in text2image models. The motivation is that language encoders account for most parameters of the model and thus reducing it would be desirable from a memory saving perspective. Experiments are conducted on PixArt, and demonstrated superior performance than competitive pruning methods. In the rebuttal, the authors show additional results on Flux which supports the same claims. Reviewers are overall positive, recognizing that it's a simple and effective method. Though concerns still remain, for example, the proposed method does cause significant drop in performance and also it could be better motivated in terms how the method is specific for text2image settings instead of LLMs in a general sense. Overall, I think this paper has sound ideas and results in a small but valid scope, and I lean towards accepting it.