Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think
摘要
评审与讨论
This paper proposes an improvement to REPA, a regularization designed to accelerate the training of diffusion transformers. While the original REPA aligns the spatial features of vision encoders like DINO with the features of the diffusion transformer, the proposed method, REG, further aligns the class token from DINO as well. In addition, the diffusion transformer is trained to predict not only the latent tokens but also an additional class token. This approach is simple, introduces minimal computational overhead, and significantly improves training speed, as demonstrated on ImageNet generation experiments.
优缺点分析
[Strengths]
- The proposed method is simple and effective.
- The paper is clearly written and easy to understand.
- Experimental validation is comprehensive, including both unguided and guided generation comparisons, ablation studies, and CKNNA analysis.
[Weaknesses]
-
The paper only presents results for shorter training iterations compared to SiT and REPA. Although REG demonstrates higher training efficiency, it is unclear whether the performance remains strong or even improves with extended training. For instance, would training SiT-L/2 to 700K iterations or SiT-XL/2 to 4M iterations yield better results (Table 1)? Similarly, would REG benefit from training to 800 epochs (Table 2)?
-
REG incorporates four loss terms: velocity prediction and representation alignment for both latent and class tokens. However, ablation studies only remove the class token’s representation alignment loss (Table 6). There are no experiments that remove the class token’s velocity prediction loss or employ alignment using only class tokens. More detailed ablation is necessary to identify which loss components contribute most to REG’s performance.
-
Both REPA and this paper on REG omit a simple but strong baseline: allocating more training computation to timesteps that contribute more to performance. For example, the EDM [A, B] or Stable Diffusion 3 [C] paper samples timesteps from a logit-normal distribution instead of uniform, and emphasizes higher noise levels through timestep shifting for high-resolution images. Given that the core motivation of REG is to improve training speed, excluding this baseline weakens the claim.
-
Table 2 lacks baseline CFG performance without guidance interval, making comparisons incomplete.
[A] Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models, NeurIPS 2022.
[B] Karras et al., Analyzing and Improving the Training Dynamics of Diffusion Models, CVPR 2024.
[C] Esser et al., Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, ICML 2024.
问题
-
In the appendix, ImageNet 512×512 results are presented, but there is no EDM2 [B] baseline, and REG underperforms compared to EDM2. While outperforming EDM2 is not necessarily the goal, it would be valuable to include a discussion on REG’s applicability to EDM2 or on why EDM2 performs better.
-
In Table 3, why does DINO-B outperform DINO-L? An explanation or analysis would help clarify this counterintuitive result.
[B] Karras et al., Analyzing and Improving the Training Dynamics of Diffusion Models, CVPR 2024.
局限性
Yes
最终评判理由
The extended training results and the 512×512 results are impressive, and the ablation studies are thorough.
格式问题
None
Thank you very much for your valuable comments and the time you devoted to reviewing our manuscript. We address each of your comments in detail below.
[W1]: REG improves the performance with extended training.
We further train REG, and as the training progressed, it consistently shows performance improvements across different task settings. For example, the two tables below present the ImageNet results without and with CFG, respectively.
ImageNet without CFG. Compared to other approaches, REG achieves the best performance, reaching an FID of 1.8 at 4M training steps, which is 4.1 lower than REPA.
| Model | Iter | FID↓ |
|---|---|---|
| SiT-XL/2 | 400K | 17.2 |
| SiT-XL/2 | 7M | 8.3 |
| +REPA | 200K | 11.1 |
| +REPA | 400K | 7.9 |
| +REPA | 1M | 6.4 |
| +REPA | 4M | 5.9 |
| + REG (ours) | 200K | 5.0 |
| + REG (ours) | 400K | 3.4 |
| + REG (ours) | 1M | 2.7 |
| + REG (ours) | 2.4M | 2.2 |
| + REG (ours) | 4M | 1.8 |
ImageNet with CFG. REG exhibits consistently improving performance with extended training time, achieving its best FID score of 1.36 at 800 epochs.
| Model | Epochs | FID↓ | sFID↓ | IS↑ | Pre.↑ | Rec.↑ |
|---|---|---|---|---|---|---|
| SiT-XL/2 | 1400 | 2.06 | 4.50 | 270.3 | 0.82 | 0.59 |
| + REPA | 800 | 1.42 | 4.70 | 305.7 | 0.80 | 0.65 |
| + REG (ours) | 80 | 1.86 | 4.49 | 321.4 | 0.76 | 0.63 |
| + REG (ours) | 160 | 1.59 | 4.36 | 304.6 | 0.77 | 0.65 |
| + REG (ours) | 480 | 1.40 | 4.24 | 296.9 | 0.77 | 0.66 |
| + REG (ours) | 800 | 1.36 | 4.25 | 299.4 | 0.77 | 0.66 |
[W2]: The effectiveness of the an extra discriminative class token and the denoising task for it.
Ablation studies (Table 6) evaluate entanglement alone by removing alignment loss (REPA loss), assessing the extra noised class token and its denoising task (velocity prediction loss). Results show introducing a noised class token without a matching denoising objective causes a large performance drop (FID 98.67). This is because, without class token denoising, the model cannot autonomously generate a discriminative class token to guide inference. Moreover, there is an inconsistency between training and inference. This discrepancy disrupts the generation process. In inference, no explicit class token is provided; the diffusion model cannot rely on it as in training.
In contrast, introducing the noised class token with a dedicated training denoising task yields optimal performance (FID 26.67). This is because class token denoising loss enables the diffusion model to autonomously generate a semantically discriminative class token during inference. The generated class token can effectively guide the generation of other tokens. Moreover, this setup ensures consistency between training and inference.
| Method | Class token | Velocity prediction loss for class token | FID↓ | sFID↓ | IS↑ |
|---|---|---|---|---|---|
| SiT-B/2 | × | × | 33.0 | 6.46 | 43.70 |
| + Entanglement | |||||
| DINOv2-B | ✓ | × | 98.67 | 9.80 | 12.36 |
| DINOv2-B | ✓ | ✓ | 26.67 | 6.88 | 59.37 |
Therefore, using the class token alone without corresponding denoising task (velocity prediction loss) yields no substantial benefit, while employing both jointly leads to clear improvements.
[W3, Q1]: REG achieve better performance in higher-resolution experiments on ImageNet .
Following your suggestion, we further train REG on ImageNet and compare it with EDM2 [1]. Among all methods in the table below, REG achieves the best performance.
| Model | Epochs | FID↓ | sFID↓ | IS↑ | Pre.↑ | Rec.↑ |
|---|---|---|---|---|---|---|
| Pixel diffusion | ||||||
| VDM++ | — | 2.65 | — | 278.1 | — | — |
| ADM-G, ADM-U | 400 | 2.85 | 5.86 | 221.7 | 0.84 | 0.53 |
| Simple diffusion (U-Net) | 800 | 4.28 | — | 171.0 | — | — |
| Simple diffusion (U-ViT, L) | 800 | 4.53 | — | 205.3 | — | — |
| EDM2 | — | 1.81 | — | — | — | — |
| Latent diffusion, Transformer | ||||||
| MaskDiT | 800 | 2.50 | 5.10 | 256.3 | 0.83 | 0.56 |
| DiT-XL/2 | 600 | 3.04 | 5.02 | 240.8 | 0.84 | 0.54 |
| SiT-XL/2 | 600 | 2.62 | 4.18 | 252.2 | 0.84 | 0.57 |
| + REPA | 80 | 2.44 | 4.21 | 247.3 | 0.84 | 0.56 |
| + REPA | 100 | 2.32 | 4.16 | 255.7 | 0.84 | 0.56 |
| + REPA | 200 | 2.08 | 4.19 | 274.6 | 0.83 | 0.58 |
| + REG (ours) | 80 | 1.68 | 3.87 | 306.9 | 0.80 | 0.63 |
[1] Karras et al., Analyzing and Improving the Training Dynamics of Diffusion Models, CVPR 2024.
[W4]: REG perform better in CFG performance without the guidance interval.
Following your suggestion, we compare the missing CFG performance without guidance interval results in Table 2, and find that REG performs better than REPA.
| Model | Epochs | FID↓ | sFID↓ | IS↑ | Pre.↑ | Rec.↑ |
|---|---|---|---|---|---|---|
| SiT-XL/2 | 1400 | 2.06 | 4.50 | 270.3 | 0.82 | 0.59 |
| + REPA w/o interval | 200 | 1.96 | 4.49 | 264.0 | 0.82 | 0.60 |
| + REG w/o interval (ours) | 200 | 1.68 | 4.36 | 312.3 | 0.80 | 0.62 |
[Q2]: The explanation and analysis about REG use DINOv2-B which perform better than DINOv2-L.
Thank you for your question. We provide the explanation and analysis for this issue below:
REG performs better with DINOv2-B, which is consistent with REPA. The observation that REG performs better with DINOv2-B than with DINOv2-L is consistent with findings reported in the REPA study. REPA does not exhibit a performance improvement when using larger encoders. Please refer to the content reproduced below from Table 2 of the original REPA paper, along with our experimental results.
| Mehtod | Target Repr. | FID↓ |
|---|---|---|
| SiT-L/2 | - | 18.8 |
| + REPA | DINOv2-B | 9.7 |
| + REPA | DINOv2-L | 10.0 |
Alignment difficulty and hyperparameter selection affect performance. Specifically, aligning shallow-layer representations of SiT (e.g., 4 layers) with deep-layer representations of DINOv2 (the final layer) may result in representational mismatches and increased training instability, thereby hindering effective alignment. For REG, final converged loss values reflect alignment and generation quality. Ablations on REG + SiT-B/2 (targeting DINOv2-B/L) use REPA’s loss formulation, reporting denoising (velocity prediction), alignment (negative cosine similarity), and total losses.
Specifically, Lower velocity prediction/total losses mean better generation; alignment loss ([−1,1]) closer to −1 indicates stronger alignment between SiT and DINOv2, reflecting more effective semantic understanding. Results show all three losses are lower when aligning with DINOv2-B than DINOv2-L. This highlights that SiT-B/2 aligns more easily and effectively with DINOv2-B, resulting in superior generative performance relative to DINOv2-L.
| Method | Target Repr. | FID↓ | Velocity prediction loss↓ | Alignment loss↓ | Total loss↓ |
|---|---|---|---|---|---|
| REG + SiT-B/2 | DINOv2-B | 15.22 | 0.77 | -0.31 | 0.46 |
| REG + SiT-B/2 | DINOv2-L | 17.36 | 0.78 | -0.29 | 0.49 |
Furthermore, our hyperparameter ablation experiments for DINOv2-L show that, with appropriate tuning of β (0.03->0.05), which is the weight of the class token velocity prediction loss, the performance gap between DINOv2-B and DINOv2-L can be significantly reduced.
| Mehtod | Target Repr. | β | FID↓ |
|---|---|---|---|
| SiT-B/2 | - | - | 33.00 |
| + REG | DINOv2-B | 0.03 | 15.22 |
| + REG | DINOv2-L | 0.01 | 22.23 |
| + REG | DINOv2-L | 0.03 | 17.36 |
| + REG | DINOv2-L | 0.05 | 15.53 |
| + REG | DINOv2-L | 0.07 | 16.76 |
Overall, alignment difficulty and hyperparameter selection affect performance. To ensure a fair comparison, we follow the same experimental setup as REPA and adopt DINOv2-B as the alignment target across all models in our experiments.
Thank you for the detailed rebuttal. The extended training results and the 512×512 results are impressive, and the ablation studies are thorough. I will raise my score to 5.
Dear Reviewer xw43,
We are happy to hear that our rebuttal addressed your concerns well. Also, we appreciate your support for our work. Following your insightful suggestions, we will update the corresponding experimental results in future versions to provide a more comprehensive evaluation.
Thank you once again for your invaluable feedback.
Best regards,
Authors
The authors propose REG (Representation Entanglement for Generation) which improves/accelerates diffusion model training by integrating a high-level class token from pretrained vision models (e.g., DINO v2) into low-level image latents. Unlike REPA, which implicitly aligns the discriminative clean features with the image latents only during training, REG extends the alignments to be more explicit, by entangling the low-level image latents with the high-level class token from the pretrained vision models. This enables generation of coherent image-class pairs from noise, enhancing quality and training speed with minimal computational cost.
优缺点分析
- Strengths:
- The paper is written clearly.
- The proposed method is simple but effective.
- Good visualizations of the proposed method.
- Experiment results are impressive.
- Weaknesses
- Performance in T2I setting is questionable.
- Applicability of the proposed method to the widely-used pretrained diffusion models in T2I setup.
- Some details are not clear.
- Missing baseline; REPA-E.
Overall, this paper lacks a baseline comparison with REPA-E, and performance report in T2I setting, which is more widely used than class-conditional setting. It would be much more impactful if the proposed method can be applicable in T2I setting, and even to the pretrained large models. However, the proposed method has a reasonable motivation with a suitable method, and shows good performance in class-conditional setting compared to the recent baseline.
I think the proposed method is good enough to give accept. However, I am not sure if it could be impactful as the performance in T2I setting (which is more widely used in practice) is questionable. If the authors can clarify some of my concerns, especially performance in T2I setting, I would raise my score to strong accept.
问题
- Can this method be still effective in T2I task? I guess the text embedding includes rich information as much as dino cls embeddings, and therefore the proposed method might not be as effective as in the simple class conditional setting.
- Where do you get the parameters for the linear layer mapping to ? Is it trained together?
- in ? How? and .
- Dimension of and the output of . Are they in ? I guess it should be, as is kinda proxy ground truth. Now, how do you maximize cos similarity? Inner product might yield . Do you only maximize the diagonal terms (perhaps 1 if normalized)?
- Related works involving the conditioning signal to the diffusion process, as this paper proposes, i.e., the condition is noised and denoised together. Can this be only used for training from scratch? Can it be used for finetuning pretrained large models?
- How is FID measured?
- Are the same models used for Tab 1 and Tab 2?
- For Tab 1, I guess it is unconditional generation? 50k for synthetic and 50k for real?
- For Tab 2, it is confusing. CFG requires the model to take a conditional signal. Is it given through cross attention layer? Is one-hot class conditioned? How many samples are used to measure?
- “Applying the loss in earlier layers yields superior results.” Why is this?
局限性
- Performance in T2I setting is not reported.
- Applicability to pretrained T2I large models, e.g., SD1, SD3, FLUX.
- Comparison with REPA-E
最终评判理由
The author response during the rebuttal has resolved my concerns.
To my understanding the strengths of this paper are: On top of REPA (, which proposed a soft regularization that can boost diffusion training speed by leveraging the pretrained knowledge of DINO v2) this paper further suggests a method to incorporate the class token of pretrained large scaled discriminative models into diffusion models. The experiment results show that their idea yields non-trivial benefits (faster training and better performance). They also show that their methods can be generalized beyond the classifier-guidance setting.
Thus, I recommend to accept.
格式问题
N/A
Thank you very much for your valuable comments and the time you devoted to reviewing our manuscript. We address each of your comments in detail below.
[W1, Q1]: Performance in T2I setting.
Thank you for your question. We add text-to-image (T2I) results in the Appendix by adopting the same experimental configurations as REPA and UViT [1], utilizing MMDiT [2] as the backbone network. The results are presented below. REG demonstrates superior generation quality compared to alternative approaches on MS-COCO [3].
| Method | Type | FID↓ |
|---|---|---|
| AttnGAN | GAN | 35.49 |
| DM-GAN | GAN | 32.64 |
| VQ-Diffusion | Discrete Diffusion | 19.75 |
| DF-GAN | GAN | 19.32 |
| XMC-GAN | GAN | 9.33 |
| Frido | Diffusion | 8.97 |
| LAFITE | GAN | 8.12 |
| U-Net | Diffusion | 7.32 |
| U-ViT-S/2 | Diffusion | 5.95 |
| U-ViT-S/2 (Deep) | Diffusion | 5.48 |
| MMDiT (SDE; NFE=250) | Diffusion | 5.30 |
| MMDiT + REPA (SDE; NFE=250) | Diffusion | 4.14 |
| MMDiT + REG (SDE; NFE=250) | Diffusion | 3.07 |
[1] All are Worth Words: A ViT Backbone for Diffusion Models, CVPR 2023
[2] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, ICML 2024
[3] Microsoft COCO: Common Objects in Context, ECCV 2014
[W2, Q5]: Applicability to pretrained T2I large models.
REG is not limited to training from scratch; it can also be applied to fine-tuning pretrained large-scale models. However, due to time constraints and limited computational resources, we currently lack sufficient resources to fine-tune REG on pretrained large-scale text-to-image (T2I) models (e.g., SD1, SD3, FLUX). In future work, we plan to progressively extend REG to these models in order to evaluate its generalization capabilities and scalability in more complex generative settings.
[W3, Q2]: Explanation about the linear layer.
The parameter of the linear layer is randomly initialized and is trained with the diffusion model. The specific process is: The class token of DINOv2-B is projected by a learnable linear layer (Linear: → ) to get . This linear layer is jointly trained with the entire diffusion model (SiT), to minimize the total loss function (Equation 11).
[W3, Q3, Q4]: Details of calculating negative cost similarity.
Thank you for your question. In the final version of the manuscript, we will incorporate further revisions to improve clarity and facilitate better understanding. The following provides a detailed description:
- is the reference representation of DINOv2 (pseudo ground truth). It is obtained by concatenating and .
- represents the output of the n-th SiT block, which is mapped by a trainable MLP projection to get
- (Equation 9) computes negative cosine similarity between and . It is minimized to encourage alignment.
- We compute negative cosine similarity only between corresponding token positions (i.e., the diagonal elements) to get the matrix rather than .
- The detailed pseudocode is present below.
F_0denotes ,Wdenotes ,D_vfrepresents the dimension .
# B: Batchsize N+1: Length D_vf: Dimension
# F_0: [B, N+1, D_vf] - reference features (e.g., from SiT)
# W: [B, N+1, D_vf] - predicted features (e.g., from DINOv2)
# Step 1: Normalize both feature sets along the feature dimension
F_0 = F.normalize(F_0, dim=-1) # [B, N+1, D_vf]
W = F.normalize(W, dim=-1) # [B, N+1, D_vf]
# Step 2: Compute element-wise dot product between corresponding vectors
temp = F_0 * W # [B, N+1, D_vf]
# Step 3: Sum across feature dimension to compute negative cosine similarity for each position
align_loss = -torch.sum(temp, dim=-1) # [B, N+1]
# Step 4: Average across all tokens and batches to get scalar alignment loss
align_loss = align_loss.mean() # Final scalar loss (e.g., -0.26)
[Q6]: How is FID measured?
We follow the evaluation protocol of REPA for computing the Fréchet Inception Distance (FID) [1], using the Inception-V3 model to extract 2048-dimensional features from both generated and real images. The FID is then calculated based on the mean and covariance of these feature distributions over 50,000 samples.
[1] Gans trained by a two time-scale update rule converge to a local nash equilibrium, NeurIPS 2017
[Q6.1]: Are the same models used for Table 1 and Table 2?
Yes, the same models are used for both Table 1 and Table 2. Table 1 reports results without using Classifier-Free Guidance (CFG), whereas Table 2 reports results with CFG.
[Q6.2]: Explanation of Table 1 and Table 2.
Table 1 and Table 2 all are trained for conditional generation using label inputs. In both tables, 50,000 images are generated and evaluated against real images using standard quantitative metrics.
[Q6.3]: Details of Classifier-Free Guidance (CFG) in REG.
Classifier-Free Guidance enhances the model's responsiveness to conditioning signals during inference by combining conditional and unconditional predictions. This improves the quality and diversity of generated images while avoiding the complexity of external classifier guidance. We sample 50K images to measure which follows SiT and REPA.
- Form of conditioning input
- In REG + SiT, the conditioning input is formed by the sum of the timestep embedding and the label embedding following DiT [1]. This combined vector serves as the conditioning signal.
- Condition injection mechanism (AdaLN)
- The conditioning vector is passed through an MLP and fed into the Adaptive Layer Normalization (AdaLN) [1] modules of each SiT block.
- AdaLN maps the conditioning vector to layer-wise
scaleandshiftparameters to modulate the feature maps:
[1] DiT: Scalable Diffusion Models with Transformers, ICCV 2023
[Q7]: Further explanation about applying the alignment loss in earlier layers yields superior results.
REG is built upon REPA and adopts the same configuration. We provide detailed ablation studies on alignment depth below. First, REG consistently outperforms REPA across different alignment depths, and applying the alignment loss in earlier layers yields superior results, which is consistent with REPA’s findings.
| Mehtod | Target Repr. | Depth | FID↓ | sFID↓ | IS↑ | Pre.↑ | Rec.↑ |
|---|---|---|---|---|---|---|---|
| SiT-B/2 | - | - | 33.00 | 6.46 | 43.70 | 0.53 | 0.63 |
| + REPA | DINOv2-B | 4 | 24.40 | 6.40 | 59.90 | 0.59 | 0.65 |
| + REG | DINOv2-B | 2 | 18.19 | 6.67 | 83.96 | 0.62 | 0.64 |
| + REG | DINOv2-B | 4 | 15.22 | 6.89 | 94.64 | 0.64 | 0.63 |
| + REG | DINOv2-B | 6 | 16.31 | 7.11 | 91.72 | 0.63 | 0.64 |
| + REG | DINOv2-B | 8 | 17.31 | 7.23 | 87.78 | 0.63 | 0.63 |
We believe this reflects an inherent trade-off: the alignment depth determines the diffusion model’s ability to understand and generate. Aligning at earlier layers allows the diffusion model to acquire DINOv2 semantics in its early stages, enabling subsequent layers to build on a strong representation and focus on capturing more high-frequency details for generation. However, aligning too early is not always better. For example, "depth = 2" performs worse than "depth = 4" indicating a trade-off between representation understanding and generation. Overall, earlier layers tend to yield superior results.
[Q8]: Comparison with REPA-E.
Thank you for your suggestion. Following the REPA-E paper, we compare the results of REG and REPA-E, and find that REG achieves better performance on ImageNet .
| Model | Tokenizer | Epochs | FID↓ | sFID↓ | IS↑ |
|---|---|---|---|---|---|
| REPA-E | SD-VAE* | 20 | 12.83 | 5.04 | 88.8 |
| REPA-E | SD-VAE* | 40 | 7.17 | 4.39 | 123.7 |
| REPA-E | SD-VAE* | 80 | 4.07 | 4.60 | 161.8 |
| REG (ours) | SD-VAE | 20 | 8.90 | 5.50 | 125.3 |
| REG (ours) | SD-VAE | 40 | 5.02 | 4.88 | 161.2 |
| REG (ours) | SD-VAE | 80 | 3.44 | 4.87 | 184.1 |
'*' indicates that VAE is updated during end-to-end training by REPA-E
Thank you for the detailed response. I raised my score. Please merge the rebuttal update to the final version of the paper.
Dear Reviewer vBTu,
We are happy to hear that our detailed rebuttal addressed your concerns well. Also, we appreciate your support for our work. Following your insightful suggestions, we will merge the rebuttal content and update the corresponding results in the final version of the paper.
Thank you once again for your invaluable feedback.
Best regards,
Authors
REPA allows training diffusion models faster through representation alignment with a pre-trained vision foundation models such as DINO. However, during sampling, models trained with REPA are not conditioned on any information from the vision backbone used for alignment. REG concatenates a noised class token from foundation models (e.g., DINOv2) with noised image latents as input to the diffusion transformer, enabling joint reconstruction of both image and semantic content. Experiments on ImageNet 256×256 show significant training acceleration and improved generation quality.
优缺点分析
Strengths
Simplicity: REG is simple, general and easy to implement. It does not seem to require intensive hyperparameter tuning
Clarity: The background section and explanation of the method are easy to follow, and can be understood on the first read. Figure 2 is easy to understand, and allows the reader to quickly understand their approach.
Weaknesses
Training overhead: while the author discuss the sampling overhead, starting at line 219, they do not discuss the training overhead compared to training a regular SiT. In particular, they do not report the total cost of training with REG compared to a vanilla SiT without alignment.
Unfair claims: occasionally, the authors make claims that I would judge to be unfair and that could mislead the reader. I would like the authors to revise their manuscript to clarify the following parts. Namely:
- Lines 99-100
However, these methods rely on external alignment mechanisms that remain inactive during denoising inference, fundamentally limiting their capacity to fully exploit discriminative representations.
REG does not use the external alignment mechanisms either during sampling. The above suggests that it does, hence it is misleading to the reader.
- Lines 136-138
To address the problems of REPA, the external alignment is absent during the entire denoising inference process, falling short of fully harnessing the potential of discriminative information.
Similarly, this is misleading, since REG does not rely on the external alignment module either.
- Lines 165-166:
Full utilization of discriminative information.
"Full" suggests that REG cannot be improved further. This cannot be proved empirically, hence it is not a fair claim.
- Lines 169-189
This addresses a critical limitation of REPA, which relies on the external alignment mechanism that is absent during denoising inference.
This is not a fair claim: REG does not rely on the external alignment mechanism either during inference, only on the class token.
问题
Noise schedule: have the authors considered using a different noise scheduler for the diffusion process on the class embedding? Intuitively, I would expect that using a noise scheduler that is quicker to denoise the class token could be beneficial. Indeed, it means that the other tokens (that estimate the score for image tokens) could rely on the class embedding more, potentially leading to increased performance. I would appreciate if you could include such an experiment in your submission.
Additional parameters: Why does REG requires 2M mode parameters than REPA (Table 1)? I did not see an explanation for this fact.
Failure Cases: Under what conditions does REG perform poorly? Are there specific classes or image types where the method struggles? Including a more fine-grained evaluation (eg performance per class) would be interesting, for example compared to REPA of SiT.
局限性
The only limitation discussed by the authors is that models might be under-trained. No other limitations are discussed. One could argue that while their approach is effective, just as REPA, it requires using an external model, hence the approach is not end-to-end.
最终评判理由
I believe the paper should be accepted: it's simple, well-written, well-tested, and works well. I do not think it is ground-breaking though, which is why I did not give it a higher score.
格式问题
No particular formatting concerns
Thank you very much for your valuable comments and the time you devoted to reviewing our manuscript. We address each of your comments in detail below.
[W1]: Analysis of training overhead in REG.
During training, REG requires alignment with DINOv2-B representations. We pre-process the DINOv2-B features offline and store them locally. As a result, during the training of the diffusion model, the DINOv2-B features can be directly loaded through the dataloader without almost no additional FLOPs.
We summarize the total training overhead in a below table, reporting the costs required to reach the same performance upper bounds claimed in the original SiT paper. All experiments are conducted on 8 NVIDIA A40 GPUs. Our results show that REG requires only 110K training steps to reach the performance level of SiT trained for 7M steps, reducing GPU hours by 98.36%. In addition, the single-step training speed of the two models shows no significant difference. These results highlight the training efficiency of REG, demonstrating faster convergence and significantly lower training overhead compared to prior methods.
| Model | FID↓ | Training Step | All GPU hours↓ | Single step training speed↓ |
|---|---|---|---|---|
| SiT-XL/2 | 8.3 | 7M | 2380 | 1.22 s/step |
| + REG (ours) | 8.2 | 110K | 39 (-98.36%) | 1.28 s/step |
[W2]: Modify some claims.
Thank you very much for pointing out the inappropriate phrasing in our manuscript. We will revise our manuscript accordingly in the final version to ensure clarity and accuracy.
-
Lines 99-100 revision: However, these methods rely on external alignment mechanisms that do not take the discriminative representations as the input and denoising, which are unable to produce discriminative representations during inference to guide the generation process.
-
Lines 136-138 revision: In REPA, the absence of the ability to autonomously generate discriminative representations to guide generation in inference may reduce the leverage of discriminative information effectively.
-
Lines 165-166 revision: Improved utilization of discriminative information.
-
Lines 169-189 revision: This design aims to address a limitation of REPA, which cannot autonomously generate discriminative representations to guide generation in inference. Because it relies on an external alignment mechanism during training to utilize discriminative features, rather than incorporating them as input and applying the corresponding denoising task.
[Q1]: Faster denoising noise scheduler for class token.
Thank you for your suggestion. Following your recommendation, we explore using different types of noise schedulers for different tokens in transforming velocity predictions to estimate the score. Specifically, we apply a cosine path type to the class token, aiming to achieve faster denoising. For the other tokens, we retain the linear path type, assuming they could benefit from stronger reliance on the class token during the generation process. We offer the detailed ablation results below.
| Model | FID↓ | class token | the other tokens |
|---|---|---|---|
| REG + SiT-XL/2 | 1.8 | linear | linear |
| REG + SiT-XL/2 | 2.2 | cosine | linear |
However, this modification does not lead to performance improvements. We infer that the lack of gain stems from the training–inference inconsistency. During training, all tokens, including the class token, use the linear interpolant path for noise scheduling. Therefore, employing a different noise scheduler for the class token during inference creates a mismatch that negatively affects the model's performance.
Due to time constraints, we have not yet retrained the model under consistent training and inference configurations. In future work, we plan to retrain a model where the class token is trained and inferred under a faster denoising scheduler to further evaluate the effectiveness of this strategy.
[Q2]: Explanation of the additional parameters in REG.
Thank you for your comment. We explain the additional parameters (2M) introduced by REG below. Specifically, REG includes the following extra components: a Linear layer is used to map the class token from the DINOv2-B representation to the SiT space, a LayerNorm layer is applied to the mapped class token, and an additional Linear layer at the end of the network to decode the class token into the final output prediction. These components collectively enable the transformation and utilization of the class token within the REG framework. The detailed implementation and parameter calculations are provided below:
#DINOv2-B hidden_size
cls_token_dim = 768
#SiT-XL hidden_size
hidden_size = 1152
# project the class token into the same embedding space via a linear layer to obtain
self.cls_projectors2 = nn.Linear(in_features=cls_token_dim, out_features=hidden_size, bias=True)
# apply normalization to the mapped class token
self.norm = nn.LayerNorm(hidden_size, elementwise_affine=True, eps=1e-6)
# decode class token into the final output prediction
self.linear_cls = nn.Linear(hidden_size, cls_token_dim, bias=True)
Param = self.cls_projectors2 + self.norm + self.linear_cls
= (768 × 1152 + 1152) + (1152 + 1152) + (1152 × 768 + 768) = 1.77M
≈ 2M
[Q3]: Discussion on failure cases.
Thank you for your suggestion. We conducted a large-scale comparison of the visualization results between REPA and REG. Overall, REG demonstrates a substantial improvement in image quality compared to REPA, However, for a small subset of challenging samples, such as tench (a kind of fish), both REG and REPA perform poorly. We attribute this to the extra noise in the ImageNet training data. For instance, these images depict a person holding a tench, but the provided label corresponds only to the tench. This discrepancy can introduce ambiguity during training and hinder the model’s ability to learn accurate class-conditional alignment. Due to time constraints, we will make a more fine-grained evaluation in the future.
[Q4]: Discussion on limitations.
Thank you for your comment. We agree that the reliance on an external model, such as DINOv2, limits the end-to-end nature of our approach. In addition, REG has not yet been extensively explored in terms of architecture and task diversity. There are several promising directions for future work:
- Extending to broader tasks and models, such as video generation and integration with autoregressive generative models, which may benefit from more complex discriminative guidance.
- Adopting stronger variants of VAE to enhance the model's ability to generate more accurate and diverse representations, potentially improving the overall performance in generative tasks.
We plan to investigate these directions, along with improved integration schemes and generalization to more diverse domains in future work.
Thank you for replying to my reviews. While I was curious about additional ablations, such as using a different noise schedule for the class token, I understand that it might not be feasible to run experiments in a week.
Nonetheless, I encourage you to perform this ablation for the camera-ready, as I believe it would be a meaningful improvement to your work, and I suspect that it could lead to stronger performance too.
Dear Reviewer 1Q9L,
Thank you very much for your response and thoughtful feedback. Due to time constraints, we have not yet explored more noise scheduling strategies. However, we sincerely appreciate your insightful suggestion. We are particularly interested in your suggestion for using different noise schedules, especially for the class token. We believe that such meaningful scheduling may improve the convergence behavior and the overall generation quality.
As you suggested, we plan to conduct additional ablation studies on this aspect and include a more comprehensive exploration of noise scheduling schemes in the camera-ready version and future work.
Thank you once again for your invaluable feedback.
Best regards,
Authors
The paper proposes a new method for training diffusion models for image synthesis.
A one-dimensional vector, typically obtained from self-supervised vision foundation models, that represents the "semantic contents" of the image is chosen and is noised similarly to the latents (from a VAE) corresponding to the image.
The DiT is then task to denoise (predict the flow field of) both the noisy latents and the noisy representational feature.
Through this auxiliary token injection into the image sequence and modeling of the task, the authors show that the representation quality and generation quality of the DiT thus obtained is greatly improved. Furthermore, they also show that training convergence is also dramatically accelerated.
优缺点分析
Strengths: Thorough experimental study. Good ablations and analysis. Well-written and easy to understand.
Weakness: Effectiveness of entanglement alone: While there are results showcasing the effectiveness of entanglement in terms of the generation quality, it's not clear how the representation quality benefits due to the addition of extra token and denoising task.
REPA shows that the representation quality in DIT and performance improves with better vision encoder. This study seems to contradict that with the performance decreasing with bigger encoder (DINOv2-L). This might because of the hyperparam chosen / having the same timestep sampling schedule for the representation part as well as the latents. But there's some missing stuff here.
While comparison with respect training iterations is helpful, it's also important to compare against SiT with respect to training FLOPs. REG, like REPA, has an alignment loss which requires additional flops compared to the standard SiT formulation.
Evaluation of the generated representation: Experiments indicating the ability of the DiT to generate an accurate representation is missing.
问题
How is it different than https://arxiv.org/abs/2504.16064v1?
What happens to the representation quality in the standalone case (when you don't have the REPA loss)?
I feel like a concrete explanation/investigation into how exactly the extra token and task helps image quality is lacking. It is mentioned that the key limitation of REPA is that it doesn't have some alignment option during inference, but it's not obvious as to why that is nedded or is a limitation in general?
How do you think the Vision foundational model chosen for the representation affects learning in the REG case compared ot REPA?
局限性
Yes
格式问题
No
Thank you very much for your valuable comments and the time you devoted to reviewing our manuscript. We address each of your comments in detail below.
[W1, Q2]: The effectiveness of the an extra noised class token and the denoising task for it.
Ablation studies (Table 6) evaluate entanglement alone by removing alignment loss (REPA loss), assessing the extra noised class token and its denoising task (velocity prediction loss). Results show introducing a noised class token without a matching denoising objective causes a large performance drop (FID 98.67). This is because, without class token denoising, the model cannot autonomously generate a discriminative class token to guide inference. Moreover, there is an inconsistency between training and inference. This discrepancy disrupts the generation process. In inference, no explicit class token is provided; the diffusion model cannot rely on it as in training.
In contrast, introducing the noised class token with a dedicated training denoising task yields optimal performance (FID 26.67). This is because class token denoising loss enables the diffusion model to autonomously generate a semantically discriminative class token during inference. The generated class token can effectively guide the generation of other tokens. Moreover, this setup ensures consistency between training and inference.
| Method | Class token | Velocity prediction loss for class token | FID↓ | sFID↓ | IS↑ |
|---|---|---|---|---|---|
| SiT-B/2 | × | × | 33.0 | 6.46 | 43.70 |
| + Entanglement | |||||
| DINOv2-B | ✓ | × | 98.67 | 9.80 | 12.36 |
| DINOv2-B | ✓ | ✓ | 26.67 | 6.88 | 59.37 |
Therefore, using the class token alone without corresponding denoising task (velocity prediction loss) yields no substantial benefit, while employing both jointly leads to clear improvements.
[W2]: The explanation and analysis about REG use DINOv2-B which perform better than DINOv2-L.
Thank you for your question. We provide the explanation and analysis for this issue below:
REG performs better with DINOv2-B, which is consistent with REPA. The observation that REG performs better with DINOv2-B than with DINOv2-L is consistent with findings reported in the REPA study. REPA does not exhibit a performance improvement when using larger encoders. Please refer to the content reproduced below from Table 2 of the original REPA paper, along with our experimental results.
| Mehtod | Target Repr. | FID↓ |
|---|---|---|
| SiT-L/2 | - | 18.8 |
| + REPA | DINOv2-B | 9.7 |
| + REPA | DINOv2-L | 10.0 |
Alignment difficulty and hyperparameter selection affect performance. Specifically, aligning shallow-layer representations of SiT (e.g., 4 layers) with deep-layer representations of DINOv2 (the final layer) may result in representational mismatches and increased training instability, thereby hindering effective alignment. For REG, final converged loss values reflect alignment and generation quality. Ablations on REG + SiT-B/2 (targeting DINOv2-B/L) use REPA’s loss formulation, reporting denoising (velocity prediction), alignment (negative cosine similarity), and total losses.
Specifically, Lower velocity prediction/total losses mean better generation; alignment loss ([−1,1]) closer to −1 indicates stronger alignment between SiT and DINOv2, reflecting more effective semantic understanding. Results show all three losses are lower when aligning with DINOv2-B than DINOv2-L. This highlights that SiT-B/2 aligns more easily and effectively with DINOv2-B, resulting in superior generative performance relative to DINOv2-L.
| Method | Target Repr. | FID↓ | Velocity prediction loss↓ | Alignment loss↓ | Total loss↓ |
|---|---|---|---|---|---|
| REG + SiT-B/2 | DINOv2-B | 15.22 | 0.77 | -0.31 | 0.46 |
| REG + SiT-B/2 | DINOv2-L | 17.36 | 0.78 | -0.29 | 0.49 |
Furthermore, our hyperparameter ablation experiments for DINOv2-L show that, with appropriate tuning of β (0.03->0.05), which is the weight of the class token velocity prediction loss, the performance gap between DINOv2-B and DINOv2-L can be significantly reduced.
| Mehtod | Target Repr. | β | FID↓ |
|---|---|---|---|
| SiT-B/2 | - | - | 33.00 |
| + REG | DINOv2-B | 0.03 | 15.22 |
| + REG | DINOv2-L | 0.01 | 22.23 |
| + REG | DINOv2-L | 0.03 | 17.36 |
| + REG | DINOv2-L | 0.05 | 15.53 |
| + REG | DINOv2-L | 0.07 | 16.76 |
Overall, alignment difficulty and hyperparameter selection affect performance. To ensure a fair comparison, we follow the same experimental setup as REPA and adopt DINOv2-B as the alignment target across all models in our experiments.
[W3]: Analysis of Training Overhead in REG.
During training, REG requires alignment with DINOv2-B representations. We pre-process the DINOv2-B features offline and store them locally. As a result, during the training of the diffusion model, the DINOv2-B features can be directly loaded through the dataloader without almost no additional FLOPs.
We summarize the total training overhead in a below table, reporting the costs required to reach the same performance upper bounds claimed in the original SiT paper. All experiments are conducted on 8 NVIDIA A40 GPUs. Our results show that REG requires only 110K training steps to reach the performance level of SiT trained for 7M steps, reducing GPU hours by 98.36%. In addition, the single-step training speed of the two models shows no significant difference. These results highlight the training efficiency of REG, demonstrating faster convergence and significantly lower training overhead compared to prior methods.
| Model | FID↓ | Training Step | All GPU hours↓ | Single step training speed↓ |
|---|---|---|---|---|
| SiT-XL/2 | 8.3 | 7M | 2380 | 1.22 s/step |
| + REG (ours) | 8.2 | 110K | 39 (-98.36%) | 1.28 s/step |
[W4]: The linear probing accuracy of the generated representation.
Following REPA’s setup, we compare REG and REPA’s linear probing accuracy on ImageNet. Results show REG’s superior performance, reflecting stronger representational understanding.
| Model | Acc↑ |
|---|---|
| REPA + SiT-XL/2 | 74.6 |
| REG + SiT-XL/2 (ours) | 76.3 |
[Q1]: What the differences between REG and ReDi?
REG and ReDi [1] are proposed during the same period. REG is simpler, faster in both training and inference, and achieves superior performance. We provide a detailed comparison below:
-
Motivation:
- REG: Overcomes REPA’s unaddressed issue: external alignment makes REPA unable to produce discriminative representations during inference to guide generation.
- ReDi: Investigates integrating representation learning with generative modeling, enforcing the diffusion model to learn data-latent joint distributions.
-
Representation selection and usage methods:
- REG (Representation Entanglement with Class Token): This method employs a single global class token which is concatenated with low-level latent tokens as the input.
- ReDi (Representation Fusion with Dense Tokens): In contrast, this approach utilizes multiple semantic dense tokens, which are fused with low-level latent tokens via Principal Component Analysis (PCA) to construct the input.
-
Computational overhead of REG is lower:
- Extra cost of REG is for handling one class token.
- Extra cost of ReDi is PCA on all dense tokens.
-
REG performs significantly better than ReDi::
Model Iter FID↓ ReDi 1M 5.1 ReDi 3M 3.7 REG 400K 3.4 REG 1M 2.7
[1] ReDi: https://arxiv.org/abs/2504.16064v1
[Q3]: Mechanism Difference between REPA and REG.
REPA aligns with discriminative representations during training, without incorporating them as input and applying the corresponding denoising loss. This design causes REPA to be unable to produce discriminative representations during inference to guide the generation process. In contrast, REG treats the discriminative class token as part of the input and applies a corresponding denoising loss, allowing it to guide training. During inference, REG can explicitly generate the discriminative class token and leverage it to guide the generation process.
[Q4]: The vision foundational model chosen compared with REPA.
The vision foundational model chosen strategy of REG is consistent with REPA, based on response to W2. Specifically, DINOv2-B performs better than DINOv2-L in both cases of REG and REPA, according to Table 2 of REPA and the ablation experiments of REG. Furthermore, to ensure a fair comparison, we adopt the same vision foundational model as used in REPA.
The paper introduces REG, a method for accelerating the training of diffusion models. It is a simple extension to REPA that aligns spatial information like in REPA, but additionally denoises the class tokens from the pretrained vision models. This approach significantly accelerates training and improves generation quality with minimal overhead. The method is simple and effective, and the proposed empirical evaluation is convincing. Because the approach is very simple, and the topic very timely, I think this paper makes a very interesting contribution. All reviewers have all acknowledged the quality of this submission. A shared concern about ViT-{L/B} results was properly addressed by the authors. Because of all the above, I recommend accepting this paper as an oral at NeurIPS 2025.