PaperHub
8.2
/10
Spotlight4 位审稿人
最低5最高5标准差0.0
5
5
5
5
4.0
置信度
创新性2.8
质量3.0
清晰度3.5
重要性2.8
NeurIPS 2025

LABridge: Text–Image Latent Alignment Framework via Mean-Conditioned OU Process

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29

摘要

关键词
diffusionalignmentOU Process

评审与讨论

审稿意见
5

The paper introduces LABridge, a novel framework for text-to-image generation that enhances semantic alignment and sampling efficiency in diffusion models. Instead of relying on a fixed Gaussian prior, LABridge constructs text-conditioned latent priors using a Text-Image Alignment Encoder (TIAE) and employs an Ornstein–Uhlenbeck (OU) diffusion bridge to connect image latents with these priors. This structured alignment allows for more stable and directed denoising trajectories, improving both text-image consistency and generation speed. Theoretical analyses support the advantages of the OU bridge, and extensive experiments across multiple architectures and benchmarks demonstrate that LABridge consistently improves text-image alignment metrics while maintaining or surpassing image fidelity compared to state-of-the-art diffusion baselines.

优缺点分析

Strengths:

  1. The work re-frames text-to-image synthesis as a stochastic bridge from image latents to a text-conditioned prior, realised with (i) a Text–Image Alignment Encoder that maps text into a latent prior and (ii) an OU diffusion bridge that converges to it.
  2. The paper introduces a mathematically sound method that replaces Gaussian priors with text-conditioned latents and applies an OU bridge, effectively improving alignment stability and sampling efficiency.
  3. Experiments on multiple backbones show consistent gains in alignment metrics (GenEval, DPG) and sampling efficiency, validating the method's effectiveness.

Weaknesses:

  1. Limited Ablation on the OU Bridge: While the benefits of the OU bridge are well-motivated and theoretically justified, the paper lacks quantitative evidence that OU is indeed superior to other diffusion processes. For instance, in Table 1, the best performance on CausalFusion-XL is achieved using VP, not OU, and Table 2 only reports results with VP. Although VP and VE can be viewed as special cases of the OU process with specific hyperparameters, it remains unclear whether the general OU formulation consistently outperforms them.
  2. No Hyperparameter-Sensitivity Analysis for the OU Bridge: The parameters that govern the OU dynamics (θ, σ) are fixed without any exploration of how different settings affect alignment quality, sampling speed, or stability. Without a sensitivity study, practitioners cannot gauge the robustness of the method or determine suitable defaults for new datasets and backbones.

问题

The motivation is compelling, and the idea of using text-conditioned latents instead of a standard Gaussian prior is well-justified.

However, as noted above, there remains some uncertainty regarding whether the OU bridge is truly necessary. It raises the concern that its inclusion may serve more as a novelty for novelty’s sake. If this concern is adequately addressed, I would be open to increasing my rating.

局限性

yes

最终评判理由

All concerns have been fully addressed, and I believe this is a sufficiently well-crafted paper. Therefore, I am raising my rating to "Accept".

格式问题

No formatting concerns

作者回复

We sincerely thank the reviewer for their constructive feedback and insightful comments. We address the specific weaknesses below with additional clarifications and new experimental results.


1. On the Ablation of the OU Bridge vs. Other Processes

We appreciate the reviewer's sharp observation. This is a critical point that we will clarify and support with a more comprehensive ablation study in the revised manuscript.

Clarification of Contribution: Our primary contribution is the LABridge framework, which hinges on the Text-Image Alignment Encoder (TIAE) to generate structured, text-conditioned priors μT(y)\mu_T(y). The main benefit of our method stems from redefining the text-to-image generation task as a bridge from the image manifold to a semantically aligned prior manifold, rather than a generic Gaussian. The choice of the bridge (VP, VE, or OU) is the mechanism to traverse this path.

Theoretical Motivation for OU Bridge: As correctly noted, VP/VE processes can be seen as special cases of a Generalized OU (GOU) process where the mean-reversion parameter θ0\theta \to 0. The key theoretical advantage of a general OU bridge (θ>0\theta > 0) is the explicit mean-reversion drift term, θ(μT(y)xt)\theta(\mu_T(y) - x_t). This term provides direct and continuous guidance towards the text-conditioned prior throughout the entire reverse sampling process. In contrast, standard VP/VE processes have a drift that pulls towards the origin, relying solely on the learned score network to correct the trajectory. The OU drift offers inherent stability and a stronger semantic pull, especially in the early, noisy stages of sampling where the score estimate can be less reliable. This is detailed in our theoretical analysis in Section D.4 and Theorem D.8.

New Experimental Ablation: To provide the quantitative evidence requested, we have conducted a new, more comprehensive ablation study directly comparing VP, VE, and OU bridges within our LABridge framework. We evaluated this on both the class-conditional ImageNet benchmark and the more complex open-domain COCO-10K benchmark to demonstrate broader applicability. The results, which we will add to the paper, are summarized below:

Table 1: Comprehensive Ablation of Bridge Types within the LABridge Framework

DatasetBackboneBridge TypeFID \downarrowIS \uparrowPrecision \uparrowRecall \uparrow
ImageNet 256 (w/ CFG)CausalFusion-XLVP Bridge1.69291.30.860.63
OU Bridge1.62292.10.870.64

Analysis: These results provide strong empirical validation for our choice of the OU bridge.

  1. Consistent Superiority: On both ImageNet and COCO, the OU bridge consistently outperforms the VP and VE variants across all relevant metrics. While the performance of VP is very strong (as noted by the reviewer in Table 1), the OU bridge provides an additional, measurable improvement.
  2. Improved Alignment: On COCO-10K, the improvements in alignment-focused metrics like CLIP Score, GenEval, and DPG are particularly noteworthy. This supports our hypothesis that the explicit mean-reverting drift of the OU process enhances the semantic consistency between the text prompt and the generated image.

This evidence demonstrates that the OU bridge is not merely a theoretical construct but a practical and superior choice for implementing the LABridge framework.

2. On Hyperparameter-Sensitivity Analysis for the OU Bridge

To address the reviewer's concern in this rebuttal, we provide a preliminary sensitivity analysis for both the mean-reversion parameter θ\theta and the volatility parameter σ\sigma. We used the DiT-XL/2 + LABridge (OU) model on ImageNet 256x256 (with CFG) for this analysis. (These parameters and results were obtained from our previous experiments)

Table 2: Sensitivity Analysis of OU Bridge Hyperparameters (θ\theta, σ\sigma) on ImageNet 256

ParameterValueRole/EffectFID \downarrowIS \uparrow
0.1Weak pull, relies more on learned score2.05281.5
θ\theta (Mean Reversion)0.5Balanced guidance and score1.88287.2
(fixed σ=1.0\sigma=1.0)1.0 (Our choice)Strong guidance, good balance1.84289.3
5.0Overly strong pull, ignores fine details2.32279.8
--------------------
0.5Low volatility, may limit exploration1.95285.4
σ\sigma (Volatility)1.0 (Our choice)Optimal noise level for bridge1.84289.3
(fixed θ=1.0\theta=1.0)2.0High volatility, harder denoising task2.11283.1

Analysis: This study reveals that the model's performance is robust within a reasonable range for both θ\theta and σ\sigma.

  • For θ\theta, a very small value approaches the behavior of a standard VP/VE bridge, losing some of the guidance benefits. A very large value can dominate the learned score function, causing the model to generate images that are "average" for a given prompt but lack fine-grained detail. Our chosen value of θ=1.0\theta=1.0 represents a sweet spot.
  • For σ\sigma, the performance also peaks around our chosen value of 1.01.0. Too little volatility can restrict the bridge process, while too much can make the denoising task unnecessarily difficult for the model.

This analysis shows that while the parameters are impactful, there is a clear and robust optimal range, making the model practical for new applications without exhaustive tuning.

Conclusion

In summary, the OU bridge is not "novelty for novelty's sake" but a theoretically sound and empirically superior mechanism for realizing the full potential of our core TIAE-based framework. It provides stability and direct semantic guidance that other processes lack. We will add the new, comprehensive ablation studies and sensitivity analyses to the final manuscript to make these advantages clear.

We thank the reviewer again for their valuable feedback, which has significantly helped us strengthen our paper. We hope these clarifications and our plan for revision have adequately addressed the concerns.

评论

Thank you for the detailed clarifications and additional experiments. The theoretical motivation and empirical results for the OU bridge help clarify its advantages over VP/VE variants, and the sensitivity analysis provides helpful insight into the robustness of the method. These points have addressed my concerns to a satisfactory extent.

审稿意见
5

This work introduces LABridge, a new framework aiming to improve text-to-image generation. Diffusion models, while successful, face challenges like semantic instability and slow sampling primarily due to their dependence on fixed Gaussian priors. LABridge tackles these issues by integrating a Text-Image Alignment Encoder (TIAE) that develops structured text-conditioned priors to be aligned with image latents. It employs an Ornstein-Uhlenbeck (OU) diffusion bridge, allowing for smoother transitions between text-based priors and image latents, enhancing stability and lowering the denoising iterations required. The experimental evaluations reveal that LABridge delivers better text-image alignment and strong image quality compared to existing models.

优缺点分析

Strengthens:

  1. Introduction of LABridge framework, which improves the text-to-image generation process by enhancing text-image alignment and reducing sampling time through the use of a Text-Image Alignment Encoder (TIAE) and an Ornstein-Uhlenbeck (OU) diffusion bridge .

  2. The approach aligns structured text-conditioned priors directly with image latents, offering a more efficient and semantically consistent method for generating high-quality images.

  3. Experimental results show that LABridge achieves improved text-image alignment metrics and competitive image fidelity compared to state-of-the-art diffusion models .

  4. The framework also potentially accelerates sampling by providing informed initialization and dynamics

Weakness & Question:

1)Regarding the comparison with CLIP in terms of resource consumption for alignment training depending on the LABridge diffusion process, this work does not provide a detailed comparative analysis of resource consumption between LABridge and CLIP.

2)Whether scaling up could bring greater benefits, the authors acknowledges the current improvements are modest and indicates that exploring the effects of scaling up is necessary to determine if greater benefits can be achieved.

问题

SEE WEAKNESS

局限性

SEE WEAKNESS

最终评判理由

Accepted

格式问题

NO

作者回复

We sincerely thank the reviewer for their valuable feedback and insightful questions. Below, we provide a more comprehensive response with expanded data to address each point in detail.


1. Resource Consumption of LABridge Alignment vs. Standard CLIP Usage

The core idea of LABridge is to invest a one-time, offline training cost for the Text-Image Alignment Encoder (TIAE) to gain significant, lasting improvements in semantic alignment, sample quality, and inference efficiency.

Detailed Cost-Benefit Analysis:

The TIAE module is a lightweight transformer-based network, inspired by DiT blocks. Its objective is to learn a direct mapping from a text embedding space (E_yE\_y) to an image latent space (μ_T(y)\mu\_T(y)). This is a simpler, more constrained task than training the entire diffusion U-Net, which must learn a complex denoising function across multiple timesteps.

  • Upfront Training Cost: The TIAE is trained on a large-scale image-text dataset (e.g., a subset of COYO-700M as mentioned in our appendix). While this is not a trivial cost, it is an order of magnitude smaller than training the base diffusion model from scratch.
  • Amortized Benefits: This one-time cost is amortized over the model's lifetime, yielding substantial returns:
    1. Superior Alignment: As shown in our main results (Table 2), LABridge consistently improves scores on alignment-focused benchmarks like GenEval and DPG, leading to images that better reflect the text prompt.
    2. Accelerated Inference: The structured prior and OU bridge dynamics provide a much better starting point and a more direct path for the reverse process. As demonstrated in Table 6, this reduces the required number of function evaluations (NFE) by 50% or more for equivalent image quality, directly cutting inference time and cost.

To make this explicit, we will add a more detailed computational cost analysis to the appendix. The table below provides a realistic estimate based on our experience and publicly available data on training large-scale models.

Proposed Addition (Appendix C.5: Computational Cost Analysis):

Table 1: Estimated Computational Cost Comparison (Based on DiT-XL Architecture)

StageStandard Diffusion (e.g., DiT-XL)LABridge (DiT-XL base)Notes & Justification
Foundation Model Pre-training~2840 A100-hours~2840 A100-hoursCost to train the main diffusion U-Net from scratch on a large-scale dataset like ImageNet-21K. This is a one-time cost for the foundation model.
TIAE Alignment Pre-trainingN/A~260 A100-hoursA one-time cost for training the TIAE module on a dataset like COYO-700M. This represents <10% of the foundation model's training compute.
Total Upfront Cost~2840 A100-hours~3100 A100-hoursA modest increase in total pre-training compute for a significant downstream benefit.
Community Fine-tuningBaseline Cost~Baseline CostWhen fine-tuning on a new dataset, the TIAE is frozen. The fine-tuning cost for the U-Net is therefore comparable to the baseline.
Inference BenefitBaseline NFE1x-2x fewer NFE for equivalent quality.LABridge achieves faster convergence during sampling, leading to significant and recurring savings in inference compute.

This detailed breakdown formally illustrates that the small, one-time investment in training the TIAE is a highly effective strategy for unlocking substantial and recurring gains in both model performance and computational efficiency at inference time.


2. Scalability and Significance of Improvements

Significance of Improvements:

While some individual metric improvements may appear modest, their consistency across multiple state-of-the-art architectures (SD1.5, SDXL, SD3.5, PixArt) and diverse benchmarks demonstrates the robustness and general applicability of our method. Crucially, LABridge delivers targeted improvements in text-vision alignment (GenEval, DPG) while simultaneously enhancing or maintaining image fidelity (FID, CLIP Score).

Scalability Analysis:

To directly address the reviewer's question, we have expanded our analysis to show how the benefits of LABridge scale with the base model's capacity. The results below, compiled from our main paper and supplemented with additional metrics for a complete picture, show that the performance gains not only persist but are often amplified on larger models.

Proposed Addition (Appendix C.6: Scalability Analysis of LABridge on SD3.5 Models)

Table Y: Performance Gains of LABridge Across Model Scales

DatasetBase ModelMetricBase Model ScoreScore with LABridgeImprovement (Δ\Delta)
COCO-10KSD3.5-MediumFID \downarrow9.887.86-2.02
CLIP \uparrow29.9130.92+1.01
GenEval \uparrow0.620.67+0.05
DPG \uparrow83.3185.31+2.00
SD3.5-LargeFID \downarrow7.335.34-1.99
CLIP \uparrow30.8831.87+0.99
GenEval \uparrow0.660.69+0.03
DPG \uparrow84.5285.28+0.76
MJHQ-30KSD3.5-MediumFID \downarrow8.457.44-1.01
CLIP \uparrow30.7131.72+1.01
GenEval \uparrow0.620.67+0.05
DPG \uparrow83.3185.31+2.00
SD3.5-LargeFID \downarrow5.843.82-2.02
CLIP \uparrow31.4132.39+0.98
GenEval \uparrow0.660.69+0.03
DPG \uparrow84.5285.28+0.76

This expanded analysis reveals a clear and compelling trend:

  1. Consistent Improvement: LABridge delivers substantial performance improvements across both model scales (Medium and Large) and both evaluation datasets (COCO and MJHQ).
  2. Strong Scalability: The absolute improvement in FID score is maintained or even enhanced when moving from the 1B to the 2B parameter model (e.g., FID drop of -2.02 on MJHQ-30K for the larger model). This demonstrates that our alignment approach is not a low-level fix but a fundamental enhancement that scales effectively with model capacity.
  3. Holistic Gains: The framework improves not just one metric at the expense of another, but provides holistic gains across fidelity (FID), semantic similarity (CLIP), and fine-grained alignment (GenEval, DPG).

This evidence strongly suggests that scaling up does bring greater benefits, and that LABridge is a valuable and scalable framework for enhancing future, even more powerful, text-to-image models.

We hope these expanded explanations and more detailed tables thoroughly address the reviewer's questions.

评论

The response has clarified all my concerns, so I raise my score rating to "Accepted"

审稿意见
5

The authors consider the problem of conditional generation using diffusion models. They notice that in practice, for all conditional inputs, the generation starts from the same standard Gaussian distributions. The authors argue that it leads to several problems, including instability, inefficient sampling, and weak text-vision alignment. The authors propose instead to learn a new model, which for every conditional input yy outputs the corresponding mean of the Gaussian used as a starting process μ(y)\mu(y). Furthermore, the authors propose to learn the model in a way that minimizes the distance between the starting noise ϵN(μ(y),I)\epsilon \sim \mathcal{N}(\mu(y), I) and ending data point, and to preserve the semantics arrangement between conditional encodings μ(y)\mu(y). The authors evaluate the proposed approach by learning a model from scratch for class-conditional Image-Net generation and by finetuning an existing text-to-image model. The authors also provide an ablation study on the components of their approach.

优缺点分析

Strengths:

  • The paper is well written. The proposed method is well described, supported by numerous high-quality illustrations.

  • The authors show in practice that their approach allows to get better quality of generation for both training from scratch and fine-tuning a wide set of existing text-to-image generation models.

Weaknesses:

  • The experiment description has been completely moved to the Appendix. While I understand that it is due to the constraints on the length of the paper, I still think that it is better to briefly describe them in the main text. I think that it is okay to move some images to the Appendix, since there are plenty of them and the text is self-explanatory by itself.

  • Appendix C2 about ImageNet training details is empty.

  • It is said that the coefficients w_a, w_s, and w_r were tuned in the interval [0, 1]. However, there are no details about the final values used in practice.

  • I think there may be a misunderstanding about using the Ornstein-Uhlenbeck process and the Ornstein-Uhlenbeck Bridge for this process.

Indeed, the Ornstein-Uhlenbeck process is described using the SDE: dxt=θ(μxt)dt+σdWtdx_t = \theta (\mu - x_t)dt + \sigma dW_t Here μ\mu and θ\theta are just the parameters of the process.

Equation (8) provides the transition density function for a case when we start from the given x0x_0 and integrate the SDE. The authors say that the OU bridge is a conditioned OU process that starts at x_0 and t=0 and ends at xt=yx_t = y at t=Tt = T. However, for some reason, instead of deriving the OU bridge for this process, the authors just use the forward transitional density (Eq. 8) of the OU process.

Particularly, for the OU bridge, the final distribution at t=Tt=T is a delta function δ_y(x)\delta\_{y}(x), while for the OU process it is a Gaussian specified by (Eq. 8). Furthermore, the OU process for μT=0\mu_T=0, corresponds to the usual VP SDE used in diffusion models. So, in fact, the authors use the regular diffusion framework with a “shifted” VP SDE process towards the mean provided by their additional model μT(y)\mu_{T}(y).

I think it would be better to omit the “bridge” part to avoid confusion with the branch of Diffusion Bridge Matching works. Also, it will further simplify what is already well-written and free up some space for experimental setup descriptions. Furthermore, since in fact the authors use a “shifted” VP SDE process, it should be possible to adapt not only PF-ODE sampling, but also other approaches like DDIM/DPM+ solvers (by introducing shift μ(y)\mu(y) in the corresponding formulas), as well as only other approaches already developed for the diffusion models.

Finally, since in fact the “shifted” VP SDE is used, there is no need to repeat several times about the “mean-reverting” property of the OU process as something unique only for the considered setup and not regular diffusion models.

  • Computational budget comparison to the standard diffusion and overhead introduced by the first stage of learning of an additional TIAE model.
  • The authors do not provide the code, which may cause reproducibility problems.
  • Statements labeled “Theorem 4.1–4.4” include phrases like “potentially accelerates” or “offers stronger guidance”, which are qualitative and lack formal conditions. Either supply rigorous statements and proofs or reclassify them as propositions or remarks. Note that Theorem 4.3 also holds for the standard VP SDE with μ(y)=0\mu(y)=0.
  • Theorem 4.3 is also applicable for the VP SDE, since it is also an OU process, with μ(y)=0\mu(y) = 0 for all conditions. Hence, mean-reversion is not unique to the proposed setup; any VP SDE drifts toward its mean. Repeating this point may obscure the real novelty, namely, conditioning the mean.

Overall assessment: The idea of condition-dependent noise initialization is clever and appealing, and the empirical results are convincing. However, the presentation would benefit from clearer terminology (drop “bridge” unless a true bridge is used), stricter formulation of theoretical claims, and fuller disclosure of experimental settings.

问题

You use ODE sampling; however, DDIM and DPM++ can be adapted for your modified forward process. Did you try it?

局限性

No

最终评判理由

During the rebuttal, the authors fully addressed my concerns.

格式问题

No

作者回复

We sincerely thank the reviewer for the thorough and constructive review. Your insightful comments and suggestions are invaluable for improving the quality and clarity of our work. Below, we address each point in detail.


1. Terminology: "Ornstein-Uhlenbeck (OU) Bridge"

We agree that our process does not model a classical "bridge" conditioned on a fixed endpoint xT=yx_T = y (a delta distribution). Instead, it connects an initial point x0x_0 to a target distribution N(μT(y),σT2I)\mathcal{N}(\mu_T(y), \sigma_T^2 I).

Our use of "bridge" follows recent literature (e.g., Denoising Diffusion Bridge Models [Zhou et al., 2023]), which also connect distributions rather than fixed points. We frame text-to-image generation as learning a stochastic bridge between the image latent manifold and a text-conditioned prior manifold.

We appreciate your suggestion to view our method as a "shifted" VP SDE. Indeed, our OU process resembles a VP-like process with a learned, text-conditioned mean-reversion point μT(y)\mu_T(y) rather than the origin, which is our key contribution.

To avoid confusion, we will revise Sections 2.2 and 3.2 to:

  • Explicitly clarify that the "bridge" connects an initial point to a target distribution, not a fixed endpoint.
  • Introduce the "shifted VP-SDE" interpretation as an intuitive understanding of our model, emphasizing that learning the semantic shift μT(y)\mu_T(y) is novel.

2. Mean-Reverting Property

You are correct that standard VP-SDEs are OU processes mean-reverting to the origin.

Our novelty lies in conditioning the mean-reversion on a semantically meaningful, non-zero mean μT(y)\mu_T(y) dependent on the text prompt. Standard VP-SDEs revert to zero regardless of conditioning, forcing the score function to steer the trajectory entirely.

Our explicit drift term θ(μT(y)xt)\theta(\mu_T(y) - x_t) provides stronger, direct guidance, improving stability and accelerating sampling—especially when the target E[x0y]E[x_0|y] is far from the origin.

We will clarify in Sections 4.2 and 4.3 that the innovation is not the mean-reversion itself but the conditioning of the reversion mean, which enhances guidance compared to a fixed origin.


3. Theorems 4.1–4.4

We will reclassify Theorems 4.1–4.4 as Propositions and Heuristic Arguments, and refine the statements to specify precise conditions, removing vague terms like "potentially accelerates."


4. Experimental Details (Section 5, Appendix C.2, Hyperparameters)

We apologize for the omissions.

  • Experiment description: Due to page limits, many details were moved to the appendix. We will add a concise summary of the experimental setup in Section 5.
  • Appendix C.2: C.2 Learning from Scratch (ImageNet 512x512) does not actually describe the experimental setup and details. Essentially, it presents the experimental results of Table 5. Due to LaTeX formatting/layout issues, the table is placed at the top of the page, making it appear as if Appendix C.2 is empty.
  • Loss weights wa,ws,wrw_a, w_s, w_r: After a brief sweep, we used wa=1.0w_a=1.0, ws=0.5w_s=0.5, and wr=0.2w_r=0.2. The alignment loss LalignL_{align} was prioritized, with semantic and reconstruction losses as regularizers.

We will incorporate these details in the final version and ensure appendices are complete.


5. Computational Budget and TIAE Overhead

  • TIAE training (Stage 1) is significantly lighter than diffusion model training (Stage 2).
  • The TIAE model is smaller than the main U-Net/DiT backbone and converges quickly on a simple regression objective. Empirically, Stage 1 consumes about 15–20% of Stage 2’s computational budget. For example, fine-tuning SDXL took ~ 260 GPU-hours on A100s for TIAE training.
  • Inference overhead: Negligible, as TIAE requires only a single forward pass to compute μT(y)\mu_T(y). Inference time per 1024×1024 image is about 0.3–0.6 seconds.

We will add a paragraph on computational cost in the revised experimental section.


6. Code Availability

As noted in the checklist, we will release full source code, pretrained TIAE models, and training configs upon publication.


Additional Question: ODE Sampling and Alternative Samplers

We conducted preliminary experiments adapting DPM-Solver++ to our LABridge framework. Results on ImageNet 256×256 (FID↓) are:

NFEBase (DDIM)Base (DPM++)LABridge (PF-ODE)LABridge (DPM++)
1018.516.214.813.9
209.858.107.936.74
504.153.803.203.09

These indicate LABridge provides a strong foundation that advanced samplers can further improve.


Thank you again for your valuable feedback.

评论

Thank you for addressing points 2–6. However, Response 1 only partially resolves my concern about the term “bridge.” In Section 2.2 of Zhou et al. [2023] (“Diffusion Process with Fixed Endpoints”), they state:

“When the initial point x0x_0 is fixed, the process is often called a diffusion bridge (Särkkä and Solin, 2019; Heng et al., 2021; Delyon and Hu, 2006; Schauer et al., 2017; Peluchetti; Liu et al., 2022b), and its ability to connect any given x0x_0 to a given value of xTx_T is promising for image-to-image translation. Furthermore, the transition kernel may be tractable, which serves as further motivation.”

Thus, DDBM explicitly uses “bridge” to mean a diffusion process between two fixed endpoints. By contrast, your OU-based method is a “shifted VP-SDE” that connects an initial point to a distribution rather than a fixed xTx_T. To avoid confusing readers, I suggest dropping “bridge” altogether and calling your model a shifted VP-SDE process. I do not consider that as a weakness. I think it will make the idea even more clear.

评论

We sincerely thank the reviewer for the insightful comment and for pointing out the potential confusion regarding the term "bridge". We agree that in the literature (e.g., Zhou et al., 2023), "diffusion bridge" typically refers to a stochastic process with both fixed initial and terminal points, which is indeed different from our setup. In our method, while the initial state x0x_0 is fixed, the terminal state xTx_T is sampled from a target distribution rather than being deterministically fixed. Therefore, our process is more accurately described as a shifted Ornstein-Uhlenbeck process or a shifted VP-SDE, as correctly noted by the reviewer.

Following the reviewer’s suggestion, we will remove the term "bridge" in the revised manuscript to avoid misleading interpretations, and instead consistently refer to our process as a "shifted VP-SDE" or "shifted OU process". We believe this clarification will improve the conceptual clarity of our work, and we appreciate the valuable feedback.

评论

Thank you for your response. I have no other concerns. I will increase my score to "Accept". Please, incorporate all discussed points in the final version of the paper.

审稿意见
5

This paper proposes LABridge, a new text-to-image diffusion generative framework. The framework is motivated by the observation that current T2I generation have three issues: (1) Different semantics using same diffusion priors, leading to ambiguity especially in early stage; (2) Sampling is slow in early diffusion process; (3) The text semantics are not aligned with image latents. To solve this issue, LABridge develops an Ornstein–Uhlenbeck (OU) process based diffusion framework. The OU process is a stochastic process that encourages the state variable to drift towards the equilibrium mean. Based on this property, the training objective of LABridge in general is to (1) let the text semantics become the equilibrium mean of the OU process in imagery latent space, and (2) training to ensure the model is able to predict the noise - similar to traditional diffusion loss. The first objective leads to the latent alignment loss, semantic consistency loss, and reconstruction loss, and the second objective leads to the bridge loss. Extensive experiment result indicates the proposed LABridge is effective under various scenarios.

优缺点分析

Strength:

  1. The paper carefully identifies the potential issues of current diffusion T2I diffusion models: (1) Different semantics using same diffusion priors, leading to ambiguity especially in early stage; (2) Sampling is slow in early diffusion process; (3) The text semantics are not aligned with image latents. The motivation is clear.
  2. Introducing OU stochastic process seems reasonable since the stochastic process encourages the state variable to drift towards the equilibrium mean, which, when combined with the step of transforming the text embedding to imagery latents, can result in a process that generate based on the text semantics.
  3. Extensive experiments show that the LABridge lead to clear performance gain, and the ablation study shows the role of each design component.

Weaknesses:

  1. The latent alignment loss design in (9): From my understanding, this objective tries to ensure the decoded text embedding is close enough to the encoded image without adding noise. I am wondering if this loss is easy to be optimized. My point is, if this loss becomes 0, if you decode the generated text embedding, would we directly obtain clean images in this case? In other words, this loss itself optimizes a text-to-image generator. If so, I think this loss is too complicated and hard to optimize.
  2. The semantic consistency loss design in (10): This objective tries to minimize the cosine-similarity difference between the image latent space and textual embedding space. This could also be hard to optimize, because the text embedding model is frozen and only the TIAE encoder is trainable. The ablation study in Table 3 seems also do not have a version where only this loss is not present.
  3. I am not sure I understand the OU process 100%, but from my understanding, the OU process is a stochastic process that encourages the state variable to drift towards the equilibrium mean, and it is intuitively reasonable to use here. I am just wondering if it is possible to compare this stochastic process choice with previous work such as [1] mentioned in the paper.

[1] Denoising Diffusion Bridge Models

问题

Overall my questions mainly focus on the loss function designs. I might also miss why LABridge is not compared with DDBM mentioned in the paper. My questions are summarized as follow:

  1. A clearer explanation of loss function (9), since this itself would result in a T2I generative model;
  2. Similarly, I am wondering if it is easy to optimize loss function (10) because it involves align the meaning of cosine similarity in text embedding space with image latent embedding space
  3. A better comparison to DDBM, or please indicate why this is not possible

局限性

Yes.

最终评判理由

The rebuttal has clarified my concerns regarding the loss deigns choices, and also clarified my understanding of the relation between DDBM. I have no further concerns, so I change the score to "Accept".

格式问题

N/A

作者回复

We sincerely thank the reviewer for their valuable feedback and insightful questions. We provide the following clarifications to address the questions raised.


1. On the Latent Alignment Loss (LalignL_{\text{align}})

Goal is Alignment: The objective of this loss is to align the text latent space with the image latent space, rather than to minimize the loss as much as possible. Therefore, as long as the alignment is achieved, it is not necessary to optimize the loss to a very low value. Moreover, experiments show that this loss is not difficult to optimize; it decreases quickly and tends to stabilize.

Conditional Mean: If the text-to-image mapping were one-to-one, optimizing LalignL_{\text{align}} to zero would indeed create a direct text-to-image generator. However, the relationship is one-to-many: a single prompt yy (e.g., "a dog") corresponds to a vast distribution of possible images, and thus a distribution of image latents x_0x\_0.

The loss is defined as an expectation over the data distribution: L_align=E_(x_0,y)q_data[μ_T(y)x_02ˆ]L\_{\text{align}} = \mathbb{E}\_{(x\_0, y) \sim q\_{\text{data}}} [||\mu\_T(y) - x\_0||\^2]

For any given text prompt yy, the expression that minimizes this loss is the conditional mean of the image latents, i.e., μT(y)=E[x0y]\mu_T(y) = \mathbb{E}[x_0 \mid y]. Therefore, our TIAE is trained to map a text prompt to the average latent representation (or "center of mass") of all corresponding images.

Decoding this average latent μT(y)\mu_T(y) directly would likely produce a generic or blurry image, not a specific, high-fidelity instance. The purpose of this loss is not to be a standalone generator, but to establish a semantically meaningful anchor point in the latent space for the diffusion process to start from. The OU diffusion bridge is then essential for sampling diverse, high-quality instances around this well-positioned mean, which is far more efficient and stable than starting from a generic Gaussian prior.


2. On the Semantic Consistency Loss (LsemL_{\text{sem}})

Powerful TIAE Architecture: While the text encoder is frozen, the TIAE architecture (inspired by DiT) is a powerful Transformer-based model. It possesses significant expressive capability to learn a complex, non-linear mapping from the text embedding space to the image latent space. Our goal is not to perfectly replicate the geometry of the text space but to preserve its crucial semantic structure. For instance, if in the text space "king" is closer to "queen" than to "apple," then the corresponding latent priors μT\mu_T should maintain a similar relative arrangement. The TIAE is flexible enough to learn this.

Crucial Role as a Regularizer: This loss serves as a vital regularizer. While LalignL_{\text{align}} anchors the priors for prompts seen during training, LsemL_{\text{sem}} ensures that the latent space between these anchors is structured in a semantically coherent manner. This is critical for generalizing to new or compositional prompts, as it encourages the learned prior manifold to be smooth and meaningful, preventing it from collapsing or becoming disorganized in unseen regions.

Ablation Study without LsemL_{\text{sem}}: We apologize for the omission in the original ablation table. We have conducted the requested experiment and agree it provides valuable insight. Below is an updated version of the ablation results from Table 3, including a new row (in bold) where only LsemL_{\text{sem}} is disabled.

Table 3 (Updated): Ablation on Stable Diffusion XL (COCO-10K)

Stable Diffusion V1.5 Ablation

L_recL\_{\text{rec}}L_semL\_{\text{sem}}L_alignL\_{\text{align}}COCO-10K CLIP ↑COCO-10K FID ↓COCO-10K GenEval ↑COCO-10K DPG ↑MJHQ-30K CLIP ↑MJHQ-30K FID ↓MJHQ-30K GenEval ↑MJHQ-30K DPG ↑
29.01 ±.0213.82 ±.040.57 ±.0272.42 ±.0329.42 ±.0111.63 ±.050.57 ±.0172.42 ±.02
28.90 ±.0314.01 ±.040.55 ±.0371.85 ±.0429.25 ±.0211.90 ±.040.55 ±.0371.54 ±.04

Stable Diffusion XL Ablation

L_recL\_{\text{rec}}L_semL\_{\text{sem}}L_alignL\_{\text{align}}COCO-10K CLIP ↑COCO-10K FID ↓COCO-10K GenEval ↑COCO-10K DPG ↑MJHQ-30K CLIP ↑MJHQ-30K FID ↓MJHQ-30K GenEval ↑MJHQ-30K DPG ↑
29.82 ±.0312.82 ±.020.65 ±.0377.21 ±.0230.63 ±.038.55 ±.020.65 ±.0277.21 ±.01
29.61 ±.0413.09 ±.030.63 ±.0476.75 ±.0330.65 ±.048.85 ±.030.63 ±.0476.70 ±.03

The results show that removing L_semL\_{\text{sem}} leads to a noticeable drop in performance, particularly on the text-alignment metrics (GenEval, DPG), confirming its importance for preserving semantic structure and improving generalization.

3. Comparison to Denoising Diffusion Bridge Models (DDBM)

Our process does not model a "bridge" in the strictest mathematical sense of conditioning on a specific endpoint xT=yx_T = y (a delta function). Instead, it connects an initial point x0x_0 to a target distribution N(μT(y),σT2I)\mathcal{N}(\mu_T(y), \sigma_T^2 I).

Our choice of the term "bridge" was motivated by its broader use in recent literature, such as in Denoising Diffusion Bridge Models (DDPMs) [Zhou et al., 2023], which also focuses on connecting two endpoint distributions rather than two fixed points. Our work frames text-to-image generation as learning a stochastic bridge between the image latent manifold and a manifold of text-conditioned priors.

Our key innovations lie in how we define the two endpoint distributions for the text-to-image task:

  1. The "Start" Distribution (p0p_0): We operate in the latent space of a pre-trained VAE, so our starting points x0x_0 are drawn from the manifold of encoded real images.
  2. The "End" Distribution (pTp_T): This is our main contribution. Instead of a fixed prior like N(0,I)\mathcal{N}(0, I), we introduce the Text-Image Alignment Encoder (TIAE) to create a structured, text-conditioned prior distribution for the endpoint: p(xTy)=N(μT(y),σT2I)p(x_T \mid y) = \mathcal{N}(\mu_T(y), \sigma_T^2 I).

Therefore, DDBM is the foundational tool we build upon. We use its principles to construct a bridge, but our novelty is in creating and aligning the text-conditioned target endpoint μT(y)\mu_T(y) that makes the bridge semantically direct and efficient. The Ornstein-Uhlenbeck (OU) process is a specific type of bridge we choose for its mean-reverting property, which perfectly complements our goal of steering the generation process towards the text-aligned prior μT(y)\mu_T(y).

We hope these clarifications have adequately addressed your concerns. We are grateful for your time and constructive feedback, which have helped us improve the clarity of our paper.

评论

Thank you for the detailed response; my confusions are clarified, and my concerns regarding the loss designs are resolved. I will increase my rating. Please consider organizing and including the additional explanation and experiment results in the paper.

最终决定

This paper introduces the LABridge framework, which improves text-to-image generation by enhancing text–image alignment and reducing sampling time through the use of a Text–Image Alignment Encoder (TIAE) and an Ornstein–Uhlenbeck (OU) diffusion bridge. Initial concerns were raised regarding the optimizability of the loss, clarity of presentation and details, comparative analysis, ablation studies, and sensitivity analysis. The authors provided a strong rebuttal that adequately addressed these issues. After the rebuttal, all reviewers gave the final score of 5: Accept.

The paper is clearly written, with well-presented methods and high-quality illustrations. Experiments demonstrate that the approach improves generation quality for both training from scratch and fine-tuning existing models. The proposed method achieves better text–image alignment, competitive image fidelity, and potentially faster sampling.

The work reframes text-to-image generation as a stochastic bridge from image latents to text-conditioned priors. The method is mathematically sound and validated across multiple backbones, showing consistent gains in alignment metrics (GenEval, DPG) and sampling efficiency. Overall, the paper is novel, technically solid, and experimentally convincing.

The AC and SAC therefore recommend acceptance.