Rethinking Score Distillation as a Bridge Between Image Distributions
摘要
评审与讨论
The paper analysis score distillation methods in a single framework and hypotheses about two possible error sources - the single step ODE solver approximation and the mismatch between the assumed and true source image distribution. It then proceeds to tackle the first of them using a custom negative prompt design. The authors show this improves the quality of the generated images in both 2D and 3D tasks as measured by qualitative metrics and in a user study.
优点
S1) The analysis of different methods in a single framework and with illustrations and experiments is quite interesting for a reader and useful for understanding the SotA and its motivation.
S2) The exposition is very clear except for the treatment of the related work discussion.
S3) The proposed negative prompt idea is very effective in the tested scenarios.
S4) Quantitative metrics are complemented by a user study.
S5) The "3D Sketch-to-Real" task hidden in the Appendix seems quite original but it would require more examination (and probably a separate paper) to become a substantial contribution on its own.
缺点
W1) The technical contribution of the paper is limited. Specifically, I see the DDS gradient [18] and the new proposed gradient (eps_ours) as closely related. As far as I can tell, the formulas are the same and the difference is in the choice of the source prompt. Since DDS is presented to work with arbitrary source and target prompt pairs, I feel the new method could be postulated as a specific realization of DDS for one specific source prompt. The authors also omit DDS from all comparisons which supports this viewpoint. The paper still presents a novel analysis of performance in setups that were not shown in the original DDS paper [18] but I am worried the way the method is presented here ("Ours") is not optimal from the perspective of originality.
W2) The idea of discouraging artifacts using a negative prompt is likely not a substantial scientific contribution since negative prompts are commonly used by practitioners in image generating interfaces such as Midjourney.
W3) The authors identify two sources of errors - the 1st order approximation and the source distribution match - but they mostly only analyze the latter. The former is speculated to improve performance of ISM but if this hypothesis should be considered seriously, I would expect the multi-step ODE solver to be tested in combination with the various gradient definitions. This could show it in fact helps consistently and how orthogonal it is to the other error type. Currently, not even the basic ISM is included in Table 1 so I find the "First-order approximation error" theory plausible but untested.
W4) Other suggestions for the analytical Sec. 2.3:
-
The analysis of SDS focuses on cases with s >> 1 but it does not explain the poor performance with s = 1.
-
Unlike for SDS, the analysis and Figure for DDS omits the effect of CFG while the original work uses s > 1 ([18]: Eq 3 and Fig. 5). I would expect to see some of the same distribution shifts as in SDS.
W5) The related work is deferred to the appendix. The discussion covers the various distillation methods and their applications quite well but due to its location the paper alone is sometimes a bit hard to follow. Therefore, I am personally not very keen of this design choice. Notice that I do not consider this a reason for rejection but I still put it among the negatives since I see it as a minor weakness.
Minor presentation issues and suggestions:
- [25] is NeurIPS 2022
- When Fig. 1 is first referenced the meaning of different eps suffixes has not yet been defined in the text.
Justification of recommendation
Overall this could be a very nice meta-paper on score distillation but to that goal it unfortunately only explores one part of the problem - the score definition - and not the ODE approximation. The negative prompting idea seems effective but not very novel given the common usage of negative prompts in image generation and the similarity of the score gradient to DDS. Therefore I lean towards rejection.
Post rebuttal I increase my rating after the rebuttal because the authors could show that their method is useful also in 3D generation and it also can be combined with the multi-step inversion to demonstrate influence of both limiting factors discussed in the theoretical part. These should be integrated into the paper. I keep my scope just positive since I still see the method as technically very incremental (e.g. the main difference from DDS is just the input image) and I await the discussion of the other reviewers.
问题
Q1) The authors use CFG strength s=40 (L483) or even s=100 (L487) to test their method which is a lot more than the range 3-15 considered in the DDS paper [18]. Why the difference?
Q2) The definition of the score function in Eq. 2 differs from [25] and also from what I would expect. It is a bit hard to judge because is also not at all defined but I would expect to play a role, ie. smt like as in [25, Eq. 3]. Can you please clarify this?
局限性
The authors discuss impact of AI art on society and mention that the performance of their method does not match the reverse sampling.
- The formulas of DDS and the proposed method are the same up to the source prompt.
A small yet fundamental difference is that DDS is a specialized method for editing that computes the source distribution direction based on a reference image instead of the current optimized image. That is, DDS computes the direction to the source distribution as as shown in the equation above line 155, while we compute it as as shown in the equation above line 186. This makes DDS incompatible with generation tasks since it requires a reference image. In addition, when there are reference images, it induces additional source distribution error as we explained in line 155.
- The idea of using a negative prompt is likely not a substantial scientific
We emphasize in the global response that our main contribution is to provide a new view of the sources of errors in SDS. To demonstrate that improving the source distribution approximation error can improve generation quality, we propose a two-stage optimization process. Although using negative prompt is a common practice, the proper way to use it with SDS is not well explored. For example, as shown by the two-stage ablation experiment in Figure A1, always using negative prompt in SDS leads to divergence and poor geometry.
- The authors identify two sources of errors - the 1st order approximation and the source distribution match - but they mostly only analyze the latter.
Please find the qualitative results of ISM in Figure A6. We notice that ISM generally outputs sharper results than SDS. However, it actually still uses single-step estimation to approximate the bridge. Although it uses multi-step inversion to produce a noisy sample instead of adding randomly sampled noise, it computes two epsilon directions using diffusion models to estimate the gradient, similar to SDS.
Instead, here we propose to resolve first order approximation error by solving the entire PF-ODE path to recover the dual bridge and estimate the endpoint of the bridge that is coupled with. In this way, we obtain the most accurate gradient direction with minimal approximation error . We refer to this approach as “full path”. However, solving the inversion ODE is not trivial. We noticed that the inversion can exaggerate the distribution mismatch error and cause the optimization to get stuck at a local optimum at the beginning of the optimization. Instead, the high variance of the single-step methods often demonstrates more robustness to different initializations. Therefore, we first perform the single-step score distillation optimization to obtain reasonable results before moving to solving the full bridge.
With this approach, we can now explore addressing both the first and second source of error. First source (linear approximation) with “full-path” and second source (source distribution mismatch error) with 2-stage. As shown in the table below, we find that using the “full-path”multi-step (mitigating error source 1) always outperforms the single-step methods, achieving a lower FID. However, the same trend does not fully transfer to the text-to-3D experiments. We observe that it typically introduces additional artifacts and makes the optimization less stable. We leave the best way of leveraging this gradient as a future research exploration.
| Method | Addressing linear approx. error | Addressing dist. mismatch error | FID |
|---|---|---|---|
| SDS | No | No | 79.95 |
| with two-stage | Yes | No | 69.82 |
| with full-path PF-ODE | No | Yes | 66.51 |
| with two-stage & full-path | Yes | Yes | 62.69 |
4.a) The analysis of SDS does not explain the poor performance with .
SDS requires to make dominate the gradient direction. The function of the term effectively averages the optimized image by adding different Gaussian noise. As shown in Figure A5 in the uploaded PDF, a small value of would make the generation over-smoothed and lacking in details.
4.b) Unlike for SDS, the analysis for DDS omits the effect of CFG.
We believe that the effect of changing s is equivalent to changing the learning rate of the optimization, which controls the strength of the editing in DDS without changing the gradient direction. Therefore, we omit it in our gradient analysis.
5, 6. The related work is deferred to the appendix. Minor issues.
Thank you, We will address these in the revised version.
Q1. Why use CFG strength or 100 which is more than the range 3-15 in DDS?
s does not affect the direction of the gradient and can be absorbed into the learning rate. In their experiment, learning rate of 0.1 is used while we mostly use a learning rate of 0.01. Equivalently, we can use and learning rate=0.1 which results in a similar scale to the DDS hyperparameters.
Q2. Can you please clarify the definition of the score function in Eq. 2, compared with [25]?
We follow the notation in the DDPM paper. Suppose that , then the score can be computed as:
[25] uses a different noising function, thus a different score function.
Thank you for answering my questions.
Thanks for updating your review!! We will incorporate these suggested experiments in our revised version.
Regarding DDS, we also wanted to mention that it uses the same negative prompt throughout the optimization, which our new ablation study shows is highly ineffectual. We want to iterate that the contribution of our paper is the proposed framework, which enables the understanding and the proposal of a two-stage optimization pipeline.
This paper proposes interpreting score distillation sampling (SDS), a widely used method for generating 3D, 4D, and vector graphics, through the lens of the Schrödinger Bridge (SB) problem. According to the paper, SDS is a linear approximation of the optimal path moving from the current distribution to the target distribution. The paper identifies two sources of approximation error: first-order approximation error and source distribution mismatch. To address the second error source, the authors suggest a simple method: using a text prompt that describes the source distribution instead of a null prompt. This approach is computationally more efficient than the best variant of SDS, VSD, and the authors demonstrate its effectiveness across various tasks such as text-to-image, text-to-3D, painting-to-real, and illusion generation.
优点
-
The interpretation that SDS finds the optimal path connecting two distributions is novel and aids in understanding the behavior of the widely used SDS.
-
The paper addresses the oversaturation problem of conventional SDS and is computationally more efficient than VSD.
-
The paper demonstrates effectiveness across a wider range of tasks compared to previous papers (SDS, NFSD, and VSD).
-
The paper is well-written and easy to follow.
缺点
-
Up to section 2.3, I enjoyed reading and expected a principled solution. However, the solution in section 2.4 was quite naive and heuristic. A major drawback of this solution is that the pre-trained diffusion model needs to accurately match the proposed descriptions such as "oversaturated, smooth,…" with the source distribution. Considering that the text-to-3D experiments are based on Threestudio, my assumption is that the text-to-image model used in this paper is Stable Diffusion 2-base (the exact model is not mentioned in the paper). It is questionable whether other models (MVDream, SDXL, PixArt, SD3, etc.) can understand these descriptions well. My guess is that diffusion models trained on such high-quality data will still generate clean images even when descriptions like "oversaturated, smooth,…" are appended, and therefore will still suffer from the source distribution mismatch problem.
-
The paper does not consider the Janus problem, which frequently occurs in text-to-3D. It is questionable whether the proposed methodology would be effective with MVDream [50], a pre-trained diffusion model that addresses this issue.
-
VSD has the strength of not only ensuring the quality of rendered images but also achieving sample diversity as the number of particles increases. Therefore, line 282 is not true, and the proposed method falls behind VSD in terms of sample diversity.
问题
-
What pre-trained diffusion model did you use in this paper?
-
I personally tried text-to-3D with SDS using Diffusion-DPO [A, B], which is a post-trained diffusion model for aesthetic quality, and failed to generate plausible geometry. Can this phenomenon be explained from the perspective of the Schrödinger Bridge problem?
[A] https://huggingface.co/mhdang/dpo-sd1.5-text2image-v1
[B] Wallace et al., Diffusion Model Alignment Using Direct Preference Optimization, CVPR 2024.
局限性
The limitations of this paper are stated in section 4, but they need improvement.
As mentioned in the weaknesses, unlike the method proposed in this paper, VSD can resolve the diversity issue by increasing the number of particles, so line 282 is misleading.
Additionally, it is necessary to mention the slow generation speed of SDS and the failure to address the Janus problem in text-to-3D, for the benefit of the readers.
- The solution in section 2.4 was quite naive and heuristic. A major drawback is whether other models (MVDream, SDXL, PixArt, SD3, etc.) can understand the descriptions "oversaturated, smooth,…" well.
Although using negative prompt is a common practice in text-based diffusion models, how to use it with SDS is not well explored. As shown in the global response and Figure 1A, simply using it all the time during the optimization process leads to inferior results. Instead, the two-stage process consistently outperforms the single-stage baselines.
We show that the prompt generally works well with other models like MVDream and SDXL.
We perform our 2D generation experiment with MVDream using SDS and our two-stage optimization process. We use the same negative descriptors for the source prompt as proposed in our experiments with Stable Diffusion 2.1. As shown in Figure A4-a, we show that the two-stage optimization (bottom) produces more convincing colors and additional realistic high frequency details compared to an SDS baseline (top) for the same MVDream model.
We do the same with the SDXL base model, again using the same negative descriptors proposed for Stable Diffusion 2.1. In Figure A4-b, the two-stage optimization process (below) produces less saturation artifacts and more high frequency details than the SDS baseline (above). SDXL is known to perform poorly in the SDS setting and is therefore not commonly used, but we include these results to demonstrate the universality of the proposed optimization.
Despite the fact that these diffusion base models are trained on multi-view or high-quality images whose captions may not contain the proposed negative descriptors, we argue that the powerful pretrained text encoders that embed their prompts represent these artifacts well. For example, the embedding of “oversmoothed” is likely far from the embedding of “detailed” and close to other negative descriptors like “blurry”. Empirically, we find that the same negative descriptors work across base models without needing retuning.
- The paper does not consider the Janus problem.
Our proposed framework is focused on analyzing and improving the diffusion gradient in SDS. MVDream addresses the Janus problem, which requires data priors on objects by training on multi-view data and conditioning generations with camera pose. These problems are orthogonal, and we show that our two-stage optimization process works well with MVDream to address both issues in Figure A4-a.
- VSD has the strength of not only ensuring the quality of rendered images but also achieving sample diversity as the number of particles increases. Therefore, line 282 is not true, and the proposed method falls behind VSD in terms of sample diversity.
Thanks, we will make clear the advantage VSD has on sample diversity. In line 282, we mentioned, “...neither approaches have yet to achieve the quality and diversity of images generated by the reverse process.“ We did not intentionally mean that our two-stage optimization achieves better diversity than VSD, but to say, all SDS variants still induce lower diversity than the reverse process. The reason for that is still unclear. We hope that our analysis may inspire future research to better understand this problem.
For instance, to analyze the diversity issue within our framework, we notice that the ending point of the bridge is deterministically decided by the initial conditions and the ODE processes. When training a LoRA on all the particles, the loss encourages different ODE processes on individual particles. As a result, the LoRA module assigns slightly different directions to each particle and improves diversity. A recent study [1] reinforces this point by introducing a repulsive ensemble method to VSD. In general, this is beyond the scope of our paper. We will add this discussion in the paper.
[1] https://arxiv.org/abs/2406.16683
Question 1: What pre-trained diffusion model did you use in this paper?
We use stable-diffusion-v2-1-base for our experiments. We will make this clear in our revised version.
Question 2: I personally tried text-to-3D with SDS using Diffusion-DPO [A, B], which is a post-trained diffusion model for aesthetic quality, and failed to generate plausible geometry. Can this phenomenon be explained from the perspective of the Schrödinger Bridge problem?
Thanks for bringing up this interesting observation. We have also observed that models like SDXL fail to generate reasonable geometry in practice. Similar to Diffusion-DPO, SDXL filters its data using the aesthetic scores. Our hypothesis is that the images with high aesthetic scores overrepresent canonical views of the object. For example, the front view of a dog is often deemed to be more aesthetic than its back view. As a result, this induces an issue that the target distribution of SB heavily biases toward this canonical view. When applying these models to 3D generation, there could be more inconsistency across different views, which makes the optimization less stable.
Since most text-to-image models use a frozen text encoder, untrained negative descriptor embeddings are likely to be far from positive descriptor embeddings, as the authors mentioned in the rebuttal. However, a text-to-image model would not be able to output a good score or gradient for an untrained negative embedding. I believe this is why SDXL performs poorly in the SDS setting.
Considering that recent text-to-image models are increasingly trained on high-quality images, the proposed method seems difficult to apply beyond stable-diffusion-v2-1-base and MVDream, which is a post-trained version of stable-diffusion-v2-1-base.
While I see value in this paper's explanation of the behavior of SDS and its variants (including VSD as addressed in the rebuttal) from the perspective of the SB problem, I feel that the proposed method for addressing the issues with SDS needs further improvement and is not yet ready for publication in NeurIPS.
Therefore, I will maintain my current score. I remain open to further discussion.
Thanks for the response!
However, a text-to-image model would not be able to output a good score or gradient for an untrained negative embedding. I believe this is why SDXL performs poorly in the SDS setting.
This may be a misunderstanding—SDS is not performing poorly because of the negative embedding (since naive SDS uses no negative embedding!). Naive SDS just performs poorly overall with SDXL. In fact, the experiments in the rebuttal PDF actually show that our approach, and therefore adding the negative description, improves upon the SDXL-based SDS baseline, producing results with fewer color artifacts. This is evidence that our method is not difficult to apply beyond SDv2.1.
The objective of this paper was to analyze the sources of error in SDS. We hypothesized that accurately representing the current source distribution is one key to enhancing SDS quality. To validate this, we introduced an experimental alternative to SDS that appends negative modifiers to more effectively model the source distribution.
In our initial submission, all our experiments used SDv2.1 as the base model because our goal was to compare to naive SDS, and at the time of submission, Stable Diffusion 2.1 was (and still is) the main image generator used for score distillation sampling experiments. Delving further into how the SDS performance varies across SOTA text-to-image diffusion models, especially those trained on aesthetic images, is an interesting and under-explored research direction that is beyond the scope of this paper.
In the future, as novel image generators appear, there may be more effective ways of modeling the source distribution. We believe our provided experiments in the paper and the new additions in the rebuttal have sufficiently validated our analysis, and these insights will be applicable to newer image generation models.
Thank you for the detailed response. I apologize for the late additional question, but I have one more.
In the experiments conducted in the paper and rebuttal, including those with SDXL, within what interval were the diffusion timesteps sampled, and what was the weighting scheme for each timestep?
For example, DreamFusion uses sigma-weighted SDS, and VSD uses timesteps only within the range [0.5, 0] during the refinement stage.
Since the sampling distribution and weights of timesteps in diffusion models are known to be important [A], I am curious about this aspect. I’m curious if the effect of negative prompts might manifest at specific diffusion timesteps and weighting schemes.
[A] Kingma et al., Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation, NeurIPS 2023.
Thank you for the follow-up and sharing this interesting paper!
For the text-to-2D experiments, including the SDXL experiment in the response, we adopt the timestep sampling and weighting as proposed in Dreamfusion (i.e., sigma weighting and ). In general, we find that annealing the maximum timestep is helpful, but exclude that from text-to-2D experiments for a more direct comparison. For text-to-3D experiments, to make a fair comparison with VSD, we use the configuration from ProlificDreamer. That is, we use sigma weighting and sample timestep for the first steps and for the remaining steps. Overall, we observe similar effects when tuning these hyperparameters in SDS and our proposed second stage. We will add these details in our revision.
Thank you for providing the detailed information. It seems that hyperparameters are consistent across comparisons with other methods, and there appears to be room for further performance improvement by adjusting the weightings or noise schedules of diffusion models. While I still don't believe the proposed method is a principled solution, through the rebuttal and discussion, I have come to see this work as a step toward an optimal solution in the future. Therefore, I would like to raise my score to 5.
This paper revisits the application of Score Distillation Sampling (SDS) for tasks with limited data availability by proposing a new interpretation based on Schrödinger Bridges for optimal-cost transport between distributions. The paper highlights that existing SDS methods produce artifacts due to linear approximations and poor estimates of source distributions. By aligning the text conditioning more closely with the source distribution's characteristics, the authors demonstrate significant improvements in image generation tasks such as text-to-2D and 3D, and translation between art styles. The proposed method avoids the computational overhead of previous approaches while achieving comparable or superior results in terms of image quality across various domains.
优点
- Overall I find that the writing is clear, concise, and well-structured, making it easy for readers to follow the arguments and understand the key points. I really like the view of optimal transport between source and target distributions to understand score distillation.
- This paper provides a comprehensive analysis of existing methods from a unified point of view. They further propose a simple yet effective method in transferring 2D diffusion prior to the 3D scene generation or editing. In contrast to prior state-of-the-art ProlificDreamer, it does not require fine-tuning of diffusion models, which may introduce training inefficiency and instabilities
缺点
- The description of the proposed method is quite concise. Some technical parts lack enough rationales. For example, estimating the source distribution by negative text prompts lacks the rationales. I believe providing detailed analysis with ablation studies will make the paper more informative to answer the following questions: How do you choose these negative prompts? Does it need to be hand-crafted carefully every time adapting to a new domain/task? Why do you have to propose a two-stage optimization pipeline?
- How do you choose the value of w? If the value of w is still very large, then the proposed method likely inherits the weaknesses of using a high CFG with standard SDS: less diversity of generations for a given prompt, less realistic generations, and over saturated colors. Besides, if the value of w is still very large, I think the proposed method is more similar to directly using the original negative prompts as guidance.
- The text-to-3D experimental results are somewhat not convincing since the comparison of both qualitative and quantitative results is inadequate and more competitive baselines should be included, such as Fantasia3D, Magic3D, CSD, NFSD, etc.
问题
See the weakness section for questions. I would like to raise my scores if the above questions are addressed.
局限性
Yes.
- Some technical parts lack enough rationales, such as the negative prompt and two-stage optimization process. How do you choose these negative prompts?
We propose a two-stage optimization process motivated by our SB framework. To reduce the effect of the distribution mismatch error, we aim to more accurately describe the current distribution across different stages of optimization. While we experimented with more complicated approaches to achieve this, we found prompting to be unreasonably effective and much simpler. The negative prompts were selected based on the characteristic SDS artifacts, e.g., "oversaturated, smooth, pixelated,.." (rest in paper). This description does not require hand-crafting based on problem domains—it is kept fixed across all shown examples and use-cases.
We do not claim that these specific prompts are optimal or even of particular interest. To demonstrate this, we prompted GPT-4 through ChatGPT a single time to generate alternative negative prompts and ablated the results in Figure A2 with otherwise identical hyperparameters.
ChatGPT Prompt:
Here's a set of "negative prompts" to append to a text-to-image prompt that describe undesirable image characteristics: ", oversaturated, smooth, pixelated, cartoon, foggy, hazy, blurry, bad structure, noisy, malformed" I want to try a variety of them, please brainstorm many of roughly the same length
Generated Variations:
1. ", washed out, grainy, distorted, flat, smeared, overexposed, undefined, choppy, glitchy, dull"
2. ", low contrast, jumbled, faint, abstract, over-sharpened, muddy, cluttered, vague, jagged, poor detail"
3. ", soft focus, muffled, streaky, patchy, ghosted, murky, unbalanced, skewed, mismatched, overcrowded"
4. ", overbright, scrambled, bleary, blocky, misshapen, uneven, fragmented, obscured, chaotic, messy"
5. ", dull tones, compressed, smeary, out of focus, unrefined, lopsided, erratic, irregular, spotty, stark"
In addition, we show that the same negative descriptors work across different base models, such as MVDream. Since MVDream denoises four camera-conditioned images jointly, we treat the canvas of four images as a single optimization variable for the SDS gradient. In Figure A4-a, we compare the SDS baseline (top) to the proposed two-stage optimization (bottom), in which we generate more natural colors and detail. This is especially noticeable in the grass around the crocodile. Due to the space limit, we will add more results in the revised version.
We also ablate the proposed two-stage optimization process in the global response. As the source distribution keeps changing along with the optimization process, it is necessary to update the source distribution. We effectively achieve this by first running SDS (stage 1) then updating our source distribution with negative prompts to steer the optimization away from the artifacts (stage 2). In the global response, we show that if we start with such “negative” prompts from the beginning, which do not accurately describe the rendered images at initialization, it causes additional distribution approximation error and fails to generate a plausible object.
- How do you choose the value of w? Is large w value similar with large cfg that causes artifacts?
We choose to produce a gradient on a similar scale as SDS with CFG scale . This is because, in text-to-3D, it is crucial to balance many other regularization losses on sparsity, opacity, and so on. Using a similar scale allows us to adopt the same hyperparameters to compare fairly with other SDS variants. In addition, w is different from CFG since SDS also incorporates a term with sampled Gaussian noise, and the CFG term needs to be large enough to make the dominant. Instead, simply scales the gradient, unlike the CFG scale in SDS, which changes the direction of the gradient.
We also notice that is relatively robust and gives similar results in a range. Instead, when the CFG scale is small in SDS, the averaging effect of Gaussian noise dominates and creates oversimplified 3D objects. When or CFG scales are large, with other loss weights intact, the optimization becomes unstable and may diverge. See Figure A5 in the uploaded PDF for this qualitative comparison.
- Missing comparison in text-to-3D with more competitive baselines.
Thank you for the suggestion. We ran a comparison with Fantasia3D, Magic3D, and CSD through a drop-in replacement of SDS with our method. We did not compare with NFSD as they did not release the official code and we empirically found that its results resemble SDS results. Specifically, all three methods optimize a textured DMTet, which is initialized from an SDS-optimized NeRF, using SDS or CSD for 5k or 10k iterations. We replace the SDS or CSD stage of these approaches with the two-stage optimization motivated by our framework. Just like our text-to-3D NeRF experiment, we perform the first stage for 60% of iterations and the second stage for 40% of iterations. Note that we keep all the other hyperparameters the same, which were tuned for the baselines, not our method. This replacement leads to the same optimization time as the original methods. For Fantaisia3D and Magic3D, we use threestudio for fair comparison (Magic3D does not have code available) and the default prompts, which are generally believed to work the best with this reimplementation. For CSD, we use the official implementation. As shown in Figure A3, our method improves the visual quality of all the methods by reducing the oversaturated artifacts of SDS and improving the details.
Thanks for your reply. I would like to maintain my score.
We thank all reviewers for their thoughtful feedback. We propose an optimal transport view to understand score distillation, which reviewers “really like” (Cgf2), and find “novel” (sBp4). We provide illustrations and experiments under this single framework, which reviewer XvCb finds “quite interesting … and useful for understanding the state-of-the-art and its motivation.” They also find the method we propose to improve SDS based on this interpretation simple yet effective, (Cgf2, sBp4, XvCb), efficient (Cgf2, sBp4) and explained with clear exposition (Cgf2, sBp4, XvCb).
We want to stress that our primary contribution is the analysis of the sources of error in SDS and its variants (i.e., why it does worse than sampling with reverse process)—forming the hypothesis that accurately expressing the current source distribution is crucial for improving the quality of SDS. We validate this hypothesis with a simple approach that appends negative modifiers to better model the source distribution. While this method is likely useful in itself, it primarily serves as a practical way to empirically support our hypothesis (i.e., since it outperforms baseline methods that less accurately model the source distribution).
Most of the reviewer concerns were centered around the particular design decisions of our experimental optimization approach, or noted similarities with existing methods. In this rebuttal (both here and in the individual reviewer responses) we detail some of the motivations for these design decisions and answer some recurring questions:
Our Schrodinger's Bridge (SB) interpretation presents SDS as transporting the optimization variable from a source distribution toward a target distribution. This interpretation highlights the importance of accurately modeling the source distribution—where inaccuracies may cause the characteristic artifacts of SDS (e.g., saturation, over-smoothing, etc). To validate this, we devise an experimental solution that aims to better model the source distribution at different stages of optimization, and compare it to SDS and its variants. Our experimental optimization procedure has two stages. At early stages of optimization, we use the standard SDS optimization objective, since the source distribution estimated by SDS (i.e., the model’s unconditional prediction) is a reasonably good approximation of the sample distribution at initialization (blob initialization in NeRF optimization or zero initialization for images). At later stages, once the SDS objective has begun to instill many of the characteristic artifacts in the optimized solution—we change to modeling the source distribution with the target scene description, appended with a set of standard negative modifiers that approximately describe the collection of SDS artifacts (“oversaturated, smooth, pixelated, cartoon, foggy, hazy, blurry, bad structure, noisy, malformed,”). While this descriptor is fixed across all sequences (and therefore does not require per-instance/domain hand-crafting), it much more accurately models the intermediate source distribution and thus we find that it is effective at steering the optimization towards an artifact-free solution.
SDS by itself is shown to be notably worse in all our comparisons, but one may additionally wonder—why not only use the second stage? As noted above, our objective is to improve the estimate of the source distribution, and at early stages, SDS’s unconditional sample already models the source distribution (random initialization) reasonably well. In fact, the negative prompt modifiers likely model this distribution particularly poorly. We experimented with this variant, and found that optimization either (1) diverges, resulting in an entirely black volume or (2) generates very unreasonable geometry. A three-way comparison is shown in the attached PDF Figure A1, showing that our proposed two-stage approach is clearly better than both (1) only SDS, and (2) only stage-2.
Our analysis of the error present in SDS includes two potential sources: (1) the first-order approximation error, as well as (2) the source distribution mismatch error described above. SDS incurs the first error by using only a single-step estimate of the Schrodinger Bridge rather than solving it fully through the full-path PF-ODE. To validate that reducing the first source of error can further improve SDS (XvCb), we perform an experiment in which we solve the entire PF-ODE path to recover the dual bridge (instead of using a first-order approximation) and estimate the endpoint of the bridge that is coupled with . We use this endpoint as the target. Although this is slow, it consistently produces better results as shown by the lower COCO-FID scores in the table below. Please see response to XvCb for more details:
| Method | Addressing linear approx. error | Addressing dist. mismatch error | FID |
|---|---|---|---|
| SDS | No | No | 79.95 |
| with two-stage | Yes | No | 69.82 |
| with full-path PF-ODE | No | Yes | 66.51 |
| with two-stage & full-path | Yes | Yes | 62.69 |
We have also performed more experiments as suggested, described in the individual responses below. We will add these additional experiments and figures in the revised paper. We hope these additions strengthen the proposed Schrodinger Bridge SDS interpretation, which is our primary contribution.
The paper proposes a new interpretation of generating content using score distillation sampling, that leads to a range of improvements in quality when adjusting existing approaches to this insight.
This submissions received borderline scores. Then, the author response seems to have been effective as two reviewers have increased their score. This indicates authors will be willing and able to produce a much improved final manuscript.
No blocking issues seems to have been brought up during the review and discussion.
An ethics aspects was raised, authors replied, and it seems that this is a simple user study about the fidelity of images, where the users response is not used to train any model or influences the method at all. So no issue here.
The AC did not read the paper or understood in detail, but as scores seem good now, this probably is an accept.