7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.5

置信度

创新性2.8

质量3.0

清晰度2.5

重要性2.8

NeurIPS 2025

Conditioning Matters: Training Diffusion Policies is Faster Than You Think

Zibin Dong,Yicheng Liu,Yinchuan Li,Hang Zhao,Jianye HAO

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

摘要

关键词

Embodied AI; Diffusion policies; Imitation learning; Diffusion models; Flow matching; Vision-language action models

评审与讨论

审稿意见

评分: 5置信度: 42025-07-02

The paper shows that diffusion-based imitation learning approaches can suffer from loss collapse, meaning that the diffusion process can not differentiate any more between different conditions. The authors show theoretically that loss collapse can be avoided if the prior distribution of the diffusion process is also conditioned. To do so, the mean of the prior is learned via an auto-encoding framework. The authors show the effectiveness of the approach on several simulated (libero and meta-world) and real robot tasks. Comparisons include baselines to fine-tuned VLA models, where the given approach can reach similar performance with less updates. The paper also contains an illustrative study about the effect of the conditioning on different modalities and can reduce modality over-reliance.

优缺点分析

Strength:

Good theoretical result (even though I could not check the proofs in detail) which is underpinned with emperical data
Rather simple approach that can be used to extend various diffusion / flow based architectures
Convincing results in sim and on real robot
Exhaustive analysis of the properties of the algorithms and ablations

Weakness:

Writing could be clearer at some places
Other ways to learn the prior distribution should be discussed (and compared) in more detail.

问题

In different applications (image generation - DALL-E style prior) or PriorGrad in SpeechSynthesis have also learned the prior for diffusion processes. It should be discussed why such ways of learning the pior can not be transferred to this diffusion policy setup
Could the prior be also learned by supervised learning (i.e. learning a Gaussian log-likelihood with mean and variance given by a DNN)? This would at least be a good baseline
For a reader that is not fully familiar with the conditional flow matching literature, Section 2 was quite confusing to read and it got only clearer once I read over the remaining sections. When introducing the variable z, it should be better explained what this variable corresponds to

局限性

Limitations are appropriately discussed.

最终评判理由

Its a good paper with nice theoretical insights that uncovers a key problem in conditioning of diffusion policies. Results are also convincing and the authors could address my minor concerns in ther rebuttal

格式问题

None

作者回复

2025-07-29

Question 1: Other ways to learn the prior distribution

This is a very insightful question. First, we'd like to categorize image generation and speech synthesis under the umbrella of AIGC. We want to emphasize that the challenges diffusion models face in Embodied AI versus AIGC are likely very different. AIGC currently grapples with generating increasingly complex modal distributions, which places significant demands on a model's generative capacity. In contrast, action chunks are fundamentally low-dimensional, smooth trajectories, meaning the main difficulty for the model lies in accurately understanding the condition $c$ .

Therefore, we believe the motivation behind Cocos is different from the AIGC works you're referring to. For AIGC, their prior distributions primarily encode information about the data itself, aiming to improve the realism and fidelity of synthetic data. However, Cocos is designed to enhance the model's ability to differentiate between conditions and to prevent loss collapse.

What about learning a data prior in a supervised learning way? More specifically, using an MSE loss to learn an action prior, and then using a diffusion model to refine this prior distribution? We did experiment with this approach: we removed the autoencoder's decoder and used the encoder as an action prior model, predicting the action directly with an MSE loss. Similar to the autoencoding version, we added Gaussian noise with std=0.2 to the predicted action to represent $q(x_0|c)$ . In this case, $x_0$ becomes the "not-so-accurate" action overlaid with Gaussian noise.

Here are the results of this experiment:

	5k	10k	20k	30k
cocos (autoencoding)	83.1 ± 0.9	81.1 ± 1.3	92.2 ± 1.3	93.8 ± 0.3
cocos (action prior)	70.0 ± 1.8	81.0 ± 1.0	84.5 ± 0.6	87.2 ± 0.6
vanilla (w/o cocos)	66.5 ± 1.6	78.3 ± 0.5	83.7 ± 0.3	82.9 ± 0.4

We observed that while this approach still improves training efficiency and scores compared to a vanilla Gaussian prior $p(x_0)$ , it does not perform as well as autoencoding $c$ . This suggests that using an explicit action prior is not necessarily a better choice. We believe that compressing as much information as possible from the condition (e.g., vision-language cues) into $p(x_0|c)$ might be a more critical choice.

Question 2: Section 2 Clarity

We sincerely apologize that Section 2 was confusing to read. We understand the importance of clarity and will thoroughly revise this section. We'll ensure that all notations, especially the definition of $z$ , are explained more precisely. Our goal is to make it much easier for readers to smoothly follow the main argument of the paper.

Thank you once again for your insightful suggestions. We'll incorporate all the discussions above into our revision, conduct a broader survey of prior distributions, and delve deeper into their exploration. We look forward to your further feedback and will respond promptly.

2025-08-06

Thanks for the additional experiments and clarifications, they will further improve the paper. However, as my score is already an accept, I remain with my already positive score.

2025-08-06

I'm glad our rebuttal addressed your concerns. We will continue to refine the revised paper!

审稿意见

评分: 5置信度: 42025-07-03

This paper uses conditioning-variable dependent priors as means to increase the performance of flow-based policies, particularly by improving both training time and sensitivity to conditioning variables.

优缺点分析

Strengths:

(1) This method clearly works very, very well. It's performance is almost uniformly an improvement over the state of art. (2) The method is quite modular and easy to use; in indeed, the principle is compatible with any particular choice of prior. Moreover, the VAE method has competitive performance with optimal \beta choice, which limits additional hyperparameters, and the EMA version has comparable performance (3) COCOs drastically enhances policy expressivity, allowing the model to compete with far more complex architectures (4) The authors identify context sensitivity as a a key bottleneck to performance. (5) Generally, the experiment were very dilligent and comprehensive. The visualizations representing the actions, i find, were very insightful. (6) Overall, the writing is rather clear (with the exception of the theorem below) (7) the benchmarks, real hardware experiment, and use of text-conditioning as well as observation conditioning make the result very compelling.

Weakness: (1) The theorem is incredible informal. The authors do not define certain key terms in the theorem statement, and the phrasing is colloqiual. Moreover, it is not clear to what extent the theorem "adds something new" in the sense that, all that is being said is that if your scores are Lipschitz in c in the network, your network will do similar things. This is rather tautological. I understand what it is trying to say, but I would encourage the authors to include a more formal statement if they intend to include a theorem. (2) I am a little unclear about how the encoding of the action behaves. What seems strange is that F(E(c)) doesn't really seem like an action prior, in the sense that it is trained purely via an autoencoder loss, rather than, say a Gaussian approximation to a target action distribution. Said otherwise, the distribution \mu(x_0 | c) is "separated" for different c, but the prior doesn't ensure that mu(x_0 | c) is close to the target p(x_1 | c). I think this is ok - in some sense, you just want to separate out the network gradients, but it is counterintuitive and bears some explanation. Moreover, it might make sense if potentially mu(x_0 | c) is related to p(x_1 | c) in some way (but perhaps it doesn't have to be?)

问题

Can the authors describe their decision choices for learning mu(x_0 | c) purely via an autoencoder loss? Do the authors think performance could be improved if the action prior was somehow closer to the action distribution?

局限性

Yes, the authors adequately describe limitations.

最终评判理由

This is a very nice contribution that elucidates the effect of coupled priors on better action generation. I think this will be a useful contribution to the literature.

格式问题

None

作者回复

2025-07-29

Question 1: Improve Theorem quality

WE SINCERELY APOLOGIZE FOR THE CURRENT QUALITY of this section! Thank you for your suggestion; we will certainly improve the rigor and clarity of this content, including more formal statements. Due to NeurIPS policy this year, we cannot provide external links or PDFs, but we will include the revised version in our updated paper.

Question 2: Design choice about $\mu(x_0|c)$

Our design choice for this form primarily stems from our finding that the way to avoid the loss collapse is to sample training data from $q(x_1, c)q(x_0|c)$ instead of $q(x_1, c)q(x_0)$ . We elaborated on this reason in the final part of Theorem 1 (Sorry again for any unclear expressions in the Theorem that might confuse. We will improve it.) Therefore, the simplest way to recover $q(x_0|c)$ is to perform an autoencoding, making $x_0$ a latent representation of $c$ , which doesn't necessarily have to be an action prior.

We also experimented with making $q(x_0|c)$ directly an action prior using a supervised learning MSE loss. Specifically, we removed the autoencoder's decoder and used the encoder as an action prior model, predicting the action directly with an MSE loss. Similar to the autoencoding version, we added Gaussian noise with $\text{std}=0.2$ to the predicted action to represent $q(x_0|c)$ . In this case, $x_0$ becomes the "not-so-accurate" action overlaid with Gaussian noise.

Here are the results of this experiment:

	5k	10k	20k	30k
cocos (autoencoding)	83.1 ± 0.9	81.1 ± 1.3	92.2 ± 1.3	93.8 ± 0.3
cocos (action prior)	70.0 ± 1.8	81.0 ± 1.0	84.5 ± 0.6	87.2 ± 0.6
vanilla (w/o cocos)	66.5 ± 1.6	78.3 ± 0.5	83.7 ± 0.3	82.9 ± 0.4

We observed that while this approach still improves training efficiency and scores compared to a vanilla Gaussian prior $p(x_0)$ , it does not perform as well as autoencoding $c$ . Overall, we believe modeling action distributions for diffusion models is a relatively easy task—they aren't as complex as modalities like images. The current bottleneck for diffusion policies likely lies in more accurately understanding $c$ .

Thank you very much for your valuable suggestions. We will incorporate all the new content discussed above into our revision. Our goal is to refine any imprecise descriptions and avoid all potential ambiguities. We look forward to your further feedback and will respond promptly.

评论- Thank you!

2025-08-05

Thank you for the feedback! It's interesting to see that the VAE prior works better; perhaps this due to the Gaussian mean prior leading to greater mode collapse. I remain positive and maintain my score!

2025-08-06

I'm glad our rebuttal addressed your concerns. We will continue to refine the revised paper!

审稿意见

评分: 5置信度: 42025-07-03

This paper proposes a modification to conditional flow matching for training diffusion policies. The authors identify a failure mode called loss collapse, where the policy ignores condition inputs and degenerates to modeling the marginal action distribution. To mitigate this, it replaces the standard Gaussian prior with a condition-aware source distribution derived from a vision-language encoder. Extensive experiments on the LIBERO, MetaWorld, and real-world robot benchmarks demonstrate improved training efficiency and performance.

优缺点分析

Strengths

Well-identified failure mode: The analysis of loss collapse is insightful. The authors present both empirical and theoretical evidence of a degenerative training loop, wherein weak condition integration leads to diminished conditioning influence, further reinforcing the collapse.
Simple and intuitive solution: Modifying the source distribution to be condition-dependent is a natural and plausible fix. Anchoring the prior around condition semantics better aligns the training objective with the desired conditional generation behavior.
Comprehensive experiments: The method is thoroughly evaluated across two simulation benchmarks and 10 real-world tasks.

Weaknesses

Lack of clarity in analysis metrics: The use of cosine similarity and norm scale metrics to analyze condition utilization is central to the paper’s claims. However, their interpretations are deferred to the appendix, which makes it harder to follow the main argument.
Incomplete mitigation of root cause: It is questionable whether the method fundamentally resolves the loss collapse issue or merely reduces its likelihood. The method strengthens conditional input but does not seem to prevent collapse.
Potential limitations in diversity and generalization: The demonstrations used in LIBERO and MetaWorld are relatively deterministic. It remains unclear how well the method scales to environments with highly diverse or multimodal demonstrations.
Questionable semantic prior assumption: The assumption that vision-language features offer a good prior for action distributions may not always hold. In some settings, similar instructions can require significantly different actions due to scene geometry or context.

问题

How does the method prevent loss collapse, or does it only reduce the chance of collapse?
How does the method perform when applied to datasets with highly diverse demonstrations?
Can you provide more insight into what would happen when semantic vision-language priors misalign with the actual distribution (e.g., occlusions, ambiguous phrasing)?

局限性

The paper briefly discusses limitations regarding the design of the condition-aware prior (e.g., fixed variance Gaussian) and the scope of current evaluations.

最终评判理由

Interesting analysis and neat solution.

格式问题

作者回复

2025-07-29

Question 1: Lack of Clarity in Analysis Metrics

Thanks for your feedback. We apologize for the oversight regarding clarity; we'll move the explanations of these metrics from the appendix to the main body in the revision. This will ensure readers can follow the main arguments more smoothly.

Question 2: How does the method prevent loss collapse, or does it only reduce the chance of collapse?

We believe this is a misunderstanding stemming from insufficient clarity in our explanation. In the final part of Theorem 1, we theoretically demonstrate that loss collapse is entirely avoided by sampling training data from $q(x_1, c)q(x_0|c)$ instead of $q(x_1, c)q(x_0)$ . This approach ensures that the upper bound of the loss gradient's distance can be arbitrarily large, preventing collapse from occurring. This isn't a heuristic to merely reduce the chance of collapse. We'll adjust the relevant section in the revision to highlight this explanation more clearly and prevent any misunderstandings.

Question 3: Potential Limitations in Diversity and Generalization

We agree that demonstrations in MetaWorld are relatively deterministic. However, Libero offers diversity. For instance, all tasks in Libero-Goal are set in nearly identical environments (except for some randomness in object layout). Demonstrations involve multiple tasks performed in the same scene, requiring the model to fully understand language instructions to differentiate them. Additionally, due to layout randomness and variations in human teleoperator habits, demonstrations themselves are diverse and multi-modal. Furthermore, our real-robot experiments also incorporate randomness in object placement and maintain diversity in manipulation actions during teleoperation data collection. Therefore, we believe our current experimental design ensures sufficient diversity, except for MetaWorld.

Question 4: Questionable Semantic Prior Assumption

As with our response to Question 2, we believe this is a misunderstanding. Our core argument is that the source distribution should contain sufficient information about the condition $c$ , not necessarily that it must be an action prior. We conducted further experiments to investigate this:

We also experimented with making $q(x_0|c)$ directly an action prior using a supervised learning loss. Specifically, we removed the autoencoder's decoder and used the encoder as an action prior model, predicting the action directly with an MSE loss. Similar to the autoencoding version, we added Gaussian noise with std=0.2 to the predicted action to represent $q(x_0|c)$ . In this case, $x_0$ becomes the "not-so-accurate" action overlaid with Gaussian noise.

Here are the results of this experiment:

	5k	10k	20k	30k
cocos (autoencoding)	83.1 ± 0.9	81.1 ± 1.3	92.2 ± 1.3	93.8 ± 0.3
cocos (action prior)	70.0 ± 1.8	81.0 ± 1.0	84.5 ± 0.6	87.2 ± 0.6
vanilla (w/o cocos)	66.5 ± 1.6	78.3 ± 0.5	83.7 ± 0.3	82.9 ± 0.4

In some settings, similar instructions can require significantly different actions due to scene geometry or context.

We don't believe this scenario is a concern. Even though the source distribution contains instruction information, the diffusion model still learns a distribution-to-distribution transform. This allows it to effectively model multi-modal distributions and, consequently, output significantly different actions even for similar instructions when context or scene geometry demands it.

2025-08-07

Thank you for the rebuttal. The authors have addressed most of my concerns. I am therefore updating my score to Accept.

2025-08-07

I'm glad our rebuttal addressed your concerns. We will continue to refine the revised paper!

审稿意见

评分: 4置信度: 22025-07-03

This paper proposes a simple idea to avoid conditional collapse in diffusion policy learning. The authors claim that the reason why condition collapse occurs is because the training objective degerates into modeling the marginal action distribution. To prevent this, they use a simple trick where instead of using a simple isotropic Gaussian to model the noise, the authors employ a conditional gaussian which depends on the conditioning variables. Experiments show that the appraoch is very effective.

优缺点分析

Strengths:

The method is simple, yet effective.
Cocos improves over not using it in many experiments.

Weakness:

The theoretical section is a bit hard to parse and understand. I would encourage the authors to simplify the writing a bit.
Adding more experiments on datasets like Robomimic, and applying Cosos over different models (different sizes, larger models) or methods can make the experimental section more strong.
I would have liked to see more analysis on the conditional vectors. For example, how does the \mu vector look for different conditions. Is there a difference in the distributions when instructions change - for example, move left, move right, etc. Such kind of analysis would make it more interesting.
The method still lacks behind methods like Seer. I understand the method is smaller and undertrained. But, if trained with larger models, will it help?

问题

Does the performance improvement seen with Cocos scale with model size?
Can this method be used for general tasks like text to image generation? I wonder if the findings in this paper help across other tasks.

局限性

I feel the idea is simple and effective which is nice. But the experimental section seems somewhat weak.

最终评判理由

I thank the authors for all the additional experiments in the rebuttal. It addressed many of my concerns, and hence I upgrade my rating to borderline accept.

格式问题

作者回复

2025-07-29

Question 1: Simplify the Writing of Theorem

Thanks for the suggestion! We'll rewrite this section to make the key points clearer, avoid misunderstandings, and simplify the writing to be much easier to follow.

Question 2: Adding More Experiments on Datasets like Robomimic

We didn't initially consider Robomimic primarily because this benchmark seems to be largely solved. Also, since Libero and Robomimic share the same simulation engine, we opted for the more challenging Libero. To address your concern, we've now included some experiments on Robomimic. The tasks seem to be simple, and we don't see a significant difference in performance:

	Square (ph)	Transport (ph)	ToolHang (ph)
w/ Cocos	1.0	1.0	0.98
w/o Cocos	1.0	0.98	0.98
Note: Since Robomimic tasks lack language instructions, we removed the T5 encoder, and the policy network only accepted vision embeddings. All other settings remained consistent with Libero.

Question 3: Cocos Scale with Model Size

That's an interesting idea! Testing scaling capabilities requires not only increasing model size but also correspondingly increasing data size. Given the limited time for rebuttal, we couldn't undertake large-scale pretraining. Instead, we tried fine-tuning PI0 (a very powerful and popular pre-trained VLA with 3 billion parameters) with Cocos on Libero. Here are the results:

	Goal	Spatial	Object	Long
w/ Cocos	86.2	88.0	93.4	72.5
w/o Cocos	80.4	82.7	91.6	61.2
Note: The "w/o Cocos" results show some discrepancy from the official OpenPI report. This is mainly because the limited time and computational resources during rebuttal forced us to adopt a lightweight fine-tuning strategy: we froze PaliGemma, only trained the action expert, used bf16 precision, a batch size of 32, and did not use PI0's pre-trained stats.

The results show that Cocos can indeed bring benefits when fine-tuning large-sized VLA models.

Question 4: The Method Still Lacks Behind Methods like Seer

The figures might have been misread; DP+Cocos actually surpasses Seer on Libero. However, it's true that it still doesn't reach the levels of OpenVLA-oft and PI0. Rather than attributing this to model size, we believe it's more likely because these VLA models have undergone large-scale pretraining, whereas DP+Cocos is trained from scratch on specific benchmarks.

Our work primarily aims to explore the capability bottlenecks of current diffusion policies, rather than to release a state-of-the-art VLA model. Therefore, we plan to discuss the application of Cocos in large-scale pretraining more extensively in future work.

Question 5: Can this method be used for general tasks like text to image generation?

Theoretically, our method could be beneficial for tasks like text-to-image generation, as our theory doesn't limit specific modalities. However, we want to emphasize that the challenges faced by diffusion models in Embodied AI versus AIGC are likely very different.

AIGC currently grapples with generating increasingly complex modal distributions, posing significant demands on a model's generative capacity. In contrast, action chunks are fundamentally low-dimensional, smooth trajectories, and the model's main difficulty lies in accurately understanding the condition $c$ . This is why AIGC models are developing increasingly powerful modal compressors to reduce the complexity of the distributions diffusion models need to fit, making the design of a data prior potentially more crucial for image generation. Cocos, on the other hand, is specifically tailored to the challenges of diffusion policies in Embodied AI, aiming to help the model better differentiate between various conditions.

2025-08-07

Thank you again for your suggestions. We wanted to gently remind you that the reviewer-author discussion period is set to conclude in the next one or two days. We have not yet received a response from you, including the mandatory acknowledgement. We are very much looking forward to hearing from you and having the opportunity to discuss our work.

评论- Rebuttal Summary

2025-08-05

Dear Reviewers,

We thank all the reviewers for their thoughtful and constructive feedback. We are grateful for your time and for the positive reception of our work. We're encouraged to see that reviewers found our analysis of the loss collapse problem insightful and our solution both simple and effective.

Based on all suggestions, we conducted several new experiments and have clarified key aspects of our method. These results will be included in our revised paper to further strengthen our claims.

Design Choice for the Conditional Prior. Reviewers asked about our decision to use an autoencoding loss for the conditional prior $p(x_0|c)$ instead of a supervised action prior. To address this, we ran an experiment directly comparing a supervised action prior. The results show that autoencoding method outperforms the supervised action prior, confirming that simply learning a coarse action prior is not the primary benefit. Instead, compressing rich conditional information from vision and language cues into the prior is more crucial for preventing loss collapse.

	5k	10k	20k	30k
cocos (autoencoding)	83.1 ± 0.9	81.1 ± 1.3	92.2 ± 1.3	93.8 ± 0.3
cocos (action prior)	70.0 ± 1.8	81.0 ± 1.0	84.5 ± 0.6	87.2 ± 0.6
vanilla (w/o cocos)	66.5 ± 1.6	78.3 ± 0.5	83.7 ± 0.3	82.9 ± 0.4

Scalability with Model Size. To investigate if our method scales with model size, we applied it to fine-tuning PI0, a very powerful pre-trained VLA model with 3 billion parameters. Our method provided significant performance boosts across a variety of tasks on the LIBERO benchmark. This demonstrates that our approach is not limited to smaller models and can effectively enhance the performance of large-scale, pre-trained policies.

Goal Spatial Object Long
PI0 w/ Cocos 86.2 88.0 93.4 72.5
PI0 w/o Cocos 80.4 82.7 91.6 61.2
Performance on Robomimic. We also ran experiments on the Robomimic benchmark. We found that since many of these tasks are less complex, our method provides marginal, but consistent, improvements.

Square (ph) Transport (ph) ToolHang (ph)
w/ Cocos 1.0 1.0 0.98
w/o Cocos 1.0 0.98 0.98

	Goal	Spatial	Object	Long
PI0 w/ Cocos	86.2	88.0	93.4	72.5
PI0 w/o Cocos	80.4	82.7	91.6	61.2

	Square (ph)	Transport (ph)	ToolHang (ph)
w/ Cocos	1.0	1.0	0.98
w/o Cocos	1.0	0.98	0.98

最终决定Accept (poster)

2025-09-17

This paper studies the training of CNF models for robot control, which approximate distributions over actions given state or goal information, such as images. A small modification is introduced into the modelling: instead of transporting a fixed source distribution to a conditional target by a learned conditional vector field, it is proposed to make the initial distribution dependent on the condition using a pretrained representation. This modification is shown to improve success rates on a variety of simulated and real robot control tasks.

Main strengths, all identified by multiple reviewers:

This is a simple, indtuitive, widely applicable idea that delivers very strong results on a variety of tasks.
Comprehensive experiments, including applications to real robots and well-chosen illustrations.
Clear exposition.

Main weaknesses:

The theoretical guarantees are not very meaningful and not stated in proper mathematical language. The authors did not respond to this point beyond promising improvement. The authors are strongly encouraged to improve this section.
Questions about the manner of defining the conditional source distribution, addressed by a new experiment using a different pretrained representation that approximates the action distribution.
Questions about how the method handles multimodality in the action distribution and about scaling, both answered satsifactorily in the rebuttal.

The reviewers unanimously recommend acceptance, and I agree with this recommendation.