Uncertainty-aware Preference Alignment for Diffusion Policies
We propose Diff-UAPA, a novel framework that aligns diffusion policies with human preferences by integrating uncertainty-aware objectives and MAP estimation.
摘要
评审与讨论
This paper addresses the challenge of sample efficiency and robustness in preference-based reinforcement learning (PbRL), where the learning agent relies on human preference feedback instead of explicit numerical rewards. The authors propose an uncertainty-aware PbRL framework that explicitly models the uncertainty in human feedback and integrates it into both the preference prediction and policy optimization processes. Specifically, the method employs a Bayesian preference model to capture epistemic uncertainty and leverages this information to guide more informative queries and safer policy updates. The paper demonstrates the proposed approach on several benchmark continuous control tasks, showing that the method achieves improved sample efficiency and robustness compared to existing PbRL baselines. The key contributions include the integration of uncertainty estimation into the preference model, the design of an active query strategy based on uncertainty, and an uncertainty-aware policy optimization algorithm.
优缺点分析
Strengths:
-
This paper accurately identifies a crucial but often overlooked issue in PbRL: the uncertainty of preference data. This is an important topic in current alignment research. The authors propose using a Beta distribution as a prior for trajectory advantage values, combined with regret-based preference modeling and MAP inference, enabling the model to capture preference uncertainty explicitly.
-
The paper introduces Diff-UAPA, which is designed for both streaming and batch preference inputs, demonstrating broad applicability.
-
The authors tightly integrate the preference alignment problem into the diffusion policy framework, proposing chain reward and advantage modeling based on diffusion trajectories, which addresses limitations faced by DPO/CPL methods in this setting.
-
The experimental validation covers diverse scenarios, including synthetic tasks, real-world preference datasets, and real robot control tasks. Moreover, the authors conduct sensitivity analyses under various noise levels and types, along with ablation studies, to demonstrate the method’s advantages in dynamic preference iteration and uncertain data conditions.
Weaknesses:
-
Although the Beta distribution can theoretically represent uncertainty in the strength of preferences, its effectiveness under complex preference distributions (such as asymmetric or multi-modal cases) remains questionable. Additionally, the paper initializes the Beta parameters and to 1. How would varying the initialization affect the model’s performance under different scenarios (e.g., the various noise conditions shown in Table 5)?
-
Equation 8 and Equation 17 introduce additional computational overhead. How does this overhead compare to previous methods? It would be helpful if the authors could provide a more detailed comparison of the computational cost relative to prior approaches.
-
The comparative experiments only benchmark against diffusion-based methods. Why not compare with non-diffusion baselines (such as INPO, IPO, or Nash-MD-PG)?
Others:
- Line 106: “datatset” → “dataset”
问题
- In Equation 11, multiple layers of approximation are introduced. Could the accumulation of these approximations affect the final performance?
- Proposition 4.1 assumes that trajectories are i.i.d, but in practice, trajectories may be dependent. How can you ensure that this is maintained in MDP?
- In Algorithm 1, how is the reference policy initialized? How sensitive is the final performance to the choice of ?
局限性
yes
最终评判理由
The authors' rebuttal has addressed most of my concerns. Based on the current state of the paper and the authors’ clarifications, I recommend the paper is marginally acceptable.
格式问题
No formatting issues
Dear Reviewer,
We sincerely appreciate the time and effort you have dedicated to reviewing our work. We have carefully addressed each of your comments with detailed responses and clarifications. We hope these will effectively resolve your concerns.
Q1. Although the Beta distribution can theoretically represent uncertainty in the strength of preferences, its effectiveness under complex preference distributions (such as asymmetric or multi-modal cases) remains questionable.
Response. We thank the reviewer for raising this important point. To clarify, we do not use the Beta distribution to model preference strength directly. Instead, as detailed in Lines 213–217, we use a Beta distribution to serve as a prior on the win‑probability , i.e., the probability that trajectory wins against the average candidate. Since human pairwise feedback is a binary comparison, which is inherently Bernoulli (win vs. loss). So the Beta distribution serves as a natural, conjugate choice: its two parameters flexibly capture both skewed (peaked near 0 or 1) and broad (centered near 0.5) probability without too much complexity. Empirically, even under asymmetric or multi-modal cases, the aggregated win-probabilities remain effectively unimodal and are well-fit by a single Beta. For example, they either concentrated near 0 or 1 when consensus is strong. We acknowledge that truly multi‑modal preference patterns may necessitate more expressive priors such as mixtures of Beta distributions, which is a promising direction for future work.
Q2. Additionally, the paper initializes the Beta parameters and to 1. How would varying the initialization affect the model’s performance under different scenarios (e.g., the various noise conditions shown in Table 5)?
Response. Thank you for raising this question. In this work, we utilize as an uninformed prior. This corresponds to a uniform distribution over , which serves as a natural choice for uninformed belief. This choice provides a neutral starting point with the least information.
In terms of varying and to other values, this operation indeed has an impact on the model performance. For instance, setting and will introduce strong prior information into the model (implying 9999 persons vote 1, and only one person votes 0). Moreover, the absolute values of and critically affect prior strength: although both and have a mean of 0.5 (assuming every trajectory is an average player), the latter encodes a far stronger initial belief in such a “half-win” probability. Such a concentrated prior can dominate the posterior and hinder adaptation to true preference signals. Therefore, we choose the uniform distribution to initialize due to the above reasons.
Q3. Equation 8 and Equation 17 introduce additional computational overhead. How does this overhead compare to previous methods? It would be helpful if the authors could provide a more detailed comparison of the computational cost relative to prior approaches.
Response. Thank you for your thoughtful comment. We address the computational overhead introduced by the diffusion policy (Eq. 8) and the Beta model (Eq. 17) in Appendix D.3. To summarize, training the diffusion policy takes approximately twice as long as our Transformer baseline. However, this overhead is partially mitigated by the action-sequence prediction strategy adopted from [1], which improves sample efficiency. As for the Beta prior, it is learned using efficient reparameterization techniques, incurring a cost comparable to fitting a standard reward model. In practice, this adds only a few additional minutes to the overall training time.
[1] Diffusion policy: Visuomotor policy learning via action diffusion. International Journal of Robotics Research, 2023.
Q4. The comparative experiments only benchmark against diffusion-based methods. Why not compare with non-diffusion baselines (such as INPO, IPO, or Nash-MD-PG)?
Response. Thank you for the comment. We would like to clarify that our evaluation does include non-diffusion baselines. Specifically, as shown in Lines 265–270 and Table 1, there is a line of baselines based on the Behavior Transformer (BET), which is a representative non-diffusion policy optimization method. We assume that the "non-diffusion" approaches you mention (INPO, IPO, Nash-MD-PG) correspond to "non-DPO" methods. Indeed, our experiments also include non-DPO methods like UA-PbRL and RIME (see Lines 336–340 and Table 4).
To solve your concerns, we conduct additional experiments using the baselines you mentioned.
| Method | Lift | Can | Square | Transport | p1 (kitchen) | p2 (kitchen) | p3 (kitchen) | p4 (kitchen) |
|---|---|---|---|---|---|---|---|---|
| InPO | 53.2 ± 1.4 | 54.2 ± 2.2 | 57.2 ± 5.3 | 50.4 ± 4.3 | 100.0 ± 0.0 | 98.2 ± 1.4 | 90.1 ± 2.8 | 52.2 ± 1.1 |
| IPO | 50.7 ± 0.9 | 53.0 ± 4.3 | 54.7 ± 5.8 | 53.0 ± 12.1 | 99.3 ± 0.4 | 99.3 ± 0.4 | 92.3 ± 0.0 | 55.3 ± 1.4 |
| Diff-UAPA | 56.1 ± 0.9 | 61.3 ± 2.2 | 68.1 ± 0.6 | 64.0 ± 4.0 | 100.0 ± 0.0 | 99.7 ± 0.2 | 95.4 ± 0.6 | 70.9 ± 4.6 |
The results highlight the uncertainty-aware capability of Diff-UAPA. In contrast, InPO and IPO perform worse due to the absence of mechanisms for handling uncertainty in noisy labels.
Q5. In Equation 11, multiple layers of approximation are introduced. Could the accumulation of these approximations affect the final performance?
Response. Thank you for your valuable question. We assume that the 'approximation' you mentioned indicates the 'inequality' and 'expectation' in Eq. 11. We have discussed some reasons in Lines 193–197, and we have offered a more detailed justification of the necessity and rationale behind our model in Lines 193–197.
To clarify, approximating expectation via sampling is a common practice. When it comes to Jensen's inequality:
-
Jensen’s inequality gives an upper bound on the negative log-sigmoid, where the error does not grow infinitely with variance. So training the loss function of Eq. 11 yields similar results to Eq. 10.
-
Evaluating Eq. 10 directly means computing probabilities at every diffusion step, which is an expensive process where small errors can be amplified by the outer sigmoid. By applying Jensen’s inequality to the log-expectation, we obtain a simpler surrogate loss that is both faster to compute and more stable.
It is also worth noting that the derivation from Eq. 10 to Eq. 11 follows the standard practice for MLE-based diffusion policy alignment. Similarly, [2] applies an analogous approximation when moving from Eq. 13 to Eq. 14.
[2] Diffusion model alignment using direct preference optimization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Q6. Proposition 4.1 assumes that trajectories are i.i.d, but in practice, trajectories may be dependent. How can you ensure that this is maintained in MDP?
Response. We appreciate the reviewer’s question regarding Proposition 4.1. In fact, Proposition 4.1 rests on the classic Beta–Bernoulli conjugacy, which formally assumes that the wins out of comparisons behave like conditionally independent Bernoulli trials given a latent win-probability . Crucially, one only needs the weaker condition of exchangeability, and De Finetti’s theorem guarantees that any exchangeable Bernoulli sequence can be viewed as i.i.d. draws from some , yielding the posterior . In episodic RL, independent environment resets, a stationary policy during data collection, and randomized initial states ensure that the sequence of win/loss labels across trajectories is effectively exchangeable (and hence conditionally independent given ), even though state–action pairs within each trajectory remain correlated by the MDP dynamics. Therefore, Proposition 4.1 does not require trajectories to be strictly i.i.d., and it only requires the exchangeability of the binary preference outcomes, a condition naturally met by standard episodic rollout protocols.
Q7. In Algorithm 1, how is the reference policy initialized? How sensitive is the final performance to the choice of ?
Response. Thanks for your valuable question. is first randomly initialized and then trained via standard behavior cloning on the entire set of state-action pairs extracted from the preference dataset , until it reaches a predefined performance threshold (e.g., a target cumulative reward). To solve your concerns, in terms of the sensitivity of the performance , we conduct additional experiments by varying the performance of in the Robomimic Lift environment.
| Reference Policy | BET-CPL | BET-DPO | Diff-CPL | FKPD | Diff-UAPA-C | Diff-UAPA-I |
|---|---|---|---|---|---|---|
| 25% | 40.7 ± 4.2 | 37.3 ± 5.0 | 40.6 ± 5.7 | 42.0 ± 2.0 | 44.7 ± 4.1 | 41.7 ± 2.3 |
| 75% | 79.0 ± 7.8 | 76.3 ± 1.3 | 80.0 ± 2.0 | 82.3 ± 1.3 | 84.0 ± 3.5 | 82.7 ± 3.1 |
From the experimental results, it can be seen that the quality of the initial policy has an impact on the final result. But the Diff-UAPA still outperforms other baseline methods.
Q8. Some typos.
Response. Thank you for your valuable suggestions. We have corrected them accordingly and thoroughly checked the paper to avoid similar issues.
Thank the authors for their carefully prepared rebuttal. It addresses most of my concerns. I have raised my rating to 4.
We are immensely grateful for your updated evaluation. Your thoughtful and thorough feedback has been invaluable in refining our work, and we deeply appreciate the time and effort you have dedicated to this process. Your constructive input has significantly contributed to the improvement of our paper, and we are honored by your positive reassessment. Thank you once again for your generous consideration and support.
This paper presents Diff-UAPA, an uncertainty-aware preference alignment algorithm for diffusion policies. The algorithm employs a maximum posterior objective with an informative Beta prior to align policies with regret-based preferences, enabling direct optimization without explicit reward functions while mitigating inconsistent preferences across groups. Extensive experiments on simulated and real-world robotics tasks demonstrate Diff-UAPA's effectiveness.
优缺点分析
Strengths:
- The method is reasonable, and the paper is clearly written.
- Diff-UAPA shows demonstrable effectiveness in simulated and real-world robotics tasks.
Weaknesses:
-
This method builds upon several existing works [1,2,3], which may limit the novelty of the proposed approach. In my understanding, compared to [3], this paper primarily replaces the reward distribution with an advantage distribution and represents it using a Beta distribution (where the advantage derivation process is similar to [1]), while the optimization objective derivation for Diffusion Policy is largely based on [2]. The paper would benefit from clarifying the distinctions from these prior works to better highlight its contributions.
-
The experimental evaluation mainly employs label flipping in preference datasets, which may not fully replicate the inherent uncertainties characteristic of human preferences.
-
The necessity of the Beta distribution warrants further justification—whether directly predicting advantage distributions (as in [3]) or employing alternative distributions would be equally effective. Moreover, the Beta distribution's unimodal nature may limit its ability to capture the diverse human preferences.
[1 ]From to : Your Language Model is Secretly a Q-Function. COLM 2024.
[2] Diffusion model alignment using direct preference optimization. CVPR 2024.
[3] A distributional approach to uncertainty-aware preference alignment using offline demonstrations. ICLR 2025.
问题
- In the introduction line 54, the authors mention
interpreting preference alignment as a voting process
but this concept is not elaborated upon or referenced again in the main text.
-
It appears that UA-PbRL[1] and other uncertainty-aware preference alignment algorithms are not compared in the main experimental evaluation.
-
While the paper frequently refers to iterative preference alignment, the model training appears to be offline, which seems contradictory.
[1] A distributional approach to uncertainty-aware preference alignment using offline demonstrations. ICLR 2025.
局限性
Yes.
最终评判理由
While the author has resolved most concerns, the overall contribution remains at the same level. I maintain my original score.
格式问题
No.
Dear Reviewer, we sincerely value your time and effort in evaluating our work. We have prepared comprehensive responses and clarifications to address each point you raised. We hope these responses can resolve your concerns.
Q1. This method builds upon several existing works ....
Response. Thank you for raising this concern. We would like to provide a more detailed discussion highlighting the differences between this paper and the three prior works as follows.
-
Difference with [1]. Though both works start from the Maximum Entropy reinforcement learning, [1] uses the cumulative reward of the trajectory as the compare metrics, while ours adopt the advantage function.
- Preference model. In [1], the preference model employs the cumulative reward of a trajectory as the comparison metric, utilizing an implicit reward model defined as . Since our research is based on diffusion policies, we specifically define a chain advantage function for the diffusion policy to include all the latent variables in the middle. The marginalized form of the chain advantage function serves as the advantage function in our setting (Eq. 8). That means we need to consider the coupling of probability of the whole diffusion path rather than a state-action pair probability .
-
Difference with [2]. While we adopt some techniques from [2], such as Jensen’s inequality and the convexity of when deriving from Eq. 11 to Eq. 12 (we have accordingly cited [2] during the derivation), the problem setting, preference model, and derivation procedure are different. Additionally, we would like to highlight that the alignment of the diffusion policy via MLE is not our primary contribution. Instead, we primarily follow the approach developed for LLM in [2], and adapt it to the RL setting in our work.
- Preference model and derivation procedure. [2] employs a reward-based preference model, which relies on the reward of the LLM's final output (see Eq. 3 in [2]) while regularizing the KL-divergence with respect to a reference policy. In contrast, this work adopts a regret (advantage)-based preference model (see Eq. 5) within the maximum entropy RL framework. Although both preference models ultimately lead to similar objectives for aligning the diffusion policy, the derivation procedures are somewhat different. Specifically, the derivation of this work is based on trajectory-wise alignment, whereas [2] only addresses the step-wise alignment. To accomplish this using the advantage-based preference model, we introduce the chain advantage function (i.e., Eq. 8), which is defined based on the chain reward function presented in [2]. We then calculate the expected value of the chain advantage function with respect to the diffusion latent variable, which serves as a key index for evaluating the trajectory.
-
Difference with [3]. While both works utilize the Beta prior with a MAP objective to model uncertainty during the alignment process, this work differs significantly regarding the problem formulation, underlying motivation, and approach to incorporating the Beta prior. Specifically,
-
Implementation. Specifically, [3] employs an iterative update rule, which is an approximation approach from numerical analysis (see Eq. 8 in [3]). This approach implicitly assumes an average reward of zero and allows for substantial fluctuation in the Beta prior parameters (, ). In contrast, our framework integrates the Beta prior directly into a trainable loss function for the target policy. We utilize a marginalized chain advantage function () as our comparison metric, which is computed by . This distinction means the assumption of a zero-mean reward function is not applicable to our method. Furthermore, we have to carefully control the numerical magnitudes of and to appropriately regularize the influence of the prior loss during training. These fundamental differences in implementation highlight a notable gap between the two approaches.
-
Approach. [3] adopts a two-step procedure, proposing a MAP objective for learning a distributional reward model. To achieve this, [3] introduces an iterative update rule that refines the reward model using the and parameters output by the learned Beta prior model, until it converges to the maximum of its MAP objective. This reward model is then used for policy learning. In contrast, this work derives a unified MAP objective for directly aligning the diffusion policy in a single-step process, without the need for reward learning. We update our diffusion policy using a unified loss function (Eq. 17). By maximizing the likelihood of the diffusion policy's output under the learned Beta distribution, the process guides the policy to align the estimated with their prior distribution . This approach is more straightforward and efficient.
-
[1] From to : Your Language Model is Secretly a Q-Function. COLM 2024.
[2] Diffusion model alignment using direct preference optimization. CVPR 2024.
[3] A distributional approach to uncertainty-aware preference alignment using offline demonstrations. ICLR 2025.
Q2. The experimental evaluation ....
Response. We appreciate your observation regarding the use of label flipping in our experimental evaluation of preference datasets. To address the inherent uncertainties characteristic of human preferences, we incorporated multiple noise models to simulate inconsistent human behavior. Specifically, our evaluation, as detailed in [1], includes the Stochastic Noise Model, the Myopic Noise Model, and the Irrational Noise Model, with the latter being equivalent to label flipping. The experiment results are provided in Table 4. We believe these diverse models collectively provide a robust framework for capturing the complexities and uncertainties inherent in human preferences.
[1] Robust preference-based reinforcement learning with noisy preferences. ICML 2024.
Q3. The necessity of the Beta distribution ... diverse human preferences.
Response. Thank you for your question. We would like to have a deeper discussion on Beta Distribution.
-
Effectiveness of Beta Distribution. The Beta distribution proves highly effective for modeling human preferences in our framework due to its intuitive representation of confidence and uncertainty. It is the conjugate prior for the Bernoulli distribution on . Its two parameters, and , directly reflect accumulated "positive" and "negative" votes (e.g., preferred vs. unpreferred trajectory instances), allowing for a flexible update of beliefs as new preference data arrives (Lines 219–220). Other distributions are not naturally bounded on [0, 1] nor built for voting, which limits their suitability for proportion-based data. The Normal distribution’s unbounded support and symmetric shape, the Uniform’s fixed variance, and the Gamma’s semi-infinite range hinder their flexibility for diverse preference patterns.
-
Unimodality of Beta Distribution. Since our study adheres to the standard setting of PbRL where preferences for a single sample are binary—either positive or negative, the aggregated win-probabilities remain effectively unimodal and are well-fit by a single Beta. For example, they either concentrate near 0 or 1 when consensus is strong, or near 0.5 when feedback is mixed. We acknowledge that truly multi-modal or list-wise preference patterns may necessitate more expressive priors such as mixtures of Beta distributions, which is a promising direction for future work.
Q4. In the introduction Line 54 ... in the main text.
Response. We apologize for the missing explanation. This process is very similar to choosing one of two candidates. Human annotators act as voters, casting preferences for trajectories (candidates) through pairwise comparisons in an election. The Diff-UAPA framework aggregates these "votes" using a MAP objective, incorporating a Beta prior to model uncertainty arising from diverse or conflicting preferences, akin to handling voter disagreements in a probabilistic voting system.
Q5. It appears that UA-PbRL ....
Response. Thanks for your question. In this paper, we compare UA-PbRL in our in-depth study (i.e., Table 4), evaluating its performance in MuJoCo environments under different types of noise. However, we acknowledge the importance of testing them in the main experimental evaluation, the results are:
| Method | Lift | Can | Square | Transport | p1 (kitchen) | p2 (kitchen) | p3 (kitchen) | p4 (kitchen) |
|---|---|---|---|---|---|---|---|---|
| UA-PbRL | 55.3 ± 2.2 | 57.4 ± 2.3 | 62.3 ± 2.7 | 60.6 ± 5.7 | 100.0 ± 0.0 | 98.7 ± 1.3 | 92.2 ± 3.7 | 66.3 ± 4.2 |
| Diff-UAPA | 56.1 ± 0.9 | 61.3 ± 2.2 | 68.1 ± 0.6 | 64.0 ± 4.0 | 100.0 ± 0.0 | 99.7 ± 0.2 | 95.4 ± 0.6 | 70.9 ± 4.6 |
This demonstrates that Diff-UAPA outperforms UA-PbRL, primarily owing to the powerful modeling capabilities of the diffusion policy.
Q6. While the paper ... which seems contradictory.
Response. Thank you for raising this point. The iterative preference alignment process in our framework is distinct from the offline or online RL setting. As outlined in Lines 126–134, iterative preference alignment refers to the method by which the preference dataset is constructed, labeled from various people, and utilized during training. This process does not specify the data collection mechanism, such as whether the agent interacts with the environment or relies on a pre-collected dataset. Thus, it is fully compatible with the offline training approach described in the paper, resolving any apparent contradiction.
Thank you so much for your thoughtful and detailed responses. They have thoroughly addressed all of my concerns, and based on these responses, I will maintain my positive rating.
Thank you for your acknowledgment and confirmation. We are truly pleased to hear that the explanation successfully addressed your concerns. Your thoughtful and detailed feedback has been invaluable in refining our work, and we greatly appreciate the time and effort you dedicated to reviewing our paper. Thank you!
The paper proposes Diff-UAPA, a direct preference alignment algorithm for diffusion policies, designed to address inherent uncertainties in human preferences across diverse user groups. Diff-UAPA uses a MAP objective, guided by an informative Beta prior for capturing the uncertainty. Experiments on both simulated robotic and real-world tasks demonstrate the robustness of Diff-UAPA for the uncertainty in preference data.
优缺点分析
Strengths.
-
The proposed method advances beyond traditional MLE objective in PbRL, which implicitly assumes a uniform prior and consequently lacks sensitivity to the inherent uncertainty in preference dataset, by using an informative beta prior and MAP objective.
-
The paper provides comprehensive empirical validation, consistently demonstrating Diff-UAPA's superior performance and robustness across both simulated and real-world robotics tasks, and various human preference configurations including synthesized, realistic, and noisy data.
Weaknesses.
-
Training informative beta prior leads to additional computation costs.
-
It seems that the baseline methods are not iteratively trained to adapt to evolving preference signals in the same manner as Diff-UAPA.
问题
Q1. Can we come up with the proposed method's online counterpart?
Q2. Could the author provide a discussion on the advantages and disadvantages of the proposed method over simple robust methods such as data filtering and label smoothing?
Q3. Could the author consider other iterative method baselines for a fair comparison with Diff-UAPA?
局限性
The authors present a limitation section in Appendix.
最终评判理由
The author's rebuttal has addressed all my questions and concerns. Considering the rebuttal, along with the soundness and completeness of the methodology and experiments, I keep my positive score.
格式问题
None
Dear Reviewer, we greatly appreciate your constructive comments. We have seriously considered your suggestions, and we hope the following responses can address your concerns:
Q1. Training informative beta prior leads to additional computation costs.
Response. Thank you for raising this concern. As discussed in Appendix D.3, in this work, we use efficient techniques like the reparameterization trick to improve scalability. In practice, the computational cost of training the Beta model is similar to training a reward model in traditional PbRL. Since our method avoids training a reward model, the added cost is less effective compared to conventional PbRL. Additionally, the extra computational cost only slightly increases training time by a few minutes, while the subsequent RL phase is much more demanding, often taking several hours.
Q2. It seems that the baseline methods are not iteratively trained to adapt to evolving preference signals in the same manner as Diff-UAPA.
Response. Thank you for your insightful comment, and apologies for the confusion. As described in Line 260, our experiments consist of four rounds of iterative updates, each with a fixed number of episodes. This protocol is applied identically to all methods, including the baselines, so every algorithm is retrained on the newly acquired preference data at each round. We have clarified this procedure in the revised manuscript.
Q3. Can we come up with the proposed method's online counterpart?
Response. Thanks for your question. The proposed Diff-UAPA can be effectively generalized to an online setting due to the inherent flexibility of its core components, namely the Beta prior and MAP objective, which are not limited to offline datasets. The Beta prior’s ability to model uncertainty in human preferences makes it particularly suitable for handling dynamically evolving feedback from diverse user groups in real-time. To achieve this online counterpart, a feedback interface can be implemented to collect pairwise preference data (or from a pre-trained critic model) during or after trajectory execution in an interactive environment. This feedback can be stored in a dynamic preference replay buffer, which is incrementally updated as new trajectories are generated. The Beta prior parameters (, ) are then updated, enabling the diffusion policy to adapt continuously to new preferences while maintaining robustness to uncertainty.
Q4. Could the author provide a discussion on the advantages and disadvantages of the proposed method over simple robust methods such as data filtering and label smoothing?
Response. Thanks for your valuable question. We would like to provide a more detailed discussion as follows:
-
Advantages: UAPA offers significant advantages over data filtering and label smoothing. These Robust PbRL methods (e.g., label smoothing, data filtering) generally aim to exclude noisy or inconsistent data from the training process. While this may help reduce the impact of outliers, it also risks discarding valuable information if certain data points are mistakenly deemed as outliers. This filtering approach can result in lost opportunities for learning from diverse, potentially useful preferences. Besides, such techniques are usually static, lacking mechanisms to adjust to temporal or contextual changes in preference data. In contrast, UAPA employs a Beta prior to explicitly model the uncertainty inherent in diverse human preferences. Rather than eliminating uncertain data points, our framework assigns them lower confidence weights, allowing these samples to contribute conservatively to policy learning. This ensures that outliers or inconsistent preferences are not outright discarded but are instead integrated in a manner that minimizes their negative impact while preserving the richness of the preference dataset.
-
Disadvantage: Data filtering and label smoothing are computationally lightweight, as they involve simple rule-based preprocessing of the dataset. Training the Beta prior model in Diff-UAPA requires variational inference, which adds computational overhead compared to simpler methods. However, a small additional time cost (a few minutes) compared to the more demanding reinforcement learning phase is totally acceptable. (See more details in Appendix E)
We have included such a discussion in our revised paper to enhance clarity.
Q5. Could the author consider other iterative method baselines for a fair comparison with Diff-UAPA?
Response. Thank you for your question, and we apologize for any confusion. While none of our baseline methods include mechanisms to handle noisy preference labels, they all undergo four rounds of iterative updates during training, following the same procedure as Diff-UAPA (see Line 260). Therefore, these baseline methods are also iterative in nature.
To better address your concern, we additionally included an iterative preference alignment method, IPO [1]. The results are as follows:
| Method | Lift | Can | Square | Transport | p1(kitchen) | p2(kitchen) | p3(kitchen) | p4(kitchen) |
|---|---|---|---|---|---|---|---|---|
| IPO | 55.5 ± 1.4 | 54.0 ± 2.2 | 59.2 ± 5.3 | 56.7 ± 4.3 | 100.0 ± 0.0 | 97.4 ± 1.4 | 90.8 ± 2.8 | 61.5 ± 1.1 |
| Diff-UAPA | 56.1 ± 0.9 | 61.3 ± 2.2 | 68.1 ± 0.6 | 64.0 ± 4.0 | 100.0 ± 0.0 | 99.7 ± 0.2 | 95.4 ± 0.6 | 70.9 ± 4.6 |
We observe that Diff-UAPA consistently outperforms IPO, primarily due to its uncertainty-aware capabilities.
[1] IPO: Iterative Preference Optimization for Text-to-Video Generation. arXiv, 2025.
Thank you for the detailed responses. They have addressed all my questions and concerns. Considering the rebuttal, along with the soundness and completeness of the methodology and experiments, I will maintain my positive score.
We sincerely appreciate your acknowledgment and confirmation. We are delighted to learn that our response has fully addressed your concerns. Your thoughtful and detailed feedback has been instrumental in refining our work, and we are deeply grateful for the time and effort you invested in reviewing our paper. Thank you once again for your invaluable contribution.
The paper proposes Diff-UAPA, an uncertainty-aware framework for aligning diffusion policies with human preferences using a Beta prior to model preference inconsistencies. The approach integrates a maximum posterior objective with iterative preference alignment, demonstrating robust performance across simulated and real-world robotics tasks. Extensive experiments validate its effectiveness against baselines, including ablation studies and noise sensitivity tests. However, a critical question arises regarding the originality of the core contributions, as key components appear to build heavily on prior works.
优缺点分析
Strengths
-
The paper is well-written and clearly organized, with a logical flow from problem statement to experimental validation.
-
Important Motivation: The focus on handling uncertainty in preference alignment is timely and relevant, as human preferences from diverse groups inherently contain inconsistencies. The work addresses a critical gap in existing PbRL methods that overlook such uncertainties.
-
The empirical evaluation is robust, covering multiple robotic tasks, real human preference datasets, and various noise conditions. The comparison with state-of-the-art baselines (e.g., BET, Diff-CPL, FKPD) and ablation studies (e.g., Beta prior ablation) effectively demonstrate the method’s superiority and validate the contribution of each component.
Weakness:
-
The paper’s novelty is unclear, as the two key components seem to rely heavily on prior works: The diffusion policy alignment framework in Section 4.1 builds directly on Wallace et al. (2024), which introduces Direct Preference Optimization (DPO) for text-to-image diffusion models. The Beta prior model in Section 4.2 is inspired by Xu et al. (2025), which uses a Beta distribution for uncertainty-aware preference alignment in offline settings. While the combination of these components is novel, the paper fails to sufficiently demonstrate its unique innovations tailored to the problem at hand, weaken its claim to breakthrough innovation.
-
The theoretical contribution is primarily a reapplication of Beta priors (a well-known conjugate prior for Bernoulli distributions) to the diffusion policy context, without substantial new theoretical insights. The proof in Proposition 4.1, while correct, formalizes a standard Bayesian modeling approach rather than introducing a novel framework.
问题
See weakness.
局限性
Yes.
最终评判理由
The authors have adequately addressed my concerns, and thus I maintain my positive evaluation score.
格式问题
No.
Dear Reviewer, we sincerely value your time and effort in evaluating our work. We have prepared comprehensive responses and clarifications to address each point you raised. We hope these responses can resolve your concerns.
Q1. The paper’s novelty is unclear, as the two key components seem to rely heavily on prior works: The diffusion policy alignment framework in Section 4.1 builds directly on Wallace et al. (2024), which introduces Direct Preference Optimization (DPO) for text-to-image diffusion models. The Beta prior model in Section 4.2 is inspired by Xu et al. (2025), which uses a Beta distribution for uncertainty-aware preference alignment in offline settings. While the combination of these components is novel, the paper fails to sufficiently demonstrate its unique innovations tailored to the problem at hand, weakening its claim to breakthrough innovation.
Response. Thank you for raising this concern. We would like to provide a more detailed discussion highlighting the differences between this paper and the two prior works as follows.
-
Difference with [1]. While we adopt some techniques from [1], such as Jensen’s inequality and the convexity of when deriving from Eq. 11 to Eq. 12 (we have accordingly cited [1] during the derivation), the problem setting, preference model, and derivation procedure are different. Additionally, we would like to highlight that the alignment of the diffusion policy via MLE is not our primary contribution. Instead, we primarily follow the approach developed for LLM in [1], and adapt it to the RL setting in our work. Specifically,
-
Problem setting. [1] is formulated in the context of LLM alignment, where rewards are assigned exclusively at the final step, and preferences are based solely on the final output of the LLM. This approach optimizes the LLM’s ultimate output, without considering intermediate steps or the overall process leading to the final result (see Eq. 14 in [1]). In contrast, Eq. 12 in this paper extends the framework to a trajectory-wise setting within the RL field. The key distinction is that in RL, we incorporate intermediate rewards, adjusted by a discount factor , which accounts for the cumulative discounted rewards throughout the trajectory. As a result, this paper formulates preference alignment based on the entire trajectory rather than focusing solely on the final state-action pair.
-
Preference model and derivation procedure. [1] employs a reward-based preference model, which relies on the reward of the LLM's final output (see Eq. 3 in [1]) while regularizing the KL-divergence with respect to a reference policy. In contrast, this work adopts a regret (advantage)-based preference model (see Eq. 5) within the maximum entropy RL framework. Although both preference models ultimately lead to similar objectives for aligning the diffusion policy, the derivation procedures are somewhat different. Specifically, the derivation of this work is based on trajectory-wise alignment, whereas [1] only addresses the step-wise alignment. To accomplish this using the regret (advantage)-based preference model, we introduce the chain advantage function (i.e., Eq. 8), which is defined based on the chain reward function presented in [1]. We then calculate the expected value of the chain advantage function with respect to the diffusion latent variable, which serves as a key index for evaluating the trajectory.
-
-
Difference with [2]. While both works utilize the Beta prior with a MAP objective to model uncertainty during the alignment process, this work differs significantly regarding the problem formulation, underlying motivation, and approach to incorporating the Beta prior. Specifically,
-
Problem formulation and motivation. In [2], the authors introduced an uncertainty-aware PbRL framework that specifically addresses epistemic uncertainty arising from an offline preference dataset with imbalanced comparison frequencies across different trajectories. In their setting, the fewer compared trajectories introduce greater epistemic uncertainty in reward prediction, due to the insufficient data available for these samples. Conversely, our work targets the aleatoric uncertainty associated with human preferences in an iterative updating dataset as defined in Definition 3.1. In this setting, different groups of human annotators may provide inconsistent or even conflicting preferences for the same pair of trajectories, which causes the aleatoric uncertainty that represents inherent stochasticity. In other words, by interpreting the parameters and of the Beta distribution as representing the counts of 'vote' and 'unvote' human feedback, [1] uses a Beta prior to model the difference in the absolute values of and across different trajectories (e.g., Beta(10,2) vs. Beta(100,20), where the former exhibits greater uncertainty due to fewer counts). In contrast, this work employs the Beta prior to model the relative strength of and for different (e.g., Beta(6,6) vs. Beta(10,2), where the former exhibits greater uncertainty due to inconsistency of votes). (Part 1 of Section C)
-
Approach. [2] adopts a two-step procedure, proposing a MAP objective for learning a distributional reward model. To achieve this, [2] introduces an iterative update rule that refines the reward model using the and parameters output by the learned Beta prior model, until it converges to the maximum of its MAP objective. This reward model is then used for policy learning. In contrast, this work derives a unified MAP objective for directly aligning the diffusion policy in a single-step process, without the need for reward learning. We update our diffusion policy using a unified loss function (Eq. 17). By maximizing the likelihood of the diffusion policy's output under the learned Beta distribution, the process guides the policy to align the estimated with their prior distribution . This approach is more straightforward and efficient.
-
[1] Diffusion model alignment using direct preference optimization. CVPR, 2024.
[2] A distributional approach to uncertainty-aware preference alignment using offline demonstrations. ICLR, 2025.
Q2. The theoretical contribution is primarily a reapplication of Beta priors (a well-known conjugate prior for Bernoulli distributions) to the diffusion policy context, without substantial new theoretical insights. The proof in Proposition 4.1, while correct, formalizes a standard Bayesian modeling approach rather than introducing a novel framework.
Response. Thank you for your insightful feedback. While both approaches indeed leverage the Beta prior within Bayesian framework, there are distinctions in how policies and Beta prior are updated.
-
Iterative Alignment for Diffusion. Diff-UAPA introduces a novel iterative preference alignment framework tailored specifically for diffusion policies (Definition 3.1). Unlike traditional static offline preference alignment methods, our framework incrementally adapts the diffusion policy to preferences from diverse user groups, leveraging the sequential nature of diffusion models. This iterative process, supported by the Beta prior, enables dynamic updates that account for evolving preferences over time (Lines 284–287). In contrast, the approach in [1] operates in a static offline setting, lacking mechanisms for iterative adaptation to new preference data.
-
Dynamic Adaptation of Beta Prior. Our proposed variant, Diff-UAPA-I, further distinguishes itself by enabling incremental training of the Beta prior, allowing it to adapt dynamically to changing preferences and environmental conditions. This dynamic adaptation is theoretically grounded in our iterative alignment framework, which ensures robust handling of evolving uncertainties in preference data. In contrast, the Beta prior in [1] is learned solely from offline demonstrations and does not incorporate mechanisms to adapt to new or evolving preferences, limiting its flexibility in dynamic settings.
[1] A distributional approach to uncertainty-aware preference alignment using offline demonstrations. ICLR, 2025.
Thank you for your detailed response. I have no further questions.
Thank you for your recognition and confirmation! Your thoughtful and detailed review has been invaluable in helping us refine and improve our work. We deeply appreciate the time, effort, and expertise you brought to reviewing our paper. Thank you once again!
This paper studies diffusion model using the human preference data. The goal is to tackle the uncertainty in the human preference data. The key idea is to learn a beta distribution to capture the uncertainty and then incorporate it to learn the policy. All reviewers have acknowledged the contributions from this paper. The rebuttal also addressed all concerns raised in the first round. Overall, the paper has its unique perspective and made contributions to advance diffusion model. I will recommend an acceptance.