PaperHub
7.0
/10
Spotlight3 位审稿人
最低3最高4标准差0.5
4
4
3
ICML 2025

Learning Parametric Distributions from Samples and Preferences

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

Preference feedback can significantly enhance parameter estimation in continuous parametric distributions, achieving faster convergence rates and lower variance compared to sample-only methods.

摘要

Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preferences-based M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of $\mathcal{O}(1/n)$---a significant improvement over the $\Theta(1/\sqrt{n})$ rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward.
关键词
Statistical learningContinuous parametric distributionsPreference feedbackEstimation error rate

评审与讨论

审稿意见
4

This paper studies the conditions under which preference feedback improves parameter estimation. The authors show that preference-based estimator can achieve a better assympototic variance than sample-only estimators. When incorporated with hard constraints with deterministic preference, the authors prove an estimation error of O(1/n)\mathcal{O}(1/n), improving upon traditional rate O(1/n)\mathcal{O}(1/\sqrt{n}), under some restrictive assumptions. They also develop a matching lower bound.

给作者的问题

1.

Do the results in Section 4 rely on the specific reward model rθr_\theta? Does it have to be the log likelihood logpθ\log p_\theta? If not, can the authors provide some examples satisfying all the assumptions while supporting general reward functions?

2.

Can the authors provide more high-level intuitions behind the accelerated rate of O(1/n)\mathcal{O}(1/n)? For examples, when the reward model is just log likelihood, the deterministic preference label only provides additional information on the magnitude relationship between logp(x),logp(y)\log p(x),\log p(y). Why is this sufficient to improve the estimation error?

3.

Is it possible to consider misspecification case, i.e., θ∉Θ\theta_*\not\in \Theta? In this case, θ^\hat{\theta} should converge to an optimal estimator in Θ\Theta. Would similar acceleration effects hold?

论据与证据

Yes

方法与评估标准

Yes

理论论述

No

实验设计与分析

Yes, in Section 6

补充材料

Yes, the major framework of proof

与现有文献的关系

It is generally related to machine learning community

遗漏的重要参考文献

No

其他优缺点

Strengths

The paper studies a very interesting problem, i.e. how does preference labels improve statistical estimation. The theoretical results are solid and also surprising overall, providing important insights on the benefits of preference.

Weaknesses

As mentioned by authors, the assumptions are quite restrictive and are only verified under simple setup. This is not a big issue, though.

其他意见或建议

see Strengths And Weaknesses part above

作者回复

We thank Reviewer yquN for the time spent and the positive feedback. We address the reviewer’s questions below.

1. Reward models Except for Theorem 4.3, all the derivations in Section 4 hold for general (hence reward-based) preference models provided Assumptions 4.2,4.4,4.5 and 4.7 hold. Characterizing the expressivity of parametric rewards satisfying those assumptions is interesting, yet challenging. We provide two positive and one negative examples.

  • Positive: monotonic reward. Suppose that ~θ(x,y)=f(pθ(x))f(pθ(x))\tilde \ell_{\theta}(x,y) = f(p_{\theta}(x)) - f(p_{\theta}(x)) where ff is increasing on [0,1][0,1]. Since sign(~θ)=sign(θ)sign(\tilde \ell_{\theta}) = sign(\ell_{\theta}), hence the parameters with zero classification loss and our estimators are the same. Therefore, our results hold for this class of rewards when our assumptions hold for the log-likelihood reward. When ff is decreasing, the preferences are “reversed”, and similar arguments can be made. This example includes (1) normalization by a multiplicative constant (e.g., temperature β\beta) and (2) the odds-ratio reward-based preference based on f(x)=log(x/(1x))f(x) = \log(x/(1-x)) and defined by Hong et al. (2024, ORPO: Monolithic Preference Optimization without Reference Model).
  • Positive: margin with Gaussian. Suppose that ~θ=θ+c\tilde \ell_{\theta} = \ell_{\theta} + c where cc is a constant and θ\ell_{\theta} is the Gaussian log-likelihood preference. By extending our computations from Appendix E, Assumptions 4.2, 4.4, 4.5, and 4.7 hold with cc-dependent positive constants. Margins are used by Meng et al. (2024, SimPO: Simple Preference Optimization with a Reference-Free Reward) and IPO from Azar et al. (2023, A General Theoretical Paradigm to Understand Learning from Human Preferences).
  • Negative: reference model with Gaussian. Suppose that ~θ=θθ0\tilde \ell_{\theta} = \ell_{\theta} - \ell_{\theta_0} where θ0\theta_0 is known and θ\ell_{\theta} is the Gaussian log-likelihood preference. Since ~θ(x,y)=xy,θθ0\tilde \ell_{\theta}(x,y) = \langle x-y, \theta - \theta_0 \rangle and θ~θ(x,y)=xy\nabla_{\theta} \tilde \ell_{\theta}(x,y) = x-y, Assumption 4.5 is violated for u=θθ0u=\theta^\star - \theta_0. Not all direct alignment algorithms rely on a reference model (see SimPO or ORPO).

2. Accelerated rate Accelerated rates arise when accumulating random variables having a positive density at a specific point through a minimum (or maximum) operator.

  • When estimating the location parameter θ\theta of a uniform distribution over [θ,θ+1][\theta,\theta+1], the optimal estimator achieving the accelerated rate is the minimum of uniform observations whose density is positive at θ\theta.
  • For deterministic preferences with log-likelihood rewards, we observe the true ordering between likelihoods. This enforces a hard constraint on the admissible parameters, which can be expressed with a minimum operator. More precisely, the maximal deviation Rn,uR_{n,u} along direction uu is upper bounded by the minimum of positive random variables whose density is positive at zero under Assumption 4.7. With high probability, this min operator is upper bounded by O(1/n)O(1/n). The proof combines Lemma 4.6 and an upper bound on the inverse of the cdf based on a Taylor expansion around 00.

3. Misspecification Under misspecification, the deterministic preferences might not provide separability within Θ\Theta since θΘ\theta^\star \notin \Theta. Then, DP MLE should be defined as SP MLE by using the 0-1 loss 1(u<0)1(u < 0) instead of the logistic loss logσ(u)-\log \sigma(u). This combines a cross-entropy loss and a classification 0-1 loss, reweighted by a regularization λ>0\lambda > 0. This objective is reminiscent of single-stage alignment procedures such as ORPO and ASFT, see Gorbatovski et al. (2025, The Differences Between Direct Alignment Algorithms are a Blur). Without separability, computing DP MLE can be NP-hard. Under sufficient regularity, DP MLE converges to θ0argminθΘKL(θ,θ)+λm(θ)\theta_{0} \in argmin_{\theta \in \Theta} KL(\theta^\star,\theta) + \lambda m(\theta) where mm as in line 270, where θ0θ\theta_0 \ne \theta^\star. This minimization is challenging, as θm(θ)\theta \to m(\theta) might not be convex. Deriving a tractable ELBO method for this optimization is an interesting direction to obtain tractable and robust estimators. As θ0\theta_0 lies in the boundary of Θ\Theta, we should control the maximal deviation wrt to θ0\theta_0 for directions that point towards the interior of Θ\Theta to prove an accelerated rate. While some elements of our analysis might be salvaged, we believe that finer technical arguments should be derived to capture this interesting setting.

Restrictive assumptions See the answer to Reviewer vgrP for a detailed discussion on how to weaken them to local conditions.

审稿人评论

Thanks for the authors' feedback. I keep my recommendation for acceptance.

审稿意见
4

This paper studies when adding preference feedback can boost the parameter estimation in the cases of Gaussian and Laplace distributions. The results are mainly theoretical, containing three parts: (1) For M-estimators, adding an additional ``preference'' term related to the logarithm of probability helps reduce the asymptotic covariance; (2) For estimators based on hard preference constraints, the error converges at a rate of O(1/n)\mathcal{O}(1/n) with high probability; (3) This rate is mini-max optimal up to dimension and problem-dependent constants, using Assaud's Lemma.

给作者的问题

None.

论据与证据

Most of the results are theoretical and supported by proofs. The assumptions are satisfied by the Gaussian or the Laplace distributions.

方法与评估标准

The estimators in this paper are mostly theoretical. SO and SP are typical M-estimators. AE and DP require (1) solving a feasibility problem, which can be NP-hard for general cases; and (2) the hard preference assumption (analogous to the feasibility condition). I don't think this could be the real case.

理论论述

I checked the proof sketch. It makes sense to me.

实验设计与分析

I checked the experiment results. As I mentioned before, this paper is theoretical. Yet there are still some minor issues.

  1. It seems a little weird that any randomized estimator in Cn\mathcal{C}_n (RU) outperforms the one that maximizes the log-likelihood in Cn\mathcal{C}_n (DP) in Figure 1(a).
  2. The legend doesn't match the figure's line style.

补充材料

No.

与现有文献的关系

This paper provides a new perspective in analyzing the role of preference. This paper's case is not the same as the human preference alignment (e.g. RLHF): the paper is studying an estimator ``plus'' some preference data as an additional source of information, while RLHF or DPO is trying to learn something from only the preference data, implying that the paper may be of limited value to the LLM literature. However, this paper still provides some interesting observations, which might be of interest to the community of statistics.

遗漏的重要参考文献

No (as far as I know).

其他优缺点

None.

其他意见或建议

None.

作者回复

We thank Reviewer Gth4 for the time spent and the encouraging feedback. We address the reviewer’s questions below.

Iterative human preference alignment We investigate the case where pairs of observations and their preferences are tied together, which includes the log-likelihood ratio as preference. We detail the connections with iterative human preference alignment.

  • Many human preference alignment methods build on the Bradley-Terry (BT) model for preference, based on rewards. Direct alignment algorithms use variants of the log-likelihood to define the implicit reward of a policy. Choosing θ(x,y)=logpθ(x)logpθ(y)\ell_{\theta}(x,y) = \log p_{\theta}(x) - \log p_{\theta}(y) coincides with the optimal policy for maximum entropy RL (see, e.g., Swamy et al, 2025, All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning).
  • For offline preference data, the assumption (X,Y)pθ2(X,Y) \sim p_{\theta^\star}^{\otimes 2} is unrealistic as θ\ell_{\theta^\star} is collected from a fixed data set of pairs of observations. Recent LLMs are built on iterative alignment procedures. At stage NN, the model pθNp_{\theta_N} is trained based on the preference data for generations by the previous model, i.e., (X,Y)pθN12(X,Y) \sim p_{\theta_{N-1}}^{\otimes 2}. Under the realizability assumption and without mode collapse, this self-refinement paradigm converges towards the true model pθp_{\theta^\star}. Our setting characterizes the limiting behavior of this iterative process, i.e., preference based on θ\ell_{\theta^\star} for observations from pθp_{\theta^\star}.

1. RU versus DP Figure 1(a) provides evidence suggesting that the randomized estimator (RU) and the worst-case estimator (WE) perform on par with DP MLE: RU is slightly better than DP, itself slightly better than WE. Figures 1(b) and (2) highlight that DP outperforms WE and AE for larger dimensions, where the gap increases when dd is nonnegligible compared to nn. Therefore, only DP obtains the best-of-both world estimation error rate. For large dd, implementing RU is challenging. We conjecture it suffers from the same limitation as AE. This is supported by additional experiments on new estimators using the setting of Section 6, see anonymous plots at https://anonymous.4open.science/r/ICML25SuppExp .

  • Figure 1(a) extended. The center estimator (CE) returns the center of the interval Cn\mathcal C_n. The truncated Gaussian estimator (TrG) returns a realization from a Gaussian distribution with mean CE and variance 4/n4/n, which is truncated to Cn\mathcal C_n. TrG performs on par with RU. CE outperforms both TrG and RU. This suggests that being far away from the boundary of Cn\mathcal C_n improves performance compared to DP that lies on the boundary of Cn\mathcal C_n as observed empirically. Moreover, randomization on Cn\mathcal C_n worsens performance compared to CE. Using the derivation in lines 55-63, it is coherent that CE improves on DP by a multiplicative constant: the average of those two (non-independent) random variables decreases faster. This can be proved by refining the proof of Lemma 4.6 to account that n=Nθ,1+Nθ,1n = N_{\theta^\star,-1} + N_{\theta^\star,1} (defined in Line 658).
  • Figures 1(b) and 2 extended. For d>1d>1, multiple centers exist and we use the Chebyshev center estimator (CCE) of Cn\mathcal C_n. While CCE outperforms AE by a constant margin, CCE only outperforms DP in the regime of large nn compared to dd. It performs worse than SO for small nn. Geometrically, for small nn and large dd, the random polytope Cn\mathcal C_n is more likely to be “spiky” along some directions. Due to those distant vertices, the center becomes a worse estimator than DP, since the “average” is intuitively less robust to outliers. In contrast, DP dominates SO statistically (Lemma 4.1), hence it achieves rate O(d/n)O(\sqrt{d/n}) when nn is small compared to dd.

2. Line style The dashed line is shorter than the solid line, see SP (sto) and SO, yet others are not distinguishable. We will correct this.

审稿意见
3

The paper provides a set of estimators and conditions to improve the estimation error in learning the parameters of continuous parametric distributions when additional preference feedback is available. More concretely, the question is the following: For a continuous parametric distribution p_θp\_\theta with i.i.d. samples {(x_i,y_i)}i\{(x\_i, y\_i)\}_i and a known reward function r_θr\_\theta, how/when does including noisy/deterministic preferences z_ir_θ(x_i)r_θ(y_i)z\_i \propto r\_\theta(x\_i) - r\_\theta(y\_i) improve the estimation error of the parameter θ\theta?

To answer the above question, the authors first leverage the asymptotic theory of M-estimators, showing that a maximum-likelihood estimator (MLE) that takes preference data into account has the same standard error rate as sample-only estimators, i.e., Θ(1n)\Theta(\frac{1}{\sqrt{n}}), while achieving a potentially improved asymptotic variance for noisy preferences and a further improved variance for deterministic preferences.

For deterministic preferences, they take a further step and provide another estimator: an MLE with the hard constraints given by preferences. They then make several assumptions on pθp_\theta and rθr_\theta to show that this new estimator can achieve an accelerated error rate of O(1n)\mathcal{O}(\frac{1}{n}) compared to the standard Θ(1n)\Theta(\frac{1}{\sqrt{n}}). In particular, they show that Normal and Laplace distributions with log-probability reward functions satisfy these assumptions.

Finally, they prove that the rate of O(1n)\mathcal{O}(\frac{1}{n}) is minimax optimal up to problem-specific dimensions and logarithmic factors. Toy experiments on a multivariate Normal distribution are provided to support the theoretical findings.

给作者的问题

Despite the limitations I mentioned, I still think the paper has the potential to be accepted at the conference. The deciding factor for me is the authors' response to the limitations I highlighted under "Significance of the Results for the Community" in the weaknesses section. Could the authors elaborate on the implications of model misspecification in DP-MLE and also explain how one can verify assumptions 4.4, 4.5, and 4.7 for realistic models? I don't necessarily expect proof that DP-MLE is robust to misspecification, but I would expect at minimum an acknowledgment of this as a major limitation of the work.

Minor Questions:

  1. In Figure 1.a, how does the RU method outperform DP? This seems somewhat counter-intuitive. Could you elaborate on this observation?

  2. In Section 6, what is the goal and implication of including the paragraph about covariance gap starting in line 434? It seems disconnected from the other points in the experiment section. Could you provide more context for its relevance?

  3. How do you envision applying the DP-MLE method for preference learning in realistic language model training? Can you provide at least an outline of when/how this approach might be feasible in practice?

论据与证据

The paper is a theory paper, where all its theoretical claims have been rigorously proved under the stated assumptions. The authors do not overstate their contributions and clearly acknowledge the limitations of the work (e.g., the restrictiveness of the assumptions). Moreover, the toy experiments in Section 6 are consistent with and provide empirical support for the theoretical claims established in the previous sections.

方法与评估标准

The paper primarily contributes to the theoretical aspects of preference learning. The main methodology described is the deterministic preferences MLE (DP-MLE) presented in Section 4, which uses a 0-1 loss to constrain the set of feasible parameters based on the implicit assumption that reward models are well-specified. This approach makes sense if there are good reasons to believe the reward model is indeed well-specified. However, I do have some concerns about the assumptions used in the analysis, which I will outline in the following sections.

理论论述

The paper has several theoretical claims:

  1. Lemma 3.1{\color{green}\text{Lemma 3.1}} and Lemma 3.2 on the asymptotic variance of the preference-based M-estimators.
  2. Lemma 4.1{\color{green}\text{Lemma 4.1}} on the benefits of using their proposed constraint-based estimator compared to M-estimators for Normal distributions.
  3. Theorem 4.8{\color{green}\text{Theorem 4.8}} (including Lemma 4.6{\color{green}\text{Lemma 4.6}}) and its corollary Theorem 4.3{\color{green}\text{Theorem 4.3}} on proving the accelerated rate of O(1n)\mathcal{O}(\frac{1}{n}) for their proposed estimator.
  4. Theorem 5.3 (including Lemma 5.1) on the estimation lower bound for the deterministic feedback case.
  5. Proving that the Normal and Laplace distributions satisfy Assumptions 4.2, 4.4, 4.5, 4.7, and 5.2, which are necessary for the theoretical claims in the paper (Appendices E, F).

I have only checked the correctness of the results shown in green{\color{green}\text{green}} and did not find any issues.

实验设计与分析

As far as I can tell, the experiment section presents a toy multivariate Gaussian setting with the sole purpose of supporting the theoretical findings. The code is provided, but I did not verify it directly. However, the experimental results appear sound and provide appropriate empirical support for the theoretical claims.

补充材料

I have reviewed the proofs provided in Appendix B (except for section B.2) and Appendix C. I did not review the results in Appendices D, E, and F (corresponding to the theoretical results I have not proof-checked).

与现有文献的关系

The contributions are directly related to the empirical success of preference-based fine-tuning of large language models through methods like RLHF, compared to methods that only rely on positive examples such as supervised fine-tuning. In this context, the work attempts to develop a deeper theoretical understanding of when such preferences can help improve learning, using a simplified parametric setting.

The ideas presented in Section 3, regarding the effect of adding preferences to standard M-estimators, are primarily based on the well-established asymptotic normality theory of parametric MLE. The authors apply similar techniques and tools to calculate the Fisher information matrix in the preference-based setup and investigate the conditions under which it can be strictly more informative than the standard M-estimator.

The results in Section 4, however, appear more novel and rely on the hard constraints imposed by deterministic preferences. The authors have appropriately discussed previous related results that use similar hard constraints to achieve better estimation error: for example, the parameter estimation of a uniform distribution on [θ,θ+1][\theta, \theta + 1] by taking the minimum of samples (Wainwright, 2019), which has a known minimax rate of Θ(1n)\Theta(\frac{1}{n}).

遗漏的重要参考文献

As far as I know, the essential references have been discussed.

其他优缺点

+ Soundness, Novelty, and Technical Contributions

All assumptions are concretely specified, and I find the technical contribution around achieving the accelerated rate of O(1n)\mathcal{O}(\frac{1}{n}) both interesting and novel. While I have concerns about the restrictiveness of the assumptions, the authors demonstrate that both Normal and Laplace distributions satisfy these assumptions, which can be seen as a meaningful degree of practicality. All results are well-supported by rigorous proofs and tested by the toy experiments.

- Quality of Presentation

The presentation of the work could be significantly improved. The current writing, especially in Sections 1 to 3, reads more like a collection of independent chunks of information without a cohesive story connecting them. I believe the authors have developed several ideas and attempted to articulate them, but in doing so, they relied on implicit contextual understanding that isn't provided in the text. I would suggest approaching the writing from the perspective of a reader encountering the paper with no prior knowledge of the work and providing sufficient context throughout. Additionally, in Section 2, definitions and motivations are sometimes intermingled, making it unclear what constitutes a formal definition versus what serves as intuitive examples. The following concrete instances illustrate these issues, though they are more cases:

  1. I needed to read the entire paper first to understand the paragraph in lines 57-63 about how hard constraints can help achieve better estimation error for a standard normal distribution. The presentation would benefit from more context regarding what is known and not known by the estimator about the estimation task, what the parameter of interest actually represents, and why SiS_i is defined in this way.

  2. In Section 2, the paragraph about informative preferences (lines 137-150) defines the two sets G0\mathcal{G}_0 and G1\mathcal{G}_1 based on a vague notion of "informativeness" without providing context for why one might be interested in samples with non-zero preference gradients. What does it mean to say "only preferences of samples in G1(θ)\mathcal{G}_1(\theta^\star) can provide information on θ\theta^\star"? Why is G0\mathcal{G}_0 defined if informativeness is only based on G1\mathcal{G}_1?

  3. Also in Section 2, the paragraph on negative examples (lines 152-164) is extremely unclear. This paragraph could be placed anywhere in the paper without affecting the overall narrative. The claims lack concreteness, and no proof or proof sketch is provided.

- Significance of the Results for the Community

The main limitation of this work is perhaps its overly restrictive set of assumptions, which may limit its significance and applicability in the broader community, particularly in preference learning. Reading the first lines of the abstract, It seems that the authors motivate the applicability of their theory based on advances in preference learning for language models. While the authors acknowledge the restrictiveness of their assumptions, they rely on their results showing that Normal and Laplace distributions satisfy these assumptions to claim broader applicability. However, I do not believe that merely demonstrating compliance with these assumptions for Normal/Laplace distributions is sufficient to establish applicability in more complex scenarios like preference learning in language models. My concerns are twofold:

  1. The deterministic method (DP-MLE) with 0-1 loss in Section 4 relies on the well-specification of the reward (preference) model. Although standard asymptotic theory for M-estimators also assumes well-specification of the parametric model class, there seems to be a big difference. In standard MLE, if the model is misspecified, one can employ quasi-MLE to obtain robust estimation (see [1]). However, in the deterministic case, if the reward model is misspecified, the constraint set Cn\mathcal{C}_n may not necessarily converge to a set containing the true parameter θ\theta^\star as nn \to \infty. Since DP-MLE is constrained to Cn\mathcal{C}_n, I suspect it could yield arbitrary estimates under misspecified models and lack robustness. This is particularly concerning given that model misspecification is almost always a possibility, especially when dealing with human annotators, where reward models are known to be misspecified [2].

  2. Even assuming correctly specified models, the paper provides no recipe to verify whether Assumptions 4.4, 4.5, and 4.7 hold for a given parametric model. These assumptions appear extremely difficult to check for arbitrary parametric models. The authors devote six pages of mathematical derivations just to prove them for the relatively simple cases of Normal and Laplace distributions.

References

[1] White, Halbert. "Maximum likelihood estimation of misspecified models." Econometrica: Journal of the econometric society (1982).

[2] Casper, Stephen, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman et al. "Open problems and fundamental limitations of reinforcement learning from human feedback." arXiv preprint arXiv:2307.15217 (2023)

其他意见或建议

  1. The definition of 0-1 loss for stochastic preferences in lines 230-231 is unclear. I suggest the authors clarify what this means exactly and explicitly state why minimizing such a loss is NP-hard.

  2. Theorem 4.3 is presented as a corollary of the main Theorem 4.8, yet it appears in an earlier section. This may cause confusion for readers. I suggest the authors first state Theorem 4.8 and then present the corollary specifically for Normal/Laplace distributions to improve logical flow.

  3. The proof of Theorem 4.8 could be easier to grasp if the authors provided some intuition behind the definition of Vθ,uV_{\theta^\star, u} on line 302.

  4. There appears to be a typo in lines 431-432.

伦理审查问题

N/A

作者回复

We thank Reviewer vgrP for the time spent and the detailed comments. Due to the limited space, we only address some of the reviewer’s concerns.

Restrictive assumptions While our research question is inspired by iterative human preference alignment (see the answer to Reviewer Gth4), we do not claim the direct applicability of DP MLE for realistic LLM training. When studying DP MLE only, we conjecture that the “global” assumptions 4.2 and 4.4 can be weakened to local versions. Using time-uniform concentration results, we can build a sequence of shrinking confidence regions (Rn)n(R_n)_n around SO MLE that contains θ\theta^\star for all time nn with high probability (whp). Then, we modify DP MLE to be constrained on RnCnR_n \cap C_n that contains θ\theta^\star whp. For nn large enough, RnCnR_n \cap C_n will be included in a local neighborhood of θ\theta^\star under which the “local” assumptions 4.2 and 4.4 are satisfied. Given that Assumption 4.4 is based on “ignoring” the reminder term in a first-order Taylor expansion, assuming a local version is a significantly weaker requirement.

1. Misspecification There are two possible sources of misspecification not taken into account by our current analysis.

  • Misspecified observations. See answer to Reviewer yquN when θΘ\theta^\star \notin \Theta. When pFp^\star \notin F, Ln(θ)L_n(\theta) is a quasi-log-likelihood term, as FF doesn’t contain the true structure. Under sufficient regularity, SO quasi-MLE converges towards θ0argminθΘKL(p,pθ)\theta_{0} \in argmin_{\theta \in \Theta} KL(p^{\star}, p_{\theta}) where ppθ0F p^\star \ne p_{\theta_0} \in F. Without the separability from well-specified deterministic preference, we define DP quasi-MLE as in the answer to Reviewer yquN. Under sufficient regularity, this estimator converges towards the minimizer of a similar optimization problem based on the above KL and a misspecified equivalent of m(θ)m(\theta).
  • Misspecified preferences. The Bradley-Terry (BT) model that uses reward-based preferences has limited expressivity as it doesn’t allow for intransitive preferences. Even when individuals exhibit transitive preferences, their averaged preferences might be intransitive due to disagreements. See Munos et al. (2024, Nash Learning from Human Feedback) or Swamy et al. (2024, A Minimaximalist Approach to Reinforcement Learning from Human Feedback).

2. Verifying our assumptions In all generality, it is challenging to give a general recipe to formally verify those assumptions. A formal verifier (Lean) or software (SageMath) might be useful given a closed-form definition. Numerically, those assumptions can be confirmed or rejected by sampling from pθ2p_{\theta^\star}^{\otimes 2}. Assumption 4.4 is rejected by exhibiting (Xi,Yi)D~(θ,θ)D(θ,θ)(X_i,Y_i) \in \tilde D(\theta^\star,\theta) \setminus D(\theta^\star,\theta). Assumptions 4.2 and 4.5 are confirmed by finding (Xi,Yi)D(θ,θ)(X_i,Y_i) \in D(\theta^\star,\theta) and (Xi,Yi)G1(θ,u)(X_i,Y_i) \in G_{1}(\theta^\star,u). The sampling complexity of such tests scales as the inverse event’s probability. Using Dvoretzky–Kiefer–Wolfowitz inequality, Fθ,uF_{\theta^\star,u} can be estimated to verify Assumption 4.7 hold. Our additional experiments with accelerated rate include Laplace and Rayleigh distributions, see the anonymous plots at https://anonymous.4open.science/r/ICML25SuppExp .

RU versus DP See answer to Reviewer Gth4.

Covariance Gap Our simulations suggest that the asymptotic gaps between SO and SP are mild. The empirical gap is also small for moderate nn.

0-1 loss See answer to Reviewer yquN. For non-separable data, the minimization of the 0-1 classification loss can be NP-hard even for the simple class of linear classifiers, e.g., Feldman et al. (2018, Agnostic Learning of Monomials by Halfspaces is Hard). Inspired by (Tang et al., 2024, Generalized Preference Optimization: A Unified Approach to Offline Alignment), we implement estimators based on other convex surrogates: Hinge, Square, Truncated square, Savage, and Exponential. All estimators perform on par with the logistic loss, see the plot at https://anonymous.4open.science/r/ICML25SuppExp .

Intuition on Vθ,uV_{\theta^\star,u} It quantifies the amount of information in (Xi,Yi)(X_i,Y_i) to discriminate θ\theta^\star from other parameters on the half-line directed by uu. The lower Vθ,u(Xi,Yi)V_{\theta^\star,u}(X_i,Y_i) is, the more discriminative (Xi,Yi)(X_i,Y_i) is.

Informative preferences For observations with null preference gradient, parameters close to θ\theta^\star could have similar preferences. Therefore, those samples are not sufficient to discriminate between them.

Negative examples Those claims are a direct consequence of the definitions and will be proved in Appendix for completeness.

Typo It will be fixed.

审稿人评论

Thanks for clarifying the main concerns. I’ve raised my score based on the additional context you provided. However, I would still like this discussion—at least in part—to be included in the main paper, especially the sections on restrictive assumptions and misspecification. For this reason, I vote for acceptance, conditional on updating the camera-ready paper based on this discussion.

作者评论

We appreciate the reviewer’s support for the acceptance of our work based on the additional context. We will use the extra page in the main paper to include these interesting discussions, such as misspecification, verification and relaxation of our assumptions, alternative reward models, and more detailed intuitions. Additionally, we will expand the Appendices with those supplementary experiments and provide detailed proofs to support our added comments.

最终决定

The paper studies shws how learning with preferences can improve parameter estimation rates for Gaussian and Laplace distributions. They show that for estimators based on hard preferences the error converges at a rate of O(1/n) and that this rate is mini-max optimal up to dimension and problem-dependent constants. The reviewers found the paper measured in its statements, the contributions novel, and well supported by theory and experiments. All reviewers agreed the work should be accepted. Given the widespread use of supervised fine-tuning and preference based learning to align LLMs, I do think some of the insights here can be translated into practical learning algorithms to improve LLMs -- this may be worth commenting on within the discussion in terms of the practical ramifications of this work. I recommend the authors use the extra page to incorporate vgrP's suggestions to reframe the writing for sections 1-3 with an emphasis on clarity since doing so should mitigate some of the questions raised by other reviewers as well.