/10

Poster5 位审稿人

最低3最高4标准差0.4

ICML 2025

Ranking with Multiple Oracles: From Weak to Strong Stochastic Transitivity

Tao Jin,Yue Wu,Quanquan Gu,Farzad Farnoud

提交: 2025-01-24更新: 2025-07-24

摘要

We study the problem of efficiently aggregating the preferences of items from multiple information sources (oracles) and infer the ranking under both the weak stochastic transitivity (WST) and the strong stochastic transitivity (SST) conditions. When the underlying preference model satisfies the WST condition, we propose an algorithm named RMO-WST, which has a bi-level design: at the higher level, it actively allocates comparison budgets to all undetermined pairs until the full ranking is recovered; at the lower level, it attempts to compare the pair of items and selects the more accurate oracles simultaneously. We prove that the sample complexity of RMO-WST is $ \tilde O( N\sum_{i=2}^{N}H_{\sigma^{-1}(i),{\sigma^{-1}(i-1)}} )$, where $N$ is the number of items to rank, $H$ is a problem-dependent hardness factor, and $\sigma^{-1}(i)$ represents the $i$-th best item. We also provide a tight lower bound that matches the upper bound of approximate ranking under the WST condition, answering a previously open problem. In addition, when the SST condition is satisfied, we propose an algorithm named RMO-SST, which can achieve an $\tilde{O}(\sum_{i=1}^{N} H_i \log(N))$ sample complexity. This outperforms the best-known sample complexity by a factor of $\log(N)$. The theoretical advantages of our algorithms are verified by empirical experiments in a simulated environment.

关键词

active rankingstochastic transitivityrank aggregationdueling banditslower boundupper bound

评审与讨论

审稿意见

评分: 32025-03-13

The paper studies the problem of ranking a set of $N$ items using $M$ ranking oracles under weak or strong stochastic transitivity conditions. Each oracle $u$ , when queried with a pair of items $i$ and $j$ , independently returns an indicator of its preference, represented by a probability $p_{i,j}^u$ . The objective is to determine the ranking of all items while minimizing the total number of queries to the oracles.

The paper establishes tight bounds under both weak and strong stochastic transitivity conditions. The upper bound under the strong stochastic transitivity condition matches existing lower bounds, whereas the upper bound under the weak stochastic transitivity condition matches the lower bound presented in the paper.

Experiments justify the theoretical findings of the paper.

给作者的问题

Assumption 3.1 seems strong, as it requires all oracles to have consistent ratings for all pairs of items. This assumption may not hold in practice—for example, when evaluating two LLMs, some users may favor one model over the other, while others may not. Is it possible to relax this assumption?
What are the logarithmic factors hidden in Theorem 5.4?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

I didn’t go through the proofs.

实验设计与分析

The experimental section discusses an improved algorithm presented in the Appendix.

补充材料

I didn’t go through the supplementary materials.

与现有文献的关系

Ranking with multiple noisy oracles should have applications in many fields.

遗漏的重要参考文献

No.

其他优缺点

The paper is well written overall.
The problem studied is of practical importance and could have applications in multiple fields.

其他意见或建议

Line 104–105, $\Delta := |p_{i,j} – 1 / 2|$ . Something may be missing here, e.g., min or sum.
Line 247, “After N-1 rounds of finding the maximal item…” Why are there $N-1$ rounds here? Does $U$ contain $N$ items?

作者回复

2025-04-01

Dear reviewer thanks for your time and effort for the review and suggestions, we address your concern below:

Q1: Line 104–105, $\Delta := | p_{i,j} - \frac12 |$ . Something may be missing here, e.g., min or sum.
A1: thanks for spotting this, the correct definition should be $\Delta := \min_{(i,j) \in [N] \times [N]} | p_{i,j} - \frac12 |$

Q2: Line 247, “After N-1 rounds of finding the maximal item…” Why are there N−1 rounds here? Does U contain N items?
A2: This sentence “After N-1 rounds of finding the maximal item…” actually refers to Algorithm 1, where it finds the current maximal item and deletes it. After $N-1$ rounds, the remaining item will be the minimum item out of the N items. We will move it to the correct place. We apologize for the confusion.
As for Algorithm 2, the candidate set $U$ has at most $N$ items. Every time a new pairwise relation is revealed, one of the two items cannot be the maximum item, so we remove it from $U$ . After at most $N$ rounds, the maximum item will be found.

Q3: Assumption 3.1 seems strong, as it requires all oracles to have consistent ratings for all pairs of items. This assumption may not hold in practice—for example, when evaluating two LLMs, some users may favor one model over the other, while others may not. Is it possible to relax this assumption?
A3: It is possible to consider more general cases. Without a strict consistency guarantee, we need to define a ground-truth ranking using numerical indicators like win rate. For example, one may use the Copeland/Borda score to determine a ground-truth ranking when there is inconsistency among oracles.
Additionally, these methods typically involve examining all oracles with the same accuracy and then concluding with a majority voting, which usually leads to a less efficient algorithm (introducing an additional factor of $M$ ).

Q4: What are the logarithmic factors hidden in Theorem 5.4?
A4: $\log^2 (M) \log(H_i)$ is the hidden factor.

审稿意见

评分: 32025-03-13

This paper studies the problem of ranking items using noisy pairwise comparisons from multiple oracles under both the weak stochastic transitivity (WST) and strong stochastic transitivity (SST) conditions. While previous work has explored ranking under SST with single and multiple oracles and WST for a single oracle, this paper extends the study to WST with multiple oracles, completing the literature in this area. Secondly, the authors derive a lower bound for the WST setting, which also holds for SST, resolving a prior conjecture. As a third contribution, they propose an improved algorithm for SST that reduces the sample complexity by a logarithmic factor compared to previous methods.

给作者的问题

In addition to the question in claims and evidence, I have an additional question

The authors assume there is no ‘best’ oracle for all pairs of comparisons. If there is a single best oracle, does the algorithm quickly detect that oracle and is able to reduce the complexity that closely matches the single oracle case?

论据与证据

The notation $\tilde{O}$ hides logarithmic factors. It is not specified which logarithmic factors are hidden. Therefore, it is difficult to verify the third claim in the contributions, where the authors claim to reduce the sample complexity by a logarithmic factor.

See questions

方法与评估标准

Yes, the problem is standard and with well-defined evaluation criteria.

理论论述

Essentially correct.

实验设计与分析

The experiments are fine, considering the main contribution of the work is primarily theoretical.

补充材料

N/A

与现有文献的关系

The work is well positioned with respect to the other works in literature.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

Other minor comments

``medium'' in Algorithm 4 should be defined in the text.
Assumptions 3.2 and 3.3 can be named as definitions.
Page 3, Lines 138 -139: ``Iterative-Insertion-Ranking (IIR) [Ren et al., 2019] proposes an algorithm ...'' requires rewording.
The notations $H_i, H_{i,j}$ cause confusion (especially in the table and main contributions) since it is not defined until page 4, and I had to refer to the reference papers to determine their precise definition. Also, in Table 1, $\Delta_i$ is not defined, even in text (I believe).

作者回复

2025-04-01

Dear reviewer, thanks for your positive comment. We address your question raised below:

Q1: ``medium'' in Algorithm 4 should be defined in the text.
A1: medium follows the definition in conventional statistics, where it’s the number(item) that is ranked in the middle of all numbers in the set of interest. We will add an explanation to the main text.

Q2: Assumptions 3.2 and 3.3 can be named as definitions.
A2: Thanks for pointing it out. Indeed, we can define the two settings first. We will modify them in our revision.

Q3: Page 3, Lines 138 -139: ``Iterative-Insertion-Ranking (IIR) [Ren et al., 2019] proposes an algorithm ...'' requires rewording.
A3: That’s a great catch, we would revise it into: Ren et al. propose the Iterative-Insertion-Ranking (IIR) algorithm that uses preference interval trees to perform binary search to sequentially insert items into an already ranked list.

Q4: The notations $H_i$ , $H_{i,j}$ cause confusion (especially in the table and main contributions) since it is not defined until page 4, and I had to refer to the reference papers to determine their precise definition. Also, in Table 1, $\Delta_i$ is not defined, even in text (I believe).
A4: To improve the readability of the paper, we consider moving the “Preliminaries” section right after the “Introduction”, where $H$ can be defined a bit earlier. For $H$ we used in the ‘’Abstract’’ and ‘’Introduction’’ section, we revised the wording so that understanding it as a Hardness Factor is enough to proceed with reading. $\Delta_i$ is defined on the right column of line 174, based on the definition of $\Delta$ .

Q5: The authors assume there is no ‘best’ oracle for all pairs of comparisons. If there is a single best oracle, does the algorithm quickly detect that oracle and is able to reduce the complexity that closely matches the single oracle case?
A5: That is right, if there is indeed a single best oracle, the algorithm can be slightly modified to only keep those decent oracles found in previous rounds, and eventually only one best oracle will be left to perform all of the rest pairs.
Line 96, right column, mentions an algorithm that maintains two sets of estimates globally: one for items and one for oracles (Wu et al., 2022). This paper studies exactly the setting you mentioned under SST, which can be easily extended to the WST setting as well. We will revise the original text to incorporate this.

审稿意见

评分: 32025-03-13

This paper considers the problem of estimating the ranking of $N$ items using pairwise comparisons. . Two different assumptions are considered: weak stochastic transitivity (if $p_{ij} \ge 1/2$ and $p_{jk} \ge 1/2$ , then $p_{ik} \ge 1/2$ ) and strong stochastic transitivity ( $p_{ik} \ge \max (p_{ij}, p_{ik})$ ), where $p_{ij}$ is the probability that the oracle answers $i \succ j$ when comparing $i$ and $j$ . There are multiple oracles that return the correct answer with different probabilities. This paper proposes an approach with improve sample complexity that combines Probe-Max proposed by Lou et al. (2022) for ranking estimation and Compare proposed by Saad et al. (2023) for oracle selection. They also prove a lower bound on the sample complexity.

update after rebuttal

I keep my positive score based on the authors' rebuttal. I hope that the authors update the manuscript so that the two independent components of the proposed algorithm are clearly presented.

给作者的问题

Not particularly.

论据与证据

The contribution of the paper is clearly stated, and all the theoretical results are provided with the full proofs.

方法与评估标准

The empirical results use only randomly generated datasets. It is not fully satisfactory, but reasonable just for complementing the theoretical results.

理论论述

I did not check the details of the proofs, but briefly checked them and found no fatal errors.

实验设计与分析

In my opinion, the experiments could be improved by adding more benchmark algorithms. Currently, the experiments compare the proposed method with a single competitor, which is a naive extension of Probe-Max originally developed for the single-oracle setting. Since the proposed method is a combination of high-level ranking algorithm and a low-level oracle selection algorithm, it is possible to consider benchmarks based on existing methods given in Table 1, even if they are for the single-oracle setting. If the authors add more benchmark algorithms to the experiments, the empirical contribution of this paper would be increased.

补充材料

I briefly checked the proofs.

与现有文献的关系

The proposed method in this paper is built on the existing algorithms for the single-oracle setting such as Probe-Max by Lou et al. (2022) and IIR by Ren et al. (2019) and the one for the multiple-oracle setting by Saad et al. (2023). The sample complexity upper bound is improved and lower bound is newly proved.

遗漏的重要参考文献

To my knowledge, all related works are clearly discussed.

其他优缺点

Not particularly.

其他意见或建议

The introduction and problem setting are clearly written, but the algorithm description is complicated. I believe it can be simplified by separately describing building blocks. In my understanding, the proposed method is a combination of ranking estimation and oracle selection, and both of them can be independently combined with other methods.

作者回复

2025-04-01

Dear reviewer thanks for your time and effort for the review and suggestions, we address your concern below:

Q1: Since the proposed method is a combination of high-level ranking algorithm and a low-level oracle selection algorithm, it is possible to consider benchmarks based on existing methods given in Table 1, even if they are for the single-oracle setting.
A1: Thanks for the suggestion. The main difficulty in including more benchmarks is that most such benchmarks are offline datasets. There are no widely acknowledged benchmarks for online, interactive algorithms. This is the main reason why we use a synthetic environment with multiple oracles. Ren et al. also used a synthetic environment for the single-oracle setting.

Q2: I believe it can be simplified by separately describing building blocks. In my understanding, the proposed method is a combination of ranking estimation and oracle selection, and both of them can be independently combined with other methods.
A2: Thanks for your suggestion. You are correct that the algorithms can be broken down into two major pieces. We will revise and restructure the text to make it easier to read.

审稿意见

评分: 42025-03-14

The paper studies the problem of identifying the ranking of a set of items by querying pairwise preferences from several oracles. In particular, they study two settings: (1) weak stochastic transitivity (WST), where there exists a ranking (permutation) of items such that items ranked higher has higher probability of being preferred by all oracles; (2) strong stochastic transitivity, where in addition to (1), if an item i ranks higher than j, and j ranks higher than item k, then it is easier (probability of preference is further away from 0.5) to identify i is better than k than the other two comparisons. They measure the performance of an algorithm by the number of times needed to query oracles.

In the WST setting, they provided a lower bound on the number of queries, and proposed an algorithm that achieves an upper bound that matches the lower bound. In the SST setting, they proposed an algorithm that improves upon prior best-known sample complexity by log factors, and matches the lower bound of the problem.

"## update after rebuttal

I would like to keep my evaluation.

给作者的问题

论据与证据

Yes

方法与评估标准

Yes

理论论述

I did not check proofs.

实验设计与分析

I did not find problems.

补充材料

与现有文献的关系

The paper studies a problem of rank aggregation, which is a problem studied by many prior works.

The paper specifically focuses on the setting where there are multiple oracles, where there is only one work that studies it under the SST condition. The paper proposed an algorithm that achieves better theoretical guarantees than the prior work under the SST setting. The paper also studies the problem under the WST setting.

The proposed method borrows some ideas from prior works.

遗漏的重要参考文献

I am not aware of any

其他优缺点

Strengths

The paper is a well-written and a pleasure to read.
The paper presents a theoretical analysis of the problem, providing both upper and lower bound on the WST setting and improved upon prior works under the SST setting and matches the lower bound of the problem.

Weaknesses:

Some assumptions might not hold in practical real-world applications. For instance, different people might rank different LLMs differently.

其他意见或建议

In the bounds, the confidence term \delta is missing. I guess they are in the log term. It might be good to present it to the readers to see how that affects the bound.

作者回复

2025-04-01

Dear reviewer, thanks for your positive comment. We address your question raised below:

Q1: Some assumptions might not hold in practical real-world applications. For instance, different people might rank different LLMs differently.
A1: It is possible to consider more general cases. Without a strict consistency guarantee, we need to define a ground-truth ranking using numerical indicators like win rate. For example, one may use the Copeland/Borda score to determine a ground-truth ranking when there is inconsistency among oracles. Additionally, these methods typically involve examining all oracles with the same accuracy and then concluding with a majority voting, which usually leads to a less efficient algorithm (introducing an additional factor of $M$ ).

Q2: In the bounds, the confidence term $\delta$ is missing. I guess they are in the log term. It might be good to present it to the readers to see how that affects the bound.
A2: We might omit some of the terms in the main text to reduce clutter for ease of reading. In Table 1, we included the log dependence on $\delta$ .

审稿意见

评分: 32025-03-17

This paper addresses the problem of efficiently aggregating preferences from multiple oracles to determine rankings under different stochastic transitivity conditions. The authors propose two main algorithms: RMO-WST for the Weak Stochastic Transitivity (WST) setting, and RMO-SST for the Strong Stochastic Transitivity (SST) setting. Experimental results validate the theoretical advantages of these algorithms, showing consistent improvement over a baseline method.

给作者的问题

The consistency assumption (Assumption 3.1) seems quite strong in practical settings. Have you considered a relaxed version where oracles can disagree on some small fraction of pairs? How would this affect your theoretical guarantees?
Your RMO-SST algorithm achieves a log(N) factor improvement over prior work. Is this improvement purely analytical, or does it correspond to a specific algorithmic innovation that wasn't present in previous approaches?
For the WST setting, you provide matching upper and lower bounds. Do you believe similar tight bounds exist for the SST setting? If there's a gap, where might the improvement come from?
The empirical evaluation focuses on synthetic data with a specific pattern. How would performance change with more complex accuracy distributions, such as when different oracles have expertise on different subsets of items?
Your algorithms require the number of oracles M as an input parameter. In crowdsourcing scenarios, this number might grow over time. Is there a natural way to adapt your algorithms to an online setting where new oracles can join?

论据与证据

The claims in this paper are supported by evidence:

The sample complexity bounds for both algorithms are derived through detailed mathematical proofs
The lower bound for WST ranking matches the upper bound, providing a complete characterization
Experimental results confirm the theoretical advantages, showing that RMO-WST consistently outperforms the Probe-Max baseline The evidence is provided for the theoretical contributions. The empirical evaluation, while supporting the claims, is somewhat limited in scope (only synthetic data with one specific pattern of oracle accuracies).

方法与评估标准

The proposed methods are well-designed for the ranking problem:

The bi-level design effectively separates the ranking strategy from the comparison mechanism.
The oracle selection approach intelligently adapts to varying oracle accuracies.
The WST and SST conditions appropriately model different real-world preference scenarios.

理论论述

The theoretical analysis appears sound:

Theorem 5.1 (Compare algorithm guarantees): Builds on Saad et al. (2023) with appropriate modifications
Theorem 5.2 (RMO-WST sample complexity): Relies on an accounting method that tracks and bounds the total number of comparisons made by the algorithm.
Theorem 5.4 (RMO-SST sample complexity): Replaces the Attempt-to-Compare (ATC) subroutine from Ren et al. (2019) with their improved Compare algorithm (Algorithm 3) while keeping the overall structure of the Iterative-Insertion-Ranking (IIR) framework.
Theorem 5.8 (Lower bound): Uses a reduction to a hard instance and information-theoretic arguments.

实验设计与分析

The experimental design compares RMO-WST against Probe-Max across various settings (number of ranking items and number of oracles). The experiments are sound but limited in scope. Testing with different oracle accuracy distributions or real-world preference data would have provided more comprehensive validation.

补充材料

I skimmed through it but did not read it thoroughly.

与现有文献的关系

The work extends and connects to several research areas: single-oracle active ranking, and multi-oracle ranking approaches.

遗漏的重要参考文献

Based on my assessment, the authors have thoroughly addressed the key findings present in existing literature on this topic.

其他优缺点

Strengths and weaknesses have been highlighted in other questions.

其他意见或建议

None.

作者回复

2025-04-01

Dear reviewer, thanks for your positive and constructive comments. We address your questions below:

Q1: The consistency assumption (Assumption 3.1) seems quite strong in practical settings. Have you considered a relaxed version where oracles can disagree on some small fraction of pairs? How would this affect your theoretical guarantees?
A1: Great suggestion! It is possible to generalize to a broader, more flexible case where oracles can disagree on a small group of pairs. However, without a strict consistency guarantee, we can only define an objective ranking using numerical indicators like win rate. For example, one may use the Copeland/Borda score to determine a ground-truth ranking when there is inconsistency among oracles. Additionally, these methods typically involve examining all oracles with the same accuracy and then concluding with a majority voting, which usually leads to a less efficient algorithm (introducing an additional factor of $M$ ).

Q2: Your RMO-SST algorithm achieves a log(N) factor improvement over prior work. Is this improvement purely analytical, or does it correspond to a specific algorithmic innovation that wasn't present in previous approaches?
A2: The improvement comes from applying new algorithm design that wasn't present in Saad’s work. In the previous work by Saad, when inserting an item into the sorting tree, their algorithm simply divided the confidence budget into $\log N$ pairwise comparisons, leading to an additional $\log N$ factor. Our improvement comes from treating the insertion as a whole process and carefully assigning the budget when inserting an item into the tree.

Q3: For the WST setting, you provide matching upper and lower bounds. Do you believe similar tight bounds exist for the SST setting? If there's a gap, where might the improvement come from?
A3: For the SST case, the lower bound is order-wise tight given the proof presented in Ren et al. 2019, which is briefly mentioned in Remark 5.5, line 339. Ren’s lower bound is for a single oracle, but can be easily modified to multiple oracles.

Q4: The empirical evaluation focuses on synthetic data with a specific pattern. How would performance change with more complex accuracy distributions, such as when different oracles have expertise on different subsets of items?
A4: The algorithm and theoretical analysis cover this case where different oracles have expertise on different subsets of items. For each subset of the items (or a specific pair in the context of this paper), the best oracles are elected from scratch for that specific pair. In this regard, our proposed algorithm will continue to enjoy at least the efficiency that a non-RMO version of the algorithm (e.g. Probe-Max) has.

Q5: Your algorithms require the number of oracles M as an input parameter. In crowdsourcing scenarios, this number might grow over time. Is there a natural way to adapt your algorithms to an online setting where new oracles can join?
A5: That’s a great question. Our algorithm can be viewed as selecting the best oracles for different pairs. Thus, when new oracles (crowd-source labellers) come in, they can wait until the algorithm attempts to compare the next new item pair. In this way, the algorithm can work without any modification.

最终决定Accept (poster)

2025-05-01

The paper studies the sample complexity bounds for ranking from pairwise comparisons from multiple oracles. The theoretical results include tight bounds when weak stochastic transitivity holds and an algorithm with same upper bound for strong stochastic transitivity holds (improving by a logN factor of what was previously known). The experimental validation shows the proposed algorithms performs better than the baselines.

All the reviewers agree that the theoretical contributions are well-rounded. However, few raised concerns about novelty of the results as it relies on previously known algorithms with single oracle, but the authors justified that the improved results are obtained due to new algorithmic ideas not present in previous works. A few raised concerns about the experimental validation but the authors have sufficiently justified the concerns. Since the theoretical contributions are significant, I think this is a good enough contribution to appear in ICML. The authors are advised to udpate the final version of the paper according to reviewers' suggestions about writing.