/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

Selective Preference Aggregation

Shreyas Kadekodi,Hayden McTavish,Berk Ustun

提交: 2025-01-24更新: 2025-08-20

TL;DR

We propose Selective Preference Aggregation (SPA), which abstains on disputed preference pairs to produce robust, supported rankings that better align with raters.

摘要

Many applications in machine learning and decision-making rely on procedures to aggregate human preferences. In such tasks, individual express ordinal preferences over a set of items through votes, ratings, or pairwise comparisons. We then summarize their collective preferences as a ranking. Standard methods for preference aggregation are designed to return rankings that arbitrate individual disagreements in ways that are faithful and fair. In this work, we introduce a paradigm for *selective aggregation*, where we can avoid the need to arbitrate dissent by abstaining from comparison. We summarize collective preferences as a *selective ranking* -- i.e., a partial order where we can only compare items where at least $100\cdot(1 - \tau)%$ of individuals agree. We develop algorithms to build selective rankings that achieve all possible trade-offs between comparability and disagreement, and derive formal guarantees on their safety and stability. We conduct an extensive set of experiments on real-world datasets to benchmark our approach and demonstrate its functionality. Our results show selective aggregation can promote transparency and robustness by revealing disagreement and abstaining from arbitration.

关键词

RankingDisagreementPreference AggregationSocial ChoicePluralistic Alignment

评审与讨论

审稿意见

评分: 22025-02-27

This paper proposes aggregating ordinal preferences by producing selective rankings. The proposed selective aggregation framework explicitly reveals and controls dissent. The authors develop efficient graph-based algorithms (Algorithm 1 and Algorithm 2) with theoretical guarantees on correctness, uniqueness, and runtime. Experiments across diverse real‐world datasets demonstrate that selective rankings are more robust, transparent, and fair compared to standard voting rules and ranking algorithms.

给作者的问题

See above

论据与证据

The paper asserts that standard aggregation methods hide disagreement by forcing a total order, while selective aggregation naturally reveals dissent. Although the paper provides theoretical definitions and proofs, the empirical evaluation is questionable. The experiments appear contrived and lack a convincing demonstration that selective rankings offer a meaningful advantage in real-world scenarios. Moreover, the claim of “extensive” evaluation is overstated given the limited scope and sometimes arbitrary selection of datasets.
The algorithms (Algorithm 1 and Algorithm 2) are presented as efficient and unique (optimal) solutions for the selective aggregation problem.
The selective rankings are claimed to be more robust and transparent compared to traditional methods (Borda, Copeland, Kemeny, MC4). The robustness metrics show some promise, yet the overall performance differences are marginal and not convincingly argued to translate into practical benefits. Furthermore, the paper glosses over the trade-offs involved in choosing the dissent parameter $\tau$ .

方法与评估标准

The main contribution, the idea of using selective rankings to “agree to disagree” is not clearly motivated by practical needs. In many applications, a total order is still necessary, and the paper does not adequately address how its partial order can be converted or used in those contexts.
The algorithms are described in mathematical detail.
The evaluation is carried out on several datasets converted into pairwise preferences.

理论论述

The paper’s theoretical results (e.g., Theorems on uniqueness and robustness) are mathematically rigorous under ideal assumptions.

实验设计与分析

The experiments do not convincingly show that selective aggregation improves decision quality; instead, they often merely illustrate that the method can “abstain” from making comparisons—a trivial consequence of the design.
The paper fails to explore how sensitive the results are to the choice of $\tau$ .

补充材料

与现有文献的关系

The paper is grounded in classic social choice theory and traditional ranking algorithms.
The paper also relates to contemporary research in machine learning where transparency and robustness of aggregated annotations (e.g., in RLHF) are increasingly important.
There is little discussion of alternative robust ranking methods, and key references (e.g., for baselines like ORPO or CPO) are not adequately introduced when they first appear.
The approach of simply abstaining from comparisons does not address situations where a complete ranking is essential, and the paper lacks discussion on how its method compares to other state-of-the-art aggregation techniques under realistic conditions.

遗漏的重要参考文献

其他优缺点

Strengths

The paper is mathematically detailed, providing rigorous proofs and a comprehensive supplementary section.
It introduces the idea of selective aggregation, which is novel in its explicit treatment of dissent.

Weaknesses

The overall novelty is limited; the idea of abstaining from forced arbitration is not revolutionary and may offer little benefit in many practical applications.
The method’s sensitivity to the dissent parameter $\tau$ is underexplored, and the paper does not offer practical guidance on setting this parameter.
Presentation issues (such as unclear figures and poor integration of citations) further detract from the work.

其他意见或建议

The paper would benefit from a clearer statement of its primary contribution. Is the focus on the algorithmic framework for selective aggregation, or on its application to a specific problem domain?
More extensive and realistic experiments are needed to justify claims of improved transparency and robustness.
A discussion on how to convert a selective (partial) order into a total order when required would strengthen the practical relevance of the approach.

作者回复

2025-04-01

Response Thank you for your feedback! We address them and include tables at https://tinyurl.com/2ybsfs95

Is the focus on the algorithmic framework for selective aggregation, or on its application to a specific problem domain?

The primary contribution is the proposed algorithmic framework. We will revise the text to state this. The explored problem domains serve as illustrations of the framework's capabilities in different contexts.

The approach of simply abstaining from comparisons does not address situations where a complete ranking is essential …

We appreciate this concern. Many relevant real-world tasks (e.g. RLHF, content moderation, search ranking) do not require total orders. When necessary, the dissent parameter can be raised to 0.5, and further until graph disconnect (with loss of guarantees). If there is still no total order, there is no majority preference between certain items or a cycle is formed. If a total ordering is truly necessary, other methods can be applied within tiers to produce local complete orderings. This process distinguishes comparisons supported by the algorithm's guarantees (between tiers) from those less well-founded. Online, the approach highlights where more information could resolve disagreement. SPA serves as a robust first step: identify consensus comparisons first, then apply targeted methods (experts, more users) where disagreement persists. We’re happy to add this to the text.

The paper glosses over the trade-offs involved in choosing the dissent parameter.

We have several strong points to consolidate a new subsection in section 3 (Algorithms), which are already in the paper: The path algorithm (Appendix) allows us to avoid preselecting a tau value. Importantly, computing all possible rankings for any $\tau$ < 0.5 incurs no additional asymptotic cost.

Theoretical guarantees can guide its selection. The discussion starting at line 281 gives an example of how we might select tau based on assumptions about noise or missing preferences.

There is little discussion of alternative robust ranking methods, and key references (e.g., for baselines like ORPO or CPO) are not adequately introduced when they first appear.

We're happy to add discussion of ORPO/CPO/other methods. ORPO focuses on alignment during the training process. SPA is less constrained - it may be possible to use it in tandem; SPA could filter for high-quality responses with ORPO used for fine-tuning.

CPO more closely resembles SPA, but its purpose is to contrast “near-perfect but flawed translations”. While comparing top tiers in SPA could have a similar purpose, our method captures preferences across all items. As a result, we feel rank aggregation methods are more appropriate baselines.

The selective rankings are claimed to be more robust and transparent … yet the overall performance differences are marginal and not convincingly argued.

We appreciate this and will highlight SPA's benefits. SPA provides practical benefits through transparency and robustness. It reveals underlying consensus strength (e.g., Sushi's weak majority via path, App D.3). Existing methods obscure this. As noted, SPA offers stronger robustness guarantees (see dGCS). SPA exhibits no inversions while others vary significantly - this stability is crucial in domains like RLHF, preventing reward signals from flipping due to small sample changes. SPA consistently yields lower disagreement rates (Table 1) compared to baselines (often 0-6% for SPA vs. 4-12%+ for baselines).

The overall novelty is limited; the idea of abstaining from forced arbitration is not revolutionary and may offer little benefit in many practical applications. The main contribution, the idea of using selective rankings to “agree to disagree” is not clearly motivated ….

Beyond abstaining, SPA reveals preference strength, crucial for trustworthy AI. For instance, methodologies aiming to learn from demonstrations [Brown et al. 2019] or generative reward models [Mahan et al. 2024] use steps with simple majority rules. Our rankings are always supported by a majority of users, clarify where preferences conflict, remain stable under noise, and can be adjusted. Our response to sHCm further details future use cases in ML workflows like RLAIF and personalized models.

Mahan, et al. "Generative reward models." arXiv preprint arXiv:2410.12832 (2024).
Brown, et al. "Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations." ICML. PMLR, 2019.

More extensive and realistic experiments are needed ….

We appreciate the feedback. Table 8 (in our response) shows performance in scenarios with and without consistent user preferences. Only SPA reveals user contradictions. See also our responses to dGCS (ablation) and fgRQ (dropped comparisons).

We hope that our response resolves any misunderstandings! Thank you again for your time, and we look forward to resolving any remaining concerns that you may have.

审稿意见

评分: 42025-03-07

The paper introduces a new preference aggregation solution, called Selective Preference Aggregation (SPA). Its essential feature is to return a partial order of items based on beyond-majority principles. More precisely, for any $\tau \in [0, 0.5)$ , SPA constructs a total order over the finest partition of the items such a derived order $i \succ j$ cannot occur if more than a fraction $1-\tau$ of users disagree with $i \succ j$ . In a sense, the partial order is "safe", i.e. it only makes ordering decision if a large, potentially overwhelming, majority agree with the decision. The paper proves desirable property of SPA and evaluates it empirically on several datasets.

给作者的问题

Could the authors more rigorously define the adaptation of SPA in Section 6?

I would also be interested in the authors' thoughts on how SPA could be adapted for highly sparse comparisons, as is the case in RLHF and recommendation AIs (see e.g. [3]).

Additionally, one interesting feature of pol.is is to leverage community detection to find agreement across (unbalanced) communities. How could SPA be adapted to such settings (especially when the communities are not predefined, and have instead to be learned)?

(note that combining the two problems, namely sparsity and cross-community agreements, makes the problem even more challenging, as the sparsity may be adversarial, i.e. some community overrates some items rather than others)

[3] Plackett-luce regression mixture model for heterogeneous rankings. M Tkachenko, HW Lauw (2016). https://dl.acm.org/doi/abs/10.1145/2983323.2983763

论据与证据

The paper shows that:

The values of all SPAs for $\tau \in [0, 0.5)$ can be computed efficiently, using graph algorithms.
For $\tau \to 0.5$ , SPA isolates an existing Condorcet winner.
SPA is "safe": under reasonable conditions, adding missing preferences cannot modify previously orders.
Adding a new item to SPA (with comparisons to other items) will not invert orders (though it may cancel some).

While I have not carefully analyzed the provided proofs, I am fully convinced that the theorems hold.

The paper is also evaluated on 5 datasets, with findings consistent with the intuition, i.e. SPA makes fewer mistakes by abstaining to return a total order, though this can lead to a very coarse partition (and thus a very sparse comparison graph). The results seem reasonable.

Finally, the paper shows how SPA can be used for preference learning in a machine learning context (and thus, in principle, with generalizations to non-evaluated items). I am less convinced by the value of this experimental setup (see below).

方法与评估标准

The experiments in Section 5 make sens to me, and I am satisfied with their design, their presentation and the results.

However, the experiment in Section 6 is less compelling. In particular, the value of doing machine learning is to allow for generalization to non-evaluated items. Yet, as far as I understand, the trained models are evaluated on the training set (line 422, left column). I would suggest that they instead separate the DiCES dataset into two subsets, one for training and the other for evaluation. The reported prediction error should then correspond to the evaluation set.

Also, I am not sure I fully understood how SPA was adapted to Section 6.

理论论述

While I have not carefully analyzed the provided proofs, I have overviewed the Appendix. I am fully convinced that the theorems hold, as they leverage classical graph theoretical constructions.

实验设计与分析

As I said above, I believe that Section 6 would gain by separating the dataset into a training and an evaluation set, to evaluate the out-of-training-set predictions of the SPA-based trained model.

补充材料

I did not thoroughly review the supplementary material. However, the results seem reasonable to me.

与现有文献的关系

The literature review seems satisfactory to me. Perhaps the authors could add references to more datasets that fit their algorithms (see suggestions below).

遗漏的重要参考文献

The paper does seem to be missing any key reference.

其他优缺点

I really appreciated the paper's motivation section, especially with respect to the limitations of the principle of majority and the need to go beyond this, not only for legitimacy reasons, but also because of security. This yields an originality that is of great strength to the paper.

The paper is also very well written, with an exception for Section 6 which I found confusing at times (see below).

I think that, especially in the current age of irreconcilable judgments on AIs' preferred behaviors, especially in the context of social media moderation and content amplification, the paper is of very high significance.

其他意见或建议

I did not understand how SPA was precisely defined in Section 6 (line 383, right column). The paper writes "SPA for the largest value of dissent, which leads to the clearest distinctions among conversations". How is "distinctions among conversations" measured? Moreover SPA creates a partition. Are the authors guaranteeing a partition into two subsets (toxic and non-toxic)? What if there are more subsets? I would appreciate clarifications on this definition.

Additionally (and perhaps relatedly), the definition of label error (line 420, left column) introduces a variable $t$ which is never defined (except using the word "consensus" which I do not seem to understand, see below), and whose value in the experiments does not seem to be given. Could the authors clarify?

Less importantly, I would urge the authors to look into the data of https://pol.is [1]. There has been recent efforts to make such data openly available, and they really match the DiCES dataset structure. But there are many more such data. Most importantly, I believe that SPA would be a great addition to the pol.is website.

The authors could also be interested to look into the Tournesol dataset [2], though the comparisons in this dataset are a lot more sparse, which raises additional challenges.

Finally, I am confused about the use of the word "consensus" throughout the paper. In plain English, "consensus" seems synonymous with "quasi-unanimity", which suggests using $\tau \to 0$ . But this is not how it seems to be used, e.g. lines 323, right column, or line 420, left column.

[1] Polis: Scaling Deliberation by Mapping High-Dimensional Opinion Spaces. Christopher Small, Michael Bjorkegren, Timo Erkkilä, Lynette Shaw, Colin Megill (2021). Departament de Filosofia, Sociologia i Comunicació Audiovisual i Publicitat.

[2] The Tournesol dataset: Which videos should be more largely recommended? Lê-Nguyên Hoang, Romain Beylerian, Julien Fageot, Louis Faucon, Aidan Jungo, Adrien Matissart, Nathaël Noguès (2024). https://openreview.net/forum?id=5WFzk0H27p

作者回复

2025-04-01

Thank you for your time and feedback! We appreciate your feedback and your detailed suggestions for improvement, including further datasets to improve our work. We provide tables at https://tinyurl.com/2ybsfs95

However, the experiment in Section 6 is less compelling … The reported prediction error should then correspond to the evaluation set. In our revised setup, we use only a subset of users and items (80/20 train/test split) to create rankings for the given items. We switch to using a pre-trained model (bert-mini) as a starting point, and replace pairwise majority with the Copeland method for consistency with our experiments. We then note the total per-user error, as well as how well a model trained on these rankings generalizes to new users in Table 10.

We also provide a table of model generalization to new items (with train users), new users (on train items), and new users on new items. We binarize to pick the threshold with the Maximum TPR s.t FPR is capped to 10% (Table 11). We hope to spend further time exploring other pre-trained setups and architectures to ensure higher performance for test set items and users, and to demonstrate SPA’s performance advantages with other choices of threshold.

Also, I am not sure I fully understood how SPA was adapted to Section 6.

We made an error in our description of the adaptations, now corrected. DICES uses annotators who rate an item toxic/non-toxic, but the level of toxicity beyond that is not clarified. In our new setup, in order to avoid excessive (and incorrect) levels of “ties”, we have only included preferences where there is distinction (toxic vs non-toxic). We then scale the weights of each preference pair to the same total weight, to make each item-pair equally important.

How is "distinctions among conversations" measured?

“Distinctions between conversations” is used to specify the greatest number of tiers (greatest comparability). We have added text to our manuscript to make that distinction clear. There are several tiers created - the chosen threshold determines which are grouped under toxic/non-toxic.

The definition of label error (line 420, left column) introduces a variable t … Could the authors clarify?

$t$ in this instance represents the majority threshold, at which point a majority of users rate a conversation toxic. We have revised the text to make this clear as well.

Finally, I am confused about the use of the word "consensus" throughout the paper. In plain English, "consensus" seems synonymous with "quasi-unanimity", which suggests using . But this is not how it seems to be used, e.g. lines 323, right column, or line 420, left column.

We agree that the term was not well-defined and should be clarified. In this setting, we use “consensus” to refer to the true majority vote of annotators.

Additionally, one interesting feature of pol.is … especially when the communities are not predefined, and have instead to be learned)?

This is a fascinating direction for potential future work! We take the question to mean that we want to find cases where multiple distinct communities agree (please correct us if needed!). One could imagine applying this approach hierarchically in such a setting - finding a selective preference aggregation for each community individually, then treating each of those tiered rankings as an individual “judge” and finding a preference aggregation across communities. If communities need to be learned, one could imagine assigning each individual a weight based on an estimated probability of belonging to each community (see our response to reviewer dGCS for further details on adding weights to our approach). SPA could also be used as part of the identification process; identifying groups of users who agree with each other in areas where overall disagreement is high (within a tier, for instance) can be used to help determine these communities.

I would also be interested in the authors' thoughts on how SPA could be adapted for highly sparse comparisons, as is the case in RLHF and recommendation AIs (see e.g. [3]).

In future work, we plan to explore different ways of adapting SPA for use in larger/sparser datasets. The most straightforward is to build on existing work using LLMs to make judgments (RLAIF). We could use multiple LLMs to make judgements to avoid amplifying model biases; one could also use lightweight models and “recruit” larger models only when preferences conflict to save on computational cost.

SPA could also be used with modeling. SPA could be used to limit uncertain responses from existing models, or could be used to help create models that better model users. Comparisons within tiers could be used to train a base model for accuracy, and identified groups (see previous response) could be used to train personalized models with high accuracy for each group.

We hope our responses clarify your points! Thank you for your feedback and suggestions.

审稿意见

评分: 22025-03-13

The paper introduces a new framework for ranking via preference aggregation while allowing for disagreement of the voters. Unlike many traditional methods that enforce a total order, the approach aims to construct a partial ranking, only comparing items where a sufficient majority agrees. The paper proposes an algorithm graph based algorithm to construct ordering relationships of the items. They provide a correctness analysis and asymtotical runtime of it. The authors also apply their method on small datasets.

给作者的问题

In the problem statement $SPA_\tau$ , what is exactly Comparisons(T )? Also $\mathbb{T}$ is undefined, I guess it is the family of tiered rankings.
What the colors in figure 2 represent?
What is the unit measure of the runtimes in Figure 4? Seconds?

论据与证据

The authors claim that the algorithm is fast and scalabile however this is not convincing, expecially from an experimental perspective. Indeed:
The datasets are way too small. The higher number of items is 175. Since the runtime seems to be quadratic (in the items), I don't see reasons why not to apply it on thousands of items (also because other methods can handle such dimension)
Often, there are (of the order of) n*log n pairwise comparisons available (this is because it is enough to rank the n items with such amount of comparisons), but the case analyzed in the paper with % of missing pairwise comparisons different from 0, is only the one with 7 items. It would be interesting to see the behaviour of the algorithm in different cases, i.e., for missing pairwise comparisons for bigger datasets.
SPA_0 and SPA_min very often do not produce actual rankings. Often the number of Tiers is 1 which would mean that all items are incomparable, and thus not very informative
in figure 4 there is not reported the unit measure of the runtime. I think it is seconds, but 60-100 seconds for 500 items seems rather slow.

From a theoretical perspective, the runtime of O(n^2 * p) is reasonable.

方法与评估标准

See the issues above on the datasets and experiments.

理论论述

The theoretical analysis seems rigorous and sound. I checked the algorithm 1 and its runtime.

实验设计与分析

Besides the issues previously mentioned in Claims And Evidence, the rest is valid and clear.

补充材料

I review the runtime of algorithm 1. And read the additional theoretical results.

与现有文献的关系

The distinction with ranking with ties is not very clear. The reason why the method would deem items as incomparable and not equivalent is not very well explained. In particular, the items deemed incomparable might actually be evaluated and deemed equal, by some individuals/evaluators. I think that there is the need to better clarify this distinction theorically. Also the empirical analysis should benefit from a direct comparisons with ranking with ties methods.

遗漏的重要参考文献

Perhaps the Bradley-Terry model, which is commonly used to rank from pairwise comparisons colud be mentioned: Bradley, Ralph Allan; Terry, Milton E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons". Biometrika. 39 (3/4): 324–345.

其他优缺点

其他意见或建议

Typos and similar:

line 48R, the sentence seems incomplete
appendix A2, issues with the references and an extra ,

作者回复

2025-04-01

Response Thank you for your time and feedback! We include supplementary tables here: https://tinyurl.com/2ybsfs95

The authors claim that the algorithm is fast and scalabile however this is not convincing … … 60-100 seconds for 500 items seems rather slow.

As you’ve noted, our empirical runtimes were slower than our asymptotic analysis suggests. That is due to some inefficiencies that we have since resolved in our implementation. We provide a table of updated runtimes (in seconds) in Table 6 and have updated the figure in our manuscript. (Note that the naive method remains slow because it does not leverage the path approach’s tricks.)

We’re happy that you agree our proven asymptotic runtime of O(n^2*p) is reasonable. Indeed, many competing methods can be quite computationally intensive - for instance, Kemeny-Young is NP-Hard and must be solved with integer programming in many contexts. Our algorithm’s asymptotic scaling - which is linear in the number of input pairwise preferences - gives a strong characterization of our algorithm’s scalability, independent of any quirks in implementation or hardware.

Often, there are (of the order of) n*log n pairwise comparisons available…, for missing pairwise comparisons for bigger datasets.

We appreciate your point regarding the lack of missing pairwise comparisons. We provide Table 7 highlighting the number of comparisons at $\tau$ _max with 5% of all pairwise comparisons dropped. We also refer you to values of Δ-sampling, which serve as a median change with 10% of samples dropped, equivalent to 10% missing comparisons.

SPA_0 and SPA_min very often do not produce actual rankings …

To clarify, we do feel these values (even when creating minimal tiers) highlight key information. At SPA_0, any tiers > 1 indicate unanimous preference among users. The dissent rate at which we see SPA_min can be informative as well - in certain datasets such as Sushi, this value is high (> 0.4), which reveals high levels of underlying disagreement. This information can also be found essentially for free, since Algorithm 2 gives these tiered rankings as part of the process of finding more granular rankings like SPA_max.

The datasets are way too small …

We appreciate the reviewer's suggestion. Existing larger datasets are generally sparse; we discuss future adaptations with reviewer sHCm that would enable SPA. We note that the number of potential pairwise preferences processed in existing scenarios is already substantial - over 7.5 million for DICES (Section 6) – demonstrating SPA's ability to handle a considerable volume of preference data. SPA’s design and theoretical guarantees (Section 4) ensure predictable behavior at any scale. Several applications remain at small values of n; reviewer dGCS points out potential applications of SPA in RLHF at n < 5, and we note applications that exist at the current scale. If the reviewer wants further clarity on larger scale data, we are willing to conduct experiments on synthetic data.

The distinction with ranking with ties is not very clear … this distinction theorically.

We use 'incomparable' when no strict pairwise preference exists between items. Items in the same tier are thus 'incomparable', which can arise from various situations like cycles, evenly split preferences, or judges explicitly marking items as equivalent (see Fig 3). Therefore, 'incomparable' (in the same tier) does not necessarily mean 'equivalent'. For example, if ⅔ of judges think A > B, ⅔ think B > C, and ⅔ think C > A, that does not necessarily mean A = B = C. Perhaps 0 judges think any of those items are equivalent. Users can state equivalence, and our ranking considers it. However, the distinction between abstention/disagreement and asserted equality is a key benefit within our tiers.

Also the empirical analysis should benefit from a direct comparisons with ranking with ties methods.

Regarding comparison to tie-handling methods: standard baselines like Borda or MC4 do produce ties when scores are equivalent, but by design these instances are unlikely. Our mechanism —abstaining based on exceeding a pairwise dissent threshold — differs from methods inducing ties via score equivalence.

what is exactly Comparisons(T )? Also T_mathcal is undefined, I guess it is the family of tiered rankings

We define $\Comparisons{\tierset} := \sum_{i,j \in \intrange{n}} \indic{\aggprf{i}{j}{\tierset} \neq \dnc{}}$ in Section 2, although we acknowledge issues with formatting in our submission that we have now fixed. Please let us know if you would like further clarification on our definition of Tiered Rankings (T_mathcal) in Definition 2.1.

What the colors in figure 2 represent?

The colors denote items in the same tier.

Perhaps the Bradley-Terry model….

We have added a reference, and include an additional experiment in our response to reviewer mXsg.

We hope these clarifications address your concerns! We are happy to resolve any further questions.

审稿人评论

2025-04-03

Thanks for your reply. However, several of my concerns have not been addressed.

the method has not been shown to work with n*log n number of pairwise comparisons (10% missing comparisons is almost like having all of them). Collecting n^2 is almost always unavailable indeed the authors admit that: 'Existing larger datasets are generally sparse' but there is no test on these real datasets
the datasets are way too small, with way too few items
Minor, but I feel some baselines against methods with ties are needed. Bradley-Terry has extensions with ties ([Rao and Kupper, 1967] - Rao, P. and Kupper, L. L. Ties in paired-comparison experiments: A generalization of the bradley-terry model. Journal of the American Statistical Association, 62(317): 194–204, 1967.), but also one could simply deem as ties when two scores of a method are below a certain threshold. You could easily make tiers based on the ties.
Furthermore, Reviewer mXsg is on point with 'The experiments do not convincingly show that selective aggregation improves decision quality; instead, they often merely illustrate that the method can “abstain” from making comparisons—a trivial consequence of the design.' Did you consider datasets with a ground truth ranking, besides the observed pairwise comparisons to rank from?

作者评论

2025-04-08

Thanks! We've included some responses below – but we’d like to start with a broader misunderstanding.

Collecting n^2 is almost always unavailable… but there is no test on these real datasets

Our work focuses on a large set of real-world tasks like college rankings, subjective recommendations, and toxicity prediction (Sec 5). These tasks cover major applications of preference aggregation and benchmark our approach across domains, preference types, disagreement, missingness, and noise. The datasets in our paper are smaller because they are cases where we can gather sufficient pairwise comparisons to make reliable claims.

It seems like you are concerned we did not test our method on a sparse dataset with millions of items and/or users. This is not because our method doesn’t scale but because we know how it would behave given the degree of sparsity. In such tasks, where we are missing so many preferences, every selective ranking would have a single tier. To be clear, this is a feature, not a bug. There is no collective preference claim that we can make that is robust.

We want to be clear that we see the problems as important, but out-of-scope. In this case, we can extend our paradigm to handle them in several ways (e.g., by imputing missing preferences from a plausible distribution and constructing selective rankings). We see them as out of scope since it requires fundamentally different approaches and detracts from the fact that we need a different paradigm in the first place. Here, it is important to establish the foundations of the method and highlight that it works correctly for important problems.

method has not been shown to work n*log n number of pairwise comparisons

There is no reason that it would not. We can scale up the synthetic datasets.

If you meant collecting ~n log(n) comparisons per judge (assuming transitivity), our approach is fully compatible via preprocessing, although this assumption may not hold in the real-world.

10% missing comparisons is almost like having all of them.

We can include ablation studies where we drop more comparisons (or the threshold fraction of dropped preferences at which we obtain a single tier). Again, this is a feature not a bug.

..baselines against methods with ties are needed.

We note that the methods do have ties. The broader issue is that they do not arise often. We are happy to include these in the table and discuss them in a revision.

Did you consider datasets with a ground truth ranking? You could easily make tiers based on the ties.

Thanks for bringing this up. We’d like to use this as an opportunity to address an important point: there is no ground truth for many tasks where we aggregate preferences – i.e., what is the ground-truth when we vote, ranking colleges, ranking sushi? In such tasks, we’d view ground-truth as the set of individual preferences. Standard aggregation will distort the ground truth as a result of reconciliation. In contrast, selective aggregation would return "as much ground-truth as possible” .

Our algorithm is exactly the subset of items where sufficient users agree to disagree. We want to highlight that this behavior is a result of deliberate algorithm design. Thresholding doesn't inherently capture preference cycles or the disagreement depth revealed by dissent levels.

We do want to say that we have cases where we show that we return a ground truth ranking. Specifically, in Section 6, we return a ranking that minimizes per-user disagreement relative to other methods and generalizes to new users (see sHCM). See our response to reviewer dCGS for recovery of rankings with user noise.

The experiments do not convincingly show that selective aggregation improves decision quality; instead, they often merely illustrate that the method can “abstain” from making comparisons—a trivial consequence of the design.

We agree that the discussion points do not clearly articulate this. We have been revising the paper and this should come across far more clearly now.

Our experiments show that selective aggregation leads to better decisions because:

It only highlights where people agree
It is robust by design, when other methods are brittle

In comparison, existing methods lead to "bad decisions" because:

They overrule users (because they aim to return complete orders)
Their output changes drastically under different conditions

The first point shows that existing methods will not lead to "good decisions" because they inherently overrule users. The second shows that – even in settings where they are willing to tolerate disagreement – existing methods may still lead to bad decisions because their output is sensitive to the realities of preference data and aggregation. Specifically, we show that their output will change dramatically when we drop a little data or add a little noise. These are all realistic scenarios that would lead existing methods to fail. However, our method is robust by design.

审稿意见

评分: 32025-03-15

This paper introduces selective preference aggregation (SPA), a framework that aggregates ordinal preferences into partial orders (tiered rankings) to avoid arbitrating disagreements. The core contributions include a graph-based algorithm, theoretical guarantees (e.g., stability under missing data), and empirical validation across datasets like NBA rankings and toxicity detection. SPA demonstrates improved transparency and robustness compared to traditional methods (e.g., Borda, Kemeny).

给作者的问题

In RLHF settings where the number of items n is typically small (<5), SPA’s tiered rankings may collapse to trivial solutions (e.g., all items in a single tier due to insufficient data). To validate SPA’s utility in AI alignment tasks, could you provide ablation studies or theoretical analysis demonstrating its behavior for n < 5? This would clarify whether SPA’s advantages (e.g., robustness, transparency) persist in the small-scale preference comparisons characteristic of RLHF.
SPA assumes uniform weighting of user preferences. However, in real-world applications (e.g., expert-driven labeling), users may have heterogeneous weights (e.g., experts’ preferences matter more). Does SPA can be directly applied in such a scenario? and analyze how it affects tiered rankings and guarantees (e.g., stability under missing data)? This would enhance SPA’s applicability to scenarios where users have varying levels of credibility or expertise.

论据与证据

Most claims are supported by evidence, but critical gaps in statistical rigor and baseline comparisons weaken their persuasiveness.

SPA’s 0% inversion rate under Δ-Gaming (Table 1) supports robustness claims, but the absence of p-values or confidence intervals undermines statistical significance.
SPA’s 18.4% label error in toxicity detection (Figure 5) is compelling, but the expert baseline’s high error (43%) limits its interpretability.
RLHF Applicability: The claim that SPA generalizes to RLHF with small n ( $<5$ ) lacks empirical support.

方法与评估标准

The proposed methods are theoretically sound, but evaluation criteria and baseline selections are outdated and contextually limited.

理论论述

Key theorems are valid under stated assumptions but require clarification to resolve inconsistencies.

Proposition 4.2: Stability under missing data relies on imputing preferences as indifference ( $\pi$ =0), which may not hold for non-random missingness.

实验设计与分析

Experimental designs are generally rigorous but lack critical controls and scalability tests.

SPA’s linear runtime is untested on large n (>10^4), limiting real-world applicability.
Robustness claims (e.g., Δ-Gaming) lack significance tests, making it hard to validate improvements.

补充材料

Yes. I have reviewed Appendix D to understand the experiment sections (Section 5/6).

与现有文献的关系

The paper situates itself within social choice theory and machine learning.

遗漏的重要参考文献

N/A

其他优缺点

The paper’s originality and practical impact are strong, but structural and clarity issues hinder accessibility.

Weaknesses

Related work is embedded in the Introduction, lacking a dedicated section. This prevents readers from gaining a systematic understanding of the field’s context.
Non-essential theorems (e.g., Section 4) and complex proofs in appendices impede readability.

其他意见或建议

Typos

Missing notation of " $\Delta$ -Gaming" in Table 1. It should be " $\Delta$ -Adversarial", based on the caption.
L349, "We randomly sample or flip" should be "We randomly drop or flip "

作者回复

2025-04-01

We thank you for your response! We provide tables at https://tinyurl.com/2ybsfs95

In RLHF settings where the number of items n is typically small (<5) …This would clarify whether SPA’s advantages (e.g., robustness, transparency) persist in the small-scale preference comparisons characteristic of RLHF.

Certainly! SPA's ranking robustness primarily hinges on the number of judges (p) and their agreement (1 - $\tau$ ). High consensus is needed for admitted pairwise preferences. More judges clarify patterns and enable rankings at lower dissent. Low granularity suggests the need for more judges. Deep disagreement might warrant other methods (like expert input), highlighting SPA's transparency. (See reviewer responses mXsg/FgRQ for details). We provide an example in Table 9. Increasing p (10 -> 30) allows more tiers (3 -> 4) at lower dissent (0.5 -> 0.333). Limited rankings (dissent < 0.5) can signal the need for more judges and identify items with strong agreement.

SPA assumes uniform weighting of user preferences. However, in real-world applications (e.g., expert-driven labeling), users may have heterogeneous weights (e.g., experts’ preferences matter more). Does SPA can be directly applied in such a scenario? and analyze how it affects tiered rankings and guarantees (e.g., stability under missing data)?

This is possible. The most trivial way is to include expert preferences as equivalent to multiple judges. We can also use arbitrary nonnegative weights on each judge. The guarantees then become relative to not overruling a certain sum over weighted preferences of judges.

Just set

$\\textrm{Disagreements}(T) := \\max_{i,j \\in [n]} \\sum_{k=1}^p w_k \\mathbb{1}[\\pi_{i,j}^{k} \neq 1, \\pi_{i,j}(T) = 1]$

Where $w_k$ is the weight of a given judge, renormalized so that $\sum_{k\in\intrange{p}} w_k = p$ . We can maintain many of the guarantees premised on the disagreement constraint - for example, stability under missing data. Some theory does change, for example, $|\mathcal{W}|$ is no longer necessarily linear in p.

Key theorems are valid under stated assumptions but require clarification to resolve inconsistencies …may not hold for non-random missingness.

Proposition 4.2 does hold even for adversarial, non-random missingness - that’s one of the more fascinating ramifications of our approach, actually. The proposition holds because indifference is a conservative assumption to make - it counts as the same or more disagreement with any possible comparison than would any other possible value. We never make comparisons that would be invalid given the true values, regardless of the missingness mechanism.

It may be possible to model missing preferences to extend coverage. See our response to reviewer sHCm (future work).

SPA’s linear runtime is untested on large n (>10^4), limiting real-world applicability.

We appreciate your prior note that there are applications for n<5; please also see our response to FgRQ and sHCm, which address the scale of experiments and use cases. We would like to note that in many domains (e.g., fine-tuning), a limited amount of examples (50-100) is enough to improve performance. From Meta: “A general trend we’ve seen is that quality is more important than quantity … documentation suggests even a 50- to 100-example dataset can potentially make a difference.”

https://ai.meta.com/blog/how-to-fine-tune-llms-peft-dataset-curation/

SPA’s 0% inversion rate under Δ-Gaming (Table 1) supports robustness claims, but the absence of p-values or confidence intervals undermines statistical significance. Robustness claims (e.g., Δ-Gaming) lack significance tests, making it hard to validate improvements.

We would like to clarify the nature of the Δ-Gaming (Δ-Adversarial) metric reported in Table 1. SPA exhibited zero inversions across 100 simulations, each with 10% adversarial preference flips. This 0% reflects the maximum observed inversion rate under the worst-case condition, demonstrating strong stability against manipulation. As such, CIs/p-values on this reported maximum are not directly applicable. We refer you to Appendix C for guarantees on stability and robustness.

The proposed methods are theoretically sound, but evaluation criteria and baseline selections are outdated and contextually limited.

We wish to clarify our intent with the choice of metrics. Our chosen evaluation criteria (e.g., Disagreement Rate, Abstention Rate, Δ-Adversarial robustness) directly quantify the core characteristics of SPA – its ability to manage the trade-off between coverage and dissent, and the resulting robustness gained through abstention. If the reviewer feels specific metrics would enhance the evaluation, we are open to revising/adding these where possible.

We hope our responses clarify the points raised and are happy to address any remaining questions.

最终决定Accept (poster)

2025-05-01

The paper received mixed reviews. The discussion lead to the feeling that the issue is whether one finds the paper well-motivated/significant or not. As even the negative reviewers wrote that the paper is methematically detailed, I am willing to believe that if the paper raises controversies, it might actually be a valuable addition to the confernece, that may be setting some paths for future research.