7.0

/10

Poster3 位审稿人

最低4最高5标准差0.5

4.0

置信度

创新性3.0

质量3.3

清晰度3.3

重要性2.7

NeurIPS 2025

Post Hoc Regression Refinement via Pairwise Rankings

Kevin Tirta Wijaya,Michael Sun,Minghao Guo,Hans-peter Seidel,Wojciech Matusik,Vahid Babaei

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

prediction refinementpairwise rankingmolecular property prediction

评审与讨论

审稿意见

评分: 4置信度: 32025-06-30

This work introduces RankRefine, a model-agnostic, post-hoc method to refine regression predictions by incorporating pairwise ranking information from human experts or general-purpose LLMs, to overcome the limitations of deep-learning regressors when data is scarce. Experiments on synthetic and real-world benchmarks demonstrate the effectiveness of the proposed method.

优缺点分析

Strength:

This method is model-agnostic and plug-and-play. This means there is no need to post-train the model, which reduces implementation efforts.
The paper is well-motivated and the studied scenario is meaningful: There does exist label scarcity issues from real-world experiments across various scientific domains and tasks.
The experiment part demonstrates the effectiveness and robustness of the proposed method, and provides some insights, eg, moderate accuracies yield non-marginal performance gains.

Weakness:

The reference set is a core component in the effectiveness of the method, but the paper does not provide very sufficient discussions (The paper only analyzes the effect of the reference set size). The paper leverages random sampling for simplicity which is good, but a deeper analysis of the reference set’s quality and sampling strategy could empirically strengthen the work. Here are some discussion points that might be interesting:

The sampling strategy: Would a more advanced sampling strategy improve the results?
The distribution misalignment between reference and test: Would performance degrade if the reference and test distributions differ largely? The training-set (label, molecular structure, etc.) distribution could differ largely from the test distribution. This is common in practical scientific scenarios. Considering this in this paper could improve practicality.
The diversity of the selected reference data points: How about the significance of ensuring
- 1. label diversity, which means it spans the full range of property values (avoiding bias toward extreme or median values), or
- 1. the molecular scaffold structure diversity, which means the reference set covers various molecular scaffolds (not a single structure type)

问题

I have listed three questions I'm interested in the weakness part.

局限性

Yes.

最终评判理由

During the rebuttal, the authors have addressed several concerns I raised via the added experiments about alternative sampling strategies and distribution shift issues. I will keep my score.

格式问题

I have not noticed any major formatting issues in this paper.

作者回复

2025-07-31

We thank the reviewer for their thoughtful and constructive feedback. We address the concerns regarding the reference set’s sampling strategy, distribution alignment, and diversity with new empirical analyses and clarifications below.

Q1: Will a more advanced sampling strategy improve the results?

We agree that the reference set is central to RankRefine's effectiveness. While we used random sampling for simplicity, we now include additional experiments to study the impact of biased sampling. We use synthetic data where the reference and query labels are drawn uniformly within some ranges. We simulate predictions by injecting zero-mean Gaussian noises to the query labels.

As shown in Table 1, we systematically restrict the reference set to increasingly narrow subsets of the query labels' range to simulate non-uniform sampling (e.g., bias toward certain property values). Even under severe reference bias (e.g., 70% range coverage centered away from query), RankRefine maintains strong performance. For example, with 60% ranker accuracy and 70% reference set bias, we still achieve a 12.3% reduction in MAE over the base regressor.

The fact that 0% reference bias achieves the best performance indicates that sampling from the full label range (uniform, unbiased sampling) yields the best performance. However, even clustered references retain strong utility. While more advanced strategies such as adaptive, query-aware sampling (e.g., using the base regressor’s prediction to guide selection) could further improve performance, we leave this as a promising direction for future work.

Table 1. RankRefine performance, measured in β (β = $\frac{MAE_\text{refined}}{MAE_\text{non-refined}}$ , lower is better), with non-uniform sampling of the reference set. We restrict the reference set range to increasingly narrow subsets of the ground truth label range, simulating clustered or biased samples. Best performance is achieved when the reference set spans the entire range of the ground truth labels, but clustered reference sets still provide substantial regression improvements.

Ranker Accuracy	50%	55%	60%	65%	70%
*Reference Set Bias (RB) = 0% range*	0.8828	0.8706	0.8343	0.8443	0.8598
*RB = 10% range*	0.9034	0.8731	0.8493	0.8529	0.8768
*RB = 20% range*	0.9078	0.8753	0.8366	0.8694	0.9002
*RB = 30% range*	0.9076	0.8701	0.8439	0.8733	0.9280
*RB = 40% range*	0.9143	0.8767	0.8618	0.8824	0.9414
*RB = 50% range*	0.9253	0.8786	0.8640	0.8769	0.9480
*RB = 60% range*	0.9349	0.8858	0.8712	0.8929	0.9546
*RB = 70% range*	0.9385	0.8919	0.8770	0.8957	0.9597
*RB = 80% range*	0.9551	0.8984	0.8798	0.8943	0.9558
*RB = 90% range*	0.9664	0.9042	0.8842	0.9035	0.9631

Q2: Would performance degrade if the reference and test distribution differ largely?

To test the impact of distribution shift between reference and query, we created two challenging scenarios where the reference labels and query labels are drawn uniformly from different range values:

Disjoint distributions: reference labels in [-1, 0] or [1, 2], query labels in [0, 1]. We averaged the results of the two cases, that is, between {ref = [-1, 0], query = [0, 1]} and {ref = [1, 2], query = [0, 1]}.
Partial overlap: 50% overlap between reference and query label range

In both cases, we fix the reference label range width and vary the ranker accuracy. Despite the mismatch, RankRefine still improves the base regressor when the ranker is moderately accurate (≥65% in the disjoint case, ≥55% in the partial overlap case). These findings suggest that RankRefine is robust to distribution shift that is common in real-world settings with scarce or heterogeneous data.

Table 2. β for disjoint and partially overlapping reference set.

Ranker Accuracy	50%	55%	60%	65%	70%
*Disjoint*	1.2906	1.1629	1.0170	0.9437	0.9219
*Partial Overlap*	1.0160	0.9465	0.8700	0.8607	0.8898

Q3: How does label and scaffold diversity in the reference set affect performance?

Label diversity plays a key role. Table 1 demonstrates that reference sets spanning the full range (0% reference bias) of ground-truth labels consistently achieve the best results, underscoring the importance of covering the ground truth labels spectrum.

We agree that scaffold diversity is also an important factor in molecular applications. However, we did not explicitly control for it in our experiments, as our goal is to present RankRefine as a general-purpose method applicable across domains. Measuring or enforcing scaffold (i.e., design) diversity is specific to molecular data and does not easily transfer to other domains where structural diversity is harder to define. We view scaffold-aware reference selection as an important, domain-specific extension of our work and leave it for future.

2025-08-05

I thank the authors for the clarifications and added experiments about alternative sampling strategies and distribution shift issues, which have addressed several concerns. And I understand that some of my questions (eg., scaffold aware, adaptive sampling etc) could be set as a future work. Taking all of these into considerations, I will keep my score.

审稿意见

评分: 5置信度: 52025-07-03

This paper proposes RankRefine: a method for improving the estimates of uncertainty-aware regressors, for scientific tasks for example, by utilizing pairwise rankings generated either by LLMs or humans in the loop. Given a query, RankRefine first gets an estimate of the mean and variance using a regressor. Then, the query is ranked against other known references by asking either a human expert or LLM. These rankings are then used to generate a rank based estimate of the mean and variance. Finally, these two estimates are combined with inverse variance weighting to produce an improved estimate with lower variance. Under some assumptions, the authors show that this improved estimate will have lower mean absolute error. Experiments show that these improvements hold up in practice.

优缺点分析

The quality of this work is excellent. The figures are clear, all claims are justified with assumptions stated, and the authors are honest about the limitations of their work. That said, I do feel that there are a lot of limitations, in particular the assumption that the regressor is unbiased. It would be nice to see a bit more detail on this.

The clarity is also excellent. I had no issues understanding the method and feel like enough detail was provided to implement it myself.

I would rate the significance as fair. The experimental results show improvements when using RankRefine, however this involves 20 pairwise rankings per query, which could be expensive for a human in practice. The idea of using an LLM is promising and shows some good results, although its not clear how much of this is due to memorization of the LLM's training data.

The originality is good. Previous work has explored using pairwise rankings to generate predictions, and LLMs have shown some promise in answering scientific ranking questions. Combining these two ideas to create a more accurate estimate is fairly novel.

问题

How bad is a biased regressor?

In many practical applications I can imagine that the regressor might systematically over or under predict by a constant factor. If this was the case, how would the performance of the proposed estimator deteriorate? Could potentially make a table with the performance as a function of different degrees of over or under prediction.

Does the LLM simply memorize some of the ranking answers?

It seems plausible that the LLM might simply memorize whether a particular molecule is more or less soluble than another, as an example. Is there evidence that the LLM can successfully reason through the question to come to the right conclusion? I'll leave my recommendation at borderline pending the authors response to these first 2 questions.

Have you considered non-uniform sampling of the reference points?

It seems the reference points are sampled uniformly randomly from the labeled data. Could the variance of the rank-based estimate be reduced by a smarter choice of references? Intuitively I would imagine that references that aren't clustered together would be better.

局限性

yes

最终评判理由

The authors have addressed my main concerns in their rebuttal. The additional ablation demonstrates that their method maintains performance improvements even with a biased estimator. The use of a proprietary dataset mitigates concerns about GPT memorization, and the authors have also provided satisfactory responses to my minor questions. For these reasons, I am increasing my recommendation to accept.

格式问题

none

作者回复

2025-07-31

We thank the reviewer for the thoughtful and constructive feedback. We are encouraged by the positive remarks on the quality, clarity, and originality of our submission. Below, we address each of the reviewer’s concerns in detail and provide new empirical results to support our claims.

Q1: How bad is a biased regressor? In many practical applications, the regressor may systematically under or overestimate.

We agree that this is an important, practical concern. Our experiments on the TDC ADMET datasets (Figure 3 in the main paper) demonstrate that RankRefine works in real-world scenarios. We have also conducted new experiments where we systematically inject constant bias into an otherwise unbiased regressor. Specifically:

We start from synthetic predictions (arbitrary labels injected with random noise) with zero-mean Gaussian noise, yielding a mean absolute error (MAE) close to one standard deviation of the ground truth labels.
We then add a constant offset, increasing in magnitude from 0% to 70% of the label standard deviation (SD).
We report the refined-to-original MAE ratio (β = $\frac{MAE_\text{refined}}{MAE_\text{non-refined}}$ , lower is better) for varying ranker accuracies.

The results (see Table 1) demonstrate that RankRefine remains effective even under substantial regressor bias. For example, with a 60% bias and only 60% ranker accuracy, RankRefine achieves β = 0.9649, i.e., a 3.51% gain.

Table 1. RankRefine performance, measured in β, on a biased regressor. We inject constant bias into an otherwise unbiased regressor, where the magnitude of the offset is measured as a percentage of the ground truth labels standard deviation. RankRefine offers improvement even when the regressor is significantly biased (e.g., 60% of the SD of the ground truth labels)

Ranker Accuracy	50%	55%	60%	65%	70%
*Bias (B) = 0% SD*	0.8911	0.8436	0.8694	0.8668	0.8949
*B = 10% SD*	0.8995	0.8800	0.8623	0.8602	0.8969
*B = 20% SD*	0.9084	0.8869	0.8704	0.8704	0.9047
*B = 30% SD*	0.9224	0.8997	0.8844	0.8861	0.9192
*B = 40% SD*	0.9409	0.9172	0.9059	0.9071	0.9395
*B = 50% SD*	0.9645	0.9403	0.9329	0.9343	0.9654
*B = 60% SD*	0.9942	0.9702	0.9649	0.9670	0.9967
*B = 70% SD*	1.0291	1.0062	1.0022	1.0049	1.0326

Q2: Does the LLM simply memorize some of the ranking answers?

This is a valid concern. To test for potential memorization, we used a proprietary compound activity dataset (75 compounds; IC50 labels) collected by our collaborators (experimental biochemists). This dataset is unpublished and therefore not part of any public training corpora, ruling out direct memorization. ChatGPT-4o achieved 60.14% pairwise ranking accuracy on this private dataset. This result indicates generalization and not memorization.

Q3: Have you considered non-uniform sampling of the reference point? Intuitively, references that aren’t clustered together should be better.

We agree with the reviewer’s intuition that a clustered reference set would reduce the performance gains. We evaluated the effect of non-uniformity on synthetic data (similar to Q1) by restricting the reference set range to increasingly narrow subsets of the ground truth label range, simulating clustered or biased samples.

As shown in Table 2, RankRefine’s performance remains stable as long as the reference set overlaps with the query label range. Even with severe bias (e.g., 60% range), the method still improves the regressor estimate. For instance, with 60% ranker accuracy and 70% reference bias, RankRefine achieves a 12.30% MAE reduction (β = 0.8770). We also confirm, as mentioned by the reviewer, that RankRefine performs best when the reference set labels are spread across the full range of the ground truth labels and not clustered.

Table 2. RankRefine performance, measured in β, with non-uniform sampling of the reference set. We restrict the reference set range to increasingly narrow subsets of the ground truth label range, simulating clustered or biased samples. Best performance is achieved when the reference set spans the entire range of the ground truth labels, but clustered reference sets still provide substantial regression improvements.

Ranker Accuracy	50%	55%	60%	65%	70%
*Reference Set Bias (RB) = 0% range*	0.8828	0.8706	0.8343	0.8443	0.8598
*RB = 10% range*	0.9034	0.8731	0.8493	0.8529	0.8768
*RB = 20% range*	0.9078	0.8753	0.8366	0.8694	0.9002
*RB = 30% range*	0.9076	0.8701	0.8439	0.8733	0.9280
*RB = 40% range*	0.9143	0.8767	0.8618	0.8824	0.9414
*RB = 50% range*	0.9253	0.8786	0.8640	0.8769	0.9480
*RB = 60% range*	0.9349	0.8858	0.8712	0.8929	0.9546
*RB = 70% range*	0.9385	0.8919	0.8770	0.8957	0.9597
*RB = 80% range*	0.9551	0.8984	0.8798	0.8943	0.9558
*RB = 90% range*	0.9664	0.9042	0.8842	0.9035	0.9631

Q4: Pairwise rankings could be expensive for a human in practice.

While 20 comparisons per query may seem expensive, we note that many scientific domains already rely on human expert ranking (e.g., drug design). Our method is also compatible with low-cost and parallelizable rankers, like LLMs (which we have shown is not simply memorizing answers) or crowd workers, offering a promising tradeoff between accuracy and human effort.

2025-08-06

I appreciate the authors response to my questions and concerns, particularly the biased estimator and non-uniform sampling results. That said, could the authors explain why the performance is noticeably worse with a ranker accuracy of 70%? Presumably beta should decrease monotonically with higher ranker accuracy, at least in the case with 0% bias.

2025-08-06

Thank you for the question. The reviewer is correct that performance, measured by β, should ideally improve (i.e., decrease) with increasing ranker accuracy, especially under 0% bias. The dip at 70% (and higher) accuracy is due to bias in the rank-based estimates in our implementation.

Specifically, we solve Eq. (2) using an optimizer from SciPy. Post-submission, we found that this optimizer can produce estimates outside the range of the reference set, particularly when ranker accuracy is high and pairwise comparison contradictions are rare. This extrapolation leads to degraded final estimates.

To address this, we can apply a simple clipping heuristic that constrains the rank-based estimate to remain within the reference set’s range, that is, $\hat{y}^\text{*rank}_0 \leftarrow \max(\min(\hat{y}^\text{*rank}_0, UB), LB),$ $UB = \max{(\mathbb{A}\cup \mathbb{B})}, \qquad LB = \min{(\mathbb{A}\cup \mathbb{B})},$ $\mathbb{A}$ and $\mathbb{B}$ (from Eq. 2) contain references ranked below and above $x_0$ , respectively. This intervention restores the expected monotonicity, even under biased (clustered) reference sets. In Table 3 below, we report results under the same setup as Table 2 (Q3), now with the clipping applied.

We note that this heuristic is not yet included in the submitted manuscript, but we will incorporate it in the final revision. We hope this clarification resolves the reviewer’s concern.

Table 3. RankRefine performance (β) with non-uniform reference set sampling. Clipping the rank-based estimate prevents degradation at higher ranker accuracy.

Ranker Accuracy	50%	55%	60%	65%	70%
RB = 0%	0.8323	0.8187	0.7972	0.7837	0.7786
RB = 10%	0.8308	0.8140	0.7915	0.7878	0.7752
RB = 20%	0.8237	0.8125	0.7914	0.7862	0.7702
RB = 30%	0.8237	0.8149	0.7898	0.7813	0.7754
RB = 40%	0.8289	0.8128	0.7942	0.7800	0.7748
RB = 50%	0.8338	0.8188	0.8001	0.7862	0.7813
RB = 60%	0.8417	0.8274	0.8086	0.7960	0.7896
RB = 70%	0.8520	0.8373	0.8232	0.8109	0.8053
RB = 80%	0.8642	0.8537	0.8425	0.8330	0.8300
RB = 90%	0.8800	0.8740	0.8663	0.8634	0.8620

2025-08-09

Thank you for the updated results. The rebuttal and detailed follow up discussion have addressed my main concerns. I am updating my score to accept.

审稿意见

评分: 4置信度: 42025-07-06

This paper introduces RankRefine, a novel post hoc method for refining regression predictions by incorporating pairwise rankings. The approach is model-agnostic and does not require retraining the base regressor, making it highly versatile and practical, especially in low-data scenarios. Extensive experiments on multiple molecular property prediction benchmarks and diverse tabular datasets demonstrate consistent improvements in predictive performance across different settings.

优缺点分析

Strengths

1.The paper introduces RankRefine, a novel post hoc method that refines regression predictions by incorporating pairwise rankings. This approach is model-agnostic and does not require retraining the base regressor, making it highly versatile and practical for various applications, especially in low-data scenarios.
2.Extensive experiments on multiple molecular property prediction benchmarks and diverse tabular datasets demonstrate that RankRefine consistently improves predictive performance across different settings.

Weaknesses

1.The theoretical analysis relies on several assumptions, such as the regressor and rank-based estimates having unbiased, Gaussian-distributed errors and being independent. These assumptions may not hold in real-world scenarios, especially when dealing with heavy-tailed or skewed noise.
2.The paper mainly focuses on using LLMs as rankers but does not explore other potential types of rankers in detail. There could be other ranking approaches or models.

问题

1.What is the optimal size of the reference set for different types of regression tasks and ranker accuracies?
2.How does RankRefine compare with other ranking methods or models that could potentially be used as rankers in different domains?

局限性

The authors have already articulated several limitations within the paper. However, there are additional concerns that merit further consideration.

1.The method involves comparing the query sample with a reference set to obtain pairwise rankings. As the size of the reference set increases, the computational cost of generating and processing these rankings may become a concern.
2.The oracle experiments assume uniformly random ranking errors, but real-world rankers may exhibit systematic biases.

最终评判理由

The authors provide detailed rebuttal and additional experiments. My concerns have been addressed. I will keep my positive score.

格式问题

No concerns regarding formatting.

作者回复

2025-07-31

We thank the reviewer for their thoughtful assessment and constructive feedback. We appreciate the recognition of RankRefine's novelty, versatility, and empirical strength across diverse benchmarks. Below, we address each of the reviewer’s concerns in detail and hope our clarifications and new experimental results strengthen your confidence in our work. We believe RankRefine offers a practical and broadly applicable contribution, especially in settings where ranking information is more accessible than precise regression labels.

Q1: The method relies on some assumptions such as unbiased regressor and may not hold in real-world scenarios.

We start from synthetic predictions (arbitrary labels injected with random noise) with zero-mean Gaussian noise, yielding a mean absolute error (MAE) close to one standard deviation of the ground truth labels.
We then add a constant offset, increasing in magnitude from 0% to 70% of the label standard deviation (SD).
We report the refined-to-original MAE ratio (β = $\frac{MAE_\text{refined}}{MAE_\text{non-refined}}$ , lower is better) for varying ranker accuracies.

The results (see Table 1) demonstrate that RankRefine remains effective even under substantial regressor bias. For example, with a 60% bias and only 60% ranker accuracy, RankRefine achieves β = 0.9649, i.e., a 3.51% gain.

Table 1. RankRefine performance, measured in β, on a biased regressor. We inject constant bias into an otherwise unbiased regressor, where the magnitude of the offset is measured as a percentage of the ground truth labels standard deviation. RankRefine offers improvement even when the regressor is significantly biased (e.g., 60% of the SD of the ground truth labels)

Ranker Accuracy	50%	55%	60%	65%	70%
*Bias (B) = 0% SD*	0.8911	0.8436	0.8694	0.8668	0.8949
*B = 10% SD*	0.8995	0.8800	0.8623	0.8602	0.8969
*B = 20% SD*	0.9084	0.8869	0.8704	0.8704	0.9047
*B = 30% SD*	0.9224	0.8997	0.8844	0.8861	0.9192
*B = 40% SD*	0.9409	0.9172	0.9059	0.9071	0.9395
*B = 50% SD*	0.9645	0.9403	0.9329	0.9343	0.9654
*B = 60% SD*	0.9942	0.9702	0.9649	0.9670	0.9967
*B = 70% SD*	1.0291	1.0062	1.0022	1.0049	1.0326

Q2: How does RankRefine compare with other ranking methods or models that could potentially be used as rankers in different domains?

Thank you for this suggestion. While RankRefine is agnostic to the choice of ranker, we recognize the value of exploring different types. To that end, we experimented with a Random Forest (RF) classifier as the ranker on the Caco2 dataset (TDC ADMET benchmark). Despite being a simple model, it achieves 75.78% pairwise ranking accuracy.

This highlights the flexibility of RankRefine, it can leverage any ranker (few-shot LLMs, learnt classifiers, domain-specific heuristics), making it adaptable to different domains.

Q3: What is the optimal size of the reference set?

Thank you for raising this point. As shown in Figure 3 (main paper), RankRefine’s performance improves as reference set size increases, up to a point. We observe that performance saturates around 20 samples, with diminishing gains beyond that (e.g., 30 samples). This suggests that RankRefine is not only effective but also efficient, requiring only a small number of reference samples to yield performance improvements.

Additional Comments

We also note the reviewer’s concern about theoretical assumptions (e.g., Gaussian errors, independence). While these simplify analysis, we validate RankRefine extensively in real-world settings (e.g., molecular property prediction, tabular data) where such assumptions do not strictly hold. The consistent empirical improvements across diverse tasks suggest that RankRefine is robust beyond the theory’s assumptions.

On the concern of ranking cost, we note that:

Pairwise comparisons can be parallelized.
Smaller reference sets (≤ 20) are sufficient for substantial improvements in most tasks.
RankRefine only requires inference-time computation, no retraining is involved, preserving efficiency.

2025-08-07

Thank you to the authors for the response and detailed clarifications. I have no further questions, and I would like to congratulate you on the good work.

最终决定Accept (poster)

2025-09-17

The paper proposes a conceptually simple and general method for tuning regression models given preference ranking. The core experiments are done using LLMs to elicit rankings that are used to improve regression models.

Reviewers voted for accepting the work. They appreciated that (a) the method is model agnostic (as shown by results that use RF), (b) shows consistent empirical gains, (c) the paper is clearly written.

Memorization was among raised concerns: it is not clear to what extent the prior exposure of LLMs to large sways of data allows them to just leak test information. This was partially addressed during the rebuttal phase but adding a private test set.

The novelty of the work is limited. First, approaches such as GRPO are routinely used to adjust (generative) models responses based on preference data, also stemming from the Bradley-Terry Model. Second, test-time adaptations are common, including various prompting techniques. Third, there are existing post-hoc adjustment methods (included in the baseline comparison). The AC differed in the assessment novelty from Reviewers and it was discussed during the private discussion phase.

Relatedly, the paper misses an important baseline: adjusting an LLM directly based on the preference data (vs adapting a regressor based on the LLM). This naturally would in the default setting require retraining but could be also done in-context with prompting.

However, given that the paper compares to two prior methods for adjusting regression models based on ranking information, the paper does show to make an advance in a timely application.

All in all, I am recommending acceptance of the paper.