Multivariate Conformal Selection
We propose Multivariate Conformal Selection (mCS), a framework extending Conformal Selection to multivariate settings, achieving finite-sample False Discovery Rate control and enhanced selection power with applications in drug discovery and beyond.
摘要
评审与讨论
This paper introduces Multivariate Conformal Selection (mCS), a new approach for selecting candidates in settings with multivariate responses, such as drug discovery and large language model alignment. Unlike traditional Conformal Selection, which is limited to single-variable outputs, mCS extends the framework by leveraging multivariate nonconformity scores and ensuring False Discovery Rate (FDR) control. The authors propose two variants: mCS-dist, which uses distance-based scores, and mCS-learn, which optimizes selection criteria through differentiable learning.
给作者的问题
I would be grateful if the authors could clarify my question on the proof of Theorem 3.5.
论据与证据
Through both simulated and real-world experiments, the study demonstrates that mCS enhances selection accuracy while rigorously controlling FDR. These numerical results support the thereotical claim of the paper.
I have some concerns on the proof of one of the main results of the paper (please refer to the section Theoretical Claims).
方法与评估标准
Proposed methods and/or evaluation criteria make sense for the problem at hand.
理论论述
I would be graeful is the authors could clarify the following aspects on the proof of Theorem 3.5:
-
On page 12, just before "The last equality is again by", should be (same later in the sentence and in the next equation). With these corrections, I seems to me that the equality before "By definition, {p_l }{l \neq j} is invariant after..." is not true anymore. Indeed, the event "j \in S^*{j\to0}" is equal to " 0 \leq q|S^_{j\to0}/m" since the p-value corresponding to j in S^_{j\to0} is 0.
-
In the proof of Theorem 3.5, could the authors comment on this sentence: "Above two cases happen with probability 1 since there are no ties almost surely." ? I was not clear to me why there are no ties almost surely since V. It might be related to the fact that the proof is derived considering diterministic p-values (and thus the terms in Eq.(3) for the tie-breaking of the nonconformity scores are not there), but I was not able to understand it.
实验设计与分析
The authors provide analysis of their method considering different dimensions of the output space, different properties for the target region and discussed results on both simulated and real data. The empirical evaluation is solid.
I would have been interested to have a more concrete/detailed description of the biological meaning the output space for the application on real data.
补充材料
I read Sections
In Section A.1, it would be good to provide a reference when it is stated: "By a standard result from conformal inference ..."
与现有文献的关系
The paper extends the work from Jin & Candès "Selection by prediction with conformal p-values" to the multivariate response setting. Some key concepts originally introduced by Jin & Candès are adapted to the multivariate setting.
遗漏的重要参考文献
Relevant related works are cited as far as I am aware.
There is another line of work proposing conformal methods to control the FDR on the set of true edges in graph when considering a link prediction problem (such as "Conformal link prediction for false discovery rate control", Ariane Marandon). These works are not cited by the authors but - while being connected - they adress a different task than the one considered in the paper and thus are not essential.
其他优缺点
其他意见或建议
Here are some typos:
-
"but replaces the the denominator" (sec. 2)
-
In algo 1, at line 1, r_j should be r_{n+j}.
-
on the left side of page 5, F (V (x, y), U ) ~ U(0,1) and not (-1,1)
伦理审查问题
None
On page 12, just before "The last equality is again by", should be replaced with ... since the p-value corresponding to in is 0.
A1: Thank you very much for pointing out the typo and the error in our proof. We have corrected the typo accordingly.
Let us clarify the corrected reasoning clearly. In the original manuscript, we aimed to show:
which is correct, yet we used a wrong path to prove this inequality. We claimed:
which is incorrect, since as pointed out, 's -th argument is 0, not . However, consider the following chain of inequalities:
This corrected argument allows us to properly continue the proof as originally intended. We have updated the manuscript to clearly reflect this correction.
In the proof of Theorem 3.5, could the authors comment on this sentence... but I was not able to understand it.
A2: Thank you for pointing this out. In fact, the assumption that have no ties is unnecessary. We can simply handle the case where by combining it directly with the scenario originally labeled (i), where . After this adjustment, the proof proceeds without difficulty.
We have clarified this point and updated the manuscript accordingly.
In Section A.1, it would be good to provide a reference when it is stated: "By a standard result from conformal inference..."
A3: We added a reference (Vovk et al., 2005) to corroborate the claim.
Other Comments Or Suggestions
A4: We have corrected all the typos and addressed all the minor suggestions. We highly appreciate your careful review, which has helped us improve the clarity and quality of our manuscript.
This paper addresses the important problem of multivariate conformal selection. While multivariate conformal prediction is relatively well studied, this appears to be the first work on multivariate selection tasks.
update after rebuttal: I did not change my score as the recommendation is already to accept.
给作者的问题
Could you expand more about tie breaking? What if one used a worst-case rule for ties instead of breaking them at random, so that one has a deterministic test procedure?
Footnote 2: The choice of r_[n+j} needs to be exchangeable in which sense? Regarding its coordinates, or ensuring exchangeability with the calibration data?
How would one define regional monotonicity for a classification problem?
In Theorem 4.1 why do you need to assume i.i.d.; would exchangeability suffice?
论据与证据
The claims are supported by proofs and simulation studies.
方法与评估标准
As this is the first paper on multivariate conformal selection there does not seem to be any benchmark data sets available.
It would be good to see an evaluation on classification / binary outcomes as well.
理论论述
The proofs look plausible.
实验设计与分析
The experimental setup is well explained in the supplementary material.
补充材料
I looked over it but mainly for information.
与现有文献的关系
The following paper seems also related
Klein, Michal, et al. "Multivariate Conformal Prediction using Optimal Transport." arXiv preprint arXiv:2502.03609 (2025).
遗漏的重要参考文献
None came to mind
其他优缺点
Figure 1 is too small to read properly. Perhaps move Task 2 to the supplementary?
其他意见或建议
See above
It would be good to see an evaluation on classification outcomes as well.
A1: In our paper, we chose to focus primarily on regression tasks because they represent a more challenging setting for conformal selection.
In fact, the selection problem for classification (univariate or multivariate) can be directly reduced to the univariate conformal selection framework introduced by Jin and Candes (2023). This reduction explains why we did not include separate evaluations specifically for classification scenarios, as such evaluations would essentially revisit methods already established in the literature.
To clarify this point, let us briefly demonstrate the reduction explicitly:
-
For the univariate classification setting, suppose the response space is composed of classes with target region (with ). Then, by defining a binary response: , the original selection problem directly translates into a univariate conformal selection problem, where we select samples with .
-
For multivariate classification case, e.g. suppose responses are drawn from a joint class space , and the target region . Again, we define a binary indicator converting the original multivariate selection problem into a standard univariate selection task:
Since , there is a direct correspondence between the multivariate and univariate nonconformity scores: and where . Moreover, regional monotonicity is simply the usual monotone condition of univariate conformal selection:
Thus, every classification-based selection task can be naturally and effectively solved using existing univariate conformal selection methods. We will clarify this point explicitly in our final manuscript.
The following paper seems also related "Klein et al. (2025)".
A2: We have now cited this paper.
Figure 1 is too small to read properly.
A3: We have now increased the font size and improved the clarity of Figure 1.
Could you expand more about tie breaking? ...
A4: In our work, we follow the standard practice in conformal methods and break ties randomly. If instead, one used a deterministic worst-case tie-breaking rule-always ranking the test score behind tied calibration scores-the conformal p-value becomes: This rule ensures that for every test sample, making the test deterministic and conservative (though not uniformly distributed). Consequently, applying the BH procedure to these deterministic p-values yields a fully reproducible selection rule, but with a slight reduction in statistical power compared to random tie-breaking. We will clarify this trade-off explicitly in the final manuscript.
Footnote 2: The choice of needs to be exchangeable in which sense? ....
A5: In the original manuscript, footnote 2 mistakenly suggested that the choice of required some form of exchangeability. After careful reconsideration, we see clearly now that no such condition is needed. The point can be chosen arbitrarily from the region . From the viewpoint of our proof, the critical step is the regional monotonicity condition. This condition alone ensures the relationship under the null hypothesis. The oracle p-values are uniform precisely because of the exchangeability of the calibration data. The choice of , therefore, does not influence the oracle p-value and imposes no extra exchangeability constraints. We have revised the manuscript accordingly to clearly state this simpler and accurate assumption.
How would one define regional monotonicity for a classification problem?
A6: Please see Q1.
In Theorem 4.1 why do you need to assume i.i.d.?...
A7: The proof of Theorem 4.1 indeed relies crucially on the strong law of large numbers. This result requires independence (or at least certain mixing conditions). Exchangeability alone is not sufficient here. Please see more detail in Appendix B.
Thank you for the explanation. For the strong law of large numbers under exchangeability this paper seems to do the trick:
Etemadi, N., and M. Kaminski. "Strong law of large numbers for 2-exchangeable random variables." Statistics & probability letters 28.3 (1996): 245-250.
Thank you very much for your insightful comments and for providing this useful reference. In the final manuscript, we will update the relevant discussion in the Appendix to reflect the strong law of large numbers under the 2-exchangeability condition, as suggested.
This paper proposes multivariate conformal selection (mCS), which extends conformal selection to multi-response settings by introducing the concept of regional monotonicity, generalizing univariate monotonicity, and defining multivariate non-conformity scores. mCS guarantees finite-sample false discovery rate (FDR) control. Two types of non-conformity scores are proposed. The first, mCS-dist, is based on predefined distances. The second, mCS-learn, involves a term that is learned via gradient-based optimization where the sorting has been replaced with soft (differentiable) sorting. Experiments on simulated and real-world datasets show that mCS enhances selection power while maintaining FDR control.
update after rebuttal
The new experiments regarding hyperparameters, numerical stability, and calib/test data sizes are convincing. I recommend the paper's acceptance.
给作者的问题
- The choice of (choosing it on the boundary of that would be most informative for the given ) also seem to offer room to improve selection power, as long as the choice doesn't violate exchangeability. Is this correct? If so, did the authors consider simply optimizing for this within the
mCS-distvariant? - Why is it necessary to further split into two?
- How does the method have to adapt when the underlying predictor outputs a conditional distribution rather than a point estimate?
论据与证据
- The extension of conformal selection to the multi-response setting is a novel and significant contribution.
- The framework guarantees finite-sample FDR control. The theoretical arguments seem correct.
- The proposed non-conformity scores are well-motivated.
- There were some ambiguities in the definition and implementation of the
mCS-learnalgorithm for me to assess it fully (please see comments in "Methods And Evaluation Criteria" section) - Overall, the clarity of the presentation could be significantly improved. I suggest moving up some of the experimental details (such as the definition of settings, tasks) to the main text so that the figures and tables can be interpreted from information given in the main text.
- Both
mCS-distandmCS-learnboth perform well against baselines in the simulated setting. All of the figures and tables in the main text are on the simulated setting, however, and not the real data application. - If the authors are planning to make the aggregated and imputed ADMET dataset public, that would be a significant contribution worth mentioning up front.
方法与评估标准
I had multiple questions and concerns about the form of the mCS-learn method as well as its presentation:
- The method requires an extra split of the calibration set into training, validation, and proper calibration sets. I would like to see more discussion on the signal lost from a reduced calibration set size, which I'd presume would be more significant in higher dimensions (many targets).
- Why is a further split of the training set necessary in line 3 of Algorithm 2?
- What was the exact algorithm used for soft ranking? The paper cites both Blondel et al. 2020 and Cuturi et al. 2019, but does not indicate the exact algorithm nor the hyperparameters it would introduce.
- Related to the above, soft sorting algorithms are typically very sensitive to the regularization parameter and small numerical errors may impede training. From Algorithm 2, it seems like only was chosen via a held-out validation split and the sorting regularization wasn't. If it was predetermined, please explain the procedure.
- The second loss function (Eq 16) introduces another hyperparameter . How was this determined?
- In Algorithm 2, line 8, what is ? Why does repeated application of mCS on lead to different power values?
- Section 5.2: what was ?
- It is very difficult to interpret the tables of results. The "settings" are only defined in the Appendix -- it would also help to refer to them by their respective characteristics in the main text (for example, setting 1 = "linear, Gaussian noise"). Please also indicate the nominal level in the table caption.
理论论述
The proof of Theorem 3.5 seems correct.
实验设计与分析
- For the simulated study, as there is full control over the data generation, the authors can use a bigger test set than 100 for cleaner evaluation
- Please also report the standard errors of the metrics across runs in the tables.
- Please share the anonymized code so the details of the implementation can be checked
补充材料
Appendix A.2 (proof of Theorem 3.5), C, D
与现有文献的关系
- Extending CS to multi-response settings is a significant contribution. The paper introduces the concept of regional monotonicity, which generalizes univariate monotonicity, required to guarantee FDR control.
- Even in the univariate setting, regional monotonicity generalizes the threshold-based framework of Jin and Candes 2023 to arbitrary sets of intervals.
- The authors provide finite‐sample theoretical guarantees (Theorem 3.5) using similar arguments as Jin and Candes 2023.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
- Algorithm 1, line 1: is confusing as it reuses index -->
- Please explain the role of up front when it's introduced, right after Eq 8 and also Eq 11
- In Algorithm 2, please include a line that assigns the value to (currently introduced without definition)
- In Algorithm 2, is overloaded with the previous discussion of Theorem 4.1
- Fig 1: Please make the figure axis labels and legend font larger
- Fig 1: Next to task number, please include a few-word description of what the task is designed for
The method requires an extra split... in higher dim.
A1: From our experience, the (multivariate) conformal selection procedure is insensitive to the size of calibration data - the calibration scores only affect the resolution of the p-values, which is typically sufficient when .
To support this point, we ran an additional experiment using with real-data Task 2 at . We set and , respectively, and we consider different shown below:
| FDR | power | |
|---|---|---|
| 100 | 0.486 | 0.594 |
| 500 | 0.501 | 0.603 |
| 3000 | 0.499 | 0.598 |
As shown, even with a small calibration set, the method maintains FDR control and strong power.
Why is a further split of the training set necessary...
A2: In short, this is because the computation of smoothed p-values require both calibration and test data; therefore to train based on the performance of a certain score, we would need to split the training set into two, which serve as calibration data and test data respectively.
What was the exact Algo used for soft ranking...typically very sensitive...explain
A3: We adopted the implementation in Blondel et al. (2020), with regularization and strength set to 0.1. Preliminary tests showed that within a reasonable range, different regularization strengths produce similar overall behavior in mCS. While cross-validation could optimize this parameter further, its exact value is not central to our primary contributions. To maintain clarity and simplicity, we chose not to include additional tuning steps for this hyperparameter.
The 2nd loss introduces another hyperparameter ...
A4: This value was chosen based on a series of preliminary experiments. Similar to the previous question, while we could cross-validate on , for the simplicity of our presentation, we fix to 0.5 in our main procedure.
In Algo 2, line 8, what is ...
A5: To accurately estimate the power of the selection rule given by (defined as an expectation in Eq. 2), we average empirical selection power over multiple random partitions of . Each random partition produces different power values, even with the same . We use partitions, which provides a stable and accurate approximation. We clarified this point in the updated version of our manuscript.
For the simulated study... bigger test set than 100.
A6: We appreciate the suggestion. While a test size of 100 may seem small for general machine learning tasks' evaluation, conformal selection performance is generally insensitive to test set size (as noted in Sec 3.1 of Jin and Candes, 2023). Additionally, we repeated our experiments over 100 independently generated datasets, ensuring the results are stable. This stability is also supported by our simulation results provided in response to subsequent comments.
Please report the std. errors...
A7: For both simulation and real-data experiments, each iteration involves sampling a new dataset, introducing variation in data and model training. Reporting std errors under these conditions would reflect primarily data variability rather than the inherent stability of the selection methods.
However, we agree numerical stability is important. Thus, in Appendix D, we provide an additional real-data experiment with a fixed dataset across iterations. Below are the observed std dev. of FDR and power (over 100 iterations) from this experiment:
FDR/Power:
| Task | CS\_int | CS\_ib | CS\_is | bi | mCS-d | mCS-l | |
|---|---|---|---|---|---|---|---|
| 1 | 0.3 | 0.000/0.000 | 0.000/0.000 | 0.248/0.022 | 0.000/0.000 | 0.000/0.000 | 0.000/0.000 |
| 2 | 0.3 | - | - | - | 0.000/0.000 | 0.000/0.000 | 0.002/0.001 |
| 3 | 0.3 | - | - | - | 0.000/0.000 | 0.021/0.005 | 0.004/0.002 |
These results confirm strong numerical stability for our proposed methods. Note that for , the trained model was fixed, so variability from model training is not considered here.
The choice of ...?
A8: We have added a detailed discussion about the choice of , please refer to our response under A5 for Reviewer 2 (PjuQ).
For both the generalized signed score and the clipped score, choosing on the boundary is already optimal. In this case, the first term is minimized (equal to 0), so no further optimization is needed to enhance selection power.
How ... to adapt when ... outputs a conditional distribution ?
A9: When the model outputs an estimated conditional distribution , the second term can be replaced by the predicted probability of being in the target region: , i.e. . This serves the same purpose: points with high predicted probability of satisfying the selection criterion will receive lower scores and are more likely to be selected.
Thank you for addressing my questions regarding the hyperparameters, numerical stability, and calib/test data sizes. The new demonstrations are convincing. l'll raise my score to an accept.
Thank you very much for your thoughtful review and constructive feedback. We're pleased that our additional demonstrations addressed your questions. Your insights have greatly helped us clarify and improve our work.
This paper introduces Multivariate Conformal Selection (mCS), extending Conformal Selection to multivariate response settings. By leveraging regional monotonicity and multivariate nonconformity scores, mCS ensures finite-sample False Discovery Rate control. Two variants—distance-based and learned scores—demonstrate strong performance in simulations and real-world tasks, improving selection power while maintaining rigorous uncertainty guarantees. All reviewers provided positive feedback. The concerns raised were mostly addressed in the rebuttal.