PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
3
2
4
3
ICML 2025

Bridging Fairness and Efficiency in Conformal Inference: A Surrogate-Assisted Group-Clustered Approach

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We propose a novel surrogate-assisted clustered conformal inference framework that empowers the construction of efficient prediction sets by pooling the protected groups into larger clusters and leveraging the surrogates.

摘要

关键词
Conformal inferenceIndividualized predictionClusteringSurrogate outcomesGroup fairness

评审与讨论

审稿意见
3
  • This paper presents a novel approach to fair conformal inference.
  • Specifically, the proposed method is based on an group-conditional conformal inference.
  • The proposed method is based on two keys: (1) clustering score values and (2) the use of an influence function for prediction sets.
  • Experimental results suggest that the proposed method is competitive to three alternative approaches and one existing approach.

update after rebuttal

  • I thank the authors for their clarifications. Based on their response, which includes additional experiments, I will raise my score from 2 to 3.

给作者的问题

  • Could the authors clarify the meaning of the subscript ff in several notations such as Vf,Yf,WfV_{f}, Y_{f}, W_{f} in eq. (1)?
  • What model was used in practice for estimating E^\hat{E} in the definition of R(W,Y)R(W,Y) in Section 3.2?

论据与证据

  • The group fairness notion in coverage, which is introduced in Section 3.3, could be clarified with concrete justifications or examples. Why is fair coverage particularly important in social aspects? Introducing examples of undesirable consequences due to unfair coverages would be helpful.
  • Furthermore, if a given prediction model does not achieve demographic parity but achieves fair coverage in conformal inference, does this imply that decision-making based on this model is fair?
  • This paper focuses on protected group-wise fairness, while Ding et al. (2023) focused on the label class-wise conformal prediction. That is, the concept of "group-conditional" conformal inference is not entirely novel within the broader machine learning research, though its application to algorithmic fairness is new.

方法与评估标准

  • The experiment part would be more convincing if the authors could explain details about the baseline method WCQR from Lei & Candes (2021), as WCQR is originally designed for counterfactuals and individual treatment effects. Is WCQR an appropriate baseline for SAGCCI?

理论论述

  • I have briefly checked the proofs of Lemma 4.3 and Theorem 4.4.

实验设计与分析

  • The authors claim that SAGCCI outperforms existing methods, however, several experimental results (e.g., Figure 3, Table 1, Figure 4) show that a baseline (Surro + Group) is competitive to or sometimes outperforms SAGCCI.
  • As Section 4.4 presents that SAGCCI requires additional computation for split conformal inference and non-conformity scores, a comparison of running time between SAGCCI and baseline methods would enrich the experiment parts.

补充材料

  • No explicit supplementary materials attached.

与现有文献的关系

  • This work would contribute to emphasize a need of uncertainty quantification technique for algorithmic fairness.

遗漏的重要参考文献

其他优缺点

  • N/A

其他意见或建议

  • (typo) line 39: estaimaled -> estimated
作者回复

Thanks for your careful review. We provide responses to your questions:

Claims And Evidence

  1. Fair coverage is crucial in socially consequential domains, where prediction intervals can help guide decision-making related to access to resources, opportunities, or fair treatment. For example, if a mortgage lender’s prediction intervals systematically under-cover minority borrowers, they may face inflated interest rates or higher denial rates. In healthcare, if limited supplies of critical antibiotics demand triaging decisions, under-coverage for certain demographics could result in denying them lifesaving medications. We will include these examples in the paper.

  2. If a given prediction model achieves fair coverage in terms of conformalized prediction sets, it does not necessarily lead to fairness in point predictions. If the length of the prediction sets are the same and the prediction sets are symmetric about the point predictions, fair coverage in prediction sets could imply fairness in point predictions. Therefore, fair coverage in prediction sets could be considered as an extension to group fairness (e.g., demographic parity) in point predictions, as coverage in prediction sets additionally accounts for the uncertainty of point predictions (as our primary motivation). Even if a given model does not achieve group fairness in terms of the point predictions, it can achieve fair coverage in the prediction sets, and one could conclude that its prediction is fair with (1α)(1-\alpha) confidence given a user-specified miscoverage level α\alpha. This point will be emphasized in our paper.

Methods And Evaluation Criteria

  1. The baseline WCQR from Lei & Candes (2021) was designed for counterfactuals (or individual treatment effects), which essentially aims to account for covariate shifts between groups with observed outcomes and missing outcomes. Therefore, it is directly applicable in our setting to account for covariate shifts between source data (with outcomes observed) and the target data (with outcomes missing). Therefore, WCQR is an appropriate baseline comparator method for SAGCCI. We highlight the additional efficiency SAGCCI gains vs WCQR by estimating the thresholds via the EIF.

Experimental Designs Or Analyses

  1. For the presented experiments with the number of groups M=3M=3 and the number of clusters K=2K=2, SAGCCI and Surro+Group have similar performances when N3000N\geq 3000. When we increase MM to 1010 and 2020, while fixing K=5K=5, Surro+Group would have much worse performance even if N5000N\geq 5000, since each group would have fewer samples to fit the non-conformity scores, which leads to deteriorated performance in both AvgSize and CovGap. The new results are presented here.

  2. All the considered conformal inference methods are implemented by the split conformal inference strategy -- therefore, their running times are similar. For example, for N=1000N=1000, the running time of these methods are all around 1 second when 25% of the data has been used for calibration and to construct the conformalized prediction sets. This point will be made clear in the paper.

Essential References Not Discussed

  1. Liu et. al., (2022) and Vadlamani et. al. (2025) focus on learning a fair quantile function (termed ConformalFairQuantl) to construct the non-conformity score to deliver reliable fair prediction sets. However, our method is more general and applicable to any non-conformity score, and is not restricted to these conformalized quantile scores. Their methods do not account for the distributional shifts in covariates, and is therefore not directly suitable to our setting, where the outcomes from the target data are missing; see the updated results here, where the performance of ConformalFairQuantl is not satisfactory under our problem setup.

Other Comments Or Suggestions & Questions For Authors

  1. We have fixed the typos and remove the subscript in VfV_f to avoid any confusion. Now, we have a new data set V=(W,Z,Y)PD0V= (W,Z,Y)\sim P_{D_0}. We use the quantile random forest to fit the quantiles qα/2q_{\alpha/2} and q1α/2q_{1-\alpha/2} and the non-conformity scores are constructed by the conformalized quantile residual by R(W,Y)=max(qα/2(W)Y,Yq1α/2(W))R(W, Y)= \max(q_{\alpha/2}(W)-Y, Y-q_{1-\alpha/2}(W)).

References

  1. Lei, L. and Candes, E. J. Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83:911–938, 2021.

  2. Liu, Meichen, et al. Conformalized fairness via quantile regression. Advances in Neural Information Processing Systems 35 (2022): 11561-11572.

  3. Vadlamani, Aditya T., et al. "A Generic Framework for Conformal Fairness." (2025) The Thirteenth International Conference on Learning Representations.

审稿意见
2

The authors introduce a new strategy for achieving equitable group coverage in conformal inference. The approach consists of two components: (1) rather than using raw sensitive groups, they cluster groups with similar conformal score distributions via K-means and (2) they leverage surrogate information correlated with the outcome to obtain tighter prediction sets. This method is supported by theoretical guarantees. To evaluate its effectiveness, the authors conduct experiments on a synthetic dataset and a real-world dataset, analyzing the impact of each component as well as the overall performance.

Updated after rebuttal

I sincerely appreciate the authors’ detailed response and the additional experimental results. These new findings seem promising. However, I remain concerned about the empirical evaluation, which I find to be quite limited for a venue like ICML. I believe that the theory needs to be justified with more rigorous experiments and examples, highlighting the benefits and limitations of the approach. In its current form, the evaluation does not, in my view, provide sufficient evidence to support strong conclusions regarding the effectiveness of the proposed approach. The experimental analysis focuses primarily on a toy problem (which could be constructed to highlight the strengths of the proposed method) and only a single real-world dataset. This narrow evaluation makes it challenging to fully assess the practical benefits and general applicability of the approach. Moreover, the comparison to SOTA methods is rather limited, which further hinders the understanding of the method’s relative performance.

Finally, since the proposed approach appears to build upon the method by Ding et al., with additional features intended to improve fairness, it would be particularly valuable to include a direct comparison to Ding et al. in the empirical analysis. Such a comparison would help to clearly demonstrate the impact of the proposed enhancements.

Therefore, in principle, I will mantain my score.

给作者的问题

Q1 - The role of each dataset is not entirely clear. Do you use D1D_1 in a specific step of the training process (e.g., clustering) and D0D_0 in another? It seems that the entire training process requires access to Y, suggesting that the approach is exclusively trained on D1D_1. Is this true? Or does D0D_0 have a significant role in training, or is its contribution primarily in another stage?

Q2 - Which are the main differences between your work and the work by Ding et al. (2024)?

Q3 - In real-world applications the number of sensitive groups can be high, specially when intersectionality is considered. How does your method behave for an increasing number of sensitive groups?

Q4 - Another common real-world scenario is having missing sensitive information. Could this method be extended to handle such scenarios?

Q5 - It is known that when employing K-means for clustering the subsequent results can vary considerably according to the value of K. Since the clustering is a key point of the proposed approach, I would like to ask how sensitive are your results for different values of K.

Q6 - Is it always the case that a higher variance on the surrogate variable indicates higher predictiveness of the primary outcomes Y? Or is there any upper limitation on the maximum value that this variance can take (above which this informativeness is lost or at most not improved)?

论据与证据

Most claims are supported by the analysis conducted in the paper. However, some statements raise concerns:

(1) The authors describe their experiments as "extensive synthetic simulations," yet only a single synthetic problem is considered. Referring to this as extensive seems like an overstatement of the empirical analysis provided.

(2) The term "uncertainty quantification" appears in both the abstract and the final paragraph of the introduction. However, this term is broad, as uncertainty can stem from multiple sources. The paper does not further discuss uncertainty estimation or quantification, nor does it provide specific theoretical or empirical analysis on the topic. To avoid unsupported claims, the authors should either use a more precise term or clarify their intended meaning on this matter.

方法与评估标准

The synthetic and real-world datasets chosen are well-suited for evaluating the proposed approach. Additionally, analyzing the individual contributions of clustering and surrogate information through ablation studies is highly valuable for understanding their respective effects and whether their combination leads to the best performance.

To further strengthen the empirical evaluation, it would be beneficial to include comparisons with existing state-of-the-art methods (e.g., Ding et al. (2024), Gao et al. (2024)). This would provide a clearer assessment of the method's impact and contribution within the broader literature. Especially since the text often lacks clarity on how the proposed approach aligns with or differs from the referenced works in the literature.

理论论述

I did not check the correctness of the proofs in detail.

However I have several concerns:

(1) You define efficiency gain as VerrrVerrV_{err}^r - V_{err}, however the former term refer to the EIF without surrogates and the latter with surrogates. Therefore the bigger the EIF of the surrogate information the more negative this quantity will be? In order to refer to a gain, wouldn't it be VerrVerrrV_{err} - V_{err}^r?

(2) Throughout the text, starting from equation (1), you estimate a probability based on the quantity Y over the target set D0D_0 where, according to your definition, Y is unknown... can you make clarifications on this matter?

(3) In Assumption 4.1. you employ the term P(D=1x)P(D=1|x), but then you claim that the assumption "states that each individual should have probability of at least c0c_0 of being included in the target data (...)", but D=1 refers to source data. Could you clarify this matter?

实验设计与分析

Some results could benefit from a more nuanced discussion to avoid potential overestimation of their implications.

For instance, in Figure 3, the primary difference between SURRO+GROUP and SAGCCI emerges when the number of instances is N=1000; for other values, their performance is quite similar across both considered metrics. At N=1000, SURRO+GROUP achieves better coverage but results in larger prediction sets. Given this trade-off, it is challenging to conclude that SAGCCI is definitively superior. Moreover, the performance gap appears to widen as the surrogate information becomes more representative, which is somewhat unexpected since both methods leverage surrogate information. Why does this difference arise given the fact that the only difference between both approaches is that one employs the clusters and the other the raw group information? A deeper discussion on why these differences arise would strengthen the analysis. Additionally, when evaluating multiple objectives, it is important to avoid suggesting a dominance relationship unless it is well-supported. Since the two objectives are measured on different scales, a difference of 1 in one metric may not correspond to an equivalent change in the other, further complicating direct comparisons.

Similarly, in Section 6, the claim that "SAGCCI achieves the best overall performance" may be somewhat misleading. The evaluation is multi-objective, and while SAGCCI performs best on some metrics, it ranks second on others. Since it does not consistently outperform across all criteria, asserting overall superiority requires additional justification. Given that the metrics are on different scales, it is also unclear whether a simple aggregation would confirm this claim.

Clarifying these aspects and being precise about comparative statements would enhance the rigor of the evaluation and make the findings more transparent.

Moreover, building on the previous comments, the claim at the beginning of Section 5—"Empirical results show (...) while significantly shrinking prediction set sizes relative to existing methods"—appears to be somewhat overstated and could benefit from a more precise formulation. The phrase "significantly shrinking" is subjective and lacks quantitative support. It would be more informative to specify the extent of the reduction in numerical terms, such as stating that the prediction set size is reduced by a certain percentage (e.g., 50%). Additionally, the assertion that the method always produces the smallest prediction set by a large margin wrt other methods is not entirely accurate. It would be helpful to clarify under what conditions the method outperforms others and acknowledge any cases where it does not. This refinement would improve the clarity and credibility of the statement.

On the other hand, I also have an additional concern regarding the experiments in Section 6 (real data). While the study describes different sensitive groups based on race, the fairness evaluation is conducted in terms of age-morbidity groups.

补充材料

I did not check the proofs provided in the supplementary material.

与现有文献的关系

The authors propose a combination of two existing techniques within the literature of conformal inference that are the clustering of the groups (e.g., Ding et al. (2024)) for a fair coverage, and employing surrogate information correlated with the true class label to construct tighter prediction sets (e.g., Gao et al. (2024)).

遗漏的重要参考文献

The related works section focuses exclusively on the use of surrogates. Expanding it to include a discussion on conformal inference and its intersection with fairness guarantees would provide a more comprehensive overview of the field. This addition would help readers better understand the broader context and position the contribution within existing research.

Additionally, the paper does not clearly discuss the similarities and differences between the proposed approach and state-of-the-art methods. Providing such a comparison would highlight the specific advantages of the proposed method relative to existing works.

Furthermore, in Section 3.3, when discussing group fairness notions, the authors cite Hardt et al. (2016). However, this work specifically addresses a post-processing method for enforcing equality of opportunity and equalized odds, rather than general group fairness metrics. A more appropriate reference, such as [1], would better align with the discussion on group fairness notions.

[1] Pessach, D., & Shmueli, E. (2022). A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3), 1-44.

其他优缺点

This is a very important problem, and the authors' approach to analyzing it offers a particularly compelling perspective.

其他意见或建议

The section titles do not adhere to the conference's recommended style, which requires capitalization of the words.

作者回复

Thanks for the comprehensive review. We now provide detailed responses to your concerns.

Theoretical Claims

  1. VeffrV_{eff}^r and VeffV_{eff} are referred to as the variances of the estimators. Therefore, the more predictive the surrogates are, the smaller VeffV_{eff} will be and the larger (more positive) the quantity VeffrVeffV_{eff}^r - V_{eff} will be.

  2. The quantile functions are trained over the source data D1D_1 and can be used on the target data D0D_0 under Assumption 4.2 (b) DYS,XD \perp Y \mid S, X.

  3. Thanks for catching this; we have updated Assumption 4.1 to be c0<P(D=1x)c_0 < P(D=1\mid x) instead.

Experimental Designs Or Analyses

  1. The improved efficiency of SAGCCI is achieved by approximately guaranteeing group-conditional coverage, a tradeoff between efficiency and fairness. The intention is to produce prediction sets that have good group-conditional coverage AND are not too large to be useful. We will remove overstatements of the superiority of our method and add: "the choice of an appropriate conformal prediction method should be assessed based on the problem setting at hand and the objectives of the analysis."

  2. For the presented experiments, SAGCCI and Surro+Group have similar performance when N3000N\geq 3000. When we increase the number of groups MM to 10 and 20, while fixing K=5K=5, Surro+Group performs much worse, even if N5000N\geq 5000, since each group would have fewer samples to fit the non-conformity scores, which leads to deteriorated performance in both AvgSize and CovGap. The new results are presented here. This point will be made clear in the paper.

  3. When σS2\sigma_S^2 increases, surrogate information will be less representative across the groups, since each group tends to have more different surrogates in finite samples. Therefore, the Surro+Group will have larger AvgSize in finite samples since it is influenced more by the surrogates which have larger variability.

  4. A more precise illustration is now presented: when σS=3\sigma_S=3, the proposed SAGCCI reduces the size of prediction sets by over 60% on average compared to Surro+Group at N=1000N=1000 but increases the CovGap by only 7%. Compared to Surro+Standard, the sizes of the prediction sets are similar, but SAGCCI reduces the CovGap by over 80% at N=3000N=3000.

  5. Age-morbidity groups are of clinical relevance and the fairness evaluation is of real world concern -- we will update the paper to motivate the sensitive groups based on these subgroups rather than on race.

Essential References Not Discussed

  1. We now provide more discussion of the proposed method and state-of-the-art methods. Existing fair conformalized predictions (Liu et. al., 2022, Vadlamani et. al. 2025) focus on learning a fair quantile function (termed ConformalFairQuantl) to construct the non-conformity score to deliver reliable fair prediction sets. However, our method is more general and applicable to any non-conformity score, and is not restricted to only conformalized quantile scores. Their methods do not account for the distributional shifts in covariates, and cannot be used in our setting, where outcomes from the target data are missing; see new results here, which show that the performance of ConformalFairQuantl is not satisfactory under our problem setup. The review paper and other fair conformalized predictions will be referenced in the paper.

Answer to Questions For Authors

  1. We realized that we were not clear about the implementation of the proposed method. A full algorithm is now presented here. Most of the models (e.g., quantiles and clustering) are fitted on I1D1I_1\cap D_1, while the density ratio is fitted on I1I_1 with both source and target data.

  2. Although the clustering idea is introduced in Ding et al, its application to fairness-related problems is new, as mentioned by Reviewers G4EX ("Relation to Broader Scientific Literature") and uHDb ("Claims And Evidence").

  3. New results for a larger number of protected groups are presented here.

  4. If one's sensitive information is missing, we could assume these subjects with missing sensitive information belong to one cluster and proceed with our approach.

  5. The cluster number is now chosen automatically by the simple heuristic described in Appendix B. To select the optimal cluster number, we could also use human-in-the-loop strategies, such as the Elbow method and other information criterion (e.g., AIC, BIC). This point will be added to Appendix B.

  6. Since SS is included in the generation of YY, higher variance in SS indicates that SS is more predictive of YY. Therefore, there is no upper limit on the maximum value of efficiency gain as long as SS is included in the data generation.

审稿人评论

Firstly, I would like to thank the authors for responding to my questions in such detail.

The examples you provided helped in understanding the differences between the fractional and the integral setting, and why the latter is important. For instance, the example you mention about vaccines motivates very well the problem you study, and it is in fact really interesting and important in the real world since their allocation is made online in many situations and it should be fair. I would encourage the authors to provide more such examples since it really helps to increase the readability and clarity of the paper. Furthermore, the examples help put more in context how important such a contribution is.

However, I believe that the paper still has room for improvement in terms of its writing, the clarity of the presented ideas, the cohesion and flow of the text, and in providing a stronger connection between theoretical findings and real-world implications or examples. Besides, the contribution appears weak without an empirical comparison to SOTA approaches. Providing further insight into the model's behaviour through empirical analysis would help better identify the settings in which the proposed approach excels and provide a better understanding of the underlying reasons. In a similar line, in its current version, it is difficult to situate the contribution within the literature and to distinguish which techniques or procedures are the authors' contributions and which are adopted from existing work (e.g., in the case of LiLA).

Therefore, in principle, I will maintain my score.

作者评论

We thank the reviewer for the assessment of our paper and response to our rebuttals. Other examples to motivate fairness could be the Equal Credit Opportunity Act enacted in 1974, which is designed to ensure fair lending practices.

We would like to emphasize that one of the SOTA approaches the reviewer refers to is the fair conformalized method (denoted by ConformalFairQuantl ). We argue that the current fair conformalized methods focus on continuous outcomes and the performance on categorical outcomes of our methods, SAGCCI, are superior to the SOTA methods, as we show in experiments -- see here.

We will emphasize our contributions more clearly in the paper. We re-emphasize our contributions here: 1) The novel development of clustered conformal inference to handle fairness regardless of the types of outcomes, whereas the current SOTA (i.e., fair conformal methods) focus only on the conformalized quantiles scores, which may not be generalized directly to scores for categorical outcomes; 2) We develop efficient surrogate-assisted conformal inference with provable guarantees on (conditional) coverages and efficiency gains, which should be a valuable contribution to the ML community.

审稿意见
4

This work studies how to provide (conformal) prediction intervals for individual outcomes when additional information, in the form of surrogate outcomes, is available for both source and target datasets. The goal is to provide tight (short) prediction intervals that also achieve nominal coverage for all subgroups. The proposed method achieves this by clustering the groups based on their conformal score distributions, then leveraging the surrogates to provide per-cluster thresholds for the conformal scores (i.e. to construct prediction intervals). The authors provide theoretical results about the efficacy of their method and give experiments on both simulated and real data.

给作者的问题

  • Can you comment on the relationship between KK and MM? Experimentally, what did you set as KK? Did you try experiments for large MM, or are there barriers to doing so with this method?
  • Can you comment on the extent to which you might generally expect score distributions to actually be clusterable?

论据与证据

  • See below. Overall yes, though there are limitations to the main theoretical result as written; unclear if the experimental setup captures everything that the work hopes to.

方法与评估标准

  • Synthetic and real datasets make sense. See below for comments on experiments.

理论论述

  • I did not check proofs in detail; the statements and high-level structure of the arguments are consistent with what I expect.
  • I am guessing that the constants in Theorem 4.6, if they had been worked out, might be very large; thus, though the result is technically given as a finite-sample bound, the interpretation of it should probably be more qualitative (“we should more or less expect this to work”) rather than precisely quantitative (“we need X samples to guarantee (α+ϵ)(\alpha+\epsilon)-valid intervals with probability 1δ1-\delta”). That said, I think this is fine given the experiments seem to suggest that the method can do well with reasonably small nn.
  • A detail that seems like it might be important, but is mostly glossed over, is that it's important for (a) the score distributions to be clusterable and (b) the clustering method to work well, since this error shows up additively. It's not actually obvious to me that (a) is actually always possible.

实验设计与分析

  • Based on the exposition in section 4, I thought that MM might be very large and that K<<MK << M. I was thus a bit confused/surprised to see M=3M = 3; and I couldn’t find what KK was set to (although for it to be anything nontrivial, I’m guessing it must be 2). Unless the MM groups are actually split up into K>>MK >> M clusters so that cluster kk is only a subset of the samples from group mm? And then the clustering approach also has the added benefit of adding structure to intra-group variability? But in that case I feel like section 4 should be written differently to reflect that.
  • The benchmarks/ablations all make sense, though I’m not sure why each plot only shows a subset of them.
  • Based on Figure 3, it seems like for n3000n \geq 3000 that SACGCI and surro+group are essentially identical?

补充材料

  • Experimental details were reviewed. Proofs were not checked in detail, except to get a sense of where constants are coming from.

与现有文献的关系

  • This work utilizes tools and ideas from causal inference (surrogates), and combines them with a clustering idea (which, though has been proposed in other fairness-related works, I haven’t seen in this context), in order to address a well-established problem in the conformal prediction community.
  • It would be nice to understand whether, for problem settings where surrogate outcomes are unavailable, clustering might still be beneficial.

遗漏的重要参考文献

  • N/A to my knowledge

其他优缺点

  • N/A

其他意见或建议

  • It would be nice in terms of presentation to have a pseudocode algorithm box to summarize the procedure. I think Figure 1 is supposed to illutrate this but it’s honestly very confusing to me.
作者回复

We thank the reviewer for the detailed comments and provide itemized responses to your questions.

Theoretical Claims & Questions For Authors

  1. The asymptotic coverage in Theorem 4.6 should be interpreted as PVPD0(yC(W;rαk)Z=z)1αo(1)P_{V\sim P_{D_0}}(y \in C(W;r_{\alpha}^k)\mid Z = z) \geq 1-\alpha - o(1) as NN\rightarrow \infty since the constants do not depend on the sample size NN. Therefore, the (α,δ)(\alpha, \delta)-PAC guarantee can be approximately met with negligible errors; see the last paragraph of Section 3.1 in the paper by Yang et al. titled "Doubly Robust Calibration of Prediction Sets under Covariate Shift" published in JRSS-B. The constants are a result of the use of concentration inequalities. We will update our statements to be more precise in the revised paper.

  2. Thanks for pointing out that the protected groups may not always be clusterable. We feel this question would fall into a more general concern as "clustering on non-clusterable groups". We have provided a simple heuristic in Appendix B to choose the number of clusters, inspired by Appendix B.4 in "Class-Conditional Conformal Prediction with Many Classes". We could further resort to human-in-the-loop strategies to avoid overly clustering on non-clusterable groups, such as the Elbow method and other information criterion (e.g., AIC, BIC) in a cross-validation setting. This point will be added to Appendix B of our paper.

Experimental Designs Or Analyses & Questions For Authors

  1. The number of protected groups MM for the presented experiments is 33 and the number of clusters is K=2K=2 as expected. To achieve the tradeoff between efficiency and fairness, KK should be no larger than MM but larger than 11 (i.e., 1<KM)1<K\leq M).

  2. The reason we only show a subset of these considered methods is to avoid presenting too much information in one figure and to avoid clutter. Figure 2 focuses on comparing SAGCCI with NoSurro+Cluster and WCQR (with surrogates) to show that we have leveraged surrogates in the most efficient way by leveraging the EIF. Figure 3 focuses on comparing SAGCCI with Surro+Group and Surro+Standard to showcase the advantages of using group-clustered conformal inference to achieve a good balance between efficiency and fairness.

  3. For the presented experiments, SAGCCI and Surro+Group achieve similar performance when N3000N\geq 3000. When we increase the number of groups MM to 1010 and 2020 while fixing K=5K=5, Surro+Group has noticeably worse performance, even if N5000N\geq 5000, when M=20M=20 since each group would have fewer samples to fit the non-conformity scores, which leads to deteriorated performance in both AvgSize and CovGap. These new results are presented here. This point will be made clear in the revised paper.

Relation To Broader Scientific Literature

  1. When the surrogate outcomes are not unavailable, the benefits of clustering can be shown by comparing the performance of NoSurro+Cluster with NoSurro+Standard and NoSurro+Group, with new results shown here, where σS=0\sigma_S = 0 indicates there are no surrogate effects. Full comparisons will be added to the Appendix for completeness.

Other Comments Or Suggestions

  1. We agree that Figure 1 can be confusing to readers and therefore replace it with a detailed pseudo-code algorithm, which can be found here.
审稿人评论

Thanks for the detailed response and new results; they address most of my concerns and I will update my score.

On the theory front, I understand the coverage guarantee of 1αo(1)1 - \alpha - o(1); my point is that the meaning of asymptotic coverage is itself a subtle point (which is notably different from e.g. the conventional guarantees of vanilla conformal which do not require concentration-type arguments) and it would be worth highlighting that in the paper.

作者评论

Thank you for taking the time to review our rebuttal, new results, and updating your score, as well as clarifying your point about the asymptotic coverage vs. conventional guarantees of vanilla conformal. We will make sure to highlight this point in the paper!

审稿意见
3

This paper introduces a conformal inference algorithm (SAGCCI) aimed to produce prediction sets that satisfy group-conditional coverage guarantees. The algorithm is intended to be robust to missing information settings, where sometimes the primary outcome (around which we are trying to produce prediction sets) is missing in the training data, but assuming the existence of a surrogate outcome that may be correlated and easier to collect data for. It also achieves small mis-coverage rates even on groups of small size with little data to extrapolate from.

To achieve this, SAGCCI does two main things: (1) clustering together groups with similar non-conformity score distributions to more accurately estimate the (1α)(1-\alpha)-quantile for each of them, by having larger amounts of data, and (2) using an efficient influence function to use the surrogate outcomes to estimate non-conformity scores (the appropriate threshold to satisfy the desired coverage level) even when primary outcomes are absent and the distribution may be different.

The main contribution is combining the clustering approach (allowing you to extrapolate about low data groups using other groups with similar score distributions) with using EIF for a more efficient estimator of the non-conformity threshold (to produce tighter prediction sets).

给作者的问题

  1. I am quite unclear in which places the source data (where primary outcomes are observable) and target data are used together, and when they are separated. In Section 3.2, it seems to be that the source data is used to estimate a non-conformity threshold over the target data (which may come from a different distribution). But in the implementation section (Section 4.4), it says that the data is combined and then split into folds, and in the experiments, they seem to be used together as well. Could the authors clarify this point?

  2. If the fold I1I_1 includes some target data (Section 4.4), how are the quantile models trained when some of the primary outcome data is missing?

论据与证据

The theoretical claims seem sound, and the experiments bear out that the paper’s algorithm achieves smaller prediction sets than existing methods.

方法与评估标准

Yes, it seems fine.

理论论述

I did not check the proof for Theorem 4.4 (in the appendix), but the remaining proofs look good.

实验设计与分析

The experiments looked good. Since the role of clustering is to provide better coverage guarantees for smaller groups, I would have liked to see the size of the groups over which we are trying to achieve coverage, not sure if this is mentioned somewhere but I couldn’t find it. Also, it is not clear what group-conditional conformal inference algorithm the authors are using as a baseline to compare against – is it the approach of calibrating thresholds separately for each group as in Vovk 2012? The authors cite “Conformal Prediction with Conditional Guarantees” (Gibbs et al 2023) as having algorithms for group-conditional guarantees but I don’t think they compare against this.

补充材料

I looked at the additional experimental results and some of the lemma proofs.

与现有文献的关系

The closest connection in terms of the clustering idea is “Class-Conditional Conformal Prediction with Many Classes” (Ding et al) as the authors cite. This work does not extend the ideas of the original paper – the clustering is applied to abstractly defined disjoint groups here, rather than groups defined implicitly by class label.

There is existing work on conformal inference for achieving group-conditional guarantees (Jung et al 2022, Gibbs et al 2023) in distributional settings as well as in adversarial settings (Bastani et al 2022), which may have connections as the adversarial case can handle settings where primary outcomes may be unavailable. These algorithms may not be directly comparable to SAGCCI as they also allow for overlapping groups.

遗漏的重要参考文献

Not to my knowledge.

其他优缺点

Strengths:

The setting for achieving group-conditional coverage even in missing information settings is interesting. The experiments show the efficacy of this approach with smaller prediction set sizes.

Weaknesses:

Via the given algorithm, group-wise coverage can only be achieved on disjoint groups on the dataset; if one wanted to achieve fairness in terms of coverage over protected groups that overlap, this algorithm cannot be used. I also find the paper a bit lacking in originality. The authors claim two innovations – using clustering, and the surrogate-assisted EIF. As far as clustering goes, the process seems identical to that in Ding et al (2024), applied to general disjoint groups rather than class label groups. I am not very familiar with derivations of EIF, so am not sure whether the derivation here is standard or not – perhaps other reviewers could weigh in or maybe the authors could clarify this.

其他意见或建议

I would have liked to see the size of the groups defined for group-conditional coverage in each of the experiments.

作者回复

Thanks for the careful review and nice comments! We here provide detailed responses to your comments.

Experimental Designs Or Analyses & Other Comments Or Suggestions

  1. The size of the groups was M=3M=3. We have conducted additional experiments with a larger number of groups (e.g., K=10,20K=10, 20), where similar conclusions can be drawn; the results are presented here. The considered group-conditional conformal algorithm (i.e., Surro+Group) is extended from Vovk (2012), and is similar to Gibbs et al (2023), where we further account for the covariate shifts across the groups. However, this method neglects the fairness condition and may have inflated CovGap, as seen in Figure 3. We will emphasize this point in the paper.

Other Strengths And Weaknesses & Relation To Broader Scientific Literature

  1. Thanks for bringing up the question about overlapping protected groups! Our method is actually applicable to protected groups that overlap with a relatively minor extension. Let GfairG_{fair} represent the set of groups (potentially overlapping) across which we aim to impose fairness, which is essentially a function of the protected groups ZZ (see Assumption 2.1 in "Fairness with Overlapping Groups"). Then our method could be used on this new set of groups GfairG_{fair}. Furthermore, our method could incorporate other clustering mappings instead of k-means, such as Fuzzy k-means or overlapping k-means, which allows for overlapping clustering as well. Thank you for providing more papers on conformal inference for achieving group-conditional guarantees. We will check them and reference them in the revised paper.

  2. Although the clustering idea is introduced in Ding et al, its application to fairness-related problems is new, as mentioned by Reviewers G4EX ("Relation to Broader Scientific Literature") and uHDb ("Claims And Evidence"). Furthermore, the derivation of the EIF with surrogates is an interesting theoretical result in its own right. We further quantify the efficiency gain of leveraging surrogates in Corollary 4.5 and establish the coverage guarantee in Theorem 4.6. See the JRSS-B paper "On the role of surrogates in the efficient estimation of treatment effects with limited outcome data" for related theory, which develops the EIF with surrogates to estimate average treatment effects. Our work extends this theory to the conformal prediction setting. This point will be mentioned in the paper.

Questions For Authors

  1. We noticed that we were not clear in explaining the split conformal inference strategy in Section 4.4. We have provided a new pseudo-code to illustrate this algorithm. To your question specifically, these two datasets are combined and split into training and calibration folds as I1I_1 and I2I_2. The intersection of the training and source data (i.e., I1D1I_1\cap D_1) is used to fit the non-conformity scores (e.g., the quantile functions), and the full training data I1I_1 is used to fit the density ratio πD(X)\pi_D(X). The reason for the random splitting of the combined data is to ensure that the density ratio πD(X)\pi_D(X) remains exchangeable across folds I1I_1 and I2I_2. Next, these fitted models are used in the EIF evaluated on the calibration set I2I_2 to obtain the estimator for the cluster-specific threshold rαkr_{\alpha}^k over the target data. The full algorithm is provided here.

References

  1. Vovk, V. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pp. 475–490. PMLR, 2012.

  2. Gibbs, I., Cherian, J. J., and Cand `es, E. J. Conformal prediction with conditional guarantees. arXiv preprint arXiv:2305.12616, 2023.

最终决定

The paper introduces SAGCCI, an innovative algorithm for group-conditional conformal inference, leveraging clustering and surrogate outcomes to enhance fairness and efficiency in prediction intervals. The theoretical foundations are sound and well-aligned with the broader literature, offering an interesting and novel perspective on conformal inference. However, the empirical evaluation raises some concerns, which limit the ability to fully assess the method's practical contributions.

The primary issue lies in the scope and depth of the empirical evaluation. While the authors conduct experiments on a synthetic dataset and a single real-world dataset, these settings may not be sufficient to establish the generalizability and robustness of the proposed approach. Reviewers expressed a desire for additional benchmarks against state-of-the-art methods, as well as a more detailed exploration of key factors, such as the variability of group sizes and the trade-offs between fairness and efficiency metrics. These elements are critical to understanding the practical implications of the method and are not fully addressed in the current work.

Although the authors provided further experiments and clarifications in their rebuttal, some aspects remain unclear. For instance, the reported improvements in prediction set sizes are accompanied by increases in coverage gaps, but the trade-offs are not analyzed in sufficient depth. Additionally, the definition and interpretation of the efficiency gain metric, a key contribution of the paper, could be more transparent and grounded in practical scenarios.

Overall, while the paper offers a promising theoretical framework and addresses an important problem, the current empirical evaluation does not fully substantiate its claims. Expanding the experimental analysis to include a broader range of datasets, comparisons with existing methods, and a deeper exploration of the trade-offs would significantly strengthen the paper’s impact and clarity.