PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
3.8
置信度
创新性3.0
质量2.8
清晰度2.8
重要性2.8
NeurIPS 2025

Synthetic-powered predictive inference

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

Predictive inference framework that leverages synthetic data to improve sample efficiency in data-scarce settings, while providing rigorous, distribution-free finite-sample coverage guarantees.

摘要

关键词
Distribution-free InferenceConformal PredictionUncertainty QuantificationDomain AdaptationGenerative Models

评审与讨论

审稿意见
5

This paper proposes a new method called Synthetic-Powered Predictive Inference (SPPI), which focuses on improving Conformal Prediction (CP) when the amount of real calibration data is small. The key idea is to use synthetic data from a different distribution, and apply a score transporter that maps scores from the real data to match the distribution of synthetic scores. The method provides nearly tight coverage guarantee as well as the worst case coverage guarantee, and is shown to work well in both theory and experiments.

优缺点分析

Strengths

  • This paper is based on a brilliant idea and presents a solid method. It proposes a new way to build prediction sets when applying conformal prediction with a pretrained model and only a very small amount of calibration data is available. Also, the authors carefully prepare for possible drawbacks in Section 3.4, which I really appreciated.

  • The presentation of the paper is also very clear. When the authors explain the kind of problems their method can be applied to, they describe the feasibility of the situation very well in lines 25–34. This is actually an important and practical issue in the CP literature. Also, they clearly state the theoretical results of their method.

Weaknesses

  • The only question I had while reading this paper is the following. Theorem 3.5 provides a safety bound when PP and QQ are not close. While the authors acknowledge this potential discrepancy, their explicit lower bound on coverage when PP and QQ are not the same is described somewhat vaguely. In particular, there is no quantitative analysis showing, under some reasonable experimental settings, how the lower bound behaves for a fixed value of mm and different levels of discrepancy between PP and QQ.

问题

I think this is a solid work. But if the authors clearly answer the questions and suggestions below, I’d be happy to further increase my score.

Questions

  • Is there a specific analysis which shows the empirical behavior of lower bounds when PP and QQ show different levels of discrepancy?

  • Since this algorithm has an extra hyperparameter β\beta that controls the window size, I’m curious about how sensitive the prediction sets are to different values of β\beta.

Suggestions

  • The idea in Section 3.4.1 about matching the synthetic scores depending on real scores deserves more than just a short remark. To me, using another model (e.g. least squares) to align synthetic calibration scores to match the distribution of real scores might raise concerns about overfitting when new test points come in. But the authors show that this doesn’t happen, which is surprising and kind of counterintuitive. I think it would be great if the paper could discuss this part in more detail.

局限性

yes

最终评判理由

The authors addressed my (minor) concerns about coverage change under the level of distribution shift, and the sensitivity under the choice of β\beta with clear theoretical and empirical support. I think this work is very solid, and I increase my rating.

格式问题

none

作者回复

Thank you for your thoughtful review and encouraging feedback. Below, we address your comments in detail.

  1. The only question I had while reading this paper is the following. Theorem 3.5 provides a safety bound when P and Q are not close. While the authors acknowledge this potential discrepancy, their explicit lower bound on coverage when P and Q are not the same is described somewhat vaguely. In particular, there is no quantitative analysis showing, under some reasonable experimental settings, how the lower bound behaves for a fixed value of mm and different levels of discrepancy between PP and QQ.

The bounds in Theorem 3.5 are ``distribution-free guardrail bounds" that do not depend on either PP or QQ, or their level of discrepancy. In other words, the lower bound is for following quantity:

\inf_\text{all distribution PandandQ} \mathbb{P}(\text{coverage event}),

and similarly, the upper bound is for the supremum of the coverage rates.

These bounds are determined solely by the sample sizes NN, mm, and levels α\alpha, β\beta, and are independent of any unknown distributions or observed data. Therefore, once NN, mm, α\alpha, and β\beta are fixed, these bounds are constant and known. In fact, one can set β\beta in advance to control the lower bound on the coverage, as detailed in Algorithm 4 of the Appendix. In our experimental plots (e.g., Figures 1 and 2), these bounds are also shown (labeled as "Lower bound" and "Upper bound"), and we observe that, across different levels of synthetic data quality, the coverage of the SPPI method always lies within these bounds---as guaranteed by Theorem 3.5.

This is, in fact, one of the main advantages of SPPI: it provides a guardrail bound on the quality of inference. While high-quality synthetic data is certainly preferable, the method still yields meaningful inference even when the synthetic data is of low quality---which is important, given that we typically lack prior knowledge about the data quality.

  1. Is there a specific analysis which shows the empirical behavior of lower bounds when P and Q show different levels of discrepancy?

Regarding the lower bounds in Theorem 3.5, we believe that our response to the previous comment, along with the experimental results in the manuscript, sufficiently addresses the question.

Theorem 3.3, on the other hand, provides coverage bounds that indeed depend on the discrepancy between the real and synthetic score distributions. The final theoretical bounds are given by the tighter of the two---that is, the maximum of the lower bounds and the minimum of the upper bounds from Theorems 3.3 and 3.5.

If the reviewer is referring to the bounds in Theorem 3.3, we kindly refer the reviewer to the new results presented in Item 1 in our response to Reviewer aUpi. There, the table presents both the bounds from Theorems 3.3 and 3.5---as well as the best of the two (denoted as ``effective bounds")---on synthetic data where the total variation distance between the real and synthetic score distributions is known. These results demonstrate how the bounds adapt to different levels of discrepancy between the real and synthetic score distributions: Theorem 3.3 dominates the bound when PP and QQ are close, whereas Theorem 3.5 dominates when PP and QQ are more distinct.

  1. Since this algorithm has an extra hyperparameter beta that controls the window size, I’m curious about how sensitive the prediction sets are to different values of beta.

We agree that it would be interesting to examine the performance of our method under different choices of hyperparameters. We have conducted additional experiments to explore this, and we will include the relevant results in the revision (due to space limitation and the changed rules that do not allow pdf responses, we are only able to provide a summary here). We appreciate the suggestion.

At a theoretical level, the sensitivity depends on the similarity between the synthetic and real score distributions. Specifically, this can be seen in the score-transportation step: if the two score distributions differ substantially, then the transported score T(Sm+1)T(S_{m+1}) is likely to lie at one of the endpoints of the corresponding window---even under a small β\beta. As a result, the coverage of the SPPI prediction set will reach the guardrail bound in Theorem 3.5, indicating that we gain little from the synthetic data.

On the other hand, if the synthetic scores closely resemble the real scores, the transported score is likely to lie within the corresponding window rather than in its endpoints---even for relatively large values of β\beta. In this case, the SPPI prediction set resembles conformal prediction with a larger sample size, achieving near 1α1-\alpha coverage, and the effect of the hyperparameter becomes relatively minor.

We include experiments on the ImageNet data below. The table reports the performance of all methods across different values of β\beta for the Siberian husky and lighter classes. The results align with the behavior described above: for the lighter class, the synthetic scores closely resemble the real ones, and accordingly, the performance of SPPI remains roughly the same across different β\beta values. In contrast, for the Siberian husky class, where the synthetic scores deviate significantly from the real scores, the performance of SPPI closely aligns with the guardrail bounds. Note that this table also presents how the lower bound of Theorem 3.5 increases with β\beta, becoming closer to the nominal 1α1-\alpha level.

Table: Results for the ImageNet data as a function of β\beta
Conditional coverage (%) and prediction set size across 100 trials for different β values, for lighter and Siberian husky classes. The target coverage is 95%. Standard errors are in parentheses. See Figure 1 (main text) for additional setup details.

Coverage (%)Size
ClassβTHM 3.5OnlyRealOnlySynthSPPIOnlyRealOnlySynthSPPI
Lighter, light0.00575100 (±0)95.2 (±0.1)95.3 (±0.1)30 (±0)3.9 (±0)4.5 (±0)
0.0175100 (±0)95.2 (±0.1)95.3 (±0.1)30 (±0)3.9 (±0)4.5 (±0)
0.0581.2100 (±0)95.2 (±0.1)95.3 (±0.1)30 (±0)3.9 (±0)4.5 (±0)
0.187.5100 (±0)95.2 (±0.1)95.5 (±0.1)30 (±0)3.9 (±0)4.8 (±0)
0.287.5100 (±0)95.2 (±0.1)95.5 (±0.1)30 (±0)3.9 (±0)4.8 (±0)
0.387.5100 (±0)95.2 (±0.1)95.5 (±0.1)30 (±0)3.9 (±0)4.8 (±0)
0.493.8100 (±0)95.2 (±0.1)96.8 (±0.2)30 (±0)3.9 (±0)5.8 (±0)
0.593.8100 (±0)95.2 (±0.1)96.8 (±0.2)30 (±0)3.9 (±0)5.8 (±0)
Siberian husky0.00575100 (±0)56.7 (±0.3)74.9 (±1.0)30 (±0)2.3 (±0)2.7 (±0)
0.0175100 (±0)56.7 (±0.3)74.9 (±1.0)30 (±0)2.3 (±0)2.7 (±0)
0.0581.2100 (±0)56.7 (±0.3)80.8 (±1.0)30 (±0)2.3 (±0)2.8 (±0)
0.187.5100 (±0)56.7 (±0.3)87.4 (±0.9)30 (±0)2.3 (±0)3.2 (±0)
0.287.5100 (±0)56.7 (±0.3)87.4 (±0.9)30 (±0)2.3 (±0)3.2 (±0)
0.387.5100 (±0)56.7 (±0.3)87.4 (±0.9)30 (±0)2.3 (±0)3.2 (±0)
0.493.8100 (±0)56.7 (±0.3)94.1 (±0.6)30 (±0)2.3 (±0)4.2 (±0)
0.593.8100 (±0)56.7 (±0.3)94.1 (±0.6)30 (±0)2.3 (±0)4.2 (±0)
  1. The idea in Section 3.4.1 about matching the synthetic scores depending on real scores deserves more than just a short remark. To me, using another model (e.g. least squares) to align synthetic calibration scores to match the distribution of real scores might raise concerns about overfitting when new test points come in. But the authors show that this doesn’t happen, which is surprising and kind of counterintuitive. I think it would be great if the paper could discuss this part in more detail.

Thank you for pointing this out; we will clarify this issue in the revised paper. The core concern---how a transformation that depends on the real scores can avoid overfitting and still provide valid inference---is precisely what is addressed by Theorem 3.5 and its extensions (Corollaries 3.6 and 3.7). These results show that the guardrail coverage bounds hold without any restrictions on the synthetic scores---they do not have to be i.i.d. and can even depend on the real data. Roughly speaking, this is because, in the derivation of the bounds, any effect of the synthetic data is eliminated by a deterministic component inherent in the structure of the SPPI procedure.

Therefore, at least in terms of maintaining the guardrail bounds, we are allowed to do ``anything”---including double dipping, overfitting, and other steps that would typically compromise statistical validity in other settings and procedures.

This theoretical flexibility allows users to adapt the synthetic scores to better align with the real scores in a data-driven manner, without compromising the validity of our guardrail coverage guarantees and without requiring data splitting. In our experiments (Figure 2 and Appendix I.2), we demonstrate that the K-nearest subset approach can improve performance by bringing the empirical coverage rate closer to 1α1-\alpha. This method is motivated by prior work (e.g., ``Class-conditional conformal prediction with many classes'' by Ding et al.), which suggests that certain classes may have score distributions more similar to the class of interest. More generally, whenever there is reason to believe that a subset of the synthetic data---or a transformation thereof---better aligns with the real scores, it is worthwhile to apply such methods.

A closing remark

Thank you for the opportunity to clarify our theoretical results and for your thoughtful feedback on our method. If any part remains unclear, we would be happy to elaborate further. We sincerely believe our response addresses your concerns and will be helpful as you reconsider your evaluation.

评论

Thanks for the detailed and thoughtful response. Your clarification on the theoretical bounds under the discrepancy between PP and QQ addressed my concern. Also, the additional empirical result about β\beta seems convincing to me really addressed my concerns. Given all that, I will update my score in support of acceptance. Great works!

评论

We sincerely thank the reviewer for their thoughtful engagement and encouraging feedback. We’re truly gratified that our clarifications and additional experiments addressed your concerns.

All the clarifications and new results prompted by your comments—as well as those from other reviewers—will be incorporated into the final version of the paper. We greatly appreciate your support for acceptance. Thank you!

审稿意见
4

This paper proposes SPPI, a framework for utilizing synthetic data in constructing conformal prediction sets. The authors propose to use a score transporter to bring the real scores to synthetic scores, and then use the large set of synthetic scores to calibrate the prediction set. Theoretical guarantees in terms of the distributional gap between real and synthetic data, and in terms of worst-case performance are provided. The proposed method is applied to three applications which show the advantages of the proposed methods in scenarios where high-quality data are scarce.

优缺点分析

Strengths:

  1. The paper studies an important problem
  2. The experiments cover rich applications
  3. The paper is clearly written and relatively easy to follow

Weaknesses: (see Questions for detailed questions)

  1. The theoretical guarantees are a bit hard to interpret
  2. The intuitions of the proposed approaches are not adequately discussed
  3. The choice of hyperparameters is not clear in the experiments

问题

  1. In Theorem 3.3, the term εP,Qm+1\varepsilon_{P,Q}^{m+1} depends on the TV distance of order statistics, which is a bit mysterious. How might this term depend on the TV distance between P and Q?

  2. The slack term β\beta is discussed in lines 167-172 that only takes effect in extreme scenarios. Is there a way to derive a bound for that extreme event in order to obtain a sharper analysis?

  3. How should one interpret the quantities in Theorem 3.5? In particular, what does Rj+R_j^+ and RjR_j^- quantify, intuitively?

  4. In general, I think the theoretical ideas behind the method are not discussed adequately. It might be helpful to include some high-level discussion on why the method could improve the efficiency and how the coverage is affected theoretically.

  5. It is also unclear why one might improve statistical efficiency by matching the distribution of real and synthetic scores. Intuitively, any efforts to align the scores will be subject to the intrinsic 1/m1/\sqrt{m} uncertainty in the real scores. Is this uncertainty still there, while you address the "bias" due to insufficient data?

  6. In Section 3.4, the theoretical results are not very illuminating, since attaining the bounds in Theorem 3.5 really doesn't tell us anything about how the coverage is improved. Can you derive some theory regarding the coverage/efficiency that is related to the quality of learning the synthetic score function?

  7. Why do you choose β=0.4\beta=0.4 in the experiments? How do the results change with the choice of β\beta? How would you recommend people to choose the value of β\beta?

  8. In Section 4.1.2, it is intuitively unclear to me why selecting samples from different label classes can help the prediction for a specific label class. Could you explain?

  9. The choice of m=15m=15 is extremely small for Section 4.2. How would the results change with the value of mm? Can the practitioners have a sense of how large NN is compared with mm to ensure that one gets some improvement?

局限性

Yes.

最终评判理由

revisions addressed some of my concerns

格式问题

N/A

作者回复

Thank you for your thoughtful review. We're glad you found the paper easy to follow. It appears that your major concerns are related to high-level theoretical aspects of our method and the data-driven adaptation of the synthetic scores. We start by addressing these two concerns and then proceed by responding to the rest of your comments.

Q4: It might be helpful to include some high-level discussion on why the method could improve the efficiency and how the coverage is affected theoretically.

We thank the reviewer for this suggestion. As discussed in the introduction, the key question is how to leverage synthetic data---which can be abundantly generated through generative AI, for instance---to improve the sample efficiency of predictive inference in settings where data is scarce.

A naive approach of simply applying existing methods to the large synthetic dataset leverages the increased sample size, but coverage is not guaranteed due to the unreliability of the synthetic data. One might suggest evaluating the quality of the synthetic data in advance, but such evaluation is not feasible in the setting of interest, where real data is scarce.

The proposed method, SPPI, addresses this issue by providing distribution-free guardrail bounds, enabling the procedure to achieve both of the following properties:

  • When PQP \approx Q, SPPI behaves like conformal prediction with a large sample size, improving efficiency and reducing coverage variability. This is essentially the result of Theorem 3.3.
  • When the QQ doesn't approximate PP well, SPPI provably remains robust and guaranteed to yield prediction sets with coverage rate that falls within the theoretical guardrail bounds established in Theorem 3.5.

Our theoretical result is based on both Theorems 3.3 and 3.5, where the effective bounds are the tighter of the two in each case. Specifically, Theorem 3.3 provides coverage bounds that depend on the distance between the real and synthetic score distributions---the smaller this distance, the tighter the bound approaches the nominal 1α1-\alpha level. On the other hand, Theorem 3.5 provides bounds that are determined solely by the sample sizes N,mN,m, and levels α,β\alpha,\beta. In fact, one can set β\beta in advance to control the lower bound on the coverage, as detailed in Algorithm 4 of the Appendix.

In sum, SPPI does not require prior knowledge of how well QQ approximates PP, and yet it automatically adapts to the quality of synthetic data, ensuring valid coverage.

Q6-A: In Section 3.4, the theoretical results are not very illuminating, since attaining the bounds in Theorem 3.5 really doesn't tell us anything about how the coverage is improved.

As noted above, Theorem 3.5 does not quantify the impact of synthetic data quality on coverage. Instead, it provides guardrail coverage guarantees that hold regardless of the synthetic score distribution.

Perhaps the reviewer refers to Theorem 3.3, as that characterizes how the coverage bounds depend on the similarity between the synthetic and real score distributions. In Section 3.4, we discuss strategies to improve the quality of synthetic data. When such strategies reduce the total variation distance between the real and modified synthetic score distributions, Theorem 3.3 implies that the empirical coverage will better follow the nominal 1α1-\alpha level.

Q6-B: Theory for the coverage/efficiency that is related to the quality of learning the synthetic score function

When data splitting is used to improve the quality of the synthetic scores, we already provide a variant of Theorem 3.3 that is formally stated in Appendix Section E.

For a more challenging setup, where synthetic scores are modified without data splitting, the real and synthetic scores are no longer independent, and thus Theorem 3.3 does not apply. Yet, the bounds of Theorem 3.5 continue to hold! In practice, we expect better empirical performance when the synthetic scores are better aligned---as we observe in our K-nearest subset experiments (Figure 2 and Appendix I.2).

Lastly, we kindly refer the reviewer to a related discussion on theoretical analysis for efficiency (i.e., prediction set size) in our response to Reviewer pG2H, Item 3.

Q5: Unclear why one might improve statistical efficiency by matching the distribution of real and synthetic scores. Any efforts to align the scores will be subject to the intrinsic 1/m1/\sqrt{m} uncertainty in the real scores. Is this uncertainty still there, while you address the ``bias" due to insufficient data?

Yes, the intrinsic uncertainty still remains. The goal of the score transformation is to shift the synthetic score distribution QQ appropriately so that it better resembles the real score distribution PP, and the ``uncertainty" in the real scores is part of the information they provide about PP. Yet, in some cases, we can still benefit from the larger sample size of the transformed synthetic scores. This is evident by our K-nearest subsets experiments, which we elaborate on in Q8 below.

Q8: Why can selecting samples from different label classes help the prediction for a specific label class?

Our method relies on synthetic data only through their scores---i.e., the prediction error of the fitted model---not through the raw inputs or labels. The experimental setting is motivated by prior work (e.g., ``Class-conditional conformal prediction with many classes'' by Ding et al.), which suggests that certain classes may have score distributions more similar to the class of interest. Selecting a subset of such classes can help align the synthetic and real score distributions, potentially improving statistical efficiency.

Q1: In Theorem 3.3, εP,Qm+1\varepsilon_{P,Q}^{m+1} depends on the TV distance of order statistics, which is a bit mysterious. How might this term depend on the TV distance between P and Q?

The following relation holds for the term εP,Qm+1\varepsilon_{P,Q}^{m+1}:

εP,Qm+1dTV(Pm+1,Qm+1),\varepsilon_{P,Q}^{m+1} \leq \mathrm{d}_\text{TV}(P^{m+1},Q^{m+1}),

where Pm+1P^{m+1} denotes the joint distribution of mm i.i.d. samples from PP, and Qm+1Q^{m+1} is defined similarly. Thus, the quantity is upper bounded by the TV distance between the product distributions Pm+1P^{m+1} and Qm+1Q^{m+1}. Although the right-hand side is simpler and more intuitive, we chose to include the εP,Qm+1\varepsilon_{P,Q}^{m+1} term to present a tighter bound. We will add this discussion to the revised version.

Q2: The slack term β\beta is discussed in lines 167-172 that only takes effect in extreme scenarios. Is there a way to derive a bound for that extreme event to obtain a sharper analysis?

Thank you for pointing this out---we found that the statements in lines 167–172 are somewhat vague, and we will make the necessary revisions. We apologize for the confusion. Specifically, what the term β\beta accounts for is the case where test score Sm+1S_{m+1} is not covered by the corresponding window, rather than ``extreme” cases. Although the step where β\beta appears involves a tight inequality, the final bounds we obtain can be loose, as some of the proof steps apply the union bound---which we believe is inevitable for the goal of deriving distribution-free bounds.

In fact, we claim that the bounds provided in this work are already quite tight. We kindly refer the reviewer to Item 1 in our response to Reviewer aUpi for further discussion on this topic and new experimental results.

Q3: How should one interpret the quantities in Theorem 3.5? In particular, what does Rj+R_j^+ and RjR_j^- quantify, intuitively?

Figure S3 in the appendix provides a concrete example of the real and synthetic scores along with their corresponding windows, illustrating how the guardrail bounds are derived. Specifically, Rj+R_j^+ and RjR_j^- are ranks within the set of synthetic scores. The ranks Rj+R_j^+ and RjR_j^- are chosen appropriately so that ``the jj-th smallest real score is between RjR_j^--th and Rj+R_j^+-th synthetic scores with sufficient probability" holds. Therefore, roughly speaking, the interval [Rj,Rj+][R_j^-, R_j^+] is a prediction interval for the rank of S(j)S_{(j)} in the set of synthetic scores.

The lower bound in Theorem 3.5 represents the proportion of windows that entirely lie below the conformal threshold Q~1α\tilde{Q}_{1-\alpha} based on synthetic data---which is equivalent to the proportion of Rj+R_j^+ that are less than (1α)(N+1)\lceil(1-\alpha)(N+1)\rceil.

Q7-A: Why do you choose beta=0.4 in the experiments? How would you recommend choosing the value of beta?

We choose β=0.4\beta = 0.4 based on the guardrail bounds in Theorem 3.5, and we recommend selecting its value following the same principle. Specifically, given the sample sizes m,Nm, N and the target level α\alpha, Algorithm 4 in the appendix provides a way to set β\beta to meet a user-specified lower bound on coverage.

Q7-B: How do the results change with the choice of beta?

We kindly refer the reviewer to the discussion and new results provided in Item 3 of our response to Reviewer UUws.

Q9: m=15 is extremely small for Section 4.2. How would the results change with the value of m? How large N is compared with m to ensure that one gets some improvement?

Note that relevant discussions are provided in Sections I.1.1 and J.1, ``The effect of the real calibration set size,” along with supporting experiments---in case this was previously overlooked. There, we demonstrate meaningful improvements for larger values of mm, such as 50. But, naturally, the degree of improvement from leveraging synthetic data tends to diminish as mm grows.

The amount of improvement depends not only on the synthetic size NN but also on the quality of the synthetic scores. If the synthetic score distribution is nearly identical to the real score distribution, then any N>mN > m will lead to improvement. However, if the synthetic scores are of low quality such that the procedure frequently hits the guardrail, the degree of improvement will be relatively minor, even with a large NN.

评论

I thank the authors for the detailed response. Just some clarification questions:

  1. Q3&6: do you mean the quantities in Thm 3.5 can be directly computed in a new task and bound the actual coverage?
  2. Re choice of β\beta: did you choose it so that with this specific value of (m,n,α,β)(m,n,\alpha,\beta), the coverage is at least 1α1-\alpha?
评论

Thank you for the follow-up and the opportunity to clarify these points.

Re 1: On Theorem 3.5 and actual coverage

Yes, the lower and upper guardrail bounds in Theorem 3.5 can be directly computed for any new task, and they are guaranteed to bound the actual coverage. These guardrails are independent of the underlying real or synthetic score distributions and are only a function of the constants (m,n,α,β)(m, n, \alpha, \beta). This property allows the user to pre-define a worst-case coverage level and choose β\beta accordingly using Algorithm 4.

For example, in Figures 1-3 of the main manuscript, the red and green dashed lines correspond to the lower and upper bounds of Theorem 3.5. As seen in these and all other experiments, the actual empirical coverage consistently lies within these guarerail bounds that can be directly computed.

Re 2: On the choice of β\beta

No, we did not choose β\beta to ensure that the guardrail lower bound exactly meets the target level 1α1 - \alpha. Instead, we selected β\beta so that the guaredrail lower bound is slightly below 1α1 - \alpha. This choice reflects the fact that Theorem 3.5 provides a worst-case coverage guarantee that holds even when the synthetic scores are highly dissimilar from the real ones. If we had set β\beta such that this worst-case lower bound is exactly 1α1 - \alpha, our SPPI would essentially default to standard conformal prediction. By allowing a slightly lower worst-case bound, we still maintain validity guarantees while also enabling efficiency gains (e.g., smaller prediction sets, smaller coverage variability) when the synthetic data are of decent quality (as formulated in Theorem 3.3).

For instance, in Figure 1 (ImageNet experiment), the target coverage is (1α)(1 - \alpha)=98%, and our choice of β\beta = 0.4 results in a lower guardrail coverage bound that is slightly lower than 98%. As illustrated, our SPPI results in actual coverage that closely matches the target level for classes whose synthetic scores are aligned with the real scores; at the same time, when the synthetic scores are of low quality, the average coverage of SPPI never falls below the lower guardrail coverage bound, in contrast with the heuristic, OnlySynthetic method.

We sincerely thank the reviewer for their engagement and thoughtful questions. All clarifications and additional experiments resulting from your feedback and that of other reviewers will be incorporated into the final version of the paper. Thank you!

评论

Thank you again for your thoughtful comments. We hope our responses have addressed your concerns. If you have any remaining questions or feedback, we’d be grateful to hear them.

评论

Thank you for the follow-up! I would suggest clarifying the choice of β\beta in the paper to guide practical use. I've raised my score.

评论

Thank you for your valuable feedback and for helping us improve our manuscript. In the revised version, we will clarify the choice of β\beta and address all other comments you raised. We greatly appreciate your engagement and your decision to raise your score.

审稿意见
4

This paper proposes a framework to leverage synthetic dataset for applying conformal prediction, in the case of scarce real data. To do so, it constructs a mapping called "transporter" using both real and synthetic calibration scores. Applying this mapping to the real test score allows to compare it to the synthetic quantile instead of the real quantile where the former is computed with much more samples than the latter.

They provide with a fast computation of their proposed conformal set with theoretical guarantee.

In the case of iid samples both for real and synthetic cases, upper and lower bounds for the marginal coverage are provided, which essentially depend on the total variation distance between the synthetic and real data distributions.

Experiments are provided for the task of image classification using either diffusion models or unseen classes as synthetic dataset to illustrate their findings.

优缺点分析

Strengths:

  • The problem tackled of data augmentation for conformal prediciton is of interest. The paper writing is clear and enjoyable to read.
  • The transporter proposed is sound and intuitive.
  • Theoretical guarantees are provided and describe the deviation to the targeted 1α1-\alpha coverage.
  • Experimental results show the applicability of their method.

Weaknesses:

  • There is no in-depth discussion about the error terms in the bounds of Theorem 3.3 and 3.5. In particular, it seems that the sum ε+1/(N+1)\varepsilon + 1/(N+1) is a bias-variance trade-off.
  • The lower bound in Theorem 3.3 seems conservative as taking large values of β\beta compared to α\alpha in the experiments still yields good coverage.
  • I am slightly dubious about the assertion Line 167 "when P=Q [...] our procedure provides a tighter upper bound than split conformal prediction". If one has P=Q then there is no problem anymore to solve in the first place.

问题

  • Could β\beta be correlated to the difference between P and Q ? Since the transporter makes the real and synthetic scores "similar", I would assume that β\beta could be used to calibrate that step similarly to results for weighted exchangeability, distributional shift. In particular, could deriving Lemma 3.1 in the case where P is not equal Q (with additional error terms that illustrate the shift) provide tighter coverage bounds ?
  • Do authors have an idea of how the bias variance (mentioned above) scale in m,Nm,N even in simple cases and in particular whether there is some intuition in how to choose the scaling of NN compared to mm ?
  • The framework is extremely general and could include weighted exchangeability, distributional shift. For these specific cases, does the algorithm proposed obtain similar guarantees to existing ones ?

I believe the proposed method is interesting but the guarantees do not allow to understand how to fully leverage it. Addressing the questions raised above could make me increase my score to accept the paper.

局限性

Yes.

最终评判理由

While the theoretical analysis could benefit from even tighter investigation, their proposed method is of interest as it requires little assumption, is flexible and authors have demonstrated strong numerical evidence of its applicability.

格式问题

No concern.

作者回复

Thank you for your thoughtful review. We are happy that you found the paper clear and enjoyable to read. It appears that your major concern is with the tightness of our theoretical results. We start by addressing this point and then proceed by responding to the rest of your comments.

  1. The lower bound in Theorem 3.3 seems conservative as taking large values of beta compared to alpha in the experiments still yields good coverage.

We would like to argue that the bounds provided in this work are already quite tight. Our theoretical result is based on both Theorems 3.3 and 3.5, where the effective bounds are the tighter of the two in each case.

We first mention that the bounds in Theorems 3.3 and 3.5 are distribution-free uniform bounds---specifically, the lower bound in Theorem 3.3 essentially states the following:

\inf_{\text{all distributions PandandQ}} \left[\mathbb{P}(Y_{m+1} \in \widehat{C}(X_{m+1})) - (1-\alpha-\beta+\varepsilon_{P,Q}^{m+1})\right] \geq 0.

This bound was constructed to control this infimum, and thus the ``conservativeness" should be discussed in terms of how close the above infimum is to zero. For specific distributions PP and QQ, it is possible that the situation is better and we achieve a higher coverage.

In practice, one would prefer a tight bound that holds for any unknown data distribution. Such a uniformly tight bound is a challenging goal and is likely infeasible (how can we be both distribution-free and distribution-adaptive?). However, our second result---Theorem 3.5---essentially addresses this concern. Recall that Theorem 3.5 provides ``guardrail" bounds, which approach 1α1 - \alpha as β\beta increases (discussed in Section D.1 as well). Importantly, these bounds do not depend on the distributions of the real or synthetic scores. Further, the lower guardrail bound is tunable: for given sample sizes m,Nm,N, and target level α\alpha, the hyperparameter β\beta can be set to meet the user-specified guardrail coverage bound (as detailed in Algorithm 4 in the appendix).

In fact, if β\beta is large, then---depending on the quality of the synthetic scores---the bound in Theorem 3.5 may dominate that of Theorem 3.3 and provide a tighter guarantee.

To illustrate this point, we ran new experiments with synthetic data where the total variation (TV) distance between the real and synthetic score distributions can be computed exactly. The table below presents the coverage rates for all methods, along with the theoretical bounds from Theorems 3.3 and 3.5, and the effective bounds in each case. We will include the new results, along with this discussion, in the revised version.

Table: Results on synthetic data.
Coverage rates across different TV distances between the real and synthetic score distributions, along with bounds from Theorems 3.3, and 3.5, and the effective bounds (minimum/maximum of the two). Standard errors are shown in parentheses. Results are shown for 1α=0.951-\alpha=0.95 and β=0.05\beta=0.05.

CoverageTheoreticalbounds
TV distanceOnlyRealOnlySynthSPPITHM 3.3THM 3.5Effective bounds
0100 (±0)95.2 (±0.1)95.3 (±0.1)(0.9, 1.0)(0.812, 1.0)(0.9, 1.0)
0.04100 (±0)94.1 (±0.1)94.2 (±0.1)(0.86, 1.0)(0.812, 1.0)(0.86, 1.0)
0.08100 (±0)92.8 (±0.1)93.0 (±0.1)(0.82, 1.0)(0.812, 1.0)(0.82, 1.0)
0.159100 (±0)89.5 (±0.2)90.2 (±0.2)(0.741, 1.0)(0.812, 1.0)(0.812, 1.0)
0.236100 (±0)85.3 (±0.2)87.1 (±0.3)(0.664, 1.0)(0.812, 1.0)(0.812, 1.0)
0.311100 (±0)80.2 (±0.2)84.2 (±0.5)(0.589, 1.0)(0.812, 1.0)(0.812, 1.0)
0.383100 (±0)74.2 (±0.3)82.2 (±0.7)(0.517, 1.0)(0.812, 1.0)(0.812, 1.0)
  1. I am slightly dubious about the assertion Line 167 "when P=Q [...] our procedure provides a tighter upper bound than split conformal prediction". If one has P=Q then there is no problem anymore to solve in the first place.

We are considering the case where the quality of the synthetic data is unknown in advance, as is often the case in practice. In particular, we do not know beforehand whether QQ equals PP or not.

However, we can still leverage synthetic data to improve inference without such prior knowledge, and that is precisely the goal of our work. We address this problem by proposing a procedure with guardrails---with the following properties (which hold despite us not knowing whether PQP \approx Q or not!):

  • When PQP \approx Q, SPPI behaves like conformal prediction with a large sample size, thereby improving upon the naive conformal prediction based solely on limited real data.

  • When the QQ doesn't approximate PP well, SPPI still provides meaningful inference to some extent, supported by the distribution-free guardrail bounds established in Theorem 3.5.

These properties are also validated empirically: our experiments show that SPPI yields more efficient prediction sets when PQP \approx Q, and continues to provide valid inference when this approximation is poor.

  1. There is no in-depth discussion about the error terms in the bounds of Theorem 3.3 and 3.5. In particular, it seems that the sum ε+1/(N+1)\varepsilon+1/(N+1) is a bias-variance trade-off.

Thank you for pointing this out. We agree with the interpretation in terms of the bias-variance tradeoff. Specifically, the fundamental goal of this work is to develop a method that leverages synthetic data—which may follow a different distribution from the real data and thus introduce some bias—to improve the sample efficiency of the inference. One important point to note is that what is being traded for the bias is not only the variance. Increasing the sample size in conformal prediction has two effects:

  • It reduces the variability of the prediction set: a larger sample size leads to smaller variance in the score bound S((1α)(n+1))S_{(\lceil(1-\alpha)(n+1)\rceil)}, which is a sample quantile of the calibration scores.

  • It reduces the conservativeness of the prediction set: Recall that in split conformal prediction, a mass of 1/(n+1)1/(n+1) is assigned to ++\infty to account for the unobserved test score. This is the source of the upper bound 1α+1/(n+1)1 - \alpha + 1/(n+1).

The proposed SPPI method allows for some bias from the synthetic data in order to gain advantages in both aspects.

  1. Do authors have an idea of how the bias variance (mentioned above) scale in m,N even in simple cases and in particular whether there is some intuition in how to choose the scaling of N compared to m?

We would say that ``a larger NN is always better.” Specifically, regarding the `bias' term εP,Qm+1\varepsilon_{P,Q}^{m+1}, it does not depend on the synthetic sample size NN, whereas the term 1/(N+1)1/(N+1) decreases as NN grows. Thus, at least in terms of NN, there is no tradeoff---a larger NN represents a strict improvement.

Moreover, the sample size of the real data, mm, affects the granularity of the bounds in Theorem 3.5. These bounds can be interpreted as 1αϵ11-\alpha -\epsilon_1 and 1α+ϵ21-\alpha+\epsilon_2, where a larger mm allows for tighter lower worst-case bound (for an appropriate choice of β\beta), thereby reducing the worst-case bias.

  1. The framework is extremely general and could include weighted exchangeability, distributional shift. For these specific cases, does the algorithm proposed obtain similar guarantees to existing ones?

Thank you for this interesting question. Extending the method to the covariate shift setting is in fact not straightforward. Specifically, the SPPI method relies on a score transporter that maps each order statistic of the real scores to the space of synthetic scores. A key property used in the theoretical analysis is that the rank of the test score is independent of the (ordered) values of the scores. However, this independence no longer holds under covariate shift. We believe this presents an interesting direction for further investigation.

  1. Could beta be correlated to the difference between P and Q ? Since the transporter makes the real and synthetic scores ``similar", I would assume that could be used to calibrate that step similarly to results for weighted exchangeability, distributional shift. In particular, could deriving Lemma 3.1 in the case where P is not equal Q (with additional error terms that illustrate the shift) provide tighter coverage bounds?

We appreciate this insightful question. If we can somehow construct a window satisfying Lemma 3.1 in the case of PQP \neq Q (through some (P,Q)(P,Q)-dependent β\beta), then yes---the SPPI procedure will achieve the bounds in Theorem 3.3, but without the εP,Qm+1\varepsilon_{P,Q}^{m+1} error terms.

However, it seems that such a construction of the window could require knowledge of the exact distributions PP and QQ, rather than merely a similarity measure between them. Moreover, our primary focus is on settings with a limited number of real data points, where accurately estimating PP or the similarity between PP and QQ a priori is not a reasonable target.

Notably, the weighted conformal method that works under general distribution shift (``Conformal Prediction Beyond Exchangeability'' by Barber et. al, mentioned by the reviewer) requires the weights to be set a priori and relies on the TV distance between the weighted score distributions to quantify the drop in coverage due to the shift. Consequently, it is unclear how the latter could tighten Theorem 3.3, especially in light of our Theorem 3.5 that allows users to adapt the synthetic scores to better align with the real scores in a data-driven manner, without compromising the validity of our guardrail coverage guarantees and without requiring data splitting.

A closing remark

Thank you for the opportunity to clarify why we argue that our theoretical analysis is quite tight, and for your thoughtful feedback on our method. If any part remains unclear, we would be happy to elaborate further. We sincerely believe our response addresses your concerns and will be helpful as you reconsider your evaluation.

评论

I thank the authors for their response, which I consider satisfactory. Given that their proposed method is sound with working experiments, I will increase my score towards acceptance of the paper.

评论

Thank you for your thoughtful engagement and encouraging feedback. We’re glad to hear that our response addressed your concerns. We truly appreciate your support for acceptance.

All clarifications and new results prompted by your comments—as well as those from other reviewers—will be incorporated into the final version of the paper. Thank you again!

审稿意见
4

This paper introduces Synthetic-powered Predictive Inference (SPPI), a new framework that enhances the sample efficiency of conformal prediction by incorporating synthetic data generated from large generative models or alternative real datasets. The main innovation is a score transporter: an empirical quantile mapping that aligns nonconformity scores between real and synthetic calibration data. SPPI provides finite-sample, distribution-free coverage guarantees regardless of the similarity between real and synthetic data, and can substantially tighten prediction sets when synthetic and real data distributions are well aligned. The approach is demonstrated on both multi-class classification (ImageNet, using diffusion models) and tabular regression (MEPS panel data), showing improved efficiency over standard conformal prediction in data-scarce settings.

优缺点分析

Strengths:

The method tackles a core limitation of conformal prediction—its inefficiency in low-data regimes—by leveraging abundant synthetic data in a theoretically principled manner. The authors provide finite-sample coverage guarantees,

Weaknesses:

  1. The paper could further discuss or systematically analyze the impact of different types of synthetic data and score functions on efficiency and coverage—especially when using less realistic or lower-quality generators.

  2. SPPI method does not guarantee that the prediction set (or interval) will always be smaller than (or even no larger than) the one constructed using only real data.

  3. More practical guidance on choosing hyperparameters (e.g., β for windows, subset size for selection), diagnostics for misalignment, and post-hoc adjustment strategies would make the framework easier to deploy for non-expert users.

问题

  1. Can the authors suggest practical ways to diagnose or quantify when the synthetic and real score distributions are too dissimilar for SPPI to be effective? Are there warning signs or automated checks practitioners could use?

  2. How does the method perform with synthetic data from weaker generators or less relevant domains? Can the authors comment on robustness in such scenarios?

  3. : For large-scale settings, how does the computational cost of SPPI compare to standard conformal prediction or other recent efficient conformal methods?

局限性

See weakness

最终评判理由

Overall, this is a solid contribution (score: 4). However, the paper does not provide efficiency guarantees; Therefore, the procedure may produce wider intervals than standard methods, constraining practical utility.

格式问题

No

作者回复

Thank you for your thoughtful review. It appears that your major concern is with the robustness of our method to the quality of the synthetic data. We start by addressing this point and then proceed by responding to the rest of your comments.

  1. How does the method perform with synthetic data from weaker generators or less relevant domains? Can the authors comment on robustness in such scenarios?

First, we would like to point out that our method relies on the synthetic data only through their scores---i.e., the prediction error of the fitted model---not through the raw inputs or labels. When the synthetic data are of low quality (i.e., the synthetic score distribution differs significantly from the real scores), SPPI provably remains robust and guaranteed to yield prediction sets with coverage rate that falls within the theoretical guardrail bounds established in Theorem 3.5. These bounds hold regardless of the quality of the synthetic data. Moreover, the lower guardrail bound is tunable: for given sample sizes m,Nm,N, and target level α\alpha, the hyperparameter β\beta can be set to meet the user-specified guardrail coverage bound (as detailed in Algorithm 4 in the appendix).

At the same time, if the synthetic data are of high quality (i.e., the synthetic score distribution closely resembles the real scores), SPPI benefits from using the synthetic data. The empirical coverage is tightly regulated around the target level 1α1-\alpha, with low variability. This aligns with the results in Theorem 3.3, as the total variation distance is small in this case.

Importantly, our method also allows for improving the alignment between synthetic and real scores using the real scores themselves, and without requiring any data-splitting. For instance, our K-nearest subset approach selects subsets of synthetic scores that are more aligned with the real ones to improve statistical efficiency. Crucially, such refinements do not compromise the coverage guarantees; indeed, we demonstrate in the K-nearest subset experiments (Figure 2 and Appendix I.2) that using this approach can improve performance.

2. Discuss or systematically analyze the impact of different types of synthetic data and score functions on efficiency and coverage—especially when using less realistic or lower-quality generators.

We thank the reviewer for the suggestion. Note that the experiments in the manuscript already include four types of synthetic data: for image classification, we consider images generated by Stable Diffusion, Flux, and images from different classes; for the MEPS dataset, we use past panel data.

We agree that further comparisons across different synthetic sources and score functions are valuable. To that end, we include in the table below new results for image classification using the HPS score function---complementing the APS-based results in the manuscript---evaluated across three different synthetic datasets. We observe a similar trend to the one observed in the manuscript.

Regarding the effect of the quality of synthetic data on performance, we kindly refer the reviewer to our response to Reviewer aUpi, Item 1, where we present new experiments with synthetic data for which the total variation distance between the real and synthetic score distributions can be computed. These experiments demonstrate how varying levels of the quality of synthetic data affect the theoretical coverage bounds, and also the empirical coverage is always within the effective bounds defined as the tighter of Theorems 3.3 and 3.5, as our theory suggests. We will include all the new results in the revised version.

Table: Results for the ImageNet data using the HPS score function
Marginal and per-class conditional coverage for three synthetic datasets. The target coverage level is 1 − α = 0.95. Other details are as in Figure 1 from the manuscript.

Stable DiffusionFLUXAuxiliary data
OnlySynthSPPIOnlySynthSPPISPPI-WholeSPPI-Subset
Marginal89.9 (± 0.1)95.3 (±0.3)93.1 (±0.1)95.9 (±0.3)94.9 (±0.4)94.8 (±0.4)
Siberian husky32.4 (±0.2)94.4 (±0.5)36.0 (±0.2)94.4 (±0.5)96.1 (±0.3)96.1 (±0.3)
Beaver79.7 (±0.2)93.5 (±0.6)35.8 (±0.2)93.4 (±0.6)97.4 (±0.1)96.0 (±0.3)
American robin87.4 (±0.1)94.4 (±0.4)71.7 (±0.1)93.7 (±0.6)98.9 (±0.1)96.2 (±0.2)
Lighter,Light94.8 (±0.1)96.5 (±0.2)30.2 (±0.2)94.2 (±0.7)94.5 (±0.5)94.5 (±0.5)
Barracouta98.4 (±0.0)98.6 (±0.1)99.6 (±0.0)99.6 (±0.0)95.1 (±0.4)95.6 (±0.4)
  1. SPPI method does not guarantee that the prediction set (or interval) will always be smaller than (or even no larger than) the one constructed using only real data.

There are scenarios in which the SPPI prediction set is always shorter than the conformal prediction set based solely on real data. For example, when the real data size is very small---specifically, m<1/α+1m < 1/\alpha + 1---the naive conformal prediction always yields a trivial prediction set, whereas SPPI can produce nontrivial, finite-width prediction sets.

But, more generally, we would like to note that while ``constructing a valid prediction set whose length (or more generally, measure) is almost surely shorter than that of the naive conformal prediction" may sound appealing, it may be very hard or even infeasible target in the distribution-free setting---especially when leveraging synthetic data whose distribution is unknown. The same holds even for weaker targets, such as achieving a smaller expected width. Intuitively, the reason is that the distribution of the synthetic data can range from very good to very bad, and so the only way to make sure that we do not harm the coverage guarantee of standard conformal prediction (which is already very tight) would be to ignore synthetic data. Our SPPI method sidesteps this challenge by introducing the guardrails, as explained in the paper.

Nevertheless, as discussed in Item 1, when the synthetic scores are aligned with the real ones, SPPI yields shorter, more efficient prediction sets. When the synthetic data are of low quality, the method remains safe, with coverage bounded by the worst-case guarantees of Theorem 3.5. These two properties are validated through our extensive experiments: in nearly all of our experimental results, the proposed procedure outperforms naive conformal prediction in terms of prediction set width, while still achieving reasonable coverage rates.

4. More practical guidance on choosing hyperparameters, diagnostics for misalignment, and post-hoc adjustment strategies.

We thank the reviewer for this suggestion, and we will add the following discussion in the revised version.

For the hyperparameter β\beta, we recommend selecting its value based on the desired guardrail bounds, e.g., 90% coverage with the ideal target level 1α=95%1-\alpha=95\%. Specifically, Algorithm 4 provides a procedure to obtain such a β\beta for given values of NN, mm, α\alpha, that meets the user-specified guardrail coverage bound.

Regarding the subset size for selection, we suggest following our experiments and fixing NN to be reasonably large (e.g., at least 300–500) and then selecting the KK nearest synthetic subsets.

As for diagnostics and post-hoc adjustment strategies, we first recall that only the distribution of the scores matters for our method, not the raw inputs or labels. Therefore, diagnostics should be done in the score space. To assess similarity between the real and synthetic scores, one can apply any 1-dimensional distributional tests such as the Cramér–von Mises or Kolmogorov–Smirnov statistics, or visually inspect the histograms of the real and synthetic scores. However, since mm is often relatively small, these diagnostics will be noisy but still useful to detect major deviations. Importantly, such diagnostics do not compromise the validity of our coverage guarantees, as discussed below.

5. Suggest practical ways to diagnose or quantify when the synthetic and real score distributions are too dissimilar for SPPI to be effective. Are there warning signs or automated checks practitioners could use?

Our main focus in this work is on the setting where the real data is limited and insufficient to provide useful inference---in such a setting, diagnostics based on the real data may be noisy. Importantly, our approach is supported with distribution-free guardrails: regardless of the quality of the synthetic data, the coverage rate always lies between some fixed coverage levels 1αε11-\alpha - \varepsilon_1 and 1α+ε21-\alpha + \varepsilon_2 (this is essentially the result of Theorem 3.5), where the lower coverage level can be pre-defined by the user. Put simply, there is no need for any ``warning signs.''

Nevertheless, one can apply the diagnostics described in the previous item to detect when the synthetic scores are far off from the real scores, indicating that the synthetic data is not useful for the inference task or could be modified to better align with the real data. In fact, the latter is precisely the approach we take in our K-nearest subsets method, which allows us to benefit from better-aligned synthetic data. As demonstrated in the paper, with this approach, the coverage rate becomes closer to 1α1-\alpha rather than hitting the guardrails.

6. For large-scale settings, how does the computational cost of SPPI compare to standard conformal prediction or other recent efficient conformal methods?

SPPI incurs almost no additional computational cost compared to standard conformal prediction at inference time. To see this, note that conformal prediction requires finding a sample quantile of the scores, while SPPI performs essentially the same computation three times. Specifically, the simplified formula for SPPI (Section 3.2) consists of three terms: S(R~)S_{(\tilde{R}^-)} and S(R~+)S_{(\tilde{R}^+)}---sample quantiles of the real scores---and Q~1α\tilde{Q}_{1-\alpha}', a sample quantile of the synthetic scores.

评论

Thank you again for your thoughtful review. We have responded to the comments and questions you raised and hope our responses have addressed your concerns. If you have any remaining questions or feedback, we’d be grateful to hear them.

评论

I appreciate the authors’ rebuttal, which resolves most of my concerns. However, I would like to further raise a point regarding the efficiency of the proposed approach. As I understand it, the method does not guarantee narrower intervals compared to those constructed using only real data. In practice, how should a practitioner decide when to use the approach involving synthetic data versus relying solely on real data? Additionally, if multiple sources of synthetic data are available, how should one determine which to use?

评论

Thank you for the follow-up and the opportunity to clarify these points. We're glad to hear that we addressed most of your concerns, and we hope the responses below will further clarify the remaining ones.

Re 1: As I understand it, the method does not guarantee narrower intervals compared to those constructed using only real data.

Indeed, providing such a guarantee is not a feasible target in the distribution-free setting we consider in this work. Except for situations where the real dataset is very small, in which standard conformal prediction yields trivial prediction sets. In this regime, our method is shown to produce non-trivial prediction sets. Further, across all extensive experiments we provided, our method produced smaller prediction sets than standard conformal prediction, even in cases with larger real datasets or higher α\alpha levels, where standard conformal prediction does not produce non-trivial sets. These empirical results suggest that, although almost surely narrower prediction sets are not guaranteed, our method offers a practical improvement in many realistic scenarios.

More broadly, our work proposes a theoretical framework for incorporating synthetic data into conformal prediction under unknown distributions---something that was not previously possible. We believe this opens opportunities for future research on the principled design of synthetic data for predictive inference, with the goal of improving the alignment with real data and reducing prediction set sizes.

Re 2: In practice, how should a practitioner decide when to use the approach involving synthetic data versus relying solely on real data?

We hope the discussion provided above better clarifies the efficiency issue. This question also complements our earlier response to Comments 4 and 5 raised by the reviewer, in which we discussed practical diagnostics to assess the alignment between synthetic and real scores.

In general, the closer the synthetic scores resemble the real ones, the more benefit our method can provide. To evaluate this alignment in practice, we suggest that practitioners apply standard goodness-of-fit tests to compare the empirical CDFs of the real and synthetic scores.

Importantly, there are standard statistical tests available, such as the Cramér-von Mises and Kolmogorov-Smirnov tests. These tests test the null hypothesis that the real and synthetic scores are drawn from the same distribution. And so, one can run such a test and avoid using the synthetic data in case the test rejects the null, suggesting a significant discrepancy between the score distributions. We recommend that practitioners perform such a test at a chosen significance level---e.g., 0.1.

We thank the reviewer for prompting this clarification, and we will incorporate this diagnostic recommendation into the revised manuscript.

Re 3: Additionally, if multiple sources of synthetic data are available, how should one determine which to use?

This scenario aligns with the experimental setup in our K-nearest subset approach, where we selected the K subsets whose score CDFs were closest to the real data. In the multiple-source setting raised by the reviewer, each subset corresponds to a distinct synthetic source. While we found this method to be effective in practice, alternative approaches may also be promising, and we will include this as a potential direction for future work.

最终决定

Claims and findings: This paper is about conformal prediction with synthetic data, motivated by data-starved scenarios. The basic idea is to show that it is possible to use synthetic samples by building mappings between the relevant quantities (e.g., scores) derived from synthetic and real samples.

Strengths: The problem is important. The idea the authors develop is intuitive but is also pretty powerful. The paper is written in a clear way.

Weaknesses: There are some practical questions (e.g., how to set some of the hyperparameters, how robust the method is to difference choices).

Decision: This is a nice paper that has a lot of intriguing results. It is worth accepting.

Rebuttal discussion: The reviewers brought up some great questions, especially around practical use of the method. The authors’ answers were very helpful and will be valuable for the final version of the paper.