6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

3.3

置信度

创新性3.0

质量3.0

清晰度3.0

重要性2.3

NeurIPS 2025

Document Summarization with Conformal Importance Guarantees

Bruce Kuwahara,Chen-Yuan Lin,Xiao Shi Huang,Kin Kwan Leung,Jullian Arta Yapeter,Ilya Stanevich,Felipe Perez,Jesse C. Cresswell

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

We apply conformal prediction to provide statistical guarantees that all important information within a long-text is captured by an automatically generated summary.

摘要

关键词

Document SummarizationConformal PredictionLarge Language Models

评审与讨论

审稿意见

评分: 3置信度: 42025-06-24

This paper introduces Conformal Importance Summarization, the first framework for importance-preserving summary generation that uses conformal prediction to provide rigorous, distribution-free coverage guarantees.

优缺点分析

strengths:

The paper offers a novel perspective by bridging conformal prediction with extractive summarization.
The analysis is both sound and creatively executed.

weaknesses:

The paper contains a large number of formulas, many of which are adaptations of existing proofs, which negatively affects readability. The authors are encouraged to summarize their main contribution to conformal prediction or extractive summarization. Additionally, please clarify: is this paper fundamentally about applying conformal prediction to the extractive summarization task to achieve its benefits—namely, rigorous, distribution-free coverage guarantees—in the resulting summaries?
The motivation behind this work is not clearly articulated. Given that the method is restricted to extractive summarization, is this still meaningful in the era of large language models (LLMs)?
The paper lacks reporting on a primary metric in conformal prediction: Set Size, which quantifies prediction uncertainty.
Instead of defining the threshold q_hat using a preset α, an alternative approach would be to use p-values, as suggested by Vovk et al. (2005). It would be interesting to evaluate the performance of the proposed framework using the p-value method for constructing prediction sets.

问题

Please see Weaknesses above

局限性

Yes

最终评判理由

After carefully considering the authors' detailed responses and the original review comments, I maintain my initial assessment of the paper. The authors argue that the density of formulas is necessary to rigorously establish their novel adaptation of conformal prediction (CP) techniques to extractive summarization. While I acknowledge that theoretical rigor is important, I think the current presentation remains challenging for readers unfamiliar with CP’s technical foundations. However, the authors’ clarification that their work extends CP techniques (not standard CP) to a new setting is valid. Thus, while readability could be enhanced, the core theoretical claims are justified.

格式问题

作者回复

2025-07-31

Thank you for your efforts to review our work! It is great to see that you found our perspective on summarization novel. Below, we will address the weaknesses you brought up.

W1:

The paper contains a large number of formulas … which negatively affects readability. The authors are encouraged to summarize their main contribution to conformal prediction or extractive summarization. … is this paper fundamentally about applying conformal prediction to the extractive summarization task …?

It is possible that there are some misunderstandings in your review about how we brought together conformal prediction and document summarization. Our contribution is not to apply standard conformal prediction to summarization tasks, but to apply the techniques behind conformal prediction to summarization tasks. Our settings and results (Theorem 1) are entirely different from that of conformal prediction as you may have seen it applied to classification or regression tasks. Thus formulas and proofs are needed to make it rigorous. Conformal factuality [41] is another example of applying the techniques of conformal prediction to a novel LLM setting (question-answering), instead of applying standard conformal prediction. For that reason, [41] also provides proofs and formulas.

In summary:

conformal prediction [57,54, 2] aims for a coverage guarantee that the ground truth answer is contained in the prediction set (Equation 1);
conformal factuality [41] aims for a coverage guarantee that all retained claims are factual (Equation 3);
conformal summarization (our work) aims for a coverage guarantee that all (or a fixed percentage of) the important sentences are retained (Equation 4).

While there are common ideas behind all these approaches, the settings and the essential coverage guarantees are notably different, and each requires independent proof that coverage is obtained.

W3:

The paper lacks reporting on a primary metric in conformal prediction: Set Size, which quantifies prediction uncertainty.

Following the above explanation, since conformal summarization is not the same as conformal prediction, we did not report “set size” directly. Set size is indeed a primary metric for conformal prediction and quantifies uncertainty. In a simple classification problem, for example, there is a fixed set of labels to choose from for every datapoint, and the conformal set is just a subset of all possible labels. For the same level of coverage $1-\alpha$ , smaller sets are more useful (L72) which makes average set size a good metric.

For our problem of document summarization the label set is no longer fixed. Instead, each summary contains a subset of all sentences from the long-texts $x$ , which vary in length. The number of sentences retained in the summary is still a good measure of quality, as we want the shortest summaries that retain all (or a fraction $\beta$ ) of the ground-truth important sentences (L111). It also still quantifies uncertainty, as longer summaries indicate the model is more uncertain about what content is important. But, since the total number of sentences in $x$ varies, rather than use the count of sentences in the summary (similar to set size in conformal prediction setting), it makes more sense to normalize the “set size” as the fraction of sentences removed (L233). This is the measure of “conformal set size” we report in Table 2 and other figures.

W2:

The motivation behind this work is not clearly articulated. Given that the method is restricted to extractive summarization, is this still meaningful in the era of large language models (LLMs)?

Instruction-tuned LLMs are powerful tools and natively can perform abstractive summarization with basic prompts. One might wonder why we bother with extractive summarization at all. We comment on this in the paragraph on L25. The main motivation is that prompted LLMs do not provide any guarantees that they will extract the information a user deems important. Even worse, they are well-known to sometimes hallucinate information beyond what was provided, or make logical errors in interpreting that information. For example, a study using GPT-4 to summarize emergency department visits found hallucinations in 42% of summaries and that 47% omitted clinically important information [A]. Many similar evaluations of LLMs are available in the literature. In response to Reviewer fdXq (Q1) we tested how well GPT-4o mini retains ground-truth important information with naive abstractive summarization. The results show only 26% of summaries had recall over 80% (low coverage), and overall recall was only 52% (low recall).

By using extractive summarization it is impossible to fabricate information, and by using conformal prediction we can align the extracted information with a user’s preferences (via the labeled calibration set) and give statistical guarantees on recall and coverage.

Let’s consider an analogy to the popular task of question-answering. Many production systems avoid “pure” LLM generation and instead use Retrieval-Augmented Generation (RAG) [B], which grounds generative outputs in retrieved text to mitigate hallucinations while preserving fluency. RAG is a two-step process that first retrieves from a document corpus and then generates from the retrieved documents. The first step involves finding the most relevant document to the question, in which the output format is controlled (not in the form of free text). By design there cannot be hallucination in the retrieval step, since it only accesses a pre-defined corpus. The second step, generation, has its output as free-form text, i.e. uncontrolled. Thus it could contain hallucinations. But as the inputs to this step are controlled, RAG greatly reduces hallucinations on this step.

Our framework for summarization addresses the same reliability concerns as RAG, and is less prone to hallucination than “pure” LLM summarization. Our method can be thought of similarly to the retrieval step in RAG, in that the output is controlled -- the output sentences are directly from the original text. Just like the retrieval in RAG, we score and rank content, but go further and provide coverage guarantees. The output of our approach can be fed into a second stage of abstractive summarization using an LLM, similar to the generation step in RAG, which serves to improve fluency and further shorten the text. Thus our approach and abstractive LLM summarization are complementary. We provide example experiments in response to Reviewer fdXq (Q3) that show using our conformal summarization as a pre-processing step improves the recall and coverage of abstractive summarization.

[A] Williams et al. “Evaluating large language models for drafting emergency department encounter summaries” PLOS Digital Health 2025

[B] Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” NeurIPS 2020

W4:

Instead of defining the threshold q_hat using a preset α, an alternative approach would be to use p-values, as suggested by Vovk et al. (2005). It would be interesting to evaluate the performance of the proposed framework using the p-value method for constructing prediction sets.

This is another interesting question, and luckily there is well understood theory around it. Essentially, the viewpoints of conformal prediction via coverage (modern view [2], which we used) and p-values (classic view [57]) are equivalent and interchangeable. They provide a different statistical interpretation, but they produce the same results. Hence, there would be no difference to the performance of our framework if we were to use the p-value method for constructing prediction sets.

Let’s take the classification setting again for example. In the modern view of CP for coverage guarantees we want to construct a prediction set that contains the true label with high probability $1-\alpha$ . The interpretation here is that the sets quantify the uncertainty of the underlying predictive model, with larger sets indicating more uncertainty. Mechanically, we compute a conformal score for each possible label and include it in the set if the score is greater than a calibrated threshold, which is a quantile of scores on a calibration set. The quantile to use depends on the desired coverage $1-\alpha$ .

In the classic view using p-values we want to construct a prediction set of all labels that are statistically consistent with observations from the calibration set. The interpretation is that CP conducts multiple hypothesis tests and controls Type I errors. Mechanically, for each possible label we compute a p-value representing what fraction of examples in the calibration set are as non-conforming as the proposed label. Then we include all labels where the p-value is greater than a significance level $\alpha$ .

The definitions of the quantile and p-value in the two views, along with their set construction method, can be exactly mapped onto one another making them totally equivalent in practice.

评论- Please double check

2025-08-05

Dear Reviewer HYEU,

Please double-check your review and provide any comments or updates in response to the authors’ rebuttal. Although you have completed the mandatory acknowledgment, I could not find any changes to your review.

Best wishes, AC

评论- Request for response

2025-08-09

Dear Reviewer,

We are open to discussion around your review and our rebuttal. Please notify us of any remaining concerns that remain after you have gone through our rebuttal.

Thanks - the Authors

审稿意见

评分: 5置信度: 32025-07-03

The paper proposes the application of the conformal prediction framework to extractive summarization. It formalizes the task of conformal importance summarization based on importance score function R for each sentence in a document and a conformal threshold that is calibrated on a small dataset. The experiments on diverse summarization tasks demonstrate the effectiveness of the method for controlling recall of important sentences, as well as the impact of different importance models.

优缺点分析

Strengths

The application of conformal importance is quite suitable for extractive summarization, especially in the high-stakes settings where controllability is important.
The method also has advantages as it requires a small dataset for calibration.
The experiments are comprehensive and well designed, which helps to support the paper claims.

Weaknesses

The proposed approach might not scale to longer documents as it requires per-sentence judgment of importance using LLMs.
The effects of the calibration data is not explored. It would be informative to have some measure of performance of top-performing importance model as a function of the calibration dataset size.
The Appendix is missing, so I could not check some details like the prompts for sentence-level importance.

问题

In Table 2, it would be helpful to have a baseline line vanilla LexRank, subject to a summary budget in words matching the average words from conformal importance summaries.

局限性

yes

最终评判理由

The authors provide additional experiments that strengthen evidence of the effectiveness of the proposed method over baselines. In the rebuttal, they also address other important aspects like the impact of the size of calibration data. Thus, I revise my recommendation to accept the paper.

格式问题

Missing Appendix.

作者回复

2025-07-31

Thank you for your time spent reviewing our paper, and for noting that our experiments were comprehensive. We are happy to respond to your questions and the weaknesses you raised below.

W1:

The proposed approach might not scale to longer documents as it requires per-sentence judgment of importance using LLMs.

We acknowledged in L322 that the datasets we used were limited in the length of documents they contained, however there is no fundamental limitation to applying our method on longer documents, only increased cost to score sentences. As such, we have repeated our experimental setup behind Table 2 on longer documents from the SciTLDR dataset [9]. We originally used the Abstract-Introduction-Conclusion subset of SciTLDR, with the average input document length being ~41. Here we use the full-document subset of SciTLDR, where the entire paper is used as the input document, increasing the input length to an average of 216 sentences, with resulting performance below:

Importance Score	AUPRC	Fraction of Sentences Removed
Original Article	0.014	0.00
Cos. Sim. Centrality	0.14	0.50
Sentence Centrality	0.09	0.45
GUSUM	0.09	0.20
LexRank	0.14	0.35
Gemini 2.0 Flash-Lite	0.25	0.45

For the longer documents the positive label rate is much lower (1.4%), and so we see the method is able to remove a higher fraction of sentences, while still maintaining a low rate $\alpha=0.2$ of failure to achieve high recall $\beta=0.8$ .

While we used sentences as a fundamental unit of text to perform conformal summarization, this was a choice for convenience; the underlying method is flexible for any defined span of text like paragraphs or strings of $N$ words. Simply break the long-text into spans, score their importance, and filter out spans with scores lower than the calibrated threshold. This is one way to scale to very long documents if per-sentence judgements become prohibitively expensive - use paragraph-sized spans to reduce the number of spans that need to be scored.

W2:

The effects of the calibration data is not explored. It would be informative to have some measure of performance of top-performing importance model as a function of the calibration dataset size.

The effect of calibration dataset size $n$ in conformal prediction is well understood theoretically, and in summary has no effect on the summary conciseness on average. The coverage guarantee of conformal prediction is famously “valid in finite samples” which means that it holds statistically for any finite calibration dataset. In practice, $n$ controls the variance of the coverage viewed as a random variable over the calibration data.

CP assumes the calibration data is drawn from a distribution, and that test datapoints are exchangeable with the calibration data. The conformal threshold $\hat q$ is a random variable that depends on the calibration data; drawing $n$ new calibration points and rerunning CP would give a different value for $\hat q$ . In turn, the conformal prediction sets would also differ, leading to slight differences in empirical coverage and conciseness. The main theorems of CP say that the coverage, as a random variable over the calibration data, will have an average that is no less than $1-\alpha$ (e.g our Theorem 1, or Equation 1). However, we know more about the distribution than just its average. For typical applications of CP, it follows a Beta distribution with a variance that scales like $O(n^{-1/2})$ . For a textbook-style explanation of these details, see Sec 3.2 of [2].

So, while slightly different conformal thresholds will in practice give different recall and conciseness for summaries, the fluctuations are random and are not expected to give better or worse conciseness as $n$ increases. $n$ controls the size of the random fluctuations.

The only other notable effect of $n$ is on the tightness of the upper bound in our Theorem 1 (Compare to Equation 1 from [2]).

Recognizing the costly nature of obtaining labels, we limited our experiments to $n=100$ and still obtained strong coverage results demonstrating that very little labeled data is needed in practice.

W3:

The Appendix is missing…

The appendix is provided as Supplementary Material rather than added to the main PDF, which were both options given in the PC’s instructions this year. Please download the zip file from OpenReview, and we can answer any further questions it brings up for you.

Q1:

In Table 2, it would be helpful to have a baseline line vanilla LexRank, subject to a summary budget in words matching the average words from conformal importance summaries.

In the paper we used LexRank as a scoring function for conformal summarization, and based on your suggestion have now run a baseline using LexRank directly for summarization on the ECT dataset, using a word budget that matches the average from the other conformal methods. For reference, this ranges from 834 words (Gemini 2.5 Flash) to 1101 words (Cosine Similarity Centrality).

We tested LexRank in both continuous mode and discrete modes. For the discrete mode, we used the four edge-weight thresholds (0.1, 0.2, 0.3, 0.4) as done in [20]. Additionally, we experimented with two initialization strategies for LexRank’s TF-IDF-based embeddings: one using only the calibration dataset, and another using the combined calibration and test datasets. We only report the latter as it is what we used in the main paper and we found the difference in results to be negligible.

We evaluate the recall of important sentences under this framework, the results of which can be found in the table below. Columns indicate which alternative conformal method was used to derive the average text length to use as the budget for LexRank.

Edge Weight Threshold	Gemini 2.0 Flash-Lite	Gemini 2.5 Flash	GPT-4o mini	Llama3-8B	Qwen3-8B	Cos. Sim. Centrality	GUSUM	Sentence Centrality
continuous	0.76	0.71	0.79	0.82	0.83	0.83	0.86	0.82
0.1	0.75	0.69	0.77	0.81	0.82	0.83	0.85	0.81
0.2	0.75	0.69	0.78	0.82	0.83	0.84	0.86	0.82
0.3	0.77	0.72	0.79	0.83	0.84	0.84	0.86	0.83
0.4	0.77	0.72	0.79	0.82	0.84	0.84	0.87	0.83

Recall varies between 0.69 (0.1 edge weight threshold, based on Gemini 2.5 summaries) up to 0.87 (0.4 edge weight threshold, GUSUM summaries). By comparison, all our conformal methods from Table 2 achieve recall $\geq 0.80$ for 80% of the datapoints. Hence, conformal summarization with the best LLMs (Gemini or GPT models) tends to outperform LexRank at retaining important information for the same length budget.

This is also consistent with our AUPRC measurement, which is purely a measure of how well an NLP method can discriminate between important and unimportant information (i.e. it does not rely on conformal methods at all). From Table 2, LexRank has AUPRC of 0.22 on ECT, significantly worse than Gemini or GPT models (0.30 - 0.43 on ECT), but still roughly comparable to other classical NLP techniques.

The fact that base LexRank’s recall is not controllable, unlike the conformal methods to which we compare it, also shows the utility of conformal prediction in the summarization setting. Integrating LexRank with conformal summarization removes this limitation of the base method alone.

评论- Reviewer comment

2025-08-05

Thanks for providing clarifications and additional experiments.

Regarding W1, while you can choose other semantic units like paragraphs, those hyperparameters are not explored in this work. Thus, the quality of the summarizer in those settings in not known.

W2: I was concerned actually about the scenario where one might have less than 100 samples. What is the impact in the variance of coverage if I use 50 samples instead? I also suppose it depends on the quality of the scoring model, so is there a way to optimize the number of samples for a given model (to achieve a desired coverage variance)? Given how costly is to obtain gold summaries, these are relevant questions for the application of summarization models in domains like the ones mentioned in the paper. Of course, you can refer the reader to related theoretical work, but since you are introducing conformal importance and summarization, this is the kind of discussion I would expect to see in this paper.

I adjusted the quality score and I am keeping my recommendation.

2025-08-06

Thank you for clarifying your concerns with W1 and W2. For W1, we refrained from claiming in the paper that the proposed method works well on spans other than sentences. We agree that this would need to be experimentally investigated before such claims are made. Still, we hope we have addressed your original concern with the new results on larger documents which show that sentence-level scoring can still work at the scale of hundreds of sentences.

For W2 on the importance of calibration set size, we can easily explore the coverage and its variance for even smaller calibration sets to cover the case where human-labeled examples are very expensive. Below, we show results similar to Figure 2 where we track the empirical coverage for various target coverage levels ( $1-\alpha$ ). We now ablate over calibration set sizes with values ranging from 25 to 100 samples, and present both the mean and standard deviation of the empirical coverage random variable. Since 100 samples were originally provisioned for the calibration set, we simply subsample them at random for the ablations with fewer samples. The rest of the experimental settings match those of Figure 2, with $\beta=0.8$ and Gemini 2.5 Flash.

Mean of Empirical Coverage

Target Coverage	n=25	n=50	n=75	n=100
0.60	0.61	0.61	0.60	0.61
0.65	0.65	0.67	0.66	0.65
0.70	0.73	0.71	0.71	0.70
0.75	0.77	0.77	0.75	0.75
0.80	0.81	0.81	0.80	0.80
0.85	0.89	0.86	0.86	0.85
0.90	0.92	0.90	0.91	0.90
0.95	0.96	0.96	0.96	0.95

Std of Empirical Coverage

Target Coverage	n=25	n=50	n=75	n=100
0.60	0.09	0.07	0.05	0.05
0.65	0.09	0.07	0.06	0.05
0.70	0.09	0.06	0.05	0.05
0.75	0.09	0.06	0.05	0.05
0.80	0.08	0.05	0.05	0.04
0.85	0.06	0.05	0.04	0.04
0.90	0.05	0.04	0.03	0.03
0.95	0.04	0.03	0.02	0.02

As guaranteed by our Theorem 1, the mean empirical coverage is consistently above the target minimum coverage $1-\alpha$ , and below the upper bound $1-\alpha + \frac{1}{1-n}$ , which is 1 to 4 percentage points higher based on $n$ .

More importantly for your question, we find the variance decreases as the calibration set size increases, which is predicted by the theory we referenced previously. At $n=50$ which you asked about, the standard deviation of empirical coverage is 1 to 2 percentage points higher than $n=100$ . However, the mean is also higher by 1 to 2 percentage points, meaning that the “failure cases”, where the empirical coverage falls below the target level due to random fluctuations, have about the same severity.

You originally asked about the performance of the methods as a function of set size. Below we show the main metric for performance -- fraction of sentences removed -- for the same settings as just discussed with $1-\alpha=0.8$ . We have rerun the method for 20 random calibration/test splits and present the mean and standard deviation.

Fraction of Sentences Removed:

Cal Set Size	Mean	Std
25	0.28	0.09
50	0.31	0.08
75	0.32	0.08
100	0.33	0.05

As $n$ decreases there is a slight decrease in metric values which corresponds to the higher average coverage we found above. Overall, there is not expected to be a strong dependence of performance on calibration set size from the theory of conformal prediction. The mechanism of CP tries to estimate the value of the conformal threshold which will lead to a desired coverage level. The “true” value of the threshold that will achieve a target coverage $1-\alpha$ is the same for all calibration set sizes, and changing the number of calibration points only affects the variance of the estimated threshold. In turn, the selected threshold directly determines the method’s performance regardless of $n$ .

If you find these points valuable, using the additional content page for the camera-ready version we can include a complete discussion, covering all conformal scoring functions.

评论- Reviewer comment

2025-08-06

Thanks for those results! To me it sounds counterintuitive that the summarization performance is the same regardless of the calibration size. But in a high stakes domain, the variance of the coverage (that is, how trustworthy are your summaries) is an important quality factor. I also can see that the performance start to converge around 75 samples, which means I could save 25% data collection effort for a similar result.

As a follow up question, do you think that the quality of the scorer would influence in those calibration curves? If you compare the strongest scorer (the best LLM) with the weakest one (maybe lexrank?), would the latter require more samples for the same variance/performance metrics?

Yes, I would consider adding this discussion if you have space available, as sample efficiency is important for summarization.

2025-08-07

We agree, the variance of the coverage is an important factor for practical deployments since the coverage guarantee in Theorem 1 is probabilistic. The only way to consistently influence this variance is through the size of the calibration set. As another point of comparison based on your suggestion, we reran our method with the LexRank importance scoring function which was much weaker than Gemini 2.5 Flash that we used in the previous comment (cf. Table 2). All other settings are the same as above

Mean of Empirical Coverage

Target Coverage	n=25	n=50	n=75	n=100
0.6	0.61	0.61	0.6	0.6
0.65	0.65	0.66	0.66	0.65
0.7	0.72	0.7	0.71	0.7
0.75	0.77	0.76	0.75	0.75
0.8	0.81	0.8	0.8	0.8
0.85	0.88	0.86	0.86	0.85
0.9	0.92	0.9	0.91	0.9
0.95	0.96	0.96	0.96	0.95

Std of Empirical Coverage

Target Coverage	n=25	n=50	n=75	n=100
0.6	0.09	0.07	0.06	0.05
0.65	0.09	0.06	0.05	0.05
0.7	0.08	0.06	0.05	0.05
0.75	0.08	0.06	0.05	0.04
0.8	0.07	0.05	0.04	0.04
0.85	0.06	0.05	0.04	0.04
0.9	0.05	0.04	0.03	0.03
0.95	0.03	0.03	0.02	0.02

The average coverage and variance results are very similar between scoring methods because they are controlled by theoretical guarantees and follow known distributions that depend on $n$ , not the efficacy of the scoring method.

The big difference is of course in performance -- the conciseness of summaries at a given coverage level.

Fraction of Sentences Removed:

Cal Set Size	Mean	Std
25	0.14	0.04
50	0.13	0.03
75	0.14	0.03
100	0.14	0.03

Lexrank is much worse at estimating the importance of sentences than Gemini 2.5 Flash, and so does not produce concise summaries. However, the trend of performance not being strongly affected by calibration set size is consistent. It is not the case that increasing the calibration set size will improve the overall performance of summarization, regardless of the scoring method used; it only has a clear effect of decreasing the variance of the empirical coverage.

审稿意见

评分: 5置信度: 32025-07-03

Given a document, the proposed Conformal Importance Summarization (CIG) framework scores each sentence’s importance and filters out less important ones using a threshold calibrated to ensure that the resulting summary retains at least a fraction \beta of important content with probability $\geq 1 – \alpha$ .

CIG is model-agnostic and supports user-tunable tradeoffs between conciseness and completeness. Experiments across five datasets show that the CIG consistently meets its theoretical recall guarantees, while offering controllable summary length and outperforming standard scoring baselines in recall and precision under varying $\alpha$ and $\beta$ settings.

优缺点分析

Strengths

This work is the first to apply conformal prediction to document summarization, which provides distribution-free statistical guarantees on retaining important content in summaries.
The method is grounded in solid conformal prediction theory, where authors formalize a coverage guarantee for summarization.
Experiment results confirm the theoretical coverage guarantee.
The paper is quite transparent about implementations and supplementary details, which can help a lot with reproducibility.

Weaknesses

The method depends on having a predefined set of important sentences for each document (for calibration and evaluation). In many real-world cases, these important labels can be subjective or expensive to collect. The quality of the summary’s guarantee is only as good as the quality of these labels. There could be some more discussion on how robust this method is to noisy or imperfect importance labels.
While the motivation is to improve summarization in high-stakes domains like health and law, evaluation is only run on standard summarization benchmarks (news, scientific papers, dialogues). In a truly high-stakes setting, importance could be multifaceted and harder to capture with a single proxy.
The method is heavily focused on coverage guarantees, and does not consider other aspects of summary quality. While completeness (high recall) is critical, things like coherence or usefulness are not being evaluated.

问题

Although existing abstractive summarizers (e.g., GPT-4 or PEGASUS) rarely reproduce the exact sentences labeled important, they often paraphrase them well enough that the underlying facts are still conveyed. Measuring how much of this critical content they actually cover (e.g., by computing the recall of important sentences using semantic-entailment or similarity rather than verbatim overlap) would clarify how frequently those baselines meet a chosen recall target \beta, even without any formal guarantees. Reporting this would shed some more light on the practical advantage of CIG.
Given that the motivation is centered on critical-domain summaries (medical, legal, etc.), have you considered how to validate CIG in such settings? What should such expert-defined important content look like, and what would be the obstacles and ways to overcome them?
Have you thought about combining the current approach with abstractive summarization techniques? For instance, you could first use CIG to identify a set of important sentences, then generate a refined summary possibly with an LLM. If you compare summaries generated via this two-stage approach to those generated via single-stage prompting of the same LLM, how do they perform in terms of quality and coverage?

局限性

yes

最终评判理由

The rebuttal addressed my points, so I choose to maintain my original positive rating.

格式问题

N/A

作者回复

2025-07-31

Thank you for your positive review! We are pleased to see that you found our theory “solid”, and will address your questions and comments below.

W1:

The method depends on having a predefined set of important sentences … can be subjective or expensive to collect. The quality of the summary’s guarantee is only as good as the quality of these labels. There could be some more discussion on how robust this method is to noisy or imperfect importance labels.

We agree with your argument that in real-world, professional, and sensitive domains, sentence level importance labels would be hard/expensive to collect. While modern LLMs have been shown to achieve good performance on many out-of-domain scenarios with zero/few-shot inference, it is known that their performance could drop on niche and/or sensitive domains. It should be recognized that conformal prediction does not require many labels compared to other ML-based approaches. We used only 100 labeled samples in our calibration sets and achieved tight coverage at high recalls (Fig 2).

Other than the direct and expensive approach of hiring professionals to label training samples, one possible idea is to adapt domain-finetuned LLMs for the purpose of either scoring or labeling. There has been a significant amount of work demonstrating that adapted LLMs show increases in accuracy, reduction in hallucination, and/or more alignment with expert preferences (e.g. [A,B]). All these indicate that they would serve better as both a scoring function as well as a labeling mechanism.

There has been research around conformal prediction’s robustness toward label noise [C], as well as how the conformal process could be adapted to explicitly handle label noise [D]. Since we use the same theoretical principles as these works, their conclusions may extend to conformal summarization, although it is not the focus of our present paper.

[A] Woo et al. “Synthetic data distillation enables the extraction of clinical information at scale”, npj Digital Medicine, 2025

[B] Fatemi et al. “A Comparative Analysis of Instruction Fine-Tuning LLMs for Financial Text Classification”, arxiv:2411.02476

[C] Einbinder et al. “Label Noise Robustness of Conformal Prediction”, Journal of Machine Learning Research 2024

[D] Penso et al. “Estimating the Conformal Prediction Threshold from Noisy Labels”, arXiv:2501.12749

W2:

While the motivation is to improve summarization in high-stakes domains like health and law, evaluation is only run on standard summarization benchmarks (news, scientific papers, dialogues). In a truly high-stakes setting, importance could be multifaceted and harder to capture with a single proxy.

One of the five datasets we used did cover the domain of healthcare (MTS-Dialog). We agree that importance can be multifaceted. We discussed in and after L168 how our method can accommodate differing viewpoints on what is important through user-specific calibration set labels and importance score functions. In this way, each user can receive tailored summaries with coverage guarantees that correspond to their personal definition of importance.

W3:

The method is heavily focused on coverage guarantees, and does not consider other aspects of summary quality. While completeness (high recall) is critical, things like coherence or usefulness are not being evaluated.

Given that coverage is satisfied, the main metric that demonstrates the effectiveness of our method is the conciseness of the summary -- how much non-important information can be filtered out without sacrificing recall of important information. Since we are focusing on extractive summarization, the notions of coherence and usefulness are less relevant. There is no expectation that an extractive summary preserves coherent paragraph structure and flow, while usefulness is solely dictated by the recall of important information. Below in the response to Q3 we discuss extending our extractive method with abstractive post-processing, where your suggested metrics become more relevant.

Q1:

Although existing abstractive summarizers rarely reproduce the exact sentences labeled important, they often paraphrase them well enough that the underlying facts are still conveyed. Measuring how much of this critical content they actually cover (e.g., by computing the recall of important sentences using semantic-entailment or similarity rather than verbatim overlap) would clarify how frequently those baselines meet a chosen recall target \beta, even without any formal guarantees.

This is a great suggestion! We tested GPT-4o mini on ECTSum (2,322 test samples), prompting it to produce abstractive summaries directly from the original source text, with the target of retaining all important content ( $\beta = 1$ ). While Table 2 reports results for our conformal methods at $\beta = 0.8$ , here we used $\beta = 1$ to measure the LLM’s raw ability to preserve information. For evaluation we used semantic-entailment to determine recall; each ground-truth important sentence was compared to the generated summary using an LLM-based evaluator to check if the information was retained in the summary.

For this direct, single-stage abstractive summarization approach we found:

Coverage: Only 26% of abstractive summaries had recall $\beta \geq 0.8$ , compared to $1-\alpha=0.8$ attained with our conformal methods.
Recall: Average recall was 52% (std=35%), far below the instructions for $\beta=1.0$ .
Length: For the summaries that retained a fraction $\beta\geq 0.8$ , the mean fraction of sentences removed was 65% (std=25%), which was higher than the best conformal method at 37% (Table 2).

These results show that naive prompting of an LLM is not likely to correctly retain the ground-truth important information, and clearly does not give any control over recall or coverage. LLMs tend to aggressively shorten content when instructed to summarize.

We ran a similar experiment but with 10 in-context labeled examples added to the prompt. Please see our response to Reviewer sN9c, W3/Q1. This increased the coverage to 40% and recall to 67%, but decreased conciseness to 58%.

Q2:

Given that the motivation is centered on critical-domain summaries (medical, legal, etc.), have you considered how to validate CIG in such settings? What should such expert-defined important content look like, and what would be the obstacles and ways to overcome them?

We did test our method on one medical dataset (MTS-Dialog), and this is a great example of what expert-defined importance could look like. The MTS-Dialog consists of doctor-patient conversations. It is reasonable to expect that in a real-world setting like this, both the doctor and patient may want a summary of their conversation, but the material each person considers important could be very different. A doctor may focus more on the symptoms described by the patient, while a patient would highlight the doctor’s medical recommendations as most important. For the same set of conversations, we may have a set of ground truth labels from doctors, and another from patients, each of which would lead to different conformal thresholds and hence prediction sets.

Going further, the expert definitions of importance could be incorporated into the importance score function. We found that LLM scoring with a basic prompt performed very well across datasets (Table 2). However, incorporating more information about what the expert deems important could be done by providing few-shot examples labeled by the expert in the prompt to activate the in-context learning capabilities of LLMs. In this way, the LLM scoring would not simply rely on parametric knowledge of what is important from the LLM’s pre-training data, but would directly incorporate expert knowledge about the dataset and preferences at hand.

Q3:

Have you thought about combining the current approach with abstractive techniques? You could first identify a set of important sentences, then generate a refined summary with an LLM. If you compare summaries generated via this two-stage approach to those generated via single-stage prompting of the same LLM, how do they perform in terms of quality and coverage?

Adding our extractive approach as pre-processing before using abstractive summarization could remove some of the unimportant noise sentences and result in a more focused summary.

We provide experiment results of this two-stage approach: (1) use our conformal summarization to extract a set of sentences meeting a target recall threshold ( $\beta=0.8$ ) with high coverage ( $1-\alpha=0.8$ ), and (2) prompt GPT-4o mini to generate a cohesive abstractive summary from the extracted sentences. Note that we did not instruct the LLM to retain all important information, just to summarize the remaining text. The results on ECTSum (2,322 samples) are below and we compare to the single-stage abstractive summarization from Q1 above:

Coverage: 42% of abstractive summaries samples reached $\beta \geq 0.8$ , compared to 26% without pre-processing.
Recall: Average recall was 67% (std=31%), compared to 52%.
Length: For the summaries that retained a fraction $\beta\geq 0.8$ , the mean fraction of sentences removed was 57% (std=25%), compared to 65%.

Hence, there does appear to be a benefit in how the LLM is able to identify important information if the pre-processing is used, although summaries were not quite as concise.

For another perspective - testing how well an LLM can preserve the recall and coverage guarantees in a two-stage set up - please see our response to Reviewer sN9c in W1/Q2.

2025-08-06

Thank you for addressing my comments, it is nice to see that single-stage abstractive summarization by prompting an LLM fails to achieve good balance between coverage and conciseness. These discussions would make a valuable addition to the paper.

For the 2-stage experiment, have you tried instructing the abstractive summarizer to rewrite the extracted setences into a coherent paragraph while keeping it concise (instead of telling it to summarize these sentences)?

2025-08-07

Yes, we have results on this. We took the extractive summary from our method and passed it to GPT-4o mini with instructions to shorten the text, but without removing any important content. This could improve the flow and style of writing which may be broken up when only extracting sentences. For a similar setting as above, the results on ECTSum are as follows:

Coverage: Overall coverage was 61%, a drop from the 80% achieved with our method alone, but higher than 42% when the LLM was instructed to summarize.

Recall: Overall recall was 75% (std=34%), compared to 67% (std=31%) with the summarization instruction.

Length: The abstractive post-processing step slightly increased the length (sentence count) compared to the extractive summary by 3% on average, though with very high variance (std=51%).

So while there may be some stylistic benefits to an abstractive post-processing step, we did not see any further reduction in length, but there was a loss of some important information.

审稿意见

评分: 4置信度: 32025-07-20

This paper introduce an extractive summarization framework using conformal importance. To do that, the method first assigns importance scores to sentences. The method uses conformal prediction to calibrate importance thresholds for sentence selection. The method is evaluated on multiple datasets using various scoring methods. Results show their method obtained controllable recall and performed competitively with standard extractive baselines. The idea of using conformal prediction for summarization is interesting. However, the paper lacks novelty and fundamental impact for the extractive summarization field.

优缺点分析

Strengths:

Novel integration of conformal prediction with extractive summarization
The proposed CIS method works with various importance scoring-based methods
Overall, the paper is well-written.

Weakness:

The method cannot handle abstractive, long-range, or structurally complex documents, which are common in legal, scientific, and technical domains. The assumption that important content exists at the sentence level is limited.
The method performance depends entirely on the quality of the sentence scoring function.
The experiment lacks direct comparison with abstractive methods or GPT-style Models for document summarization.

问题

Please provide direct comparison with SOTA LLM-generated summaries.
Please provide examples and analysis on long document summarization tasks.

局限性

Yes.

最终评判理由

Thank you for author's responses. I'd like to change my overall rating to WA.

格式问题

No.

作者回复

2025-07-31

Thank you for reviewing our paper! We are glad to hear you found our use of conformal prediction to be “novel” and will address the weaknesses and questions you raised.

W1/Q2:

The method cannot handle abstractive, long-range, or structurally complex documents… The assumption that important content exists at the sentence level is limited. Please provide examples and analysis on long document summarization tasks.

Abstractive: Our method could be combined with today’s highly performant LLMs to perform abstractive summarization as a post-processing step on our extractive summaries. The extractive summary with coverage guarantees can be fed into an LLM with instructions to further paraphrase and shorten the text, but without removing any important content. As long as the LLM can maintain information and avoid hallucination, it will provide the benefits of abstraction while maintaining our statistical guarantees.

However, in practice we find that LLMs are not good enough yet to maintain all important content while meaningfully reducing summary length further. Still, there can be other benefits of abstractive summarization, such as improving the flow and style of writing which may be broken up when only extracting sentences.

We demonstrate progress in this regard by using our conformal method with a target recall threshold $\beta=0.8$ with $1-\alpha=0.8$ coverage, and then prompt GPT-4o mini to paraphrase and shorten the text while retaining all of the information of the extractive summary. The results on ECTSum (2,322 samples) are summarized below:

Length: The abstractive post-processing step slightly increased the length (sentence count) compared to the extractive summary by 3% on average, though with very high variance (std=51%).
Coverage: Overall coverage was 61%, a drop from the 80% achieved with our method alone.
Recall: Overall recall was mean 75%, std 34%.

So while there may be some stylistic benefits to an abstractive post-processing step, there is no further reduction in length and a loss of some important information.

Note that the evaluation of recall for the abstractive method must be done on a semantic basis and is therefore less precise. For a given datapoint, each ground-truth important sentence was passed to an LLM evaluator along with the generated summary, and the evaluator was prompted to determine if the ground truth information still appeared in the summary.

Long-range: We acknowledged in L322 that the datasets we used were limited in the length of documents they contained, but this does not mean that our proposed method fails on longer documents. As a test, we have run our method on a dataset with longer document length: in the paper, we used the Abstract-Introduction-Conclusion subset of SciTLDR, with the average input document length being ~41. Here we use the full-document subset of SciTLDR, where the entire paper is used as the input document, increasing the input length to an average of 216 sentences, with resulting performance below (all experiment settings are as in Table 2):

Importance Score	AUPRC	Fraction of Sentences Removed
Original Article	0.014	0.00
Cos. Sim. Centrality	0.14	0.50
Sentence Centrality	0.09	0.45
GUSUM	0.09	0.20
LexRank	0.14	0.35
Gemini 2.0 Flash-Lite	0.25	0.45

Sentence-level: While we used sentences as a fundamental unit of text to perform conformal summarization, this was a choice for convenience; the underlying method is flexible for any defined span of text like paragraphs or strings of $N$ words. Simply break the long-text into spans, score their importance, and filter out spans with scores lower than the calibrated threshold. This point is another way to address summarization of very long documents - use paragraph-sized spans for example to reduce the number of spans that need to be scored.

The assumption is less that important content exists at the sentence level, but more that it can be captured using non-overlapping spans. However, the upside of using this assumption and method is statistical guarantees on coverage and recall, something that cannot be obtained with abstractive techniques (see Q1 below).

W2:

The method performance depends entirely on the quality of the sentence scoring function.

It is natural that any conformal prediction approach relies on the quality of its conformal score function, so we hardly see this as a weakness. For classification problems, ample research has been done on the design of high-quality scoring functions [52, 3, 30], and there is room for future research to build on our framework with new score functions. We proposed (L175ff) and evaluated (Table 2) several score functions to demonstrate the effect of score function design on the overall performance of our method, and found that today’s strong LLMs give the best results. This is encouraging, as our method can easily swap out for new LLMs that are developed over time, likely giving improved summarization performance “for free”.

W3/Q1:

The experiment lacks direct comparison with abstractive methods or GPT-style Models for document summarization. Please provide direct comparison with SOTA LLM-generated summaries.

We have run several more tests using LLMs purely in an abstractive capacity and produced metrics on recall and coverage to compare with our proposed extractive method. We evaluated GPT-4o-mini on ECTSum (2,322 test samples), prompting it with 10 labeled examples from the calibration set to enable in-context learning, with instructions to produce abstractive summaries that retain all important information (target $\beta=1$ ). Even though in Table 2 our conformal methods targeted $\beta=0.8$ , we used $\beta=1$ here to focus the evaluation on the LLM’s ability to capture important information. The evaluation of recall was done in the same way as described above for W1/Q2.

Even with in-context examples, GPT-4o-mini failed to reliably retain all the important information:

Coverage: Only 40% of abstractive summaries achieved $\beta \geq 0.8$ , compared to $1-\alpha=0.8$ attained with our conformal method, underscoring the lack of control over recall with naive LLM prompting.
Recall: Average recall was 67% (std=31%), far below the instructions for $\beta=1.0$ , and the guaranteed recall we achieved with our conformal method (minimum $\beta=0.8$ with high probability).
Length: For the summaries that retained a fraction $\beta \geq 0.8$ of the important information, the mean fraction of sentences removed was 58% (std=26%), which was higher than the best conformal method at 37% (Table 2).

In summary, abstractive summarization with LLMs, even with labeled in-context examples, tends to aggressively shorten text and results in low recall. The biggest drawback here is that the LLM does not give the user any control over recall or coverage (we prompted for 100% recall and achieved 67%). In contrast, our conformal approach consistently meets user-specified recall and coverage targets.

We ran a similar experiment but without in-context examples for comparison, but this resulted in even more aggressive shortening leading to worse coverage. Please see the response to Reviewer fdXq if you are interested.

评论- Request for response

2025-08-09

Dear Reviewer,

We are open to discussion around your review and our rebuttal. Please notify us of any remaining concerns that remain after you have gone through our rebuttal.

Thanks - the Authors

评论- Please respond to Authors' Rebuttal

2025-08-04

Dear reviewers, Please go over and respond to authors' rebuttal. Best wishes, AC

最终决定Accept (poster)

2025-09-17

This paper introduces Conformal Importance Summarization, an extractive summarization framework that applies conformal prediction to guarantee recall of important content. It is model-agnostic, requires little calibration data, and shows controllable performance across benchmarks. Weaknesses include restriction to extractive summarization, reliance on sentence-level importance and scoring quality, limited high-stakes evaluation, and heavy use of formulas.

In rebuttal, the authors added experiments on long documents, abstractive baselines, hybrid approaches, and calibration set size, clarifying that their work extends conformal theory with practical guarantees. Reviewers’ responses were largely positive after rebuttal, with raised scores despite some concerns on readability.

Overall, this is a novel, well-justified contribution to reliable summarization, and I recommend acceptance.