PaperHub
5.7
/10
Poster3 位审稿人
最低5最高6标准差0.5
6
6
5
4.0
置信度
正确性3.0
贡献度2.0
表达2.7
ICLR 2025

Conformal Structured Prediction

OpenReviewPDF
提交: 2024-09-26更新: 2025-05-17
TL;DR

We construct structured conformal prediction sets via integer programming.

摘要

关键词
Conformal PredictionStructured PredictionInteger Programming

评审与讨论

审稿意见
6

The paper "Conformal Structured Prediction" introduces a novel framework to extend conformal prediction methods to complex, structured output spaces. Conformal prediction typically provides sets of labels with a high probability of containing the true label, giving a quantified uncertainty. However, for tasks involving structured or hierarchical outputs—such as text generation, image classification with hierarchical categories, or question answering with date ranges—traditional conformal methods produce prediction sets that can be large and hard to interpret.

The authors propose a method to generate interpretable structured prediction sets using a conformal predictor that works within a directed acyclic graph (DAG) representing the label hierarchy. This approach maintains coverage guarantees while reducing prediction set complexity. The paper introduces algorithms for efficiently determining the smallest structured prediction set that meets a specified confidence level, and it adapts these methods to provide both marginal and Probably Approximately Correct (PAC) guarantees.

The authors demonstrate the utility of their approach through experiments on MNIST digits, hierarchical ImageNet classification, and a question answering dataset focused on years as answers. The results show that their framework achieves desired coverage rates while keeping prediction sets concise and interpretable.

优点

  1. Extension of conformal prediction to structured outputs using DAGs, combining conformal prediction with hierarchical representations.
  2. Rigorous theoretical development with both marginal and PAC coverage guarantees, validated through experiments in diverse domains.
  3. Generally well-organized with clear explanations and helpful visual aids, making complex concepts accessible.
  4. Addresses an important gap, potentially impacting applications that require structu

缺点

  1. While the paper’s application of conformal prediction to structured outputs is valuable, similar approaches have been explored in hierarchical classification and structured prediction. For instance, previous works have used hierarchical structures (e.g., DAGs or trees) to improve interpretability in label prediction. The paper could benefit from a more thorough comparison to these existing methods, as well as a deeper explanation of what sets its approach apart. Highlighting any unique technical advancements in how the proposed framework constructs prediction sets in hierarchical settings (e.g., advantages of the DAG-based approach) would further clarify its contributions.

Set-Valued Prediction in Hierarchical Classification with Constrained Representation Complexity. Mortier et al. (2022): This work focuses on hierarchical classification with constrained set-valued predictions, utilizing hierarchical label spaces in which classes are structured in a directed acyclic graph (DAG) or tree. It emphasizes creating interpretable predictions that adhere to hierarchical constraints, much like structured conformal prediction but without formal coverage guarantees.

  1. The experiments are limited to specific domains (MNIST digits, ImageNet, and SQuAD). Although these domains represent a variety of structured prediction tasks, they are relatively controlled environments and may not fully reflect the challenges of deploying conformal structured prediction in more complex real-world applications. For instance, the framework’s performance on prediction sets for multi-label classification or in contexts with high label ambiguity (e.g., complex multi-class categorization in medical or legal documents) remains untested.

Deploying conformal structured prediction in a healthcare setting, particularly for diagnostic tasks with hierarchical or multi-label structures (e.g., identifying conditions or diseases from imaging data or lab results), would offer insights into the model's reliability and interpretability under more variable, high-stakes conditions. This field often requires nuanced coverage guarantees and interpretability to support clinical decision-making.

Real-world document classification often involves hierarchical categories (e.g., legal documents, financial reports) and multi-label classifications. Testing in this setting could reveal how well the model scales with complex, unbalanced label hierarchies, providing additional insights into its generalizability to larger, noisy datasets that are typical in business and legal contexts.

  1. The paper assumes that the label hierarchy can be represented by a DAG, which works well for hierarchical classification but may be restrictive for tasks with overlapping or cyclical dependencies. In complex scenarios where relationships between classes are non-hierarchical or not acyclic, this assumption may not hold, potentially limiting the framework’s applicability to structured outputs with more intricate dependencies.

  2. The inclusion of PAC guarantees is a strong point, but the differences between the marginal and PAC guarantees could be better explored. The PAC guarantee is inherently conservative, and the experiments demonstrate that it leads to larger prediction sets in some cases. However, there is little analysis of scenarios where a PAC guarantee might be more beneficial than a marginal guarantee, or vice versa, depending on the task or risk tolerance.

问题

  1. How does the proposed method for conformal structured prediction fundamentally differ from or improve upon prior hierarchical prediction approaches?

  2. The framework assumes that the label space is represented by a DAG. How does this assumption impact generalizability to label structures that are non-hierarchical, cyclical, or have overlapping dependencies?

  3. Integer programming (IP) can be computationally intensive, particularly for large DAGs. Have you measured or benchmarked the runtime performance and scalability of the IP formulation, especially in tasks with larger label hierarchies?

  4. Could you elaborate on practical scenarios or domains where marginal coverage versus PAC coverage would be preferable? How should a practitioner decide between the two guarantees in real-world settings?

  5. Have you considered comparing this approach to other recent extensions of conformal prediction tailored to structured outputs or complex tasks (e.g., those applied to natural language or image data)?

  6. How sensitive is the framework to the hyperparameter m (the maximum number of nodes in the prediction set)? Is there a recommended method for tuning m based on the domain or task?

评论

Thank you for taking the time and effort to read and provide comments on our work. We provide answers to all the remarks and questions below.

How does the proposed method for conformal structured prediction fundamentally differ from or improve upon prior hierarchical prediction approaches?

Thank you for sharing this related work. We note that our goals are significantly different from the goals in [1]. In particular, the goal in [1] is to find the prediction set with the highest probability mass (as indicated by Eq. (3) in [1]), whereas our goal is to find the smallest possible prediction sets under a constraint that the coverage rate meets some desired level. As indicated by Table 2 of [1], their approach would obtain much lower coverage (e.g., <50%), whereas our results demonstrate that our approach achieves the desired coverage rate.

[1] T Mortier, et al. Set-Valued Prediction in Hierarchical Classification with Constrained Representation Complexity, 2022.

Conduct tests in more complex environments. (e.g. multi-label classification, multi-class classification in healthcare, or real-world document settings).

Thank you for this suggestion. We have added a new experiment to demonstrate the application of our framework to predict the emotion labels in a given piece of text. In particular, we use the GoEmotion dataset [2], which consists of 27 emotion categories annotated on 58,000 English Reddit comments. They also provide a hierarchical structure on these categories (Figure 2 in [2]). We show coverage rates and average prediction set sizes here (https://ibb.co/pvttLc9) using the same format as the other results in our paper. We will add these results to an updated version of our paper.

[2] Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., & Ravi, S. (2020). GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4040-4054).

The framework assumes that the label space is represented by a DAG. How does this assumption impact generalizability to label structures that are non-hierarchical, cyclical, or have overlapping dependencies?

We emphasize that our DAG structure is a structure on the space of prediction sets, and can differ from the structure of the label space. In applications where the label space has a tree or DAG structure, we can naturally consider structured prediction sets that conform to this structure, but this is not a requirement. For instance, if the label space has a graph structure, we could still construct prediction sets representing sets of labels, and impose a DAG structure on these prediction sets (e.g., based on set inclusion). Thus, our approach can be flexibly applied to more complex domains. If you have any suggestions about specific kinds of label structures that might be of interest, we are happy to discuss them in detail.

Could you elaborate on practical scenarios or domains where marginal coverage versus PAC coverage would be preferable? How should a practitioner decide between the two guarantees in real-world settings?

The main difference between the two is the margin of safety that is provided. Marginal guarantees are on average over both the training set and the new examples. Thus, for a given training set, the resulting classifier might fail to satisfy the guarantee. In contrast, PAC prediction sets hold with high probability over the training set. This difference is illustrated by comparing the coverage rate plots for the two different approaches in our paper. When using marginal guarantees, the average coverage across different random seeds is above the desired coverage level, but any individual random seed may fall above or below this level. In contrast, for PAC guarantees, the coverage is above the desired coverage level for almost all random seeds.

Given these properties, PAC coverage generally provides greater reliability than marginal coverage. Thus, in domains where it is critical for the deployed model to satisfy the coverage guarantee, PAC prediction sets should be used; otherwise, marginal guarantees may suffice.

Integer programming (IP) can be computationally intensive, particularly for large DAGs. Have you measured or benchmarked the runtime performance and scalability of the IP formulation, especially in tasks with larger label hierarchies?

The plot here (https://ibb.co/2y276tG) illustrates the average time required to solve each IP problem as we vary mm while fixing ϵ=0.1\epsilon=0.1, for each of our four domains.

(continued on next comment)

评论

(continued from the previous comment)

As can be seen, the MNIST 2-digit and ImageNet problems exhibit similar computation times, both demonstrating fast solve times. As explained in the paper, the MNIST-2 tree has a depth of 3, 100 leaves, and 111 nodes, whereas the ImageNet tree is a lot larger, with 18 layers, 1000 leaves, and 1816 nodes. In our framework, increasing the scale of the DAG does not necessarily result in increased computation time for the IP problem.

However, when extending to the MNIST 3-digit problem, the computation time for the IP slightly becomes slower. The MNIST-3 tree has a depth of 4, 1000 leaves, and 1111 nodes, which significantly increases the number of nodes compared to the MNIST-2 tree and notably increases the tree density compared to the ImageNet tree. The computation for IP can become intensive as both the number of nodes and the density of the DAG scale up. A practical strategy to alleviate this computational burden is to simplify the DAG by removing some internal nodes while preserving the overall hierarchy. In particular, if a node vv is removed, its parent nodes become the parents of each of vv's children. This approach allows us to maintain the structured prediction set property while improving computational efficiency.

Finally, the structure for the question-answering task is not a tree but a DAG, where a single node can have multiple parents. The DAG structure used for this task has 51 layers, 51 leaves, and 650 nodes, making it relatively sparse compared to structures for the other three domains. In this case, there is a small jump in computation time when m=2m = 2, but the overall computation time remains manageable.

Have you considered comparing this approach to other recent extensions of conformal prediction tailored to structured outputs or complex tasks (e.g., those applied to natural language or image data)?

We have added a comparison to [1], which considers a tree structure in the code domain (specifically, the abstract syntax tree), since this approach also targets conformal structured prediction in a specific domain. We have extended their approach to general structured prediction settings to perform our comparison; see below for details.

Another paper applying conformal prediction to natural language is [2]; however, they still use sets of samples, which may suffer from lack of interpretability. Finally, [3] considers conformal prediction for question answering; their algorithm is closely related to the learn-then-test algorithm. However, they do not consider a tree structure of structured prediction sets; instead, they only consider a single sequence of increasingly coarse-grained labels. Thus, their approach is not applicable to our problem formulation.

Now, we return to the comparison to [1]. While the strategy proposed in [1] is specialized to the code domain, we generalize their framework to apply to arbitrary DAG structures. Our results (below) show that we significantly outperform this baseline, both in terms of prediction set size and computational cost. The main shortcoming of this approach is that it leverages existing PAC prediction set algorithms, which require that the monotonicity assumption holds. Thus, their algorithm restricts the structure of the prediction sets across different values of τ\tau to enforce monotonicity. In contrast, our approach proves a novel conformal prediction bound (Theorem 3.2 in our paper) to avoid the need for monotonicity. We show results for the question-answering task here (prediction set sizes: https://ibb.co/rp2VWqs; computational cost: https://ibb.co/ss1B9wp). These plots show the prediction set sizes using the same annotations as Figure 4 of our paper. The baseline results are represented by dashed lines. For both prediction set size and computational cost, our approach outperforms the baseline for almost every parameter setting, and significantly so for m=2m=2 and m=4m=4. Results in other domains exhibit similar trends, and we will add them in an updated version of the paper (which we plan to submit by the end of the rebuttal period).

[1] A Khakhar, et al. PAC Prediction Sets for Large Language Models of Code, ICML 2023.

[2] V Quach, et al. Conformal Language Modeling, 2023.

[3] C Mohri, T Hashimoto. Language Models with Conformal Factuality Guarantees, 2024.

(continued on next comment)

评论

(continued from the previous comment)

How sensitive is the framework to the hyperparameter mm (the maximum number of nodes in the prediction set)? Is there a recommended method for tuning mm based on the domain or task?

In general, the choice of mm depends on the needs of the given application domain. It governs the trade-off between the interpretability and granularity of the resulting prediction sets. Specifically, larger values of mm allow more labels to be included in the set, often capturing finer-grained categories such as “attire” or “Blenheim spaniel”. Conversely, smaller values of mm result in sets with fewer labels, which can be more interpretable for users. We suggest that practitioners try different values of mm, and manually examine the resulting prediction sets to determine which choices offer the best tradeoff between interpretability and coverage.

评论

We would like to add an additional comment regarding the question on the sensitivity of mm. We apologize for omitting this in the previous response.

The sensitivity of the hyperparameter mm depends on the DAG structure and the performance of the underlying model. Across the four tasks we evaluated, mm is more sensitive in the question-answering problem compared to the other experiments, due to the more ambiguous problem setting and poorer underlying model performance. Since all nodes tend to have similar probability mass usually without any dominant ones, changes in mm can significantly affect the selected nodes when mm is small (e.g. 1 or 2; see Figure 4a, 4b in our paper). However, it is a common behavior that mm becomes less sensitive to the prediction set as it becomes larger (e.g. all tasks tend to exhibit smaller changes in prediction set size when mm is increased from 4 to 8).

评论

Thank you once again for taking the time to review our work. We have carefully addressed the concerns you raised to the best of our ability, and your feedback has been invaluable in improving the quality and clarity of our submission.

As the discussion period is nearing its end, we would greatly appreciate the opportunity to confirm if we have addressed your concerns satisfactorily. We hope to have the chance to discuss this further today before the discussion period closes. Have we addressed all your comments?

评论

Thank you for addressing all the comments. I will maintain my positive score of acceptance.

审稿意见
6

This paper proposes a framework for conformal structured prediction i.e., conformal prediction in the structured prediction setting. The proposed framework outputs structured prediction sets that achieve marginal or PAC coverage guarantees while minimizing prediction set size. In the context of a set of nodes in a directed acyclic graph, the prediction set as a small subset of coarse labels corresponds to the prediction set of fine-grained descendants of the coarse labels. The paper presents empirical analysis of the approach in three domains to demonstrate the performance of their approach.

优点

  1. The paper is well-organized for the most part.
  2. The paper is technically sound in its description of problem formulation and the marginal and PAC guarantees.
  3. Construction of prediction sets in the structured prediction setting and in the context of nodes in a directed acyclic graph is an important problem.

缺点

  1. Missing discussion of important related work [1, 2]: I believe the paper misses citing and comparison with important related work on conformal risk control [1]. [1] considers hierarchical image classification in ImageNet similar to the paper and controls the graph distance between nodes. Additionally, RAPS method in [2] is a conformal prediction method that introduces regularization to encourage smaller and stable sets, and is worth comparing to given the focus of the paper on reducing average set size.

  2. The empirical evaluation can certainly benefit from more analysis. In the current form, the contribution and significance of the method are not demonstrated very clearly:

    • It is hard to understand the utility of the method without comparison with more baselines. I believe doing this is especially possible for the marginal guarantees. Qualitative comparison of the prediction sets will also help demonstrate the utility of structured prediction sets. I see the paper discusses one example in the main text, however there is certainly value in adding more examples in this case (also mentioning the error level used for standard conformal prediction and other details for fair comparison).
    • Following from above, I appreciate Table 1 in the paper as it helps understand the influence of hyperparameters better. I would suggest adding similar examples for other datasets as well.
  3. The motivation of the paper is not very clear in the beginning and only becomes clearer as the method and examples are discussed later in the paper. While the introduction has sufficient details about the method, I would suggest making the motivation for structured prediction sets clearer early on.

Minor comments:

  1. L60: parameter τ\tau has not been defined and referenced early I believe without sufficient context here.
  2. Similar comment for Figure 2. The caption makes reference to τ\tau, whereas the notation has not been introduced earlier in text or in the caption.
  3. L306: typo/incomplete -> (2)
  4. L416-417: possibly missing word after ‘values’; “ in contrast, for the PAC guarantee, coverage for all values within one standard deviation...”

[1] Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal Risk Control. International Conference on Learning Representations, 2024.

[2] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction. International Conference on Learning Representations, 2021.

问题

How should mm be selected in practice? From the experiments, this selection appears as an important choice for the quality of prediction sets, however the paper lacks discussion on this aspect.

评论

Thank you for taking the time and effort to read and provide comments on our work. We provide answers to all the remarks and questions below.

Discuss and compare to related works:

Conformal Risk Control

Uncertainty Sets for Image Classifiers using Conformal Prediction

We thank the reviewer for pointing out these two related works. We want to note the following differences between our proposed algorithm and these works.

In general, the goals in the “Conformal Risk Control” paper are qualitatively very different – their goal is to minimize more general risk functions (instead of prediction set size), whereas our goal is to extend conformal prediction to structured prediction. In Section 3.3, their purpose is to control an alternative risk function based on the hierarchical label structure, while still constructing traditional prediction sets. Indeed, we believe that our approach could be combined with theirs to further improve interpretability of the resulting prediction sets. For this specific task, their algorithm can be viewed as producing a structured prediction set; however, even in this context, their search is constrained to only parent nodes of the class with the highest estimated probability y^\hat y, whereas our search space is much more general (and can be much more flexibly specified by the user).

The goal of the “Uncertainty Sets for Image Classifiers using Conformal Prediction” paper is also very different from ours. The goal of their paper is to reduce the average prediction set size by adding a regularization term (Eq. (4) in their paper), whereas our goal is to compute structured prediction sets.

Compare with more baselines

We compare our method with a baseline strategy adapted from [1]. While the strategy proposed in [1] is specialized to the code domain, we generalize their framework to apply to arbitrary DAG structures. Our results (below) show that we significantly outperform this baseline, both in terms of prediction set size and computational cost. The main shortcoming of this approach is that it leverages existing PAC prediction set algorithms, which require that the monotonicity assumption holds. Thus, their algorithm restricts the structure of the prediction sets across different values of τ\tau to enforce monotonicity. In contrast, our approach proves a novel conformal prediction bound (Theorem 3.2 in our paper) to avoid the need for monotonicity. We show results for the question-answering task here (prediction set sizes: https://ibb.co/rp2VWqs; computational cost: https://ibb.co/ss1B9wp). These plots show the prediction set sizes using the same annotations as Figure 4 of our paper. The baseline results are represented by dashed lines. For both prediction set size and computational cost, our approach outperforms the baseline for almost every parameter setting, and significantly so for m=2m=2 and m=4m=4. Results in other domains exhibit similar trends, and we will add them in an updated version of the paper (which we plan to submit by the end of the rebuttal period).

[1]: Adam Khakhar, Stephen Mell, and Osbert Bastani. Pac prediction sets for large language models of code. In International Conference on Machine Learning, In International Conference on Machine Learning, pp. 16237–16249. PMLR, 2023.

Include more qualitative examples (e.g. Figure 1) with details, such as the error level used for standard conformal prediction.

We first thank the reviewer for reminding us of the missing error level information. The error level for both the standard conformal prediction and conformal structured set is 0.05 (i.e., the desired coverage level is 0.95).

As suggested by the reviewer, we provided another qualitative example here https://ibb.co/VvRQFM2, comparing our method to the standard PAC prediction set algorithm. We set the error level to 0.05 and the confidence level to 0.99 (δ\delta=0.01), for both the standard PAC prediction set and our algorithm. As can be seen from the example, when constructing the prediction set using the standard PAC prediction set algorithm, the resulting prediction set is much larger than the one constructed by our algorithm. The large prediction set includes labels from very distant categories, making it much harder to interpret compared to having just two coarse-grained labels that summarize the labels. In our prediction set, the different dog breeds have been summarized as simply “dog”, and the different man-made artifacts have been summarized as “artifact”. The fact that the labels are quite coarse reflects the inherent uncertainty in the prediction for this image.

(continued on next comment)

评论

(continued from the previous comment)

Provide examples similar to Table 1 for other datasets

We provide the table on the above discussed example (the wig picture) in Table 2 (https://ibb.co/GWYSQ3s). Similar to our findings in Table 1, with smaller mm, our prediction sets contain fewer labels, making them more interpretable, while prediction sets with larger mm contain more fine-grained labels. Also, with higher epsilon, our algorithm is allowed to make more errors, resulting in much smaller prediction sets.

Clarify motivation earlier in the paper

We are happy to update our paper to clarify our goals.

How should mm be selected in practice?

In general, the choice of mm depends on the needs of the given application domain. It governs the trade-off between the interpretability and granularity of the resulting prediction sets. Specifically, larger values of mm allow more labels to be included in the set, often capturing finer-grained categories such as “attire” or “Blenheim spaniel”. Conversely, smaller values of mm result in sets with fewer labels, which can be more interpretable for users. We suggest that practitioners try different values of mm, and manually examine the resulting prediction sets to determine which choices offer the best tradeoff between interpretability and coverage.

评论

Thank you for including discussion of relevant related work. I also appreciate the additional empirical analysis. The paper can certainly benefit from more examples and greater clarity in writing to better motivate the work (as I suggested earlier as well). However, given the responses to the raised concerns, I would like to update my score towards acceptance.

评论

Thanks for taking the time to read our response and updated paper!

We did not have time during the response period to improve the clarity of our paper but we will work hard to do so for our next revision.

审稿意见
5

The authors propose conformal structured predictions which addresses the interpretability issue of conformal predictions when the output set is complex. Their algorithm also differs from existing algorithms as in conformal structured predictions the search space over τ\tau is not monotone anymore. This approach can be applied to tasks where the output space can be represented as a directed acyclic graph and has a hierarchical structure. The authors provide formal coverage guarantees using PAC and marginal coverage and evaluate their approach in numbers predictions with MNIST digits, ImageNet classification and temporal Q&A.

优点

  • It is an interesting problem, particularly how best to use the external structure of the labels to generate a better 'curve', i.e. recall at given output size.
  • The experimental setups were quite interesting, e.g. MNIST with number ranges.
  • The proposed method seems to extend well to DAG spaces (beyond trees). Though I suppose it is still restricted to DAG instead of Graphs to sum up the probs of final leaf nodes.

缺点

  • I would love to see a baseline where we dont use the structure at all and instead rely on regular P/R curve characteristics. Does the AUC of this model behave better? It is not clear to me as such.

  • Even if we do use the external structure and forced to only predict internal nodes in the DAG (as opposed to arbitrary set of leaf nodes), it would still be useful to understand the P/R curve look significantly different with the proposed models. There are plenty of baselines where we can do prediction on internal nodes in addition to leaf nodes.

问题

(see above)

评论

Thank you for taking the time and effort to read and provide comments on our work. We provide answers to the questions below.

As with the broader conformal prediction literature, our goal is not to achieve better AUC, but to provide statistical guarantees on the coverage of our algorithm. In fact, the conformal prediction setting is closely related to the P/R curve: “coverage” is equivalent to recall, and “prediction set size” is closely related to precision (in particular, if we ignore miscoverages, then precision set size = 1/precision; it is straightforward to modify this equation to account for miscoverages). The key distinction is the focus on obtaining provable finite-sample guarantees on the coverage (a.k.a., recall). Our main contribution is an algorithm for constructing a conformal predictor that satisfies this guarantee. While we propose an algorithm for solving this problem, we emphasize that it is much more generally applicable than just hierarchical label spaces; instead, we are targeting general structured prediction algorithms. Thus, we do not believe comparisons to existing approaches for hierarchical label prediction are useful comparisons for understanding the properties of our approach.

Nevertheless, we appreciate the reviewer’s suggestion to include a comparison with a baseline to highlight our method’s contribution. To this end, we have compared to a baseline from the conformal prediction literature.

We compare our method with a baseline strategy adapted from [1]. While the strategy proposed in [1] is specialized to the code domain, we generalize their framework to apply to arbitrary DAG structures. Our results (below) show that we significantly outperform this baseline, both in terms of prediction set size and computational cost. The main shortcoming of this approach is that it leverages existing PAC prediction set algorithms, which require that the monotonicity assumption holds. Thus, their algorithm restricts the structure of the prediction sets across different values of τ\tau to enforce monotonicity. In contrast, our approach proves a novel conformal prediction bound (Theorem 3.2 in our paper) to avoid the need for monotonicity. We show results for the question-answering task here (prediction set sizes: https://ibb.co/rp2VWqs; computational cost: https://ibb.co/ss1B9wp). These plots show the prediction set sizes using the same annotations as Figure 4 of our paper. The baseline results are represented by dashed lines. For both prediction set size and computational cost, our approach outperforms the baseline for almost every parameter setting, and significantly so for m=2m=2 and m=4m=4. Results in other domains exhibit similar trends, and we will add them in an updated version of the paper (which we plan to submit by the end of the rebuttal period).

Finally, to understand the relationship to the P/R curve, consider the this scatter plot (https://ibb.co/ZBgcpqS), which shows the relationship between recall (i.e., coverage rate) and precision (i.e., average inverse prediction set size) for the question answering task, with m=4m=4 for both methods. As can be seen, our approach significantly outperforms the baseline.

[1]: Adam Khakhar, Stephen Mell, and Osbert Bastani. Pac prediction sets for large language models of code. In International Conference on Machine Learning, In International Conference on Machine Learning, pp. 16237–16249. PMLR, 2023.

评论

We would greatly appreciate your feedback on whether we have addressed all your comments before the discussion period ends today. If your concerns and questions have been resolved, we would appreciate it if you reconsider your score.

Thank you once again for your review.

评论

Thank you once again for taking the time to review our work. We have carefully addressed the concerns you raised to the best of our ability, and your feedback has been invaluable in improving the quality and clarity of our submission.

As the discussion period is nearing its end, we would greatly appreciate the opportunity to confirm if we have addressed your concerns satisfactorily. We hope to have the chance to discuss this further today before the discussion period closes. Have we addressed all your comments?

评论

Dear Reviewers,

Thank you for your thoughtful and detailed feedback on our submission. We have carefully addressed your comments and provided clarifications. As the rebuttal deadline approaches, we kindly request you to review our responses to ensure that we have adequately addressed your concerns. If there are any remaining unclear points, we would be happy to provide further clarifications.

If you find that all your questions and concerns have been satisfactorily addressed, we kindly ask you to consider adjusting your scores accordingly.

Thank you for your time and consideration!

评论

Dear Reviewers,

We have uploaded a revision of the paper, incorporating our responses to all the reviews from the previous round. The revised texts are written in blue. Below are the major updates:

  1. We clarified the definition of τ\tau and the error level used in Figure 1 in the Introduction.
  2. We expanded the analysis in the Related work section and included the papers suggested by the reviewers to highlight how our work improves upon prior studies.
  3. We further clarified the generalizability of our DAG structure in representing the space of prediction sets in Section 4.
  4. We included a new experiment in the domain of emotion prediction and described the baseline strategy used for comparison in Section 5.1.
  5. In Section 5.2, we added:
    • A comparison of results with the baseline,
    • An analysis of difference between marginal and PAC guarantees and how a practitioner should decide between the two,
    • An discussion on the sensitivity to the hyperparameter mm, and
    • How mm should be selected in practice.
  6. We updated the quantitative plots by removing the case for ϵ=0.4\epsilon=0.4 to focus on the more pivotal value of ϵ=0.15\epsilon=0.15 and added baseline results to the original prediction sets size plots. (The plots are the same as the ones we provided earlier in the responses.)
  7. In the Appendix, we added:
    • A discussion on computational cost and scalability,
    • Analyses of additional qualitative examples,
    • Multiple plots showcasing quantitative results from the new experiments, along with baseline comparisons, and
    • A helpful illustration demonstrating the relationship between our study and the P/R curve literature.

We hope this revision addresses your concerns. If you have any further questions or comments, we would be happy to discuss and address them before the discussion period ends. We look forward to any thoughts you may have!

AC 元评审

The paper tackle ways to efficiently apply conformal prediction to structured prediction. In the proposed framework, labels are organized in a Directed Acyclic Graph (DAG). Each node corresponds to a label or a set of labels that represent either coarse-grained categories (e.g., "Animal") or fine-grained categories (e.g., "Dog," "Cat") while Edges capture hierarchical or inclusion relationships between labels. This ensures that prediction sets can balance granularity, interpretability, and coverage guarantees, making it suitable for structured prediction tasks.

The core technical contribution lies in formulating the selection of prediction sets as an integer programming problem aimed at minimizing the size of the prediction set. To ensure coverage, the approach sums the probabilities of leaf nodes covered by the selected nodes in the DAG. Marginal coverage guarantees are achieved by leveraging a previously established method that applies learn-then-test techniques to estimate thresholds for the scoring function.

Computational time is a core issue for such approach and deserves a clear description in the main paper in my opinion (not in the appendix).

审稿人讨论附加意见

The reviewers appreciated the novel framework for structured conformal prediction, highlighting its ability to construct interpretable prediction sets with formal marginal and PAC coverage guarantees. Strengths included the generalizability to structured output spaces (e.g., DAGs) and applications to tasks like MNIST, ImageNet, and question answering. Initial weaknesses identified included missing comparisons to related works (e.g., Conformal Risk Control, RAPS), limited experimental baselines, and unclear motivation in the introduction. The authors addressed these by adding comparisons, qualitative examples, and a new experiment on the GoEmotion dataset, while clarifying hyperparameter sensitivity and computational performance. While reviewers noted that writing clarity and real-world applicability could be further improved, most leaned toward acceptance, recognizing the framework’s theoretical contributions and practical value.

最终决定

Accept (Poster)