PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
2
4
3
3
ICML 2025

MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
In-context learningMany-shot In-context learningLarge language models

评审与讨论

审稿意见
2

The paper proposes is an effective framework for enhancing many-shot ICL performance in scenarios with limited labeled data. B selecting unlabeled samples for pseudo-labeling based on their influence on labeled data and by adaptively selecting demonstrations tailored to each query, the method significantly reduces the reliance on costly labeled data while improving the in-context learning performance.

给作者的问题

No.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

I skimmed through it, but I didn’t carefully examine the proof.

实验设计与分析

Yes. I checked the experimental designs including baselines, implementations and ablations.

补充材料

Yes, including Appendix B/C/D.

与现有文献的关系

The paper proposes selecting the most valuable demonstrations for many-shot pseudo-label learning.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. The paper is well-written and well-structured, making it easy to follow.

  2. The ablation study is comprehensive, providing useful insights into the contributions of different components.

Weaknesses:

  1. Practicality Concerns: The proposed method is overly complex and impractical for real-world applications. The framework requires constructing two influence graphs and selecting different demonstrations for each test query, resulting in significant computational costs. However, the performance improvements do not appear substantial enough to justify the high cost of this approach.

  2. Unstable Gains from Pseudo-Labeling: As shown in Figure 3, increasing the number of pseudo-labeled samples does not consistently lead to performance improvements across most tasks. This introduces an additional challenge of tuning the pseudo-labeling budget. In practical scenarios, there is no reliable way to verify the correctness of pseudo-labels, which may lead to worse performance than a simple few-shot approach.

  3. Limited Experimental Scope: The experiments are somewhat narrow in scope. The authors only evaluate the framework using the Gemini model and on relatively simple tasks. To strengthen the empirical validation, the paper should include experiments with additional models, such as LLaMA and Qwen. Moreover, the selected tasks should be more representative and general, such as those from the SuperGLUE and MATH benchmarks.

  4. Lack of Discussion on Theoretical Upper Bound: The paper does not discuss the upper-bound performance of the proposed method. For instance, if many-shot data were fully annotated with golden labels, how would the proposed approach compare against a retrieval-augmented generation (RAG) baseline? A discussion on this aspect would provide a clearer perspective on the fundamental limitations of the method.

  5. Comparison to More Direct Labeling Approaches: With the cost of large model inference decreasing, a straightforward alternative would be to use a sota model (like GPT4) to label all data directly. The paper does not clearly articulate the advantages of the proposed method over this simpler and more practical alternative. A thorough comparison is necessary to justify the additional complexity introduced by the framework.

其他意见或建议

No.

作者回复

W1. Practicality Concerns.

Response: Thank you for raising concerns regarding computational complexity. To address practicality, our framework incorporates strategies to improve efficiency significantly:

By employing a KV cache, we reduce computational costs by fixing labeled and pseudo-labeled demonstrations across queries, allowing caching of demonstrations within the LLM prior to inference. As demonstrated in Table 2, this approach effectively enhances efficiency without loss of accuracy, making MAPLE more practical for real-world scenarios. A detailed analysis is provided in Appendix D.

Additionally, the influence graph construction process is precomputed. In this way, we ensure that its computational cost does not scale with the number of queries, further enhancing efficiency and practical feasibility.

W2. Unstable Gains from Pseudo-Labeling.

Response: We would like to clarify that in our experiments, MAPLE consistently outperforms other baselines. MAPLE is primarily compared quantitatively against 5 baseline methods on 8 datasets with 5 settings (i.e., the number of pseudo-labeled samples). The results are shown in Figure 1. Among the 8 datasets, MAPLE performs the best on 5 datasets in all settings. On the other 3 datasets, MAPLE performs the best in 4 out of 5 settings. Therefore, these results can validate MAPLE's strong performance.

W3. Limited Experimental Scope.

Response: We appreciate your suggestion regarding the experiments. Our current experiments include both Gemini 1.5 Pro and Gemini 1.5 Flash, which are widely adopted and representative models for many-shot ICL [1,2]. Due to limited rebuttal time, we were unable to include additional models like LLaMA and Qwen, but we plan to explore them in future work. To broaden task diversity, we have added results on a math benchmark GSM8K, and the results demonstrate the superiority of MAPLE on math tasks. We will expand to more datasets in the next version.

[1] Agarwal et al. Many-Shot In-Context Learning. NeurIPS 2024.

[2] Baek et al. Revisiting In-Context Learning with Long Context Language Models. arXiv 2024.

Method2060100
Random90.091.091.5
RAG90.592.093.0
MAPLE92.594.095.0

W4. Lack of Discussion on Theoretical Upper Bound: The paper does not discuss the upper-bound performance of the proposed method. For instance, if many-shot data were fully annotated with golden labels, how would the proposed approach compare against a retrieval-augmented generation (RAG) baseline? A discussion on this aspect would provide a clearer perspective on the fundamental limitations of the method.

Response:

We appreciate your point. While deriving a theoretical upper bound for many-shot ICL is impractical due to the complexity of LLMs like Gemini 1.5 Flash, we provide an empirical upper bound by comparing MAPLE to RAG using 40, 80, and 120 fully labeled examples. MAPLE consistently outperforms RAG even under full annotation, highlighting its effectiveness and robustness beyond limited-label settings.

EmbedGPQABanking77
RAG+Golden36.8/37.8/40.478.0/81.7/83.3
MAPLE+Golden38.3/42.4/44.979.3/81.7/86.2

W5. Comparison to More Direct Labeling Approaches: With the cost of large model inference decreasing, a straightforward alternative would be to use a sota model (like GPT4) to label all data directly. The paper does not clearly articulate the advantages of the proposed method over this simpler and more practical alternative. A thorough comparison is necessary to justify the additional complexity introduced by the framework.

Response: Thank you for raising this important point. We agree that directly using a state-of-the-art model like GPT-4 (or Gemini) to label all data is a natural and increasingly viable alternative. In fact, our zero-shot baseline involves using Gemini to label the entire dataset directly. As shown in Figure 3, MAPLE significantly outperforms this zero-shot approach across all tasks.

Moreover, MAPLE only requires labeling a very small portion of the data. For instance, XSum contains over 2 million training examples, yet MAPLE achieves strong performance with at most 100 pseudo-labeled samples—representing a reduction of over 99.99% in labeling cost, even with decreasing inference costs.

To further highlight MAPLE’s advantage, we include results on GPQA using Gemini 1.5 Pro. In Table 1, MAPLE achieves 44.9% accuracy, while—as reported in Figure 8 in [1]—even fully using all labeled data yields less than 44% accuracy. This clearly demonstrates the effectiveness and efficiency of our method.

[1] Agarwal et al. Many-Shot In-Context Learning. NeurIPS 2024.

审稿人评论

I appreciate the authors’ detailed rebuttal and the additional experiments, especially the inclusion of new results on GSM8K and comparisons under fully labeled settings.

While the authors make a reasonable case for improved efficiency through caching and precomputing influence graphs, my primary concern around the practicality remains, i.e., the proposed method still involves multiple parts—pseudo-labeling, influence estimation, adaptive demonstration selection. In many realistic settings where simplicity and interpretability are crucial, this level of complexity may be difficult to justify, especially given the relatively modest performance gains in some scenarios.

In light of the new evidence, I will raise my score to weak reject, acknowledging the merits of the empirical updates and clearer discussion.

作者评论

We thank you for your further comments and appreciate the opportunity to respond. While we understand your concerns regarding practicality and perceived complexity, we would like to emphasize that the performance improvements are meaningful, and the modular design does not render the method impractical. Our clarifications are as follows:

  1. Performance improvements are meaningful. To demonstrate simplicity and interpretability, we directly compare MAPLE with the zero-shot labeling baseline, which you've described in Weakness 5 as “more practical and straightforward” (i.e., using a SoTA model to label all data directly). We report average performance across different task types and find that MAPLE significantly outperforms the practical baseline, particularly on classification tasks, where it achieves a relative improvement of 50%. These substantial gains underscore that, despite involving multiple components, MAPLE delivers performance improvements that justify the added complexity.
MethodSummarizationReasoningClassificationQuestion Answering
Zero-shot16.149.542.534.3
MAPLE20.153.766.937.7
  1. The modular design does not render the method impractical.
  • First we would like to claim that each component in MAPLE contributes meaningfully to the final performance. We provide detailed evidence through breakdowns and ablation studies, as referenced in our responses to Reviewer K6CP (W2) and Reviewer qNQK (Q3).
  • Second, the computational cost of MAPLE remains manageable.
    • (1) Pseudo-labeling: we enhance efficiency by selectively pseudo-labeling only the top-P nodes with the highest influence scores, instead of pseudo-labeling all train data, significantly reducing API calls.
    • (2) Influence estimation: The graph construction requires the computation of the relevance score rr among any pair of nodes, which will be O(V2)\mathcal{O}(|\mathcal{V}|^2). To compute shortest paths, we use breadth-first search for each node, and the cost is O(V+E)=O(V)\mathcal{O}(|\mathcal{V}|+|\mathcal{E}|) = \mathcal{O}(|\mathcal{V}|) as E=O(kV)\mathcal{E}=\mathcal{O}(k|\mathcal{V}|). Therefore, the whole shortest path computation cost is O(DLV)=O(V)\mathcal{O}(|\mathcal{D}_L||\mathcal{V}|) = \mathcal{O}(|\mathcal{V}|) as DLD=V|\mathcal{D}_L|\ll |\mathcal{D}|=|\mathcal{V}|. Notably, the above cost is only required once before inference, and does not scale with the number of test-time queries. With more queries involved during the test, the computational cost of the graph becomes more negligible.
    • (3) Adaptive demonstration selection: We emphasize that adaptive demonstration selection is an optional component that offers a trade-off between efficiency and performance, as discussed in Sec. 4.5. In MAPLE, we incorporate personalized demonstrations for each query, which incurs additional cost but effectively filters out unhelpful examples, leading to better performance. As shown in Figure 3, this adaptive strategy also improves performance in RAG settings. To accommodate efficiency-focused scenarios, we also provide a variant of MAPLE with fixed demonstration selection and KV caching (Sec. 4.5). This variant enables faster inference with only a mild sacrifice in performance, offering a flexible solution based on deployment needs. The complexity comparison of MAPLE and the KV cache variant is provided in Appendix D.

In summary, each component of MAPLE is either lightweight or designed to offer a meaningful trade-off between performance and efficiency. We also provide practical variants to accommodate different deployment scenarios. Therefore, we sincerely hope that our detailed responses can help clarify the practical aspects of our framework and address your concerns. Thank you so much for your effort in reviewing our work!

Sincerely,
Authors of Submission 14586

审稿意见
4

This paper presents MAPLE, a method for pseudo-labeling in many-shot ICL settings. Key innovation includes similarity-based selection for pseudo-labeling and demonstration example selection.

给作者的问题

  1. Do you have any explanation why MAPLE works well for GPQA, given the diversity of topics in GPQA?

论据与证据

  1. It’s interesting to study many-shot ICL under pseudo-label settings, which has practical value.
  2. The author claims “strong performance” for the MAPLE, but didn’t specify the baselines nor quantitative results for comparison.

方法与评估标准

  1. The baselines are reasonable, and the datasets are up-to-date and commonly used.
  2. It seems the embedding module is important for MAPLE, and an ablation on that is important.

理论论述

NA

实验设计与分析

See methods

补充材料

Yes

与现有文献的关系

It's an interesting extension towards both many-shot ICL and pseudo-labeling.

遗漏的重要参考文献

NA

其他优缺点

NA

其他意见或建议

  1. IMPORTANT Most figures are not rendered correctly. Labels for legend and axes are missing.
作者回复

Claim The author claims “strong performance” for the MAPLE, but didn’t specify the baselines nor quantitative results for comparison.

Response: In our experiments, MAPLE is primarily compared quantitatively against 5 baseline methods on 8 datasets with 5 settings (i.e., the number of pseudo-labeled samples). The results are shown in Figure 3. Among the 8 datasets, MAPLE performs the best on 5 datasets on all settings. On the other 3 datasets, MAPLE performs the best on 4 out of 5 settings. Therefore, these results can validate MAPLE's strong performance.

Exp It seems the embedding module is important for MAPLE, and an ablation on that is important.

Response: Thank you for the suggestion. We have conducted ablations using Sentence-BERT (SBert) [1] and DeBERTa [2] as alternative embedding models, evaluating MAPLE with 20, 60, and 100 pseudo-labeled examples. While performance varies across models, MAPLE consistently outperforms baseline, demonstrating its robustness and effectiveness regardless of the specific embedding choice.

EmbedDateGoEmotion
RAG+SBert51.4/52.4/54.431.3/32.7/33.3
MAPLE+SBert52.7/54.0/55.234.7/36.7/37.3
RAG+DeBERTa52.0/53.6/55.232.7/33.7/34.4
MAPLE+DeBERTa54.4/55.2/57.637.3/37.2/39.3

[1] Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

[2] He P, Liu X, et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention.

Comm1. IMPORTANT Most figures are not rendered correctly. Labels for legend and axes are missing.

Response: We sincerely appreciate your feedback on figure clarity. In the revision, we will ensure all figures are correctly formatted and rendered.

Q1. Do you have any explanation why MAPLE works well for GPQA, given the diversity of topics in GPQA?

Response: Thank you for pointing this out. MAPLE's strong performance on GPQA can be attributed to its adaptive demonstration selection, which tailors pseudo-labeled demonstrations specifically for each test query. This adaptability allows MAPLE to effectively handle the topic diversity in GPQA by selecting demonstrations that are contextually relevant to each individual query. Consequently, MAPLE can leverage pseudo-labeled samples to improve performance even when the topics are diverse.

审稿人评论

Thanks for the response. You can consider putting the abalation results in the Appendix. I don't have any other comment, and I'll keep my current rating.

作者评论

Dear Reviewer y2DN,

Thank you for your thoughtful review and constructive feedback. We appreciate your time and effort, and to strengthen the quality of our work, we will include the ablation results in the Appendix in the final version.

Best regards,

Authors

审稿意见
3

The paper considers semi-supervised many-shot in-context learning setting, i.e., having small labeled and large unlabeled support sets to perform in-context learning with long-context LLMs. Authors argue that within this problem setting it would be beneficial to (i) identify the most impactful unlabeled samples to pseudo-label them, and, subsequently (ii) use adaptive example selection mechanism to select examples for each test query from the union of the pseudo-labeled examples along with labeled ones. To reach both, authors leverage the concept of node influences in graphs. In particular, first, the graph is built with nodes representing examples from both labeled and unlabeled sets, and edges are assigned according to the similarity of the examples in some embedding space. Subsequently, top-p nodes from the set of unlabeled examples are selected according to the score that is lower bound to the node influence. Adaptive demonstration selection is built in the similar fashion, relying on node influence of the labeled and pseudo-labeled samples on the test query. Authors evaluate their approach of the diverse set of problems and show that it outperforms the considered baselines.

给作者的问题

  1. Do you really need to cut Top-K edges for each node? Since, you still will be selecting top-P nodes only after all, and it’s single cost thing, you can just run Top-P selection over fully connected graph? In other words, is it correct that Top-K only controls speed and does not affect the performance, or there is some interplay between both Top-K and Top-P that affect the performance?

  2. Given Figure 6 shows that, mostly, it is more beneficial to put pseudo-labeled examples at back and examples with ground truth labels closer to the query, could you also run the baseline when you just put questions for unlabeled samples (without pseudolabels). It is basically Unsupervised ICL of [1], which was shown to improve upon few-shot baseline that employs only labeled set.

  3. What I am currently missing is the ablation of what part of the proposed approach is actually the most important or the evidence that both are important.

  4. There are some inconsistencies in set of baselines for different tables and Figures. For example, Table 3 misses zero-shot, few-shot and RAG-Adapt (compared to Figure 1). Similarly, Figure 4 misses RAG-Adapt.

  5. Maybe I missed it somewhere, but what was used as the embedding model ff to construct graphs?

  6. Can we compute node influence directly and not rely on the lower-bound? Does lower bound have some benefits? How the performance will be different if the node influence would be computed directly?

[1] Agarwal et al. Many-Shot In-Context Learning. NeurIPS 2024.

论据与证据

See Questions for Authors.

方法与评估标准

The proposed method is evaluated using challenging datasets and with the recent models.

理论论述

I checked Theorem 3.2 and briefly checked the proof of Lemma A.1.

实验设计与分析

Overall, seems valid.

补充材料

I checked the proof of Theorem 3.2 and briefly checked the proof of Lemma A.1. The rest consists of the dataset details and the prompts used, along with KV Cache analysis section which was briefly checked as well.

与现有文献的关系

Many-shot in-context learning is very recent and promising approach to perform adaptation of large-context LLMs. Given the high data labeling cost, it is important to consider the semi-supervised setting, thus I believe that the paper studies an important topic.

遗漏的重要参考文献

N/A

其他优缺点

See Questions for Authors.

其他意见或建议

N/A

作者回复

Q1. Do you really need to cut Top-K edges for each node? Is it correct that Top-K only controls speed and does not affect the performance, or there is some interplay between both Top-K and Top-P that affect the performance?

Response: Thank you for the question. We want to clarify that Top-K pruning is essential and does influence selection. Our influence score in Eq(8) depends on the shortest path and its count. In a fully connected graph, all node pairs are directly connected with only one shortest path of length 1, which removes meaningful structural differences and reduces influence estimation to near-random. Moreover, computing the shortest path on a fully connected graph incurs complexity in O(V2)\mathcal{O}(|\mathcal{V}|^2). Thus, Top-K is important for both efficiency and effective selection.

Q2. Given Figure 6 shows that, mostly, it is more beneficial to put pseudo-labeled examples at back and examples with ground truth labels closer to the query, could you also run the baseline when you just put questions for unlabeled samples (without pseudolabels). It is basically Unsupervised ICL of [1], which was shown to improve upon few-shot baseline that employs only labeled set.

Response: Thank you for the suggestion, and we have added results using only unlabeled questions without pseudo-labels, following the unsupervised ICL setup (#labeled=20, #unlabled=100). Consistent with Figure 6, we observe that placing labeled examples closer to the query still leads to better performance. However, without pseudo-labels, the benefit from unlabeled examples is limited (compared to Fig 6.), highlighting the importance of label information—even if approximated—for effective many-shot ICL.

DatasetSalientGPQA
Rand w/w.o. R65.6/66.033.3/34.3
Rag w/w.o. R68.8/68.233.8/34.8
MAPLE w/w.o. R70.4/70.835.3/36.1

Q3. What part of the proposed approach is actually the most important?

Response: We appreciate the reviewer’s question. The RAG-Adapt baseline in our paper can be seen as a variant of MAPLE without the graph structure, as both rely on Contriever for relevance score. Without pseudo-labeling, as in our response to Q2, performance drops due to the lack of label information. Further, in our response to Reviewer K6CP W2, we detail how removing individual components of the influence score degrades performance. Together, these results demonstrate that each part of MAPLE is crucial to its overall effectiveness.

Q4. There are some inconsistencies in set of baselines for different tables and Figures. For example, Table 3 misses zero-shot, few-shot and RAG-Adapt (compared to Figure 1). Similarly, Figure 4 misses RAG-Adapt.

Response: Thank you for pointing this out. We believe the reference is to Table 1 (rather than Table 3) since Table 3 is a list of prompts. Due to rebuttal time constraints and API budget limits, we have now additionally run Gemini 1.5 Flash for zero-shot, few-shot, and RAG-Adapt on Table 1, as well as RAG-Adapt for Figure 4. We will include these updated results in the revised version of the paper.

Datset0-shotfew-shot20406080100
Banking7775.176.977.076.976.576.778.5
Date49.152.953.653.553.155.456.1
GPQA34.335.837.136.235.732.534.0
Datset50100150200
Banking7779.381.381.381.7
Date56.459.259.463.4

**Q5.**What was used as the embedding model to construct graphs?

Response: Thank you for the question. We use Contriever as the embedding model. This is mentioned in the right column of line 115: “...the relevance score r(vi, vj), as defined in Contriever...” For clarity, we will also explicitly restate this in the Implementation section in the revised version.

Q6. Can we compute node influence directly and not rely on the lower-bound? Does lower bound have some benefits? How the performance will be different if the node influence would be computed directly?

Response: We claim that it is computationally impractical to directly calculate the node influence. As stated in Eq. (17) in Appendix A, the node influence between any pair of nodes, viv_i and vjv_j, is calculated from the iterative expansion of the neighboring nodes of viv_i. This involves all the nodes that exist in any path between viv_i and vjv_j, and their corresponding derivatives regarding the embedding of vjv_j. This can be computationally prohibitive when the number of such nodes is massive due to the long distance between viv_i and vjv_j.

Therefore, we propose to rely on the lower bound to compute the influence score instead of directly computing the node influence. With our proposed Theorem 3.2, the influence score is computed based on the shortest path distance and the number of shortest paths. Thus, it is much easier to compute, compared to the massive computation of derivatives in the original node influence.

审稿意见
3

This work develops a semi-supervised in-context learning framework by exploiting small amount of labeled and large unlabelled dataset. A Ken graph is built upon the labeled and unlabelled dataset. The unlabelled samples (nodes) that are similar to the labeled ones are selected For pseudo labelling. Finally, demonstrations that are highly relevant to the test query are selected from the combined dataset for prediction.

给作者的问题

  • Please further explain why increasing the number of pseudo labels may harm certain datasets.

论据与证据

The claims made in this paper are mainly supported with empirical results.

方法与评估标准

The methodology and benchmarking datasets are mostly appropriate.

理论论述

Not thoroughly checked.

实验设计与分析

The experiment designs are mostly appropriate. However, evaluation on higher number of pseudo labeled samples are missing which may demonstrate the potential upperbound of the proposed method.

补充材料

No supplementary is submitted.

与现有文献的关系

Semi-supervised learning has been thoroughly investigated. The proposed method exploits a graph and the influence score shares similarity with graph based SSL and label propgation. However, integrating SSL to ICL is novel.

遗漏的重要参考文献

N/A

其他优缺点

Strength:

  1. Exploiting graph for relevance calculation could exploit the data manifold and potentially leads to better results.

  2. Semi-supervised in-context learning can alleviate the reliance on excessive human annotation.

Weakness:

  1. Building a graph for selecting relevant unlabelled samples itself induces additional computation overhead. There is no discussions on the computation cost for graph construction.

  2. The design for influence score seems arbitrary. It would be good to see if keeping only the shortest path or number of shortest paths is worse.

  3. It is worth noting that the impact of increasing the number of pseudo labeled samples is indeterministic. For GoEmotion, Banking77 and Date, the performance may still go up with more pseudo-labels. While, Tracking 7 seems does not benefit any pseudo labeled samples. A deeper analysis is necessary.7

其他意见或建议

It is unclear why the x axis starts with 20 in figure 3. With 0 pseudo-labeled samples, would this be equivalent to few-shot?

作者回复

W1. Building a graph for selecting relevant unlabelled samples itself induces additional computation overhead. There is no discussions on the computation cost for graph construction.

Response: Thank you for bringing up this point. The graph construction requires the computation of the relevance score rr among any pair of nodes, which will be O(V2)\mathcal{O}(|\mathcal{V}|^2). To compute shortest paths, we use breadth-first search for each node, and the cost is O(V+E)=O(V)\mathcal{O}(|\mathcal{V}|+|\mathcal{E}|) = \mathcal{O}(|\mathcal{V}|) as E=O(kV)\mathcal{E}=\mathcal{O}(k|\mathcal{V}|). Therefore, the whole shortest path computation cost is O(DLV)\mathcal{O}(|\mathcal{D}_L||\mathcal{V}|). Notably, the above cost is only required once before inference and thus is independent of the number of queries. With more queries involved during the test, the computational cost of the graph becomes more negligible. Moreover, as shown in Table 2 in our paper, the adaptive demonstration selection component does not incur much computational cost.

Thank you so much for your suggestion. We will include this discussion in the appendix.

W2. The design for the influence score seems arbitrary. It would be good to see if keeping only the shortest path or number of shortest paths is worse.

Response: We appreciate your suggestion and have added results (#labeled=20, #p-labeled=100) using only the shortest path and the number of shortest paths. While the shortest path captures how quickly information can travel, it overlooks robustness—relying on a single path can be fragile to noise or minor data variations. On the other hand, using only the number of shortest path captures redundancy but disregards distance; many long paths may not imply a strong influence. Our influence score is designed to capture both efficiency (via short paths) and robustness (via multiple paths), resulting in a more reliable and informative demonstration selection for many-shot ICL.

DatasetBanking77GoEmotionGPQA
len(shortest path)75.337.636.4
# |shortest path|78.637.236.9
Influence score80.838.137.4

W3. It is worth noting that the impact of increasing the number of pseudo labeled samples is indeterministic. For GoEmotion, Banking77 and Date, the performance may still go up with more pseudo-labels. While, Tracking 7 seems does not benefit any pseudo labeled samples. A deeper analysis is necessary. Q1. Please further explain why increasing the number of pseudo labels may harm certain datasets.

Response: Thank you for highlighting this point. The inclusion of additional pseudo-labeled samples can sometimes harm performance because pseudo-labels generated by LLMs are not always accurate. Incorrect pseudo-labels may introduce misleading information when used as demonstrations, negatively influencing LLM predictions for specific datasets [1]. However, our method addresses this issue by adaptively selecting pseudo-labeled samples based on their relevance to each test query. This approach mitigates the negative impact of inaccurate pseudo-labeling, as demonstrated by consistent improvements across most tasks. We note that the performance decline is limited to certain datasets, likely due to the higher difficulty or inherent ambiguity of the samples. Nevertheless, MAPLE consistently outperforms baselines even though the number of pseudo-labeled samples is not optimal.

[1] Agarwal et al. Many-Shot In-Context Learning. NeurIPS 2024.

Comm1. It is unclear why the x axis starts with 20 in figure 3. With 0 pseudo-labeled samples, would this be equivalent to few-shot?

Response: Thank you for the observation. Yes, when the number of pseudo-labeled samples is 0, Random, RAG, and MAPLE all reduce to the few-shot setting. To avoid redundancy, we omit x=0 from the x-axis and instead include the few-shot performance as a green horizontal dashed line in Figure 3 for comparison. We will clarify this explicitly in the revised version.

最终决定

This paper proposes a semi-supervised many-shot in-context learning approach - having small labeled and large unlabeled support sets to perform in-context learning with long-context LLMs.

Strengths:

  • Very important line of work and has potential to be used across various academic and industrial applications.

  • Solid experiments and results.

I enjoyed reading this work, and authors provide a detailed rebuttal addressing most of the concerns raised by reviewers. Specifically, major concerns about computation overhead for graph construction and ablation of different components of MAPLE are addressed by authors in the rebuttal. Overall, I recommend accept.