PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
3.8
置信度
创新性3.5
质量3.3
清晰度2.8
重要性3.3
NeurIPS 2025

Bandit Guided Submodular Curriculum for Adaptive Subset Selection

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We propose an adaptive subset selection framework for supervised learning, where submodular functions quantify sample difficulty to guide curriculum learning.

摘要

关键词
Curriculum LearningAdaptive Subset SelectionMulti Arm BanditSubmodular FunctionsNo-regret Analysis

评审与讨论

审稿意见
4

This paper addresses curriculum learning by reinterpreting adaptive subset selection as a multi-armed bandit problem, where each arm represents a submodular function guiding sample selection. The authors propose OnlineSubmod, an online greedy policy that optimizes a utility-driven reward and achieves no-regret performance. Unlike traditional methods that rely on predefined difficulty measures, OnlineSubmod adaptively learns effective curricula using validation-driven reward metrics. Experiments show it outperforms standard curriculum learning and bi-level optimization approaches in terms of accuracy and efficiency across vision and language tasks.

优缺点分析

In fact, I could hardly find any flaws or weaknesses in this paper. The presentation is thorough and well-structured, with ample theoretical analysis, visualization studies, and experiments covering a wide range of scales. I was unable to identify any significant shortcomings.

问题

Actually, I have two main questions or points of curiosity.

First, some methods for coreset selection or data pruning could also be applied to perform a similar task. For instance, the three papers I listed below—could the authors discuss any existing gaps or differences in their applicability compared to the proposed method?

CCS: Coverage-centric Coreset Selection for High Pruning Rates. ICLR-2023

D2Pruning: D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning. ICLR-2024.

InfoMax: Data Pruning by Information Maximization. ICLR-2025.

Second, have the authors considered evaluating their approach on much larger-scale datasets, such as those used in pre-training tasks for large language models (LLMs)?

I would consider further increasing my score if my questions were addressed with the responses provided.

局限性

Not any.

最终评判理由

Thank you for your reply, author! My questions have been answered! I recommended this article to be accepted

格式问题

Not any.

作者回复

We sincerely thank the reviewer for their time and thoughtful evaluation of our work. We are especially grateful for the encouraging remarks regarding the originality, significance, and clarity of our approach. We look forward to addressing any further questions or suggestions from the reviewer.



Discussion on Additional Related Work

InfoMax: Data Pruning by Information Maximization. ICLR-2025

  • The InfoMax formulation is a direct restructuring of the Graph Cut function (as shown in their Section 3.3), which is monotone submodular, using a similarity kernel such as DINO embeddings (for image settings) or gradient scores.

  • This is aligned with our work, where we are already using Graph Cut explicitly as a bandit arm in our current framework.

  • The key difference is that InfoMax operates in a static setting, selecting samples over the entire dataset.

  • In contrast, our method is dynamic and modular—an InfoMax-like objective can be plugged in as a bandit arm and used in batch-level pruning during training.

  • This makes our method easier to deploy in real-world settings, where adaptive selection over batches is necessary for scalability.

  • Importantly, our framework is modular—methods like InfoMax can be seamlessly integrated and evaluated within OnlineSubmod, offering better scalability in large-scale training pipelines.


D2Pruning: D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning. ICLR-2024.

  • In this paper, the data subset selection problem is framed as a subgraph pruning task over the entire training dataset, represented as a graph with the similarity metric is based on embedding or feature similarity across data points.

  • The pruning mechanism relies on message passing algorithms, and the selection optimization problem can be viewed as a graphical model. However, due to the heavy computational cost of message passing, this approach may required modifications for large-scale setups.

  • Similar to InfoMax, this method operates at a static level, not a dynamic batch-level setting.

  • As a direction for future work, it would be interesting to explore how batch-level message passing-based pruning can be adapted within a bandit arm framework, especially using diversity parameters as proposed in their paper.


CCS: Coverage-centric Coreset Selection for High Pruning Rates. ICLR-2023

  • Like other works, CCS considers a static setting to select examples that ensure both coverage and diversity.

  • A key contribution of CCS is its theoretical analysis of the pruning budget beyond which catastrophic accuracy drop can occur.

  • In our revised manuscript, we plan to explore whether similar catastrophic accuracy drop effects appear at the batch level, though this is less common than in static selection setups.

In summary, while all three related works contribute valuable insights into data pruning and coreset selection, they primarily operate in static, full-dataset regimes. In contrast, our proposed method addresses batch-level adaptive pruning in a scalable and modular manner using a bandit-driven curriculum where the reward is guided by validation performance, filling a key gap in making such strategies practical for large-scale, dynamic training setups. Further our method can be easily modified to incorporate other reward strategies like forgetting score as in CCS or other metrics discussed in the above related work.


Second, have the authors considered evaluating their approach on much larger-scale datasets, such as those used in pre-training tasks for large language models (LLMs)?

  • We thank the reviewer for this insightful suggestion. As part of our initial experiments, we had evaluated OnlineSubmod on a Pythia-70M model using OpenWebText, which represents a significantly larger-scale dataset. We were able to complete one full setup in this regime, demonstrating that our method is applicable in such settings.

  • Given that GREATS is a strong baseline in our finetuning experiments, we had planned to conduct pretraining comparisons against GREATS as well. However, due to the lack of open-source code for GREATS pretraining configuration, we were unfortunately unable to include those results at this time.

  • Moreover, given limited compute availability apart from finetuning experiments, we were unable to secure sufficient resources to conduct large-scale pretraining beyond the initial Pythia-70M run. That said, we plan to extend our evaluation to GPT-2 (small) and (medium) scale pretraining setups prior to the camera-ready deadline.

We hope these clarifications and additions address the reviewer’s concerns. Our primary contribution lies in proposing a novel bandit-guided curriculum learning framework that incorporates submodular subset selection for adaptive data sampling. We are grateful for the reviewer’s positive feedback on the core ideas presented.

We will revise the manuscript to explicitly incorporate the related works mentioned, which will help better position our method in relation to prior efforts and clarify its distinct contributions. We would sincerely appreciate a reconsideration of the overall score, which would be instrumental in supporting the acceptance of our paper.

评论

Dear Reviewer v9ha,

We sincerely thank you for submitting the acknowledgement and for championing our work. We hope our rebuttal has fully addressed your concerns. As the response period nears its end, please don’t hesitate to let us know if there’s anything further we can clarify.

Thank you once again for your time, support, and thoughtful feedback.

We look forward to your support towards the paper acceptance.

Sincerely, The Authors of #28216

审稿意见
4

This paper tackles the curriculum learning problem by framing it as a multi-armed bandit problem, where each arm corresponds to a submodular function that defines a different difficulty-based sampling strategy. The authors leverage validation losses as the reward signal for a greedy policy that guides adaptive subset selection. They provide theoretical results demonstrating that the proposed subset selection strategy incurs almost no regret by the end of training. Experimental results on LLM fine-tuning, image classification, and ablation studies show that the proposed method significantly outperforms existing approaches.

优缺点分析

Strength:

  1. Framing the curriculum learning problem as a multi-armed bandit problem is a novel and interesting approach. This formulation is principled and general, allowing for a structured exploration-exploitation trade-off.

  2. The authors address computational cost issues by utilizing submodular relaxations and low-cost gradient proxies, such as last-layer gradients or LoRA adapters. This enables the proposed algorithm to be efficient and scales well.

  3. The theoretical results provide provable regret bounds that show the proposed method can guarantee no regret by the end of training.

  4. Comprehensive experimental results show that the proposed method consistently outperforms existing sample selection techniques and achieves better accuracy-efficiency trade-offs than prior work.

Weakness:

The proposed method heavily relies on a validation-driven reward, and the authors assume access to a validation mini-batch at each time step. This raises several concerns:

  1. How sensitive is the performance to the configuration of the validation dataset?
  2. In the early stages of training, what happens if the validation dataset contains difficult samples that the model has not yet learned? A more detailed analysis of the impact of the validation dataset would make the paper more convincing.

Some typos: L270, L411, L411, L495: wrong reference "Appendix ??" L296: comparision, datsets L305: submdoular

问题

Regarding the concerns on the impact of the validation dataset, here are several questions for the authors.

  1. How sensitive is the performance to the configuration of the validation dataset?
  2. In the early stages of training, what happens if the validation dataset contains difficult samples that the model has not yet learned?

局限性

Yes

最终评判理由

Most of my concerns are well addressed by the rebuttal. So, I'll keep my positive rating.

格式问题

None

作者回复

We sincerely thank the reviewer for their time and thoughtful evaluation of our work on Bandit-Guided Submodular Curriculum. We are glad that the reviewer found our framing of curriculum learning as a multi-armed bandit problem both novel and principled, and that the theoretical and empirical contributions were appreciated. We're also encouraged by the positive recognition of our approach’s scalability and efficiency, as well as its performance improvements over existing methods.

We now move forward to addressing the questions raised by the reviewer

 The proposed method heavily relies on a validation-driven reward, and the authors assume access to a validation mini-batch at each time step. This raises several concerns:
  • First, validation-based signals are widely used in curriculum learning and subset selection to guide model development. This includes prior work such as GREATS[1] and TracIn[2], where the validation loss (or its proxy) serves as a reliable indicator of generalization. Our approach extends this principle by using the marginal drop in validation loss as a utility function for adaptive selection, and our theoretical guarantees hinge on this consistency.

  • Second, we agree that the quality and composition of the validation dataset are critical. In practice, we assume access to a relatively clean validation set — a standard assumption in most training pipelines. If the validation set contains outliers or disproportionately difficult examples, especially in early training stages, the reward signal may be noisy. To probe this, we conducted controlled experiments (see below) on CIFAR-100 where we varied the hardness of the validation samples (based on gradient norm) and observed consistent performance degradation when the validation set was biased toward difficult samples early in training.

  • Lastly, while our current instantiation uses validation loss, our framework is flexible. If a clean validation set is unavailable, alternative reward signals — such as teacher model outputs, self-training confidence, or auxiliary task feedback — can be substituted with minimal change to the bandit formulation, left for future-work.

  How sensitive is the performance to the configuration of the validation dataset?   In the early stages of training, what happens if the validation dataset contains difficult samples that the model has not yet learned? A more detailed analysis of the impact of the validation dataset would make the paper more convincing.

To better understand this issue, we conducted a controlled experiment on CIFAR-100 (300 epochs) where we varied the hardness of the validation dataset. Hardness was measured via gradient norm (higher norm ∼ harder example). We compared four validation subset configurations:

  • Easiest: Lowest gradient norms
  • EasyHard: Easy samples early, hard samples later
  • HardEasy: Hard samples early, easy samples later
  • Hardest: Highest gradient norms

Each configuration was evaluated at validation subset sizes of 10%, 20%, and 30%. Below are the final test accuracies:

Validation Subset10%20%30%
Easiest72.374.476.03
EasyHard73.174.576.4
HardEasy72.2974.676.2
Hardest71.3174.375.9

Findings:

  • Hardest validation sets yield lower performance, particularly at smaller subset sizes, likely due to noisy or overly pessimistic reward signals early in training.
  • Mixed configurations like EasyHard and HardEasy perform best, suggesting robustness to how difficult samples are distributed over time.

These results indicate that exploring how difficult samples are ordered within the validation dataset—framed through a curriculum learning perspective—could be a promising direction for future work. Nonetheless, even validation sets containing difficult examples early in training do not lead to instability or collapse.


We thank the reviewer again for their valuable suggestions and constructive feedback, which we will incorporate into the final version. We also appreciate their positive assessment of the paper and would be grateful for a corresponding increase in score to further support its acceptance.


[1][Wang et al., 2024] GREATS: Online Selection of High-Quality Data for LLM Training in Every Iteration.

[2][Pruthi et al., 2020] Estimating Training Data Influence by Tracing Gradient Descent. arXiv:2002.08484

评论

Dear Reviewer,

As the author response deadline approaches, we wanted to kindly check if there are any remaining concerns or clarifications we can provide. If our responses have addressed your concerns satisfactorily, we would be grateful if you might consider indicating your willingness to raise the score.

Thank you once again for your thoughtful review and time.

评论

Dear Reviewer 3iJi,

Thank you for taking the time to review our rebuttal. As the author-reviewer discussion period is nearing its end, we would greatly appreciate it if you could provide any further comments at your convenience.

This will allow us to address any further questions or concerns you may have before the discussion period ends. If our responses satisfactorily address your concerns, please feel free to confirm.

Thank you very much for your time and effort!

Sincerely,

The Authors of Submission #28216

审稿意见
5

The paper introduces an adaptive curriculum learning framework called ONLINESUBMOD, which casts subset selection as a multi-armed bandit problem, where each arm corresponds to a distinct submodular function guiding data sampling. ONLINESUBMOD uses a reward function defined based on the marginal utility of a training subset in improving validation loss. The authors theoretically show that the proposed greedy policy achieves no-regret performance under certain assumptions. Empirical evaluations in the vision and language domains demonstrate that ONLINESUBMOD outperforms baselines in terms of accuracy and efficiency trade-offs.

优缺点分析

Strengths:

  • Quality:
    • ONLINESUBMOD is a technically sound and well-motivated method with no-regret guarantees.
    • Empirical results show consistently superior performance over strong baselines.
  • Clarity:
    • The authors describe the problem in a well-structured manner. The algorithm and theoretical results are well-explained in the main document and supported by details in the appendix.
  • Significance:
    • The method addresses adaptive data selection for efficient training, a practical problem, using a scalable and general approach.
  • Originality;
    • The combination of submodular curriculum design via a multi-armed bandit and the reward formulation using validation-based utility seem novel.

Weaknesses:

  • Quality:
    • The assumption of stationary reward distributions does not seem realistic.
  • Clarity:
    • A summary or diagram of utility and influence functions could be beneficial for the reader.
  • Significance:
    • I'm unsure how complex it is to design a set of submodular functions as arms for a particular application.
  • Originality:
    • Although the proposed method seems novel, the discussion on the related work can be extended for non-expert readers to get a deeper understanding of the novelty of the work.

问题

  • Can you clarify the sensitivity of the method to the choice and number of submodular arms?
  • Can you explain the role of exploration hyperparameters and how to tune them in practice?

局限性

Yes.

最终评判理由

The authors' response has clarified my understanding of their work and its contribution.

格式问题

.

作者回复

We thank the reviewer for the thoughtful and constructive feedback. We are glad that the reviewer found our method to be technically sound, well-motivated, and empirically effective. We appreciate the reviewer's recognition of the method's clarity, practicality, and originality, and we are happy to address the concerns and questions raised:


On Clarity of Utility and Influence Functions:

Thank you for the suggestion. We will include a summary figure illustrating how utility and influence functions interact in the reward formulation, especially highlighting their roles in guiding the greedy sampling decisions. Along with this we will add a summary section of how such influence functions are utilised in existing literature in our final camera ready version. We agree this will significantly improve readability for non-expert audiences.

On the Complexity of Designing Submodular Arms:

In practice, we draw from a small, domain-agnostic library of well-established submodular functions—such as facility location and graph cut—that have demonstrated effectiveness across a range of tasks. These functions are not only interpretable but also align with familiar sampling heuristics like representativeness and diversity. For example, representative functions (e.g., facility location, graph cut) aim to select samples that best capture the structure of the data by covering clusters, while diversity-based functions explicitly encourage selection of dissimilar items to avoid redundancy. From a computational standpoint, we construct a similarity matrix over item gradients within each batch, which remains tractable due to the modest batch sizes typically used in training. Moreover, once the base modular function is defined, a wide range of submodular objectives can be constructed by applying concave transformations, enabling fine-grained control over the trade-off between representativeness and diversity.

On Related Work Discussion

We appreciate the recommendation to expand our discussion of related work. While we have included additional related work discussions in the Supplementary, we will revise Section 2 to more clearly position our method in relation to prior curriculum learning methods that rely on fixed or hand-crafted, model dependent or model-agnostic difficulty metrics.


Response to Questions raised by Reviewer

 Can you clarify the sensitivity of the method to the choice and number of submodular arms?
  • If the set of submodular functions acting as arms are assumed to coming from multiple(more than 2) classes (in our case, we considered two broad classes like representative and diversity based submodular functions) and if some classes are highly skewed (i.e. some classes of submodular functions has more instantiations than others), then the submodular function arms from high frequent classes would be statistically more favoured in long term horizon by the exploration branch of the algorithm).
  • In our use case we consider a well balanced setting so that both classes have equal representation. In that case, in line 4 of algorithm in exploration branch instead of uniform weightage across all arms we will sample according to the inverse weighting of the frequency of the submodular function arms to accomodate for the skewness.

 Can you explain the role of exploration hyperparameters and how to tune them in practice?
  • As discussed in line 222 to 233 (Section 3), the exploration hyperparameters governs when to prioritize more on exploration vs exploitation, which is typically the norm in multi-arm bandit literature.
  • More specifically, λ(t)\lambda(t) (Exploration Dampening) and π(t)\pi(t) (Exploration Sharpness) function as curriculum schedulers, as they decide the ordering of the submodular function arms by tuning the exploitation/exploration tradeoff dynamically.
  • If a uniformly sampled value satisfies ζ>Ξt\zeta > \Xi_t, the algorithm switches to exploitation mode and selects the arm that maximizes ϑ(aBt)\vartheta(a \mid B_t). Otherwise, an arm is sampled uniformly at random.
  • Exploration focuses on trying different arms to discover potentially better or more diverse options, even if a best arm is already known. Exploitation favors reusing the previously selected arm that has performed well so far.

In practice, we follow different variations of λ(t)\lambda(t) as indicated in our ablation sections 5.3 Fig 4 and 6, one as exponentially increasing or decreasing or keeping it constant through time. The effect of different variations are discussed more broadly in Section 5.3. The usual way we followed for tuning for optimal exploration parameters was to perform grid search experiments over a discrete search space of exponentially increasing, decreasing and constant setups.


Thank you again for the helpful suggestions and valuable feedback, which we will incorporate in the final version. We are glad to hear your positive inclination toward accepting the paper and hope that our responses have adequately addressed your concerns.

We kindly request your support in reflecting this through an appropriate score increase to help strengthen the case for acceptance.

评论

Dear Reviewer Bq1H,

As the response window draws to a close, we wanted to reach out to see if there are any outstanding issues or points of clarification we can address before the deadline. If you feel that our replies have resolved your concerns, we would be sincerely thankful if you could consider raising the current score in your overall evaluation.

We truly appreciate your time and the valuable feedback you’ve provided.

Best regards, The Authors of #28216

评论

Thank you for addressing my questions and comments. Now I have a clearer understanding of your work and its contribution, hence I decided to increase my scores.

评论

Dear Reviewer Bq1H,

Thank you for taking the time to review our work and for your thoughtful comments. We’re glad our responses helped clarify our approach and contributions. We sincerely appreciate your decision to raise your scores and your support toward our paper.

Sincerely,

The Authors

审稿意见
5

The paper addresses a well known problem of curriculum learning by reformulating this to a multi-armed bandit problem. Specifically, the authors has pointed out the reliable criteria for difficulty is still not firmly established in previous works, which not only be calculated from separate modules but also not adaptive to the evolving training dynamics. The solution is to adopt a multi-armed bandit formulation with greedy utilization-based rewards with each arm of the bandit as a submodular guiding function for sampling. The authors has demonstrated their work in both language and vision tasks and have shown superior performance.

优缺点分析

Strengths

  1. The paper adopts theoretically plausible multi-armed bandit problem into practical continual learning problem. The integration seems to be successful as shown in experimental results.
  2. The proposed method is demonstrated thoroughly in both large-scale vision and language tasks, revealing its practical value.
  3. The proposed method adds minimal computational overhead and therefore is practical.

Weaknesses

Motivation

  1. In Figure 1, the difference in differently scheduled samplers the authors tried to deliver seems to be negligible. The end points at 300 epochs are hard to discriminate from each other.
  2. The choice of submodular functions in Table 1 can be easier to be understood if there are elaborated reasons why the authors have chosen those. Without the explanation, the choice of submodular functions looks be arbitrary, despite the papers’ good results.

Experiments

  1. Figure 2 shows the first 1700 steps of training. Whereas GREATS paper has conducted experiments for longer period of time, e.g., 3k~10k iterations. Is there a reason the authors have trained only about half of the iterations?

Visualization & Presentation

These are minor concerns.

  1. I would like to mention that the paper is a bit hard to read due to excessively many highlights throughout the article. Please reconsider revising the manuscript in more planar tone and style.
  2. It would be nicer to have consistent styles throughout the figures. The styles in Figures 1, 2, 3, and 4 are not consistent and confusing. Furthermore, the dot size in Figure 4 is too large making it hard to interpret its value.
  3. Table styles can also be improved. Table 1 has italic caption whereas others do not have.
  4. It is hard to discern each lines in Figures 3, too.

Overall, I regard this paper has an acceptable level quality with both theoretical and practical contributions. However, I have a few unresolved issues. I will raise my score if all are resolved.

问题

  1. How does different types of submodular functions affect the training quality individually? That is, does the dynamics change dramatically if only a single arm is selected over time? I am asking this question because I see only a small variance between the two options shown in Figure 1.
  2. Although Figure 5 seem to show the arm selection distribution does change over time, I am curious on the purely random selection of sampling policies compared to the multi-armed bandit. Will this make a change?
  3. Lastly, I believe that the training mechanism of CIFAR (which is trained from scratch) and the training mechanism of fine-tuning of LLMs can have different dynamics and therefore require different types of submodular functions. It will be interesting to see if there is a difference by plotting the distribution of arm selection for fine-tuning LLMs.

Please note that these questions are minor concerns and are not counted in my overall scoring.

局限性

Yes.

最终评判理由

Most of my initial concerns are resolved by the rebuttal, and the authors has promised significant revision regarding presentation. I will raise my score.

格式问题

None.

作者回复

We thank the reviewer for their time in reviewing our submission and for providing encouraging feedback on our draft. We are pleased to know that the reviewer found our work promising both from a theoretical perspective and in terms of real-world applicability. We look forward to addressing all the questions here by:


Clarification on Fig 1 sampling policy accuracy gap: In Figure 1b, our primary goal is to illustrate that the blue curve—which corresponds to using representative submodular functions in the initial epochs—achieves higher accuracy earlier in training compared to the red curve, where diversity-based functions are used first. While the final accuracy at epoch 300 may appear similar, the distinction lies in the training dynamics. Notably, between epochs 200–250, the performance gap is more pronounced, indicating that scheduling representative functions early and shifting to diversity-based sampling later leads to more efficient learning. This supports our hypothesis that a predefined curriculum induced by submodular structure benefits optimization more when scheduled in a particular order. There are many possible orderings of representative and diversity-based functions; Figure 1 illustrates just one, highlighting how the choice of order affects accuracy and convergence.


Motivation behind choice of Submodular Functions We thank the reviewer for this feedback. We do elaborate on the rationale behind our choice of submodular functions in Appendix Section F (page 34), where we present their mathematical formulations and structural properties. This section details how each function promotes either diversity or representativeness with respect to the similarity kernel. To strengthen the motivation, we plan to add additional synthetic experiments that illustrate how diversity-focused submodular functions select varied samples, while representative functions favor prototypical examples—empirically validating the behavior of each function according to its intended category.



Addressing minor concerns over Visualization/Presentation Thank you for the detailed feedback on the presentation quality. We appreciate the suggestions and will incorporate them in the final version.

  • We acknowledge that the current version may contain overly emphasized highlights. In the camera-ready version, we will adopt a more neutral and planar tone, improving overall readability and flow.
  • We agree that figure consistency enhances clarity. We will standardize the styling across all figures—including color palettes, legend placement, and line widths—for better visual coherence. We will also reduce the dot size in Figure 4 to improve interpretability.
  • Thank you for pointing out the inconsistency in table captions. We will fix it to non-italics for uniformity across all table captions.
  • For Figure 3, we will improve line distinguishability through clearer labeling and contrast-adjusted color choices to make trends easier to follow.


Response to Questions raised by Reviewer

Effect of submodular functions indivually and simple random selection over arms

Thank you for the thoughtful question. To assess the importance of the multi-armed bandit setup, we conducted additional experiments on CIFAR-100 (10% subset, 300 epochs) under two ablations: (a) using a single fixed arm throughout training (i.e., no bandit), and (b) randomly selecting an arm at each round (i.e., no explore-exploit strategy). Results are shown below:

Selection StrategyAccuracy (%)
DisparitySum (Div.) (Static)68.6
FacilityLocation (Rep.) (Static)72.0
LogDeterminant (Div.) (Static)71.1
GraphCut (Rep.) (Static)72.6
Random arm per round72.0
OnlineSubmod73.6

These results show that while some individual arms (e.g., GraphCut, FacilityLocation) perform reasonably well, others (e.g., DisparitySum) are suboptimal, making fixed-arm strategies brittle. Random arm selection does moderately well but falls short of the full model, at the same time neither random or static arm selection considers improving validation performance as a reward metric. This confirms that adaptive explore-exploit selection via the bandit improves robustness and performance over naive or static alternatives.

Difference in arm distribution between image and language training setups

This is an interesting question, while its true the underlying mechanisms are different, this final distribution over the chosen arms is more dependent on the underlying training distribution. From our experiments, we observe that Log-Deteriminant(Div.) is almost never active for LLM exps, while Graph-Cut(Rep.) is more dominant. We tested on MMLU(Sociology) subset. This indicates that, atleast for this subset, Representation seems more important that diversity for test-perplexity. We will add an ablation on this in our final revised draft.


Thank you again for the helpful suggestions and the important feedbacks which we will incorporate in the final version and for your reassurance given your positive inclination towards accepting the paper.

We would appreciate an appropriate increase in the score so as to get support on the paper acceptance from your end.

评论

Dear Reviewer,

As the author response period is nearing its deadline, we wanted to check if there are any remaining concerns or questions from your end that we could help clarify within the timeframe. If our responses have addressed your concerns satisfactorily, we would greatly appreciate it if you could consider indicating your willingness to raise the score.

Thank you again for your time and thoughtful feedback.

评论

Thank you for providing the comment.

Regarding Answer 1

  1. I still believe that Fig 1 does not reveal the benefits of the presented method quite clearly. Maybe enlarging the figure and removing the unnecessary gradient-coloring in the background will help visualizing better. It's a matter of style, so I won't consider this in my score, but I still do not understand why authors have chosen to color the background of the figures, which greatly blurs the message of the plots within. Moreover, the difference between two lines in Figure 1 are hidden inside the dots, so the differences are rendered insignificant.
  2. Since both figures in Figure 1 are not showing plateau, I believe we cannot say anything about training efficiency here. Even though 300 epochs training has been a typical choice for CIFAR-100 learning, learning efficiency should be discussed based on the time and performance at the model reaches the final stage, i.e., the plateau, or the point where model starts to overfit. Even though the model can learn fast at the beginning, it does not necessarily mean that the model reaches the final stage faster or better. I still think that Table 1 is significant, though. So I won't decrease the score.

Regarding Answer 2 I have read through the appendix, and with the Answer 4 of your rebuttal, I believe the motivation of each submodular function is sufficiently provided. However, I do believe including simple experimental results for the validation of the motivations of each submodular functions will make the work more compelling.

Regarding Answer 3 and 4 Thank you for the response. The questions related to this is mostly resolved.

评论

Dear Reviewer N27T,

Thank you very much for your detailed follow-up comments. We sincerely appreciate your constructive feedback and the time you’ve taken to carefully review our work.

Regarding Figure 1, we acknowledge your concern about the background coloring and the visibility of differences between the lines. In the camera-ready version, we will enlarge the plots, remove the gradient background, and use clearer line styles, along with a magnified subplot of the near-plateau region to better highlight differences at later stages. We also understand your point about the plateau and training efficiency; we will clarify that Figure 1 illustrates the initial learning speed towards achieving higher accuracy faster.

Regarding the motivations for each submodular function (Answer 2), we are glad our rebuttal explanations were clear. We agree that simple validation experiments for each submodular function’s motivation would strengthen the paper along with our rebuttal experiment results above. We will add them in the final version.

We are grateful that you found Table 1 and the other clarifications satisfactory, and we will ensure the final version reflects your suggestions for clarity and completeness

Thank you again for your valuable input and for helping us further improve our work.

Best regards,

The Authors of Submission #28216

评论

We sincerely thank all reviewers for their detailed and constructive feedback, which has significantly improved our paper.

In summary, our work introduces a principled, theoretically grounded curriculum learning method governed by submodular functions and formulated as a bandit problem. Through extensive experiments on both large-scale language and vision tasks, we demonstrate its practical utility alongside its theoretical rigor.

This has been unanimously recognized by all reviewers for its originality, clarity, significance, and quality, as reflected in their comments:

Reviewer N27T : The proposed method is demonstrated thoroughly in both large-scale vision and language tasks, revealing its practical value. The proposed method adds minimal computational overhead and therefore is practical.

Reviewer Bq1H : The method addresses adaptive data selection for efficient training, a practical problem, using a scalable and general approach. The combination of submodular curriculum design via a multi-armed bandit and the reward formulation using validation-based utility seem novel.

Reviewer 3iJi : Framing the curriculum learning problem as a multi-armed bandit problem is a novel and interesting approach. This formulation is principled and general, allowing for a structured exploration–exploitation trade-off. Comprehensive experimental results show that the proposed method consistently outperforms existing sample selection techniques and achieves better accuracy–efficiency trade-offs than prior work.

Reviewer v9ha : In fact, I could hardly find any flaws or weaknesses in this paper. The presentation is thorough and well-structured, with ample theoretical analysis, visualization studies, and experiments covering a wide range of scales. I was unable to identify any significant shortcomings.


Addressing Reviewer Feedback Summary

Reviewer N71T – We appreciate the reviewer noting that most concerns are resolved. We will improve Figure 1 by enlarging the plots, removing gradient backgrounds, refining line clarity, and adding a magnified near-plateau view. The reviewer found Table 1 significant and the motivations for submodular functions sufficient; we have already added simple validation experiments in our rebuttal demonstrating each function’s intended behavior to further support these motivations, which we will integrate in our final draft.

Reviewer 3iJi – On reliance and sensitivity to validation-driven rewards, we clarified that such signals are standard in prior work and our framework supports alternative rewards if a clean validation set is unavailable. Controlled CIFAR-100 experiments varying validation set hardness showed that while hardest-only sets slightly reduce performance at small sizes, mixed easy–hard configurations perform best without causing instability.

Reviewer Bq1H – We clarified the roles of utility and influence functions and will add a summary figure. We explained that our method uses a small, interpretable, domain-agnostic library of submodular arms with balanced class representation, and described handling skew via inverse-frequency weighting. We also detailed how exploration hyperparameters act as curriculum schedulers, tuned via grid search, with effects demonstrated in our ablations.

Reviewer v9Ha – We will integrate detailed positioning against CCS (ICLR’23), D2Pruning (ICLR’24), and InfoMax (ICLR’25). While these operate in static full-dataset regimes, our approach is dynamic, modular, and scalable, thus making it capable of incorporating such objectives as bandit arms for batch-level adaptive selection. The reviewer’s strong assessment of the paper’s novelty, clarity, and completeness is deeply appreciated.

In summary, we sincerely thank all reviewers for their thoughtful feedback and constructive suggestions, which have strengthened the clarity, positioning, and impact of our work, and we are especially grateful to the AC for their support and careful oversight throughout the review process.

最终决定

This paper frames curriculum learning as a multi-armed bandit problem, where each arm represents a submodular function guiding sample selection, and the reward function is defined by the marginal utility of the selected samples on validation set. The authors proposed OnlineSubmod, an online greedy policy that optimizes achieve no-regret performance under certain assumptions. Reviewers seem enthusiastic about the paper across the board, appreciating the MAB framing, finding the proposed method technically sound and with strong empirical confirmation. One potential weakness is the reliance on performance on validation set, but the authors argue it is a common practice in curriculum learning literature, and reviewers seem fine with it.