5.3

/10

Rejected4 位审稿人

最低5最高6标准差0.4

4.0

置信度

正确性3.0

贡献度2.0

表达2.8

ICLR 2025

TapWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining

Ruiyi Zhang,Sai Ashish Somayajula,Pengtao Xie

OpenReview PDF

提交: 2024-09-24更新: 2025-02-05

摘要

关键词

task-adaptive pretrainingcontinued pretrainingmulti-level optimizationhyper-parameter optimization

评审与讨论

审稿意见

评分: 6置信度: 42024-11-01

The paper introduces a new task-adaptive pretraining (TAP) framework, called TapWeight, designed to improve downstream model performance by automatically optimizing the importance of each pretraining objective. Unlike traditional TAP methods that rely on manually set tradeoff parameters between multiple pretraining objectives, TapWeight dynamically learns these parameters by solving a multi-level optimization (MLO) problem. The authors report empirical results from the domain of molecular property prediction and natural language understanding, demonstrating its performance across 13 molecular and 8 language datasets compared to baseline TAP methods.

优点

The proposed framework automates the weighting of pretraining objectives, by optimizing the importance of each pretraining objective in a dynamic, data-driven manner.
The authors showcase experiments across two domains - molecular property prediction, and NLU
The authors provide an open-source implementation for their approach, adding transparency and reproducibility.

缺点

The tasks chosen for evaluation are primarily classification. Given the current state-of-the-art LLMs, and VLMs, and the corresponding benchmarks, no analysis in that regard has been shown. For example, some maths problem solving tasks such as Math QA, MetaMath etc. can be used to constitute the pre-training tasks, and evaluated on a task such as GSM8K.
The number of baselines used for comparison is limited. For example, the paper "Don’t Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner", is a strong baseline candidate for the proposed method, as the paper builds further upon the TAP method.
Increased computational complexity limits the scalability of the method to larger models.
The paper does not discuss or provide an empirical analysis of how quickly the model converges compared to fixed-weight pretraining methods.

问题

Could you provide more details on the convergence rate of TapWeight, especially compared to traditional pretraining methods with fixed weights? How does the computational overhead impact practical training times on large-scale datasets?
I am curious to know whether there are any patterns observed in the reweighting behavior across tasks, such as prioritizing certain objectives over others based on task type. Were there some pre-training tasks that were given negligible weights?
If possible, can you provide an analysis using larger models, and generative tasks, instead of classification? Additionally, a framework would be considered stable if an ablation could be shown across model families and varying scales.
To obtain the robustness of the approach, a random task which has no relation with the downstream task can be augmented with the set of other pre-training tasks, just to see of TapWeight assigns a very low weightage to the random task.
**A suggestion, please keep the tables in the page where the corresponding description is present, improves the readability of the paper.

伦理问题详情

There are no ethical concerns as per my understanding.

评论- Author Response 2

2024-11-22

5. Patterns for Pretraining Objectives Given Downstream Tasks

We observe that similar downstream tasks tend to assign similar weights to specific pretraining objectives. For instance, the Esol and Freesolv datasets, which both involve predicting physical chemistry properties of molecules, assign large weights to the MG3 pretraining objective. Conversely, the Toxcast and Clintox datasets, which focus on predicting molecular toxicity, assign small weights to MG3. We have updated Section 4.4 of the manuscript to include these observations. As the interactions between pretraining objectives and downstream task types are inherently complex, we plan to further investigate this direction in future work.

6. Convergence of TapWeight Compared to General Continued Pretraining

All experiments in this paper, including those for TapWeight and fixed-weight continued pretraining (CP), have reached empirical convergence. Empirically, TapWeight requires slightly more time to converge compared to CP with fixed weights. For example, using the MUV dataset as a downstream task, TapWeight takes approximately 1.94 times longer to converge than CP. We plan to conduct a theoretical analysis of TapWeight's convergence rate in future work.

7. Table Locations in the Paper

We thank the reviewer for the helpful suggestion. In response, we have reorganized the tables in the revised manuscript as per the reviewers’ comments to improve readability and presentation clarity.

$1$ Don’t Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner, Zhengxiang Shi, Aldo Lipani, NeurIPS 2023.

评论- Official Comment by Reviewer ca73

2024-11-26

Thank you for the responses! Most of my questions were clarified, and some were left to be explored in future work. I am increasing my score accordingly.

评论- Appreciation for Constructive Feedback

2024-11-27

Thank you once again for your thorough and thoughtful review of our manuscript. We deeply appreciate your acknowledgment of the strengths of our work, and your insights have been crucial in improving it. We are sincerely grateful for your valuable time and expertise.

评论- Author Response 1

2024-11-22

We appreciate your constructive feedback very much, and provide our response to your questions as follows.

1. Experiments on Larger Models

We thank the reviewer for this insightful suggestion. In response, we conducted additional experiments by applying TapWeight to a larger language model, RoBERTa-large, and compared it with baseline methods on three additional datasets: RCT, AGNews, and IMDB. As shown in Table 4 of the revised manuscript, TapWeight consistently outperforms the baseline methods across all three datasets, demonstrating its robustness and scalability across different sizes of pretrained models.

2. Comparison with Additional Baselines

We thank the reviewer for the helpful comments. In response, we conducted additional experiments to compare our method with the more advanced baseline, PCP $1$ . As shown in Tables 3 of the revised manuscript, experimental results on the GLUE benchmark demonstrate that TapWeight outperforms the PCP baseline, highlighting its advantage over more recent and powerful continued pretraining methods.

3. Experiments on Additional Tasks

We thank the reviewer for this valuable suggestion. In the revised manuscript, we evaluated our method on three commonly used datasets for continued pretraining: RCT, AGNews, and IMDB. These datasets cover a variety of NLP tasks beyond the NLU tasks in the GLUE benchmark: RCT involves classifying sentences in biomedical texts based on their functional roles, AGNews focuses on topic classification of news articles, and IMDB addresses sentiment analysis of movie reviews. As shown in Table 4 of the revised manuscript, TapWeight consistently outperforms both the baseline without continued pretraining and the baseline continued pretraining methods, TAPT and SimCSE, across all three datasets. These additional experiments strengthen the evidence of TapWeight's effectiveness across diverse NLP tasks outside of natural language understanding in GLUE.

We would also like to clarify why our method was not applied to text generation tasks in this paper. Most state-of-the-art language models for text generation, such as the GPT and LLaMA series, predominantly rely on a single auto-regressive loss for next-token prediction during pretraining. As this objective is the standard choice for such models, we did not prioritize applying TapWeight to reweight multiple objectives in this context. In contrast, masked language models and foundational models in specialized domains such as biology and medicine often employ multiple pretraining objectives, creating a clear need for methods like TapWeight to automate the tradeoff parameter selection. Nonetheless, we recognize the potential of applying TapWeight to text generation models and plan to explore this direction in future work.

4. Experiments with a Random Task as Objective

We thank the reviewers for the insightful suggestion. Based on this, we conducted an experiment where an additional pretraining task was introduced: predicting labels randomly sampled from a Gaussian distribution. Since these labels have no dependency on the real input data, this task does not contribute meaningfully to the continued pretraining process and is expected to receive zero weight. We added this random task alongside the original five pretraining objectives for molecular property prediction in TapWeight. Experimental results demonstrate that the weight for this random task rapidly decreases to zero within 100 training iterations. This finding highlights the ability of TapWeight to identify and effectively minimize the influence of useless pretraining objectives.

审稿意见

评分: 5置信度: 42024-11-03

This paper introduced a method called TapWeight, to automatically tune the loss weights of multiple pretraining objectives during task-adapted continued pretraining. In task adapted continued pretraining, a new mixture of pretraining objectives is introduced, targeted to improve a specific suite of downstream tasks. Downstream task performance is measured by applying SFT to the new pretraining checkpoint, and evaluating on the validation or test split. The main problem the paper aims to solve is how to balance the pretraining objectives in the continued pretraining mixture in order to maximize downstream task suite performance.

TapWeight tackles this problem by formally framing this as a three tiered multi-level optimization (MLO) problem, where Tier 1: Continued Pretraining given some loss weights, Tier 2: SFT, Tier 3: Optimizing pretraining loss weights to maximize validation performance. Note that this paper only studies the work as applied to Encoder-only models, and SFT finetuning (until convergence.) The SFT phase is slightly altered in the training setup to deduce dependence with other tiers (using the continued pretrain weights as regularization instead of initialization), and the MLO optimization employs the Betty library to solve (with some tricks to approximate the inverted Hessian, as noted in the Appendix).

The approach is evaluated on a popular, standard, molecular property prediction model, Imagemol, as well as on NLP tasks (GLUE), via RoBERTa. In both domains, the method outperforms compared baselines. The authors also provide ablation result for each component of their method.

优点

Finding effective ways of optimizing the choice of pretraining objective to optimize downstream performance is an important problem that would be of interest to the greater community. The authors provide a novel connection to MLO in this work, and the formal solving of this problem is valuable and welcome given the amount of empirical hillclimbing typically done in this space.
The paper shows convincing results that their formal optimization of the whole MLO objective works from a quality perspective, and yields results that improves over the baselines they consider. Given the complexity of solving such an MLO problem, these results are promising and this technique may be of interest to the community.
The presentation of the work is nice. It is well written, structured, and easy to follow, free of any grammatical mistakes. There are some aspects about the method that could be clearer, but overall this does not detract from the presentation of the paper.
Evaluation of the method on both molecular biology and language demonstrate well the generality of the method. Especially as many (5) pretraining objectives are considered for the molecular image model.
Reproducibility is excellent. The authors have already open sourced the code, and provide all information needed to reproduce their experiment.

缺点

1.) Soundess: Most of the experiments in this work compare against baselines that do not apply task-adaptive continued pretraining. It is only in Table 4 that the authors show this ablation result for the molecular results (but not the NLP results.) In Table 3, it is unclear whether the method is gaining because of continued pretraining in general or because of TapWeight. TAPT represents a continued pretraining comparison, but from a different data distribution.

This work would benefit from having more baselines that represent basic ways of hillclimbing the pretraining mixture. Some examples:

All pretraining objectives weighted evenly in the mixture.
Leave-one-out or one-at-a-time ablations for the pretraining mixture.
Even weighting of only the objectives that maximize certain tasks in the eval suite + any other grid-search like approach.
This dataset ablation can take place over smaller models (to use less compute), before being applied to the final model size.

In Table 5, TapWeight is 1.5x the cost of normal training, so it is possible that TapWeight may be a way to save compute over traditional hillclimbing, but having these explicit compute & quality comparisons against more baselines would largely strengthen the paper.

2.) Contribution: The contribution of this work is solid for the targeted application: continued pretraining (CP) of encoder-only models to maximize SFT tasks. However, there there is also a implicit assumption that the continued pretraining computation budget is small. In Table 5, the training budget of CP is only a little bit more expensive than that of SFT. It is unclear how true this assumption is for most modern models. It is unclear also how TapWeight would perform or handle much longer training horizons, or where CP>>SFT in terms of computational budget and they need to see a different budget of tokens.

Also note: Table 5 also only reports time rather than FLOPs, and its not possible for the reader to understand whether each phase occurs on the same amount of hardware.

The paper is missing a crucial baseline that might help the reader understand the wider applicability of the technique: training CP+SFT (Table 5) for the same time / FLOPs as TapWeight, to observe if the gains from TapWeight outweight the computation cost. (e.g. If you make the cheaper alternatives use as much compute, would they do better? Is TapWeight yield training compute efficiency gains?)

Note however that the authors are relatively upfront about limitations here.

3.) Presentation: The description of the method relies on an understanding of MLO, which is OK, but the paper would be clearer if more descriptions were provided, especially as it applies to CP. For example: it seems like MLO does CP and SFT in a single phase, as implied by Table 5, and the optimization equations. However, Figure 1 somewhat suggests that CP is done then an entire SFT run is done, before updating the loss weights. Exactly how many tokens of CP vs. SFT is seen per step is also unclear (or is it assumed they see the same number of tokens, 1:1 for each step?)

There's also some work that may be relevant, but missing from related works:

UL2: https://arxiv.org/pdf/2205.05131
U-PaLM: https://arxiv.org/abs/2210.11399 (CP version) A lot of manual ablation was done for the construction of UL2 and UL2R in U-PaLM. These would serve as very natural candidates for a more learned continual pretraining approach.

4.) Soundness: The authors make the claim Line 88 “Moreover, TapWeight is applicable to any pretrained model with multiple pretraining objectives”. This is likely true in theory, but the authors lack empirical evidence for this given their limited setting (e.g. how about decoder-only models?).

TapWeight as described also is only defined over loss weights of multiple pretraining objectives, but this is not well defined for pretraining objectives that need to see different amounts of different pretraining objectives, like UL2 mentioned above. Please correct me if I am wrong here.

问题

Could the authors provide more information about how the TapWeight update work? What is the granuarly? How many tokens of CP vs. SFT is seen per-step? Is it equal? Is my understanding correct that CP+SFT is jointly optimized in TapWeight?

If above is true, is it true then that the loss weights adaptively update during training? Have you tried taking the final weights and doing another run with the final weights, fixed?

Is extra SFT applied after TapWeight?

How many A100s are used in each phase of training, and for how long? (More information about Table 5).

评论- Author Response 2

2024-11-22

5. Relevant Related Works

We thank the reviewer for the suggestion. We have revised the manuscript and added the suggested papers to the related work section to provide a more comprehensive literature review.

6. Comparison with CP+SFT Under the Same Budget

We conducted additional experiments on three datasets from MoleculeNet (Bace, Clintox, and HIV) to compare the performance of the CP w/o reweighting method under a similar training budget as TapWeight. The baseline achieved an average score of 78.0, which is lower than the 80.9 score achieved by TapWeight. These results demonstrate that simply increasing the training time of CP does not significantly improve downstream task performance. As shown in Table 1 and Table 5 of the revised manuscript, CP w/o reweighting often fails to outperform the simple finetuning baseline without CP. This is likely because the CP stage was conducted on an unlabeled text corpus in the general domain, which is not strongly related to the downstream tasks. Without feedback from downstream tasks, as implemented in TapWeight, performing CP on this general corpus does not positively impact some downstream tasks. This is especially true when the continued pretraining methods are applied on models like RoBERTa, which have already been pretrained on general domain text.

7. Comparison with CP Using TapWeight Searched Trade-Off Weights

We thank the reviewer for the insightful suggestion. To address this, we conducted additional experiments on three MoleculeNet datasets (Bace, Clintox, and HIV) to evaluate the performance of CP using tradeoff weights searched by TapWeight. On average, this baseline achieved a score of 80.6, which is comparable to the 80.9 score of TapWeight, demonstrating the effectiveness of the tradeoff weights identified by TapWeight. However, this approach requires additional computational cost due to an extra round of CP without yielding significant performance improvements. Therefore, it is more practical to directly use the model weights from TapWeight without conducting another CP stage.

8. Additional Details on Computational Resources

All experiments were conducted using a single A100 GPU. We have revised the experimental setup section in the manuscript to make this clear.

9. Revision of Line 88

We thank the reviewer for pointing this out. We have changed this line in the revised manuscript to ‘TapWeight is broadly applicable to pretrained models with multiple pretraining objectives across various data modalities and downstream task types’ to enhance its precision.

$1$ Don’t Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner, Zhengxiang Shi, Aldo Lipani, NeurIPS 2023.

评论- Author Response 1

2024-11-22

We appreciate your constructive feedback very much, and provide our response to your questions as follows.

1. Comparison with Baseline Methods

We apologize for any confusion. As shown in Table 3, in the NLU experiments on the GLUE benchmark, the SimCSE baseline method uses the same continued pretraining data as TapWeight, with MLM and CL as the pretraining objectives. Results indicate that TapWeight outperforms SimCSE on the GLUE benchmark, demonstrating that TapWeight's superior performance on NLU tasks stems from its reweighting strategy rather than the continued pretraining process itself.

Additionally, we conducted further experiments to compare our method with a more advanced baseline, PCP $1$ , for continued pretraining. As shown in Table 3, experimental results on the GLUE benchmark demonstrate that TapWeight outperforms the PCP baseline, highlighting its advantages over more recent and powerful continued pretraining methods.

2. Comparison to Different Ways of Hill Climbing the Pretraining Mixture

We thank the reviewer for the valuable suggestion. Regarding the proposed baseline method where "all pretraining objectives are weighted evenly in the mixture," this corresponds to the CP w/o reweighting ablation setting in Table 5 of the revised manuscript. Results indicate that TapWeight outperforms this baseline, highlighting its effectiveness as a method for optimizing the pretraining mixture.

For the grid search-based baseline method, if each of the five pretraining objectives in the molecular property prediction task has four possible tradeoff weight options, this would result in approximately 10^3 times the time consumption of a single continued pretraining run. While this computational cost could potentially be reduced using a smaller proxy model, we were unable to complete this experiment in the response period. We plan to further explore this direction of optimizing the pretraining mixture in future work.

3. Scenario for Longer Training Horizons

We thank the reviewer for the insightful suggestion. For smaller downstream datasets such as RTE, Bace, or Sider, the training time for the continued pretraining (CP) stage is more than 5 times longer than that of the supervised fine-tuning (SFT) stage, due to the small size of these downstream datasets. In these scenarios, TapWeight consistently outperforms baseline methods, demonstrating its robustness across varying scales of training horizons. We plan to further investigate the robustness of TapWeight in scenarios where the CP training extends to durations as long as several weeks in future work.

4. Clarifications on Update Strategies of TapWeight Framework

Is CP+SFT jointly optimized in TapWeight?

CP (level I) and SFT (level II) are two distinct levels in the TapWeight framework, and their optimizations are performed iteratively within a multi-level optimization framework. Importantly, the models at these two levels have different parameters. This differs from a joint optimization of CP+SFT, where both stages are updated in the same optimization step, typically using the same set of parameters.

How many tokens in CP and SFT are seen per-step?

The number of tokens processed in CP and SFT is determined by the respective batch sizes of each level, as specified in the hyperparameter section (Appendices B.4 and C.4). For example, when applying TapWeight to the continued pretraining of a RoBERTa-base model with feedback from the MRPC dataset, the batch size is 512 for the CP stage (level I) and 32 for the SFT stage (level II). The number of tokens processed is therefore not equal, reflecting the specified hyperparameters.

Is it true that the loss weights adaptively update during training?

Yes, the loss weights adaptively update during training, as illustrated in Figure 2.

Is extra SFT applied after TapWeight?

Yes, after TapWeight continued pretraining, the model weights from the CP stage (level I) are used for supervised fine-tuning (SFT) to achieve optimal performance on the downstream task.

Can TapWeight be used for ‘pretraining objectives that need to see different amounts of different pretraining objectives, like UL2’?

In UL2, the amounts of different objectives are manually determined and serve a role similar to the tradeoff weights in TapWeight. However, UL2 samples these objectives during pretraining rather than optimizing them simultaneously, making TapWeight not directly applicable due to the non-differentiability of the sampling process. That said, applying TapWeight in such a scenario could be feasible with techniques like reparameterization, which we plan to explore further in the future.

评论- Kindly Reminder

2024-11-27

Dear Reviewer,

We greatly appreciate the thoughtful feedback and suggestions you provided. We have tried our best to address your questions in our response. As the discussion period is nearing its conclusion, we kindly request your input on whether any additional questions remain or if our response has satisfactorily resolved your initial concerns. We welcome any further inquiries you may have.

Warm regards,

The Authors

评论- Revised score

2024-11-27

I thank the authors for the effort put into this revision, the clarifications, and the new experimental results.

Re: Baselines. Thanks for the clarification. You are right that SimCSE include both a MLM and a CL loss. I also appreciate the added PCP results. Table 5 does also provide the equal weight baseline I suggested. I think there is decent evidence that this method is effective. The results with fixed TapWeight from the beginning of CP is also relevant and important to include. Along with the training FLOP budget-fixed results.

However the spirit of my suggest is more rooted in the question of: how does this compare to what a practitioner would do today by manually tuning weights? I think this answer is not super well understood given the ablations done in the paper. While I agree that exhaustive grid search is not feasible, I think there are many other basic comparisons to be done here: ablate 90% weight on one loss at a time, observe the win/loss tradeoffs, and assign new weights based on these quality tradeoffs, hillclimb for a few iterations.

Correct me if I'm wrong, while the baselines in this paper consider multiple losses, they don't seem to try to systematically optimize the loss weights. This paper is still missing mixture optimization baselines. It is likely that TapWeight is still more training FLOP efficient, but its not evaluated in this paper.

Summary:

I will update my score to a 5 to reflect the amount of revision done to the paper. The new results does help strengthen the value of TapWeight: it does find weights that are useful, not just due to the adaptive reweighting and the extra FLOPs seem to be worth it. If the clarifications provided here also are included in the paper, it would make the entire work stronger. I maintain that this general problem of optimizing pretraining objectives to optimize for post-SFT performance is extremely important. In the future, this paper might benefit from focusing on this framing rather than an MLO one.

At this time I do not think I could go higher than this. This paper would really benefit from a heavier revision / reframing than what's possible during this discussion period. Why?

Baselines are still a bit weak, as described above. While the extra baseline comparisons are nice, it's missing comparison to other mixture / loss weight optimization approaches.
Even if the baselines were stronger: this paper's empirical contribution is limited. It studies only traditional encoder-only models and MLO formulated this way is only applicable to loss mixtures that are differentiable. This is not very common in modern LLMs which use a single objective (causal language modeling). Keeping it encoder-only also limits us from evaluating the method on a lot of harder tasks as suggested by Reviewer tMTB.
Unclear how this would scale to more realistic continued pretraining horizons. Yes CP here is sometimes 5x the size of a SFT dataset, but typically CP could be ~50B to 1T+ tokens.

In the end, I think there is a niche case that could benefit from this technique (e.g. Imagemol), but the contribution is limited overall.

评论- Appreciation for Constructive Feedback

2024-11-28

Thank you once again for your insightful review of our manuscript. We deeply appreciate your recognition of the strengths of our work and your thoughtful suggestions for improvement. In particular, your comments on incorporating stronger baselines and exploring longer horizons for continued pretraining are extremely insightful and valuable, and we aim to address these aspects in the future.

We also acknowledge the potential limitations of TapWeight in certain use cases, such as modern LLMs that utilize a single autoregressive objective, as highlighted by the reviewer. However, we believe that, despite these limitations, TapWeight retains significant potential across a broad range of applications, particularly in the rapidly growing field of biomedical foundation models beyond traditional image and text modalities. For instance, Transformer encoder-based models for single-cell RNA-seq [1], proteins [2], and molecular graphs [3] have demonstrated remarkable success in their respective fields. The successful application of TapWeight to Imagemol highlights its potential for broader use in AI for science.

Once again, we extend our gratitude for your valuable time and expertise. Your thoughtful feedback has been instrumental in guiding our work.

[1] Theodoris et al., Transfer learning facilitates predictions in network biology, Nature, 2023.

[2] Lin et al., Evolutionary-scale prediction of atomic level protein structure with a language model, Science, 2023.

[3] Méndez-Lucio et al., MolE: a foundation model for molecular graphs using disentangled attention, Nature Communications, 2024.

审稿意见

评分: 5置信度: 42024-11-03

This paper proposes TapWeight, an approach to adapt weights for different pre-training objectives according to downstream tasks. The authors propose a three-level optimization, where the first level is the traditional pre-training level, the second level is the downstream fine-tuning level, and the third-level is downstream validation level to finally provide signals for weight updates. In the experiments, the authors demonstrate that TapWeight can outperform baselines on molecular tasks, and outperform the pre-train fine-tuning pipeline on NLU tasks. They also conduct a few ablation studies to show the importance of reweighting and multi-level optimization.

优点

The proposed TapWeight method can effectively increase model performance on targeted downstream tasks that are different from the pre-training domain.
The authors also conduct a few ablation studies to show the importance of reweighting and multi-level optimization.
The code is released for reproducibility.

缺点

The proposed multi-level optimization is not new, and can actually be seen as meta-learning.
The experiment results are not well presented. For example, table 2 is about "results" of molecular prediction, but the authors do not explain the metrics in the table, and do not use up-arrow or down-arrow to show whether the numbers should be bigger or smaller.

问题

Can the authors include experiment results of more types of tasks other than NLU and molecular prediction tasks? E.g., legal domain, financial domain, etc. Otherwise, I don't think this method has wide applications.

伦理问题详情

No ethics concerns.

评论- Author Response

2024-11-22

We appreciate your constructive feedback very much, and provide our response to your questions as follows.

1. Novelty of the Proposed Method

While our method shares some similarities with meta-learning approaches, it is distinct in several key aspects. Gradient-based meta-learning typically operates under a bi-level optimization framework and trains models from scratch, aiming to learn model parameters that quickly adapt to specific domains with limited training examples. These methods are rarely applied in the context of continued pretraining.

In contrast, TapWeight is specifically designed for learning the tradeoff weights for multiple continued pretraining objectives. Its goal is to determine the optimal weights such that, after continued pretraining and fine-tuning, the model achieves the best performance on the validation split of the downstream dataset. Our approach introduces a novel three-level optimization framework, which fundamentally differs from the traditional bi-level meta-learning paradigm and highlights the unique contributions of our method.

2. Presentation of Experimental Results

We apologize for any confusion caused by the presentation of our results. In response to the reviewer's comments, we have revised the manuscript to clearly specify the metric used in each table and explicitly indicate in the captions whether a higher or lower value represents better performance.

3. Additional Experiments on Other Domains

We thank the reviewer for this valuable suggestion. In the revised manuscript, we evaluated our method on three commonly used text datasets from different domains for continued pretraining: RCT (biomedical), AGNews (news), and IMDB (movie reviews). As presented in Table 4 of the revised manuscript, TapWeight consistently outperforms both the baseline method without continued pretraining and the baseline continued pretraining methods, TAPT and SimCSE, across all three datasets. These additional experiments across diverse domains further validate the effectiveness and robustness of our method in various contexts.

评论- Kindly Reminder

2024-11-27

Dear Reviewer,

We greatly appreciate the thoughtful feedback and suggestions you provided. We have tried our best to address your questions in our response. As the discussion period is about to end, we are wondering whether any additional questions remain or if our response has successfully resolved your initial concerns. We welcome any further inquiries you may have.

Warm regards,

The Authors

审稿意见

评分: 5置信度: 42024-11-04

This paper proposes TapWeight, which automatically determines optimal loss weights for different pretraining objectives through a three-level optimization approach.

优点

The proposed three-level approach is flexible and demonstrates transparent performance improvements.
The qualitative analysis effectively shows how different downstream tasks require different weights for pretraining objectives, justifying the need for adaptive weighting.

缺点

Major weakness:

The evaluation on GLUE benchmark, which has been saturated in recent years, would benefit from including more challenging datasets. E.g., Crossfit (160 downstream tasks) https://arxiv.org/pdf/2104.08835
The effectiveness of the pretraining objectives on more powerful generative models or LLMs remains unexplored. For example 1. gpt3.5 result on more challenging glue-x: https://aclanthology.org/2023.findings-acl.806.pdf 2. three-stage LLM method for task adaptive pretraining: https://arxiv.org/abs/2402.05140
Missing reference and comparison with other recent MLM models (they have higher performance on the GLUE benchmark): e.g., Pre-training Language Model as a Multi-perspective Course Learner https://arxiv.org/pdf/2305.03981

Other weakness:

The selection criteria for multi-objective pretraining tasks in Level I needs justification, particularly why certain tasks were chosen over alternatives like token detection or swap detection.
line 452-457: The effects of pretraining task to downstream tasks need a more in-depth investigation. Particularly, according to the trade-off score result, how will these findings better inform us on designing new or matching existing pretraining tasks with downstream tasks?

问题

see weakness section

评论- Author Response

2024-11-22

We appreciate your constructive feedback very much, and provide our response to your questions as follows.

1. Evaluation on Additional Datasets

We thank the reviewer for this valuable suggestion. In the revised manuscript, we have evaluated our method on three widely used datasets for continued pretraining: RCT, AGNews, and IMDB. As presented in Table 4 of the revised manuscript, TapWeight consistently outperforms both the baseline method without continued pretraining and the baseline continued pretraining methods, TAPT and SimCSE, across all three datasets. These additional experiments further validate the effectiveness of our approach.

We also thank the reviewers for the suggestions of using CrossFit dataset. While CrossFit is a comprehensive and challenging benchmark, it is primarily designed for evaluating few-shot learning methods with only 16 training examples per task. Although TapWeight could be applied to investigate robustness in low-resource scenarios, this is slightly beyond the scope of our current work, as TapWeight focuses on continued pretraining for general downstream tasks. Due to time constraints during the author response period, we were unable to conduct experiments on CrossFit. However, we plan to explore its potential on this benchmark in future work.

2. Comparison with Additional Baseline Methods and Larger LLMs

We thank the reviewers for the insightful suggestions. In response, we conducted additional experiments comparing our method to a more advanced baseline, PCP $1$ , for continued pretraining. As shown in Table 3 in the revised manuscript, results on the GLUE benchmark show that TapWeight outperforms PCP, demonstrating its superiority over recent and powerful continued pretraining methods.

Furthermore, we extended our experiments by applying TapWeight to a larger language model, RoBERTa-large, and comparing it against baseline methods, on RCT, AGNews, and IMDB datasets. As shown in Table 4 of the revised manuscript, the results indicate that TapWeight maintains its superior performance over baseline methods across all three datasets tested, highlighting its robustness across models of varying scales.

We would also like to clarify the goals of our paper and address any potential misunderstandings:
(1) We do not propose new pretraining objectives. Instead, our approach TapWeight leverages existing continued pretraining objectives, automatically learning tradeoff weights between these objectives based on feedback from downstream tasks.
(2) Our focus is on proposing a novel continued pretraining (CP) method and comparing it against other CP baselines such as TAPT and PCP under comparable computational budgets and model scales, as specified in those works. TapWeight demonstrates significant performance gains under these conditions. Due to computational constraints, we were unable to conduct continued pretraining experiments on models of the scale of GPT-3 or GPT-3.5, as referenced in the Tag-LLM and GLUE-X papers.

3. Selection of Pretraining Objectives

In our NLU experiments, we selected pretraining objectives that are widely used and generally recognized as effective, such as MLM (BERT), CL (SimCSE), and SOP (ALBERT). These objectives were chosen based on their demonstrated success in prior studies, particularly under continued pretraining settings. For instance, MLM and CL objectives have been validated as effective for continued pretraining by SimCSE. In contrast, objectives like token detection and swap detection have shown effectiveness in ELECTRA-style training, which involves both a generator and a discriminator. This setup differs from the continued pretraining framework employed in this paper. While these objectives were not prioritized in our previous experiments, we plan to incorporate them in future studies to assess their impact and offer more comprehensive insights.

4. Analysis of Tradeoff Score Results

We observe that downstream tasks with similar objectives tend to share similar weights for pretraining tasks to some extent. For instance, the Esol and Freesolv datasets, which both involve predicting physical chemistry properties of molecules, assign large weights to the MG3 pretraining objective. In contrast, the Toxcast and Clintox datasets, both focused on predicting molecular toxicity, assign smaller weights to the MG3 objective. We have updated Section 4.4 of the revised manuscript to incorporate these observations. As the interactions between pretraining objectives and downstream task types are complex, we plan to investigate this further in future work to provide deeper insights.

$1$ Don’t Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner, Zhengxiang Shi, Aldo Lipani, NeurIPS 2023.

评论- Appreciation for Constructive Feedback

2024-11-27

Thank you again for your detailed and considerate review of our manuscript. Your acknowledgement of our efforts have been greatly motivating. Once more, we extend our gratitude for your valuable time and expertise.

AC 元评审

2024-12-20

This paper introduces TapWeight, a task-adaptive pretraining framework that automatically optimizes the importance of various pretraining objectives based on downstream feedback through a multi-level optimization problem. TapWeight aims to improve the performance of models on specific downstream tasks by making pretraining more relevant to the target domain.

However, as pointed out by the reviewers, there are several concerns about the methodology and experimental setup: The use of the GLUE benchmark, which is considered to be saturated, limits the ability to truly gauge the improvements made by TapWeight; there is a lack of comparison with more current and challenging datasets and with recent generative models or LLMs which might have provided a clearer picture of TapWeight's advancements; the selection criteria for multi-objective pretraining tasks are not sufficiently justified, leaving questions about the rationale behind choosing certain tasks over others; and the experimental results presentation lacks clarity in some aspects, such as metrics explanation and the significance of numerical outcomes.

Even though the authors tried to address these concerns during the rebuttal, the reviewers were not fully satisfied, particularly with the response to the novelty and the generalizability of the approach.

审稿人讨论附加意见

Nil

最终决定Reject

2025-01-22

Reject