PaperHub
6.3
/10
Poster4 位审稿人
最低4最高7标准差1.3
4
7
7
7
3.5
置信度
正确性2.5
贡献度3.0
表达2.3
NeurIPS 2024

From Instance Training to Instruction Learning: Task Adapters Generation from Instructions

OpenReviewPDF
提交: 2024-05-11更新: 2024-11-06
TL;DR

The paper introduces TAGI, a novel approach that enables large language models to learn from instructions rather than extensive data instances, significantly enhancing their adaptability and efficiency in real-world tasks.

摘要

关键词
HypernetworkGeneralizationInstruction Learning

评审与讨论

审稿意见
4

This paper tackled zero-shot learning of LLMs to acquire cross-task generalization. The authors focused on the LoRA adapter, one of the methods used for LLMs for parameter-efficient fine-tuning. They generate task adapters by feeding the task instruction to the hypernetwork. Experimental results based on T5-LM-Adapt and T0 models clarified that the proposed method outperformed previous hypernetwork-based methods.

优点

  1. The alignment of LLMs to task instructions is one of the most important capabilities of LLMs. The proposed method achieved comparable performance with the simple but effective standard fine-tuning baseline even with 39% of the inference cost.

缺点

  1. Because the proposed method underperformed the standard fine-tuning on the RougeL metric, the advantage of the proposed method is inference cost. However, the explanation with respect to computational efficiency is limited. I understood the low inference cost is caused by separating the self-attention for the task input x and the instruction i. Let L be the length of x and M be the length of i. Because the proposed method adds the cross-attention between x and i, the computational cost of the proposed method is L^2 + M^2 + LM, in which each term corresponds to the self-attention of x, that of i, and the cross–attention. However, the standard self-attention requires only (L+M)^2. I consider that the difference is small and cannot understand why the proposed method saves 61% inference cost compared to the self-attention.

  2. This paper lacks the most important ablation study. Specifically, Table 5 lacks the ablation of LoRA weights. The low-efficiency cost discussed in the above item is caused by the separation of the self-attention. Therefore, the advantage of the main contribution of the proposed method, the LoRA weights generation from the hypernetwork, was not clarified from the experiments. I consider that replacing the standard self-attention with two separate self-attentions and a cross-attention does not worsen the performance largely if the pre-training is sufficient.

  3. This paper lacks implementation details. More importantly, the proposed method seems to increase the number of parameters compared to the T0/Tk-Instruct baselines due to the addition of cross-attention and HyperNetwork. Even if the proposed method is superior in terms of time complexity (the number of FLOPS), the number of the parameters and the space complexity should be discussed.

问题

See Weaknesses.

局限性

  1. This paper mentions the increased training cost instead of the reduced inference cost in the appendix. This is fiar discussion.

  2. See the third weakness.

作者回复

Dear reviewer 2ySK:

We greatly appreciate your professional review of our article. Here is our feedback.

Response to W1:

W1. Because the proposed method underperformed the standard fine-tuning...

  1. Training Phase: Gradient computations for each sample are unnecessary. Our method aims to minimize training costs for new tasks. Unlike traditional methods that require extensive labeled data and gradient updates, our approach bypasses gradient updates by directly generating task model parameters from the task description.

  2. Inference Phase: Instruction processing is only required once. We agree with the reviewer's point about our method's reduced inference costs: by processing the instruction a single time, we cut down on both time and computational expenses. However, it appears there may be a misunderstanding—our method's time and self-attention costs for a single example are similar to those of standard approaches since the instruction isn't repeatedly processed.

The primary cost savings of our method are evident when inferring multiple instances for the same task. For nn inference instances, our method incurs a computational cost of L2+n(M2+LM)L^2 + n \cdot (M^2 + LM), whereas the standard self-attention method costs n(L+M)2n \cdot (L + M)^2. Our approach is most effective when handling more instances of the same task with sufficiently lengthy instructions.

To compare inference costs, consider processing nn samples of length ii with a task instruction of length tt, ignoring the output sequence length. Let NN represent the number of FLOPs per token for an encoder-decoder model. The total computational cost for full fine-tuning is FLOPsstandard=Nn(t+i)FLOPs_{\text{standard}} = N \cdot n(t + i). For TAGI, the cost is FLOPsTAGI=N(t+ni)FLOPs_{\text{TAGI}} = N \cdot (t + ni). It can be seen that the effectiveness of TAGI is significantly increased with the increase of n and t.

Due to length limitations, we reported the analysis of length in the global author rebuttal Q1.

Response to W2:

W2. This paper lacks the most important ablation study. Specifically, Table 5 ...

Generating LoRA weights from the hypernetwork is central to our parameter-efficient fine-tuning approach. The TAGI model converts instructions and task-specific data (like few-shot examples) into efficient modules within the pre-trained language model. This not only reduces inference costs but also boosts few-shot learning in instruction-based meta-learning. Furthermore, our method can be adapted to create other efficient modules, such as adapters and prefixes, for fine-tuning.

For the ablation involving cross-attention, we combined it with the instruction for the experiment, as indicated by "Instr. Fus." in Table 4. However, during our experiments, we observed that the cross-attention module's contribution was limited, and as such, we did not report it separately.

We have included the following ablation experiments: the LoRA results for Tk-Instruct, and the outcomes of our method without the hypernetwork, directly using LoRA. The results are as follows:

MethodDefDef + 2Pos.
Tk-Instruct48.054.0
Tk-Instruct-LoRA47.554.6
Hypertuning38.948.6
HINT47.253.2
TAGI48.456.3
Ablation Study
w/o pretraining47.155.6
w/o Instr. Fus.35.140.6
w/o LceL_{ce}47.655.4
w/o LklL_{kl}45.753.9
w/o LinsL_{ins}47.555.2
w/o Hypernetwork43.850.7

The results in the last row show a significant performance drop without the hypernetwork, highlighting its crucial role in cross-task generalization. This is consistent with Table 6 of the HINT study [1]. The hypernetwork enhances task understanding by dynamically generating efficient module parameters from instructions, rather than just embedding knowledge, as with LoRA. This dynamic adaptability improves cross-task generalization. We will further explore the hypernetwork's advantages in the appendix.

[1] HINT: Hypernetwork Instruction Tuning for Efficient Zero- & Few-Shot Generalisation, ACL 2023

Response to W3:

W3. This paper lacks implementation details. More importantly, the proposed...

We have detailed the experimental procedures in Section 4.3 and Appendix A.4, and the relevant code has been submitted. The hypernetwork described in Section 3.2.1 (L-237) uses a 2-layer MLP. We will address it by updating the appendix accordingly.

parameters Costs

The T0/Tk-Instruct baselines represent full fine-tuning, as illustrated in Figure 2. Our approach involves updating only the hypernetwork and generating parameters for efficient fine-tuning modules. Adding cross-attention does not significantly increase the number of fine-tuning parameters since it is incorporated only into the even layers of the encoder. The additional fine-tuning parameters are as follows:

Method (T5-Base)# Params
T0/Tk-Instruct250M
LoRA28.3M
TAGI26.1M

Memory Costs

Let ll be the number of layers, dd the model hidden dimension, hh the number of heads, kk the size of the keys/values, and ss the sequence length to be saved.

In the default TAGI settings, we store two main components: the processed instruction sequence with dsds values (one vector per token), and the LoRA weights, totaling 2×32×l×h×k×22 \times 32 \times l \times h \times k \times 2 values (32-rank LoRA for both key and value per layer and head). The total memory cost is ds+128lhkds + 128 lhk. With kh=dkh = d, the memory cost of TAGI scales proportionally with ds+ldds + ld. Note that the default LoRA rank is 32 but can be adjusted to manage memory costs.

评论

Dear Reviewer 2ySK:

We would like to thank you again for your detailed reviews. We hope that we have satisfactorily addressed your concerns. The deadline for the rebuttal period is approaching, and we sincerely hope to receive your comments and suggestions soon. Please feel free to let us know if there are any further issues we can address.

Respectfully,

Authors

评论

Thank you for your follow-up comments and for acknowledging the clarifications provided in our response.

Response to W1:

W1. regarding effectiveness (the RougeL score): In comparison with Tk-Instruct in Table 2, there was little improvement. Although TAGI outperformed T0 in Table 3, it may be due to the knowledge distillation from Tk-instruct because TAGI used Tk-instruct, which is a refined version of T0, as a teacher model. That is, TAGI would not be effective in comparison with the standard instruction tuning.

In Table 2, our method outperforms similar hypernetwork-based approaches, surpassing the previous best method HINT by more than 1.5. Although the improvement in the 'Def' case for the traditional full fine-tuning method Tk-Instruct is modest, our approach uses fewer fine-tuning parameters, training resources, and less training time. In the 'Def + 2 Pos.' case, our method significantly outperforms Tk-Instruct, except for the 11B model.

The results in Table 3 do not pertain to Tk-Instruct. Instead, we fully meta-trained a model, replacing 'metatrain,' which incorporates versions of two different models: T5-lm and T0. From Table 3, it is evident that our method surpasses existing results for meta-training, including multi-task standard instruction tuning and all TAGI results. Additionally, our training parameters and overhead are considerably reduced, leading to a 10% reduction in computational requirements for inference.

Additionally, in the ablation study shown in Table 4, the T5-XL model achieves a score of 57.3 even without knowledge distillation (w/o L_kl), which surpasses the metatrain result of 53.1.

Response to W2:

W2. regarding efficiency (inference cost): I understood that there is an advantage when there are multiple samples. On the other hand, even with standard self-attention, it is possible to decode multiple samples efficiently by using cache KV if the model is decoder-type. L237 states `We limit our scope to encoder-decoder models for our experiment'. However, it is misleading to state that the scope of the experiment is limited. I consider that the scope of the research (or, the usefulness of the proposed method) is limited to the encoder-decoder-type and this should be discussed as the limitation.

Thank you for highlighting this issue. We acknowledge it as a limitation and have discussed it in Appendix B.1. Our approach aimed to compare with prior work, and therefore, we followed the settings of those studies. However, contemporary decoder-only models such as LLaMA3, QWen2, and other advanced models have already undergone extensive QA for Supervised Fine-Tuning (SFT), serving as robust baselines. Additionally, the continued improvement in model performance using efficient parameter fine-tuning methods like LoRA on decoder-only models suggests the feasibility of generating efficient fine-tuning modules via hypernetworks. Even though the use of KV-cache can reduce computation, it does not undermine our method's initial goal of encoding instructions only once to minimize computational load from the input side. In Table 2, we used the results from the HINT paper for GPT-2 XL (1.5B) and OPT (13B), which perform lower than encoder-decoder models of the same caliber. However, these results do not represent the most advanced decoder-only models available today.

Next, let's analyze the KV cache.

Here, we will let ll be the number of layers, dd the model hidden dimension, hh the number of heads, kk the size of the keys/values, and ss be the length of the sequence we want to save. We ignore biased terms for simplicity.

For decoder-only models, if we want to cache the key/value pairs for a given sequence, we will store 2lhks2lhks values - a key and value for every head in every layer, for each item in the sequence. Assuming kh=dkh = d, the memory cost of decoder-only models is proportional to lsdlsd.

In the default TAGI settings, we store two main components: the processed instruction sequence, which consists of dsds values (one vector per token), and the LoRA weights, totaling 232lhk22 * 32 * l * h * k * 2 values (with a 32-rank LoRA for both key and value per layer per head). The total memory cost is therefore ds+128lhkds + 128 lhk. Note that the default LoRA rank is set to 32, but this parameter can be adjusted to manage memory costs. Assuming kh=dkh = d, the memory cost of TAGI is proportional to ds+ldds + ld.

In conclusion, although decoder-only models can save computations through KV caching, our method only computes the instruction once, directly reducing the processing length at the input end. Therefore, it has better scalability in terms of sequence length (larger ss) and model size (larger dd, ll).

评论

Thank you for your detailed response. I raised my score because my concern about the effectiveness is resolved. I consider that the additional descrtiption of the detailed setup of Metatrain the discussion compared to Metatrain would be benefitial. Also, I apprecieate it if the discussion on the limitation and advantage regarding the transformer architectures is added to the paper.

评论

We are pleased to have addressed some of your concerns and appreciate the valuable feedback and improved score. We recognize that these are areas where our work needs to be revised and supplemented. We will continue to refine our paper and experiments based on the feedback, especially regarding the limitations.

评论

Thank you for offering such a thorough response to my concerns and additional results. I understood that TAGI is not comparable to Full-FT on the target task because TAGI is not fine-tuned on the target task. I raised the score accordingly. On the other hand, I have the following concerns:

W1. regarding effectiveness (the RougeL score): In comparison with Tk-Instruct in Table 2, there was little improvement. Although TAGI outperformed T0 in Table 3, it may be due to the knowledge distillation from Tk-instruct because TAGI used Tk-instruct, which is a refined version of T0, as a teacher model. That is, TAGI would not be effective in comparison with the standard instruction tuning.

W1. regarding efficiency (inference cost): I understood that there is an advantage when there are multiple samples. On the other hand, even with standard self-attention, it is possible to decode multiple samples efficiently by using cache KV if the model is decoder-type. L237 states `We limit our scope to encoder-decoder models for our experiment'. However, it is mis-leading to state that the scope of the experiment is limited. I consider that the scope of the research (or, the usefulness of the proposed method) is limited to the encoder-decoder-type and this should be discussed as the limitation.

In general, I am concerned that the effectiveness of the proposed method does not improve the standard instruction tuning, and that the efficiency of the proposed method is limited to encoder-decoder type models. In other words, I am concerned that TAGI may provide neither effectiveness nor efficiency in instruction-tuning of the current mainstream decoder-type transformer.

审稿意见
7

The paper addresses limitations of current instruction fine-tuning approaches for large language models, which rely heavily on instance training with extensive task data. This limits adaptability to real-world scenarios where labelled task instances are scarce and broader task generalisation is needed.

The contributions of this work are listed as follows:

  1. The paper introduces Task Adapters Generation from Instructions (TAGI), which automatically constructs task-specific models in a parameter generation manner based on given task instructions, without retraining for unseen tasks.

  2. TAGI uses knowledge distillation to enhance consistency between models developed through "Learning with Instruction" and "Training with Instance" by aligning labels, output logits, and adapter parameters.

  3. TAGI enables cross-task generalisation through a two-stage training process including hypernetwork pretraining and fine-tuning.

  4. Experimental results on Super-Natural Instructions and P3 datasets show TAGI can match or outperform traditional meta-trained models and other hypernetwork models, while significantly reducing computational requirements.

  5. TAGI demonstrates effectiveness in generating adapters for unseen tasks without additional parameter updating or gradient back-propagation, avoiding inefficiencies of repeatedly encoding instructions during inference.

The authors state their approach "enables the model to achieve specified tasks and learn from instructions on how to address a category of problems" while reducing inference overhead and boosting generalisation performance.​​​​​​​​​​​​​​​​

优点

Overall, this work is of high quality, and its strengths from my perspective are listed below.

  1. Novel learning paradigm: Beyond the traditional instance-based training, the authors propose a new paradigm, "learning with instructions", to mimic human learning by understanding and transforming task guidelines to adapters (parameters), which represents a shift from instance-based training to instruction-based learning.

  2. Novel problem reformulation: This paper reframes the challenge of cross/multi-task generalisation by formulating it as a task-specific parameter generation problem. This addresses a key limitation in current approaches.

  3. Thorough comparisons and ablation studies: The authors evaluate TAGI on two large-scale instruction datasets (SNI and P3), which provides a robust test of their method. The work also compares TAGI against a wide range of baselines, including methods based on zero-shot in-context learning and hypernetwork. The authors conduct detailed ablation studies to understand the contribution of each component in their model, strengthening the validity of their results.

  4. Significance on bridging human and machine learning: This is the most interesting aspect of the work. By mimicking human instruction-based learning, this work contributions to the broader goal of developing AI systems that can learn more like humans.

缺点

While the paper presents a novel approach with promising results, there are several areas where it could be improved:

  1. Limited model size exploration: The authors primarily focus on models up to 3B parameters, with only limited experiments on an 11B model. This leaves open questions about how well TAGI scales to larger models, which are increasingly common in state-of-the-art NLP. The authors acknowledge this limitation, but could strengthen their work by solving the problem.

  2. Narrow focus on encoder-decoder models: The paper exclusively uses encoder-decoder models (T5-LM and T0), citing their superior cross-task generalisation. However, this limits the generalisability of their findings.

  3. Limited analysis of instruction quality: While the paper focuses on learning from instructions, there's little discussion on how the quality or format of instructions affects performance.

  4. Limited exploration of hyperparameter sensitivity: While the authors provide some ablation studies, a more comprehensive analysis of hyperparameter sensitivity would strengthen the paper. This could include: a) Exploring the impact of different LoRA ranks beyond the chosen value of 32; b) Analysing how the size of the hypernetwork affects performance and efficiency; c) Investigating the trade-offs between pretraining steps and finetuning steps;

Addressing these points would strengthen the paper's contributions and provide a more comprehensive understanding of TAGI's capabilities and limitations.

问题

Please refer to my weakness section.

局限性

Please refer to my weakness section.

作者回复

Dear reviewer Wqw9:

We greatly appreciate your professional review of our article. Here is our feedback.

Response to W1:

W1. Limited model size exploration: The authors primarily focus on models up to 3B parameters, with only limited experiments on an 11B model. This leaves questions about how well TAGI scales to larger models, which are increasingly common in state-of-the-art NLP. The authors acknowledge this limitation but could strengthen their work by solving the problem.

We fully acknowledge this concern, as highlighted in the limitations section. Due to resource constraints and the challenges associated with training larger models, our results may not be as comprehensive as desired. We plan to refine our methodology and extend the scope of our experiments in future work.

Response to W2:

W2. The narrow focus on encoder-decoder models: The paper exclusively uses encoder-decoder models (T5-LM and T0), citing their superior cross-task generalization. However, this limits the generalisability of their findings.

Thank you for your insightful comments, which align with our own concerns. In Appendix B.1, we explained our rationale for selecting Encoder-Decoder models. Encoder-decoder (enc-dec) structure has advantages over decoder-only (dec-only) models regarding task generalization and instruction-following capabilities.

We also recognize the potential of current decoder-only large models (such as Llama3, Qwen2) that have undergone extensive fine-tuning on instruction tasks, potentially serving as a strong baseline similar to TK-Instruct. However, the effectiveness of our method on these models remains to be determined through further experimental evaluation.

In the near future, we plan to conduct experiments with the latest decoder-only large models to assess their applicability.

Response to W3:

W3. Limited analysis of instruction quality: While the paper focuses on learning from instructions, there's little discussion on how the quality or format of instructions affects performance.

We primarily explored the impact of the number of instructions on performance, as shown in Figure 3. Under our experimental settings, the quality and format of the instructions, which are optimal ("gold"), play a crucial role. The main experiment results in Table 2 show that methods based on hypernetworks do not perform as well as Tk-Instruct which also hints at the importance of instructions to complete tasks. Furthermore, Table 4 demonstrates a significant decline in performance in our ablation experiments when instruction fusion is omitted (w/o Instr. Fus.), underscoring the importance of instructions.

Response to W4:

W4. Limited exploration of hyperparameter sensitivity: While the authors provide some ablation studies, a more comprehensive analysis of hyperparameter sensitivity would strengthen the paper. This could include: a) Exploring the impact of different LoRA ranks beyond the chosen value of 32; b) Analysing how the size of the hypernetwork affects performance and efficiency; c) Investigating the trade-offs between pretraining steps and finetuning steps;

Thank you for the reviewers' valuable feedback.

Firstly, in response to point a), we present the impact of varying LoRA ranks on performance in Table 7 in the appendix. We selected a rank of 32 based on a balance between performance and computational overhead.

Regarding point b), we employed a two-layer MLP as the supernetwork for efficiency. While exploring different supernetwork sizes, such as larger models like T5 or GPT-2, might yield additional insights, it would significantly increase overhead, as observed in prior work.

For point c), Table 7 in the appendix details the effects of different fine-tuning steps on performance.

审稿意见
7

This work proposes a new learning paradigm to train large language models (LLMs) for its better task adaptation and generalization ability. Specifically, they propose the method called TAGI, which follows a two-staged teacher-student fashion by firstly learning a set of task-specific LORA weights and then utilizing the outputs from these to align and meta-learn the construction of unseen task adapter hypernetworks. The encoded task instructions will be transformed to the adapter hypernetwork weights for the downstream task adaptation. Such networks undergo generation distribution-based alignment training as well as task-specific training in wider ranges. TAGI is evaluated on the SNI multi-instruction tasks, demonstrating good performance against the baseline learning methods.

优点

  • The proposed method is neat and cleverly integrates the hypernetwork style of training with the teacher-student alignment methods.
  • The demonstrated efficiency with yet other hypernetwork-based methods is intriguing.
  • The additional studies on ICL as well as self-ablation studies are appreciated.
  • The paper is quite easy to follow and well illustrated.

缺点

  • While I do appreciate the meta-train varying experiments, the distribution shifts of tasks in meta train or test datasets seem to be consistent (i.e., I did not find explicit evidence in the manuscript saying that some of the meta-test tasks are altered). Experiments on a few trials on the train-test splitting combinations are worth investigating to provide a more comprehensive analysis (with mean and variance).
  • Why is there no comparison between TAGI to prefix-based [1] methods?
  • Slightly minor: Particularly for the ICL, how would this work complement or benefit methods such as sample selection [2] or of example optimization style?
  • This is more like a suggestion: It would be good to also have a plotted analysis on the parameter size (can be presented in terms of percentage of the entire network size) of the generated hypernetworks against the corresponding performance.

[1] Li, Xiang Lisa, and Percy Liang. "Prefix-tuning: Optimizing continuous prompts for generation." ACL 2021.

[2] Gupta, Shivanshu, Matt Gardner, and Sameer Singh. "Coverage-based example selection for in-context learning." EMNLP-Findings 2023.

问题

  • What if you use significantly more positive examples to define the task and generate the hypernetworks?

局限性

  • The authors address the limitations of the work quite thoroughly in the appendix.
作者回复

Dear reviewer UQYV:

We greatly appreciate your professional review of our article. Here is our feedback.

Response to W1:

W1. While I do appreciate the meta-train varying experiments, the distribution shifts of tasks in meta-train or test datasets seem to be consistent (i.e., I did not find explicit evidence in the manuscript saying that some of the meta-test tasks are altered). Experiments on a few trials of the train-test splitting combinations are worth investigating to provide a more comprehensive analysis (with mean and variance).

As demonstrated in Tables 10 and 11, using the P3 dataset as an example, the meta-test tasks include Sentence Completion, Natural Language Inference, Coreference Resolution, and Word Sense Disambiguation. In contrast, the meta-train tasks consist of Multiple-Choice QA, Sentiment Analysis, Structure-to-Text, and Summarization, among others. The distributions of these tasks are not expected to be consistent.

Figures 3 and 4.5 illustrate the effects of varying the number of meta-training tasks on performance. We will include mean and variance in the results in future updates.

Response to W2:

W2. Why is there no comparison between TAGI to prefix-based [1] methods?

Certainly, during our experiments, we attempted to use generated prefixes for fine-tuning, but the results were not as promising as expected. We also explored generating prefixes alongside LoRA, but the benefits remained minimal. This may be due to potential issues in our implementation, which we have yet to identify. We appreciate your feedback and will revisit the experiment with prefixes to address this issue. Thank you for highlighting it.

Here, we provide supplementary experimental results for LoRA fine-tuning (rank=32)(Tk-Instruct-LoRA) and prefix fine-tuning (num_virtual_tokens=32) (Tk-Instruct-prefix), as well as results of TAGI from replacing the hypernetwork with LoRA (w/o Hypernetwork):

MethodDefDef + 2Pos.
Tk-Instruct48.054.0
Tk-Instruct-LoRA47.554.6
Tk-Instruct-prefix42.654.2
Hypertuning38.948.6
HINT47.253.2
TAGI48.456.3
Ablation Study
w/o pretraining47.155.6
w/o Instr. Fus.35.140.6
w/o LceL_{ce}47.655.4
w/o LklL_{kl}45.753.9
w/o LinsL_{ins}47.555.2
w/o Hypernetwork43.850.7

We observe that the prefix performs exceptionally well in the "Def + 2Pos." scenario, likely because it effectively incorporates information from the positive examples. In contrast, the "Def" scenario shows less satisfactory performance, suggesting that instructions alone are insufficient for achieving optimal results.

Response to W3:

W3. Slightly minor: Particularly for the ICL, how would this work complement or benefit methods such as sample selection [2] or for example optimization style?

Better examples of Instructed Codeword Learning (ICL) typically result in improved performance. Our method leverages this principle by incorporating ICL instances into the hypernetwork's encoding process, alongside the instructions, as illustrated in Figure 2. The information from ICL examples is integrated into the decoder through instruction fusion and concatenation. This approach allows the hypernetwork to generate more accurate inference parameters based on high-quality examples. We have included results comparing regular ICL examples with those without ICL examples, demonstrating that we benefit from ICL. Similarly, our method also shows advantages from enhancements to ICL techniques.

# Num of Meta-trainDef + 2POS (T5-Base)Def (T5-Base)Def + 2POS (T5-Large)
w/o ICL / w ICLw/o ICL / w ICLw/o ICL / w ICL
638.48/38.8531.73/32.2852.00/52.93
3140.91/41.1232.66/33.6853.08/53.66
6041.94/42.5134.11/34.45-

Response to W4:

W4. This is more like a suggestion: It would be good to also have a plotted analysis on the parameter size (which can be presented in terms of percentage of the entire network size) of the generated hypernetworks against the corresponding performance.

In Table 7 of the appendix, we present the results of varying LoRA ranks and their impact on model performance. The table illustrates how different parameter sizes affect the model's efficacy. The corresponding table is included below:

RankpercentageRougeL
Def + 2POS (T5-Base)1611.3%41.8
3211.8%42.5
6412.2%42.3
Metatrain-100%42.1
Def (T5-Base)1611.3%34.8
3211.8%35.3
6412.2%35.4
Metatrain-100%35.3
P3 (T5-Base)1611.3%44.8
3211.8%45.6
6412.2%45.5
Metatrain-100%44.1

The global author rebuttal Q3 has updated the images, where you can further examine the trends.

评论

Thanks for the detailed rebuttal, I hereby retain my positive score.

评论

We deeply appreciate you taking the time to thoroughly review our work and provide thoughtful feedback. We are pleased the additional experiments adequately addressed your questions. Thank you for your openness and willingness to reconsider based on new evidence. Your flexible and fair review process facilitates quality research!

审稿意见
7

The authors introduce Task Adapters Generation from Instructions (TAGI), which converts instructions into task-specific adapters using a hypernetwork. They employ the Knowledge Distillation framework and a two-stage training process: first, hypernetwork pretraining on standard text pretraining data, followed by fine-tuning on meta-training tasks. They evaluate the proposed method on the Super-Natural Instructions and P3 datasets, demonstrating that TAGI can effectively generate adapters for unseen tasks.

优点

  • The limitations of existing methods were clearly explained.
  • The writing was clear, making it easy to grasp the proposed methodology.
  • The concept figure was intricate and facilitated easy understanding.
  • An ablation study was conducted, demonstrating the validity of the proposed methodology's elements.

缺点

  • Following a method that assigns indices to the LoRA layer, the flexibility of the LoRA architecture structure is reduced.
  • This methodology seems limited to encoder-decoder structured LLMs.

问题

  • Please explain how λ1\lambda_1 and λ2\lambda_2 were tuned.
  • Could you provide more details on the process of updating LoRA parameters via Lins\mathcal{L}_{ins}?
  • How were the tasks categorized? Please specify the criteria used.

局限性

Please refer to the Weakness section.

作者回复

Dear reviewer z6rG:

We greatly appreciate your professional review of our article. Here is our feedback.

Response to W1:

W1. Following a method that assigns indices to the LoRA layer, the flexibility of the LoRA architecture structure is reduced.

Our method dynamically generates LoRA weights, offering the same level of flexibility as the LoRA approach. We can customize the rank of the LoRA as well as select which modules require the addition of LoRA weights. Furthermore, our method can be extended to generate parameters for other efficient fine-tuning modules.

Response to W2:

W2. This methodology seems limited to encoder-decoder structured LLMs.

Thank you for your insightful comments, which align with our own concerns. In Appendix B.1, we explained our rationale for selecting Encoder-Decoder models. Encoder-decoder (enc-dec) structure has advantages over decoder-only (dec-only) models regarding task generalization and instruction-following capabilities.

We also recognize the potential of current decoder-only large models (such as Llama3, Qwen2) that have undergone extensive fine-tuning on instruction tasks, potentially serving as a strong baseline similar to TK-Instruct. However, the effectiveness of our method on these models remains to be determined through further experimental evaluation.

Response to Q1:

Q1. Please explain how λ1\lambda_1 and λ2\lambda_2 were tuned.

In the experiment, we set λ1\lambda_1 to two different values: λ1=5\lambda_1 = 5 and λ2=sigmoid(Lins)\lambda_2 = \text{sigmoid}(L_{\text{ins}}). The effects of these different λ\lambda values on the results are illustrated below. We will include this supplementary experiment in the appendix in the near future.

λ1\lambda_1λ2\lambda_2RougeL
Def + 2POS (T5-Base)0.5sigmoid(Lins)sigmoid(L_{ins} )40.1
2sigmoid(Lins)sigmoid(L_{ins} )40.9
5sigmoid(Lins)sigmoid(L_{ins} )42.5
10sigmoid(Lins)sigmoid(L_{ins} )38.7
50.241.3
50.541.6
51.041.2

The global author rebuttal Q2 has updated the images, where you can further examine the trends.

Response to Q2:

Q2. Could you provide more details on the process of updating LoRA parameters via LinsL_{ins}?

Certainly, as detailed in Equation 4 of the main text, we calculate the Mean Squared Error (MSE) loss between the LoRA parameters generated by the hypernetwork and those of the specific task model that have been fine-tuned using LoRA. As outlined in Section 3.2.2, similar to LoraHub, we fine-tune specific task models during meta-training with LoRA and save the LoRA parameters. During the training process, the hypernetwork generates LoRA parameters based on the provided instructions and examples during the forward pass. We then compute the MSE loss between the LoRA parameters Δ\Delta of the specific task model corresponding to the instruction and the parameters generated by the hypernetwork. This loss is subsequently backpropagated to update the hypernetwork.

Response to Q3:

Q3. How were the tasks categorized? Please specify the criteria used.

For the SNI dataset, we adhere to the original methodology and dataset setup for pure English tasks as established by the authors [1], with the number of meta-training tasks set to 756. The major task categories and the number of associated subtasks are detailed in Appendix Table 10. The meta-test set comprises 119 previously unseen tasks. For the P3 dataset, we follow the task division approach outlined in the original T0 work[2], separating it into meta-training and meta-testing sets. The specific divisions are presented in Appendix Table 11.

[1] SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks, EMNLP 2022

[2] MULTITASK PROMPTED TRAINING ENABLES ZERO-SHOT TASK GENERALIZATION, ICLR 2022

评论

Thank you for the detailed response.

Most of my concerns have been resolved, and exploring whether the approach remains effective beyond encoder-decoder models could be an interesting direction for future work.

I will keep my score as it is. Thanks.

评论

Thanks for your suggestion! We are pleased the additional experiments adequately addressed your questions. We are currently expanding this work to models based on decoder-only architectures.

作者回复

Q1:

Q1. Instruction Length Analysis We supplement the instruction length analysis of the experimental dataset here, theoretically supplementing the effectiveness of the method and the savings in reasoning costs.

Median sequence length, given in the number of T5 tokens, for Super-Natural Instructions and P3: We calculate the medians for each format separately, rather than adding the instance and instruction-only values together (hence the mismatch in values). We provide the calculated values in Table. We find that P3 inputs mostly consist of the instance, with prompt templates consisting of relatively few tokens, while SNI inputs consist mostly of instructions.

Text SequenceSNIP3
Median # TokensMedian # Tokens
Instance only4481
Instruction only6924
Instruction + Instance133103
Instruction + 2 positives197-
Instruction + 2 pos. + instance199-
Output16

Q2:

Q2. How λ1\lambda_1 and λ2\lambda_2 were tuned.

In the experiment, we set λ1=5\lambda_1 = 5 and λ2=sigmoid(Lins)\lambda_2 = \text{sigmoid}(L_{\text{ins}}). The effects of these different λ\lambda values on the results are illustrated below. We will include this supplementary experiment in the appendix soon.

λ1\lambda_1λ2\lambda_2RougeL
Def + 2POS (T5-Base)0.5sigmoid(Lins)sigmoid(L_{ins} )40.1
2sigmoid(Lins)sigmoid(L_{ins} )40.9
5sigmoid(Lins)sigmoid(L_{ins} )42.5
10sigmoid(Lins)sigmoid(L_{ins} )38.7
50.241.3
50.541.6
51.041.2

The images can be seen in pdf.

Q3:

Q3. The parameter size of the generated hypernetworks against the performance

In Table 7 of the appendix, we present the results of varying LoRA ranks and their impact on model performance. The table illustrates how different parameter sizes affect the model's efficacy. The corresponding table is included below:

RankpercentageRougeL
Def + 2POS (T5-Base)1611.3%41.8
3211.8%42.5
6412.2%42.3
Metatrain-100%42.1
Def (T5-Base)1611.3%34.8
3211.8%35.3
6412.2%35.4
Metatrain-100%35.3
P3 (T5-Base)1611.3%44.8
3211.8%45.6
6412.2%45.5
Metatrain-100%44.1

The images can be seen in pdf.

评论

We appreciate all the reviewers for their constructive feedback and recognition of the contributions of our paper!

Strengths:

  1. Novelty: Reviewer UQYV and Reviewer Wqw9 value the innovative approach of instruction-based learning, which emulates human learning by comprehending task guidelines and converting them into adapters (parameters). This marks a departure from instance-based training to instruction-based learning, treating all instances of a task collectively and thereby improving cross-task generalization.

  2. Sound Methodology and Strong Performance: Reviewer z6rG and Reviewer UQYV note that the proposed TAGI method ingeniously combines the hypernetwork training style with the teacher-student alignment method in a concise manner. It validates the effectiveness of the method using large-scale datasets P3 and SNI and outperforms previous hypernetwork-based methods, while also having a clear inference advantage over the original instruction fine-tuning meta-training.

  3. Efficiency: Reviewer 2ySK highlights that our proposed method achieved comparable performance with the simple but effective standard fine-tuning baseline even with 39% of the inference cost.

  4. Improved Generalization and Adequate Ablation: Reviewer z6rG, Reviewer Wqw9 and Reviewer UQYV highlighted that converting instructions and task-related data into parameter-efficient modules enhances task generalization. Meanwhile, comprehensive ablation studies have demonstrated the effectiveness of each component of the method.

Supplementary Materials:

Once again, thank you to all reviewers for your constructive comments! We have added additional experiments and will provide further explanations:

  1. How λ1\lambda_1 and λ2\lambda_2​ were tuned; (Reviewer z6rG)
  2. Details on the process of updating LoRA parameters via LinsL_{ins} and the tasks categorized. (Reviewer z6rG)
  3. Comparison between TAGI to prefix-based methods, LoRA-based methods, and TAGI from replacing the hypernetwork with LoRA (w/o Hypernetwork) (Reviewer UQYV, 2ySK)
  4. Analysis of the parameter size against the performance. (Reviewer UQYV)
  5. Comparisons of trainable parameters were conducted alongside additional theoretical analyses of inference complexity and space complexity. (Reviewer 2ySK)
  6. More analysis of hyperparameters and instructions. (Reviewer Wqw9)
  7. In the limitations section, we clarify that the current research is limited to models with encoder-decoder structures. (Reviewer Wqw9, z6rG)
最终决定

This paper proposes a novel approach to zero-shot learning: infer the weights of a task-specific module for an LLM based on natural language descriptions of tasks. It can be viewed as a combination of hypernetwork training and teacher-student teaching. The empirical performance is strong. The authors were able to address most of the reviewers' concerns during the discussion period and the authors are encouraged to incorporate these clarifications into the final version of the paper.