PaperHub
5.6
/10
Poster5 位审稿人
最低4最高7标准差1.4
7
4
4
7
6
4.0
置信度
正确性2.8
贡献度2.6
表达2.4
NeurIPS 2024

In-Context Learning State Vector with Inner and Momentum Optimization

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
in-context learninglanguage modelstask representationsmechanistic interpretability

评审与讨论

审稿意见
7

This work introduces the novel concept of the state vector. It encapsulates the information of ICL examples from separate tokens as anchors. Drawing inspiration from the duality between transformer attention and gradient descent, the authors implement inner and momentum optimization to generate task-specific state vectors. To address the context length limitation of LLMs, a divide-and-conquer strategy is employed to aggregate information from multiple examples into a single state vector. Experimental results demonstrate that patching the state vector during the inference forward pass outperforms previously proposed task vector and function vectors.

优点

  1. Although the idea of using vectors to capture in-context learning is not new, the proposed approach of optimizing the state vector using inner/momentum optimization is well-motivated.

  2. The paper is well-organized and sound with experiments and ablation studies.

  3. The experimental results and the analyses provide some useful insights.

缺点

  1. It is not clear whether larger models or more complex datasets can still benefit from this work. Is this work only applicable to relatively small models?

  2. It is unclear why a dummy query has to be included when extracting the state vector, whose task is to encapsulate the examples only. What if not?

  3. Can we use this vector to align LLMs further, considering encapsulating them into memory/vector bases for reuse and continuous learning of new knowledge?

问题

Summary of review

In general, the paper presents an interesting concept, and the overall submission is technically sound. Therefore I am positive to this paper at this point.

However, I hope the authors can address the several concerns raised in the weakness section by providing extra explanations and discussions regarding the unclear aspects. Also, please response to the additional questions shown as follows.

If the authors can well address my concerns/questions, I will keep the positive accept rating.

Additional questions

  1. I would like to know whether the ICL state vector can be used in actual alignment scenarios. If we switch to a long-context version of the model, how will the ICL state vector be different?

  2. The previous study demonstrates that the label words are information anchors [1], while this work aggregates the information of separate tokens in the state vector, is there any evidence that the separate tokens can also gather in-context information in a progressive manner?

[1]Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore. Association for Computational Linguistics.

局限性

n/a

作者回复

Dear Reviewer MvM2,

Thank you for your review. According to the feedback from you and other reviewers, we have conducted additional experiments and analysis. We have uploaded a Rebuttal PDF that contains new figures and tables. We provide the analysis in General Response (shown in a separate comment on this page).

We would like to address your concerns and questions in detail below.


Concern 1: Limitation in Model Size and Datasets

To address your concerns, we have conducted additional experiments using the larger Llama-2-70B model to evaluate the effectiveness of our proposed optimization method and the D&C aggregation. Detailed results and analysis can be found in the "Experiment on Llama-2-70B" in our General Response. Notably, the performance of ICL shows some improvement with the larger Llama2-70B model; our method continues to significantly enhance the state vector and surpass regular ICL in few-shot settings.

Regarding the dataset limitations, please refer to the response to "Question 2" where we present experiments on more complex alignment tasks.

We hope that these additional results satisfactorily address your concerns.

Question 1: Reason of Introducing Dummy Query

In our preliminary experiments, we observed a decline in performance in state vector extraction when we omitted the dummy query and simply added a separate token right after the demonstration. This finding aligns with earlier studies [1,2]. The issue likely arises because omitting the query alters the input format significantly from regular ICL. Therefore, we follow the previous settings by introducing a "dummy query" and extracting the state vector from the separator token that follows this dummy query.

Thank you for your insightful question. We appreciate your feedback and hope this clarifies our approach.

[1] In-context learning creates task vectors

[2] Function vectors in large language models

Question 2: Performance on Alignment Task

Thank you for your valuable question. We present the performance of the optimized state vector on the alignment task in a zero-shot setting. The results are shown in Table 8 in the Rebuttal PDF. Our findings indicate that although the state vector is slightly inferior to the regular ICL baseline, it still demonstrates significant potential as an effective alignment approach. By omitting the demonstration in the input, our method significantly reduces the time cost (e.g., by 5.4×\times on Llama-2-7B) while achieving 90% of the performance of regular ICL.

Moreover, compared to mainstream alignment-tuning methods such as instruction fine-tuning and reinforcement learning from human feedback, our method achieves an average of 85% of their performance without requiring any additional training. Notably, with models like Mistral-7B and Llama2-70B, our state vectors can achieve 80% of the alignment performance of GPT-4-0613. These results demonstrate that state vectors can enable efficient and effective model alignment, and that inner optimization is beneficial for complex alignment tasks.

We appreciate your feedback and hope this addresses your question comprehensively.

Question 3: Performance on Longer Model

Thank you for your insightful question. Switching to a long-context version of the model will result in a slightly different ICL state vector due to several factors:

  1. Extended Token History: The state vector can access a more extended history of tokens, allowing it to retain more contextual information and improve performance on tasks with long-term dependencies.

  2. Capturing Intricate Patterns: With a longer context, the state vector can capture more intricate patterns and relationships within the data, leading to more accurate and contextually appropriate responses.

  3. Coherent Outputs: Long-context models generally produce more coherent outputs, as the state vector maintains a stable understanding over extended sequences, reducing the risk of losing context.

  4. Enhanced Complexity Handling: The state vector in a long-context model can handle complex tasks more effectively, benefiting from the comprehensive integration of extensive contextual information.

We appreciate your feedback and hope this explanation clarifies the potential impacts of using a long-context model.

Discussion about information anchors

Thank you for your insightful comments. We are pleased to discuss the mechanism of information gathering in ICL.

Our research provides further evidence that separate tokens can progressively gather in-context information. In this paper, we investigate the impact of layer selection on state vector extraction in Section 6.2 Layer Selection. We observe that initially, increasing the number of layers for state vector extraction improves performance, and it reaches peak performance around the middle layers. This indicates that ICL information is progressively integrated into the separate token. In the work "Label Words are Anchors," it is found that information flows first from the example to its corresponding label words in the early layers, and then from the label words' representation to the final separate token representation in different layers. This also shows that ICL information progressively flows to the separate token and peaks around the middle layers. These results are consistent with our observations.

We hope this explanation addresses your concerns. If you have any further questions or need additional clarification, please let us know. We are more than happy to discuss this further.

[1] Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning.


Thank you very much again for your great questions and suggestions. Please let us know if you have any further questions, as we are happy to continue the discussion.

评论

Dear Reviewer MvM2,

We hope you are doing well. As the discussion period is coming to an end (Aug 13), we wanted to reach out to see if you have any follow-up questions. If so, we would appreciate the opportunity to respond before the discussion period ends. We believe our above messages should have addressed your concerns, and therefore may warrant an increase in score if you find them helpful as well. Would you please let us know if you have any further questions or concerns? We are happy to continue the discussion.

Thank you very much again for your thoughtful review and help in improving the paper. We appreciate your time and consideration.

Best regards,

Paper15186 Authors

评论

Thank you for your response to the rebuttal.

I appreciate that my concerns have been taken into consideration. I suggest that the authors enhance the paper by integrating feedback from other reviewers, including the use of larger-scale models and addressing presentation issues.

As a result, I maintain my acceptance of the paper at this stage.

评论

Dear Reviewer MvM2,

We are glad to have addressed your concerns. Please let us know if you have any further questions, as we are happy to continue the discussion.

审稿意见
4

The paper presents an in-depth analysis and optimization of In-Context Learning (ICL) within large language models (LLMs). In particular, it proposes inner and momentum-based optimization techniques to increase the performance of in-context learning.

优点

  1. In-depth analysis of ICL vectors across 12 different datasets (linguistic, translation, and knowledge). The analysis provides insights into ICL vectors and motivates the optimization.
  2. The optimization method can be divided into inner and momentum-based stages, each of which has shown promising improvement.
  3. The ablation study is abundant.
  4. The presentation of the research question is clear to read and follow.

缺点

  1. The method itself is somewhat simple. Specifically, the inner optimization is merely a uniform averaging.
  2. The definition of state vectors lacks a rigorous theoretical basis, making it difficult to infer the approach's generalizability and reliability across different NLP tasks.
  3. The experiments are conducted on Llama-2-7B and GPT-J-6B, which are limited in terms of scale and performance.

问题

In principle, ICL plays a more significant role in larger models (also with better long-sequence performance). Does this method benefit these models?

局限性

NA

作者回复

Dear Reviewer 4gb3,

Thank you for your review. According to the feedback from you and other reviewers, we have conducted additional experiments and analysis. We have uploaded a Rebuttal PDF that contains new figures and tables. We provide the analysis in General Response (shown in a separate comment on this page).

We would like to address your concerns and questions in detail below.


Concern 1: Method is too Simple

We respectfully disagree with the comments on the simplicity of our work. Before we dive into the concrete points, we would like to clarify our motivations and contributions:

  • We have conducted a comprehensive study of the working mechanism of the compressed vector in ICL. We demonstrated that attention activations can be equivalent to updated parameters via gradient descent under specific conditions. Leveraging this understanding, we proposed the definition of the state vector. To the best of our knowledge, we are the first to analyze the compressed vector in ICL through the dual view of ICL and gradient descent.
  • Inspired by the concepts of model soup and momentum gradient optimizer, we proposed the Inner Optimization and Momentum Optimization methods for the state vector. Extensive experiments on a range of tasks have shown that our methods significantly enhance the effectiveness and efficiency of the state vector.
  • To address the example limitation caused by model input length constraints, we introduced the Divide-and-Conquer Aggregation method. Extensive experiments demonstrate that our approach effectively aggregates information from state vectors, successfully scaling up to a large number of examples.

Although our inner optimization can be viewed as a kind of uniform averaging, it is a simple yet effective method to enhance the state vector within a single forward pass. Additionally, our proposed momentum optimization further enhances the inner optimized state vector. Our work not only provides a novel theoretical perspective on ICL compressed vectors but also introduces practical applications for the study of ICL's working mechanism. In this way, our work, as the first to establish a bridge between the theoretical understanding of ICL's working mechanism and the application of ICL compressed vectors, offers a novel integration in the field.

We hope this explanation addresses your concerns. If you have any further questions, please feel free to let us know, we would be happy to discuss them with you.

Concern 2: Limitation in Theory Basis

We acknowledge that our theoretical development is still in the early stages. Due to the complexity of transformers, it is very challenging to rigorously prove the mechanisms of ICL is very challenging. However, our method demonstrates that under specific conditions, ICL can execute gradient descent algorithms. Previous works [1,2] have also built upon this understanding, conducting experiment to support it through attention pattern similarity.

While we cannot mathematically infer the generalizability and reliability of our method, we have extensively validated it across 12 datasets. The experiments demonstrate that our method is both generalized and robust. Additionally, we have provided results on more complex and realistic alignment tasks. Please refer to the"State Vector for Alignment" in our General Response for detailed experiments and analysis. Notably, our optimized state vectors can achieve 85% of the alignment performance of GPT-4 on Mistral-7B and Llama-2-70B.

We hope this explanation and the additional experiments address your concerns regarding the theoretical basis and practical applicability of our approach.

[1] Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers

[2] In-context Learning and Gradient Descent Revisited

Concern 3: Limitation in Model Size

Thank you for your insightful feedback and for pointing out the potential limitations concerning model sizes.

In response to your concerns, we have performed additional experiments using the larger Llama2-70B model to evaluate the effectiveness of our proposed optimization method and the D&C aggregation. Due to resource constraints, we were unable to extend our method to even larger models. Detailed results and analysis can be found in the "Experiment on Llama-2-70B" in our General Response. Importantly, although the performance of ICL exhibits some improvement with the larger Llama2-70B, our method continues to significantly enhance the state vector and surpass regular ICL in few-shot settings.

We hope that these additional results satisfactorily address your concerns.


Thank you very much again for your great questions and suggestions. Please let us know if you have any further questions, as we are happy to continue the discussion. If you find that our response addresses your concerns, would you kindly consider raising your rating score for our paper? We greatly appreciate your consideration.

评论

Dear Reviewer 4gb3,

We hope you are doing well. As the discussion period is coming to an end (Aug 13), we wanted to reach out to see if you have any follow-up questions. If so, we would appreciate the opportunity to respond before the discussion period ends. We believe our above messages should have addressed your concerns, and therefore may warrant an increase in score if you find them helpful as well. Would you please let us know if you have any further questions or concerns? We are happy to continue the discussion.

Thank you very much again for your thoughtful review and help in improving the paper. We appreciate your time and consideration.

Best regards,

Paper15186 Authors

评论

I thank the authors for this detailed reply. However, regarding my points, the response is not satisfactory.

  1. The formalization of state vector is the key point if the authors claim this method "as the first to establish a bridge between the theoretical understanding of ICL's working mechanism and the application of ICL compressed vectors". What is the advance of this method in this perspective? I think this will enhance the theoretical basis. How does this improve the authors' understanding and motivate this simple method?

  2. The added experiment in Figure 10 actually raises more questions. Firstly, it looks like the "regular" baseline is missing. Also, for the few shot settings on LIama-2-70B, the method does not generate significant results (near or lower than ICL), especially for the AVG. Does this mean average aggregation is not very effective (given its simple form)? Also, it is quite true for main figure 2. For the 70B results, the D&C is also improving less over AVG. Why does this happen? Improvement on larger/more powerful base models is more concern given the nature of ICL.

All in all, I am currently unable to change my score.

评论

We sincerely appreciate your thoughtful feedback and questions. We are more than happy to engage in a discussion to address the concerns regarding the formalization of the state vector and the addition aggregation experiment on Llama-2-70B.


The Advantage of State Vector Formalization

We are glad to clarify that our formalization of the state vector contributes to the advancement of research in both the ICL working mechanism and ICL compression vector domains.

ICL Working Mechanism Previous studies[1, 2, 3] that based on the dual view of ICL and gradient descent have primarily provided evidence from the perspective of attention pattern similarity. In contrast, our research demonstrates that certain optimization algorithms applied to the gradient can also effectively refine the state vector. This finding introduces new evidence from the perspective of ICL compression vector application. Although we acknowledge that this approach does not conclusively establish the ICL mechanism, we argue that our study presents a novel perspective and empirical support that contribute to advancing the understanding of ICL’s underlying mechanism.

ICL Compression Vector Earlier studies[4, 5] on ICL compression vectors suggested that hidden states or attention activations in transformers contain information related to the ICL task function. However, these studies were largely based on hypotheses and empirical analysis, lacking a solid theoretical foundation. Our work extends this line of inquiry by leveraging the dual view of ICL and gradient descent. We propose that attention activations can be viewed as trained parameters. This offers theoretical justification for the idea that attention activations could store the ICL task function, thereby providing a more robust basis for understanding their role.

In summary, we argue that our work provides new insights and evidence that contribute to the advancement of research in both the ICL working mechanism and the application of compression vectors.

[1] Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers

[2] In-context Learning and Gradient Descent Revisited

[3] Transformers Learn In-Context by Gradient Descent

[4] In-context learning creates task vectors

[5] Function vectors in large language models

评论

The Experiment result on Figure 10

We sincerely appreciate your insightful observations regarding the experiment in Figure 10. Thank you for pointing out the issue with the missing regular baseline. To address this and supplement our analysis, we have provided the original results of the Llama-2-70B aggregation experiment below. We will also revise our tables accordingly. Due to space constraints, we are only presenting the results for the 10 examples and 100 examples settings. We categorize the result into zero-shot and few-shot settings.

Zero-shot:

RegularAvg. 10-examplesAvg. 100-examplesD&C 10-examplesD&C 100-examples
Person-Occupation0.07.618.10.731.8
Antonym0.161.063.151.764.5
English-French0.335.734.126.746.9
Product-Company0.111.017.532.934.2

Few-shot:

ICL baselineAvg. 10-examplesAvg. 100-examplesD&C 10-examplesD&C 100-examples
Person-Occupation70.468.771.148.072.6
Antonym68.167.768.362.271.7
English-French82.982.182.677.584.6
Product-Company80.180.280.975.585.1

As shown in the few-shot results, our D&C aggregation performs worse than the ICL baseline and average aggregation when the number of aggregated examples is small. The primary reason for this is that the conquer stage relies on a limited number of examples (e.g., only 1-shot in the conquer stage of D&C aggregation, compared to 10-shot for average aggregation when using a total of 10 examples). This limitation prevents the model's ability to effectively compress information from the group-specific state vectors into the final D&C aggregated state vector. However, as the number of examples increases, the performance of D&C aggregation improves significantly. At the 100-example setting, it outperforms the ICL baseline across four datasets (with improvements of 2.2 on Person-Occupation, 3.6 on Antonym, 1.7 on English-French, and 5.0 on Product-Company). Additionally, in the experiments shown in Figure 2, our method also achieves notable improvements (ranging from 1.4 to 11.3 on Llama-2-7B and from 1.3 to 10.1 on GPT-J). We achieved similar improvements on the larger Llama-2-70B model as we did on the smaller Llama-2-7B model, which demonstrates the effectiveness of our D&C aggregation method.

Regarding the average aggregation, we would like to emphasize that we included the average aggregation as a straightforward and intuitive baseline instead of the proposed method (it's not our main contributions). Consistent with your observations, we also found that the averaging aggregation does not yield significant results on Llama-2-70B. We attribute this to the fact that the simple average method, while providing some improvement, is insufficient for state vector aggregation. In contrast, our proposed D&C aggregation surpasses average aggregation under the 100-shot setting, which shows that our method is more effective and offers greater improvement over average aggregation.


Thank you very much again for your thoughtful review and feedback. Please let us know if you have any further questions, as we are happy to continue the discussion.

审稿意见
4

The authors focus on the compressed vectors in In-Context Learning (ICL). They highlights the similarities between the compressed vectors and parameters trained via gradient descent, proposing the formulation of the state vector. Then they applies two optimization algorithm to progressively refine state vector. The proposed method achieves SotA performance on diverse tasks.

优点

1.The authors introduce the formulation of state vector that encapsulates the processing state of ICL stored in the attention activations. 2.The proposed two optimization methods for the state vector seem to be effective. 3. The research topic is significant and intriguing

缺点

1.The proposed state vector seems to be benefit from examples (as shown in Eqn. 5), could you explain how it works without demonstration in the zero-shot setting (as shown in Table 1)? 2.As ICL baseline makes predictions conditioned on the demonstrations, could you explain why its performance remains unchanged with different number of examples in Figure 2? 3.The proposed method is mainly draw inspiration from trainable parameters and gradient optimization algorithm. However, the performance of applying other classical gradient optimization algorithms has deteriorated significantly in Table 2. And the authors argue that “The outcome indicates a discrepancy between the state vector and updated parameters with gradient descent.” This confuses me a lot. The authors should state their similarities and differences more clearly and precisely.

问题

See weakness.

局限性

See weakness.

作者回复

Dear Reviewer yzeU,

Thank you for your review. According to the feedback from other reviewers, we have conducted additional experiments and analysis. We have uploaded a Rebuttal PDF that contains new figures and tables. We provide the analysis in General Response (shown in a separate comment on this page).

We would like to address your concerns and questions in detail below.


Question 1: State Vector Usage in Zero-shot

We would like to clarify the input of our optimization method. When extracting the state vector, we use demonstrations and a dummy query as input. Then we extract attention activations from the forward pass to form the state vector. Thus, the process is the same for both the zero-shot and few-shot settings. Subsequently, we obtain the processed state vector through optimization or aggregation methods. The extraction and the following optimization or aggregation are performed only once since they are unrelated of the test data.

During testing, our experiments are divided into zero-shot and few-shot settings, which differ based on the input to the test data. In the zero-shot setting, we only input the query, whereas in the few-shot setting, we input both the demonstrations and the query. During the forward pass, we use the attention activations stored in the state vector to replace the ones computed in the transformer.

We hope this explanation addresses your question. Should you have any further inquiries, we are at your disposal for any additional discussions.

Question 2: Baseline Result In Figure 2

Thank you for your question.

In Figure 2, the ICL baseline denotes the 10-shot ICL result. Due to the input length limitations of the model, regular ICL cannot be directly used on multiple examples (e.g., the average input length of AG News in the 10-shot setting is 3872, while the maximum input length of Llama-2 is 4096). Therefore, we only use the 10-shot ICL as baseline, while we provide the averaging aggregation as a stronger baseline. We hope this explanation addresses your question.

Question 3: First-Order Gradient Optimizer Result

Here, we would like to provide a more detailed analysis. Firstly, we present the calculation formulas for all optimizers. In the following formulas, gtg_t and vtv_{t} denote the initial gradient and the optimized gradient at the tt-th iteration, respectively. β\beta denotes the momentum coefficient.

1. Polyak Momentum Optimizer (mom.)

vt=βvt1+(1β)gtv_{t}=\beta v_{t-1}+(1-\beta)g_t

2. AdaGrad Optimizer (adag.)

st=st1+gtgts_t=s_{t-1}+g_t \cdot g_t

vt=1st+ϵgtv_{t}=\frac{1}{\sqrt{s_t+\epsilon}} \cdot g_t

3. RMSprop Optimizer (rms.)

st=βst1+(1β)gtgts_t=\beta s_{t-1} + (1-\beta) g_t \cdot g_t

vt=1st+ϵgtv_{t}=\frac{1}{\sqrt{s_t+\epsilon}} \cdot g_tdeacay

4. Adam Optimizer (adam.)

st=β1st1+(1β1)gtgts_t=\beta_1 s_{t-1} + (1-\beta_1) g_t \cdot g_t

ht=β2ht1+(1β2)gth_{t}=\beta_2 h_{t-1}+(1-\beta_2) g_t

s^t=st1β1t\hat{s}_t=\frac{s_t}{1-\beta_1^t}

h^t=ht1β2t\hat{h}_t=\frac{h_t}{1-\beta_2^t}

vt=1s^t+ϵh^tv_{t}=\frac{1}{\sqrt{\hat{s}_t+\epsilon}} \cdot \hat{h}_t

We believe there are two main reasons why first-order gradient optimizers are not well-suited for state vector optimization:

  • First-order gradient optimizers use adaptive learning rates, which require a significant amount of historical information. This strategy often shows instability and reduced effectiveness when data is limited.

  • First-order gradient optimizers have more complex hyper-parameters than the Momentum Optimizer, making it challenging to find the optimal hyper-parameters. Experimental results indicate that directly using hyper-parameter settings commonly used in gradient descent results in poor performance.

We hope this explanation resolves your concerns. If you have any further questions, please feel free to let us know, and we would be happy to discuss them with you.


Thank you very much again for your great questions and suggestions. Please let us know if you have any further questions, as we are happy to continue the discussion. If you find that our response addresses your concerns, would you kindly consider raising your rating score for our paper? We greatly appreciate your consideration.

评论

Dear Reviewer yzeU,

We hope you are doing well. As the discussion period is coming to an end (Aug 13), we wanted to reach out to see if you have any follow-up questions. If so, we would appreciate the opportunity to respond before the discussion period ends. We believe our above messages should have addressed your concerns, and therefore may warrant an increase in score if you find them helpful as well. Would you please let us know if you have any further questions or concerns? We are happy to continue the discussion.

Thank you very much again for your thoughtful review and help in improving the paper. We appreciate your time and consideration.

Best regards,

Paper15186 Authors

审稿意见
7

This paper aims to revealing the working mechanism of the compressed state vector in context learning. They first prove the state vectors are similar with parameters trained via gradient descent. Then, they propose three methods to optimize such kind of parameters: (1) inner optimization averaging each vector of the separate token; (2) momentum optimization using momentum-based optimization algorithm; (3) divide-and-conquer strategy aggregating several vector groups. They test these methods on some medium-sized LLMs. Experiments show some improvements against other compressed vector baselines.

优点

  1. The paper is well written and easy to follow.
  2. The contribution of this paper is significant. It attempts to provide understanding of the mechanism of ICL via a gradient descent perspective. The proposed state vector can be viewed as trained parameters so that they can be optimized with future methods.
  3. The proposed optimization methods are reasonable. Experiments with these methods support their claims. The performance can be improved in most scenarios.

缺点

A substantive assessment of the weaknesses of the paper. Focus on constructive and actionable insights on how the work could improve towards its stated goals. Be specific, and avoid generic remarks. For example, if you believe the contribution lacks novelty, provide references and an explanation as evidence; if you believe experiments are insufficient, explain why and exactly what is missing, etc. Please keep in mind that the rebuttal period is not necessarily enough to run new experiments. As a result, asking for new results is rather unrealistic.

问题

  1. Does the state vector without optimization actually function the same as the task vector? If not, I think there should be another baseline called state vector (w/o optimization). The authors should claim whether this is a new compressed vector method or just a prove of similarity to gradient descent.
  2. Can the authors provide more rationality of optimizing the state vector? I understand the state vector is a compression of previous context, but what it would present after the optimization?
  3. Experiments show that the inner optimization and momentum optimization prefer different tasks. Is there a conclusion when to use which? Or is there a combination to adaptively cover all the situations?
  4. Can the authors elaborate more on the differences between averaging and D&C? Why D&C is worse when the number of examples equals 10 (Figure 2)? In my understanding, they should be the same since the authors use 10-shot ICL (there should be only 1 group).
  5. The authors should check the correctness of the experiment results. For example, in Table 5, they report Llama-2 Zero-shot Regular with six zeros but with an average of 0.2.

局限性

See questions

作者回复

Dear Reviewer FZSw,

Thank you for your review. According to the feedback from other reviewers, we have conducted additional experiments and analysis. We have uploaded a Rebuttal PDF that contains new figures and tables. We provide the analysis in General Response (shown in a separate comment on this page).

We would like to address your concerns and questions in detail below.


Question 1: Difference with Task Vector

Thank you for your feedback. Since the task vector aims to extract the hidden state in the initial LLlayers for intervention, it is functionally equivalent to our state vector without optimization that extract attention activations in initial LLlayers. However, we must emphasize that the proposed state vector differs significantly in its integration into the model. In our theoretical framework, the state vector is defined based on attention activations of the separator token, which can be viewed as updated parameters. Therefore, even though it is mathematically equivalent to the task vector, the state vector possesses a stronger theoretical foundation and interpretability.

Question 2: Contribution Clarify

We would like to accurately describe our work and contributions. Previous work [1] proposed the dual view of ICL and gradient descent and provided evidence from the perspective of attention pattern similarity. Our work extends this understanding and applies it to the working mechanism of ICL compressed vectors. Based on our understanding and theoretical framework, we propose the definition of the state vector and introduce its optimization and aggregation methods. Our paper can be considered as primarily proposing a new compressed vector method, inspired by the similarity between ICL and gradient descent. Therefore, our work establishes a bridge between the study of ICL's working mechanism and the application of ICL compressed vectors.

[1] Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Question 3: Rationality of Optimization

Thank you for your question. In our perspective, we consider the state vector not simply as a compression of demonstration text information, but rather as a storage of the implicit ICL task function information contained within the demonstrations. Our optimization methods have enhanced the accuracy of this task information. The experiments in Section 6.1 Qualitative Study can intuitively demonstrate this point. As shown in Figure 4, a trend is observable in the movement of these clusters as the example position increases, indicating that as the number of demonstrations increases, the state vector becomes closer to the ideal task representation. Our momentum optimization effectively utilizes this trend and captures the underlying task dynamics within the demonstrations, thereby enhancing the state vector.

Question 4: Choice of Optimization Method

Thank you for your insightful question.

Based on the experimental results, we find that momentum optimization performs better on Knowledge tasks, the reason may be that more examples can effectively promote the model's recall of internally stored knowledge, while 10 examples may be insufficient. Therefore, our momentum optimization can enhance the accuracy of the knowledge-based task function stored in the state vector by extending its trend.

On the other hand, inner optimization shows better average performance on Linguistics and Translation tasks. This might be because, for these tasks, 10 examples are sufficient for the task, and momentum optimization could over-edit the state vector, thereby introducing noise.

In light of these findings, we suggest using momentum optimization when increasing the number of examples effectively enhances ICL performance. Conversely, when an increase in examples does not contribute much to performance improvement, using only inner optimization is recommended.

We hope this explanation addresses your question. Please let us know if you have any further questions; we would be happy to discuss them with you.

Question 5: Difference of Aggregation Result in 10-shot

Thank you for your question. We would like to elaborate on the differences between averaging aggregation and D&C aggregation in the 10-shot setting.

In the 10-shot setting, we have only one demonstration group. For averaging aggregation, we directly extract the group-specific state vector from this single demonstration group, which becomes the final aggregated state vector.

For D&C aggregation, the process is slightly different. In the divide stage, we extract the group-specific state vector from the group-specific dummy query. In the conquer stage, we first pair the group-specific dummy query with its label to form a one-shot demonstration. Then, we input a new dummy query with this one-shot demonstration and extract the aggregated state vector from the dummy query. Thus, the D&C aggregated state vector and the averaging aggregated state vector are not the same.

Our analysis suggests that the D&C aggregated state vector performs worse when the number of examples is small because the conquer stage relies on a limited number of examples. This limitation prevents the model from effectively compressing the information from the group-specific state vector into the final D&C aggregated state vector.

Typo in Table 5

We would like to clarify that the average result in Table 5 represents the average performance across 12 tasks, not just the 6 additional tasks. We appreciate your feedback and will revise our manuscript to ensure this is more clearly communicated.


Thank you very much again for your great questions and suggestions. Please let us know if you have any further questions, as we are happy to continue the discussion.

评论

Dear Reviewer FZSw,

We hope you are doing well. As the discussion period is coming to an end (Aug 13), we wanted to reach out to see if you have any follow-up questions. If so, we would appreciate the opportunity to respond before the discussion period ends. We believe our above messages should have addressed your concerns, and therefore may warrant an increase in score if you find them helpful as well. Would you please let us know if you have any further questions or concerns? We are happy to continue the discussion.

Thank you very much again for your thoughtful review and help in improving the paper. We appreciate your time and consideration.

Best regards,

Paper15186 Authors

审稿意见
6

The paper considers the problem of using compressed vectors to replace explicit demonstrations, with potential benefits of providing new perspectives to better understand the mechanism of In-Context Learning (ICL), and addressing the issue of overly long demonstrations by summarizing them into a vector. The main challenge is the drop in accuracy. The paper proposes new methods to enhance compressed vectors, using ideas of ensemble averaging, momentum (though details are insufficient), and Divide-and-Conquer aggregation. Many ablation experiments are conducted to prove the effectiveness and answer different research questions.

优点

  1. The paper conducts a large number of ablation experiments.
  2. The methods proposed are quite novel, including the heuristic approaches and Divide-and-Conquer aggregation.
  3. In practice, the methods perform very well on certain tasks.
  4. The work provides new perspectives to better understand the mechanism of ICL.

缺点

  1. The development is still in its early stages, mostly heuristic, lacking any theoretical guarantees. However, I think this is not a critical issue, as the final performance is good in practice for certain tasks, and a rigorous analysis is very hard due to the complexity of transformers.
  2. The presentation needs improvement, e.g., Figure 4 should have a color legend for different positions. The specific parameter LL used should be clarified in Section 5.1
  3. The strict definition of the momentum gradient optimization algorithm used needs to be clearly stated.

问题

  1. What is the strict definition of the momentum gradient optimization algorithm used as opt? Is it Polyak Momentum or Nesterov momentum?
  2. In the code, what do FixOptimizer, OneStepOptimizer, and FixOneStepOptimizer represent respectively? Which one represents momentum?
  3. For Divide-and-Conquer aggregation, how does the performance compare to directly averaging the different group-specific state vectors?

局限性

  1. The presentation needs improvement.
  2. It is worth considering larger models, e.g., Llama2-70B. But existing experiments on Llama-2-7B, Llama-2-13B and GPT-J-6B is strong enough.
  3. It is worth discussing the performance on more complex ICL tasks, such as math and text summarization.
作者回复

Dear Reviewer pJQh,

Thank you for your review. According to the feedback from you and other reviewers, we have conducted additional experiments and analysis. We have uploaded a Rebuttal PDF that contains new figures and tables. We provide the analysis in General Response (shown in a separate comment on this page).

We would like to address your concerns and questions in detail below.


Concern 1: Limitation of Model Size

Thank you for your valuable feedback and for highlighting the potential limitations related to the model sizes.

In response to your concern, we have conducted additional experiments using the larger Llama2-70B model to evaluate the effectiveness of the proposed optimization method and the D&C aggregation. Please refer to the "Experiment on Llama-2-70B" in our General Response for detailed experiments and analysis. Notably, while the performance of ICL improves with the larger Llama2-70B, our method still effectively enhances the state vector and outperforms regular ICL in a few-shot setting.

We appreciate your feedback and hope these additional results address your concerns.

Concern 2: Limitation of Datasets

Thank you for your feedback and for highlighting the potential limitations associated with the tasks.

To explore the effectiveness of state vectors on more complex tasks, we conducted experiments on the alignment task, which involves aligning LLMs with human preferences and is inherently more complex and diverse. Please refer to the "State Vector for Alignment" in our General Response for detailed experiments and analysis.

Notably, our state vectors can achieve 85% of the performance of mainstream alignment-tuning methods (i.e., instruction-finetuning and reinforcement learning from human feedback) without training. With Mistral-7B and Llama-2-70B, our state vectors can achieve 80% of the alignment performance of GPT-4-0613.

We hope these additional experiments provide a thorough understanding of our method's capabilities on complex tasks.

Question 1: Definition of Momentum Gradient Optimization

Thank you for your valuable feedback and for raising an important point regarding the strict definition of the momentum gradient optimization algorithm used in our study.

In this paper, we employed the Polyak Momentum Optimization algorithm. To ensure clarity and enhance the reader's understanding, we have provided the update rules for all optimization algorithms used in Table 2, corresponding to the opt()opt(*) function calculation method described in Equation 9 in the paper.

To maintain consistency with the notation used in the paper, in the following formulas, EiE_i denotes the ii-th (1iN1 \le i \leq N) state vector (we ignore the hyper-parameter LL for simplicity), and NN is the number of state vectors. Additionally, let vtv_{t} represent the update vector at tt-th iteration, initialized as v0=0v_0=\boldsymbol{0}. Here, α\alpha denotes the learning rate, β\beta denotes the momentum coefficient, and ϵ\epsilon is a very small constant for calculation stability. Below are the detailed update rules for each optimizer:

  1. Polyak Momentum Optimizer (mom.)

gt=EtEt1g_t = E_{t}-E_{t-1}

vt=βvt1+(1β)gtv_{t}=\beta v_{t-1}+(1-\beta)g_t

  1. AdaGrad Optimizer (adag.)

gt=EtEt1g_t = E_{t}-E_{t-1}

st=st1+gtgts_t=s_{t-1}+g_t \cdot g_t

vt=1st+ϵgtv_{t}=\frac{1}{\sqrt{s_t+\epsilon}} \cdot g_t

  1. RMSprop Optimizer (rms.)

gt=EtEt1g_t = E_{t}-E_{t-1}

st=βst1+(1β)gtgts_t=\beta s_{t-1} + (1-\beta) g_t \cdot g_t

vt=1st+ϵgtv_{t}=\frac{1}{\sqrt{s_t+\epsilon}} \cdot g_t

  1. Adam Optimizer (adam.)

gt=EtEt1g_t = E_{t}-E_{t-1}

st=β1st1+(1β1)gtgts_t=\beta_1 s_{t-1} + (1-\beta_1) g_t \cdot g_t

ht=β2ht1+(1β2)gth_{t}=\beta_2 h_{t-1}+(1-\beta_2) g_t

s^t=st1β1t\hat{s}_t=\frac{s_t}{1-\beta_1^t}

h^t=ht1β2t\hat{h}_t=\frac{h_t}{1-\beta_2^t}

vt=1s^t+ϵh^tv_{t}=\frac{1}{\sqrt{\hat{s}_t+\epsilon}} \cdot \hat{h}_t

After NN iterations, we use the final update vector vNv_{N} to optimize the state vector:

opt([Ei]i=1N)=αvNopt([E_i]^N_{i=1})=\alpha * v_{N}

We hope this clarifies the strict definition and update rules for the momentum gradient optimization algorithm used in our study. Thank you once again for your insightful question.

Question 2: Comparison between D&C and Averaging Aggregation

In our aggregation experiment, we compared our D&C aggregation with the average aggregation method. The average aggregation method involves averaging the group-specific state vectors. As shown in Figures 2 and 6, although average aggregation benefits from an increasing number of examples, D&C aggregation outperforms average aggregation when the number of examples is the same. This is primarily due to the fact that D&C aggregation better captures the information in the group-specific state vector, leading to more robust and better performance.

Thank you for your question. We hope this comparison adequately addresses your question.

Question 3: Optimization in Code

In the code, FixOptimizer denotes the inner optimization. OneStepOptimizer denotes the direct momentum optimization on the original state vector, which we did not apply. FixOneStepOptimizer denotes the momentum optimization on the inner optimized state vector , and this represents the "momentum optimization" mentioned in the paper.

Suggestion of Presentation

We will revise Figure 4 to include a color legend for different positions to enhance its clarity. For the specific parameter used in the paper, we provided them in the "implementation details" in Appendix A. We appreciate your suggestions and hope these could address your concerns.


Thank you very much again for your great questions and suggestions. Please let us know if you have any further questions, as we are happy to continue the discussion. If you find that our response addresses your concerns, would you kindly consider raising your rating score for our paper? We greatly appreciate your consideration.

评论

Thank you for your detailed response. It answers most of my question.

For Divide-and-Conquer aggregation, how does the performance compare to directly averaging the different group-specific state vectors?

My question is not regarding to average aggregation, but for average of group-specific state vectors. My motivation is that the aggregation of group-specific state vectors in this paper seems non-trivial to me, and I was wondering what's the performance if a naive average aggregation of group-specific state vectors are used. I understand this is a new algorithm and implementing it and evaluating it comes with cost, and I think this new experiment is not necessary. Existing experiments are strong enough.

It is worth discussing the performance on more complex ICL tasks, such as math and text summarization

we conducted experiments on the alignment task, which involves aligning LLMs with human preferences and is inherently more complex and diverse

Thanks for the additional experiment. I think one fundamental question in this field that is still unclear is why substituting the state vector works. A relevant question is what information can be stored in state vector. Answering these questions would be far beyond the scope of this paper. But since these questions are clear, it would be beneficial to provide more evidence on the state vectors method on tasks of different complexity, even if the state vectors method turn out to perform bad on some cases. I suspect on more complex task, the information stored in state vectors would be not sufficient and performance improvements would diminish. This motivate me to ask the question about more complex ICL tasks. But again, this is not necessary, and I think existing experiments are strong enough.

Based on your promise to improve the presentation, I will increase my score.

评论

For Divide-and-Conquer aggregation, how does the performance compare to directly averaging the different group-specific state vectors?

Thank you for your question. We apologize for the earlier misunderstanding and value the chance to further discuss the naive average aggregation of state vectors that you have mentioned. However, we are slightly confused, as our understanding is that the averaging aggregation state vector proposed in our work indeed uses the average of group-specific state vectors. We believe this is equivalent to the naive average aggregation of group-specific state vectors that you referred to.

Since the naive averaging algorithm is similar to our Inner Optimization approach, we conducted an additional experiment to compare the performance of the averaging state vector with that of the inner optimized state vector. For the naive averaging algorithm, we set up a total of 100 examples, with 10 examples for a group. The results for the naive averaging of group-specific state vectors under this setting are provided below.

AntonymEnglish-FrenchPerson-Instrument
inner (zero-shot)61.0±1.066.5±1.067.4±2.6
average (zero-shot)59.3±1.467.1±0.866.7±2.2
inner (few-shot)66.2±1.674.6±0.970.1±4.3
average (few-shot)65.7±1.175.1±1.570.5±3.1

Our observations indicate that both methods enhance the state vector in a similar manner, leading to comparable improvements in performance and robustness. However, in terms of efficiency, our Inner Optimization approach holds an advantage as it requires only a single forward pass and fewer examples.

We hope our response addresses your question. If we have misunderstood your proposed algorithm, we kindly ask you to provide a more detailed description of the difference between your proposed algorithm and our averaging aggregation, as well as specify the experiment results you are interested in. We would be more than happy to continue this discussion with you.


It is worth discussing the performance on more complex ICL tasks, such as math and text summarization

we conducted experiments on the alignment task, which involves aligning LLMs with human preferences and is inherently more complex and diverse

We sincerely appreciate your thoughtful feedback and your emphasis on the importance of the state vector method in more complex tasks. While our current work focuses on explaining the working mechanism and applying it to several basic tasks, we fully recognize the significance of exploring its application in more complex scenarios. We plan to investigate the performance and applicability of state vectors in these more challenging tasks in our future work.


Thank you so much for raising the score and your very supportive comments on our paper! We will revise the paper according to your suggestions and comments. Please let us know if you have any further questions, as we are happy to continue the discussion.

作者回复

Dear Reviewers,

We greatly appreciate your insightful reviews and are delighted that you have acknowledged our paper's strengths. We briefly summarize them as follows:

  • Novelty: "The methods proposed are quite novel.", "The research topic is significant and intriguing.", "The paper presents an interesting concept.", etc.
  • Deep Analysis and Abundant Experiment: "The paper presents an in-depth analysis and optimization of ICL.", "The paper conducts a large number of ablation experiments.", etc.
  • Writing: "The paper is well written and easy to follow.", "The presentation of the research question is clear to read and follow.", "The paper is well-organized.".
  • Fully Motivated: "The proposed optimization methods are reasonable.", "the proposed approach of optimizing the state vector using inner/momentum optimization is well-motivated.", etc
  • Effectiveness: "The proposed two optimization methods for the state vector seem to be effective.", "The optimization method has shown promising improvement.", etc

We present the additional experiments on a larger model and more complex datasets. Please check out the Rebuttal PDF. The following are the brief analysis and findings:

Experiment on Llama-2-70B

We provide the optimization and aggregation results on the Llama-2-70B model. Due to resource constraints, we are unable to conduct our method on very large models. The results of the optimization method are presented in Table 7, and the aggregation method results are shown in Figure 10. We find that, compared to smaller models, all results improve when applied to Llama-2-70B. Furthermore, both the inner optimization and momentum optimization effectively enhance the state vector, either outperforming or being comparable to regular ICL in zero-shot and few-shot settings. For the aggregation results, the trends observed in smaller models remain evident in the larger model, with the D&C method still outperforming the averaging aggregation in multiple example settings. These results indicate that our inner and momentum optimization, as well as the D&C aggregation method, can also benefit the state vector on the larger Llama-2-70B model.

State Vector for Alignment

We present the performance of the inner optimized state vector on the alignment task in a zero-shot setting. We evaluate our method using two automatic alignment benchmarks: alpaca-eval (2.0)[1] and just-eval[2]. The results shown in Table 8 indicate that although our state vector is slightly inferior to the regular ICL baseline, it still demonstrates significant potential as an effective alignment approach. By omitting the demonstration in the input, our method significantly reduces the time cost (e.g., by 5.4×\times on Llama-2-7B), but achieves 90% performance of regular ICL. Compared to mainstream alignment-tuning methods (i.e., instruction finetuning and reinforcement learning from human feedback), our method achieves an average of 85% of their performance without requiring any training. Notably, with Mistral-7B and Llama2-70B, our state vectors can achieve 80% of the alignment performance of GPT-4-0613. These results demonstrate that state vectors can enable efficient and effective model alignment, and that inner optimization is beneficial for complex alignment tasks.

[1] Length-controlled alpacaeval: A simple way to debias automatic evaluators.

[2] The unlocking spell on base llms: Rethinking alignment via in-context learning.


We sincerely thank the reviewers for their constructive suggestions and questions to enhance our paper. We will address your questions and concerns below and incorporate your valuable suggestions into our revisions. Please reply if you have any further questions, and we will be more than happy to continue the discussion.

最终决定

Summary:

The paper explores compressed vectors to replace explicit demonstrations in In-Context Learning (ICL), aiming to improve understanding and address the issue of lengthy demonstrations. It introduces novel methods like ensemble averaging, momentum, and Divide-and-Conquer aggregation, supported by extensive ablation experiments.

Strengths:

  • Innovative methods for enhancing compressed vectors in ICL.
  • Extensive ablation studies provide valuable insights.
  • Strong practical performance on certain tasks.

Weaknesses:

  • Methods are largely heuristic and lack theoretical guarantees.
  • Presentation needs improvement, with clearer explanations of key details.

Reason to Accept: The paper's novel approaches and solid experimental validation offer valuable contributions to ICL, making it a worthwhile acceptance despite some early-stage development and presentation issues.