PaperHub
5.7
/10
Rejected3 位审稿人
最低3最高8标准差2.1
3
6
8
3.3
置信度
ICLR 2024

Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

Motivated by by pointers in computer programming, we have presented a new neural memory model (PANM) capable of manipulating pointers to learn symbol processing rules, thereby achieving better systematic generalization to longer sequences.

摘要

关键词
memory-augmented neural networkgeneralizationsequence learningindirection

评审与讨论

审稿意见
3
  • First, I want to highlight that I served as a reviewer for this manuscript during its submission to NeurIPS 2023. It is noticeable that several amendments, which were assured to the four reviewers, are absent in the current version presented at ICLR 2024. I strongly suggest that the authors should revise the manuscripts as promised either in a camera-ready version if accepted, or the next submission or arxiv if rejected this time.

In this paper, the capabilities of the "Pointer-Augmented Neural Memory (PANM)" system are extensively explored, demonstrating its proficiency in managing long data sequences through the integration of an external memory that operates on physical addressing. The system significantly enhances the performance of RNNs/Transformers in reasoning tasks. PANM achieves this by employing pointers for memory content retrieval and mastering a variety of operations via comprehensive end-to-end training, as is validated by experimental results.

优点

  1. This paper introduces a novel method for symbolic processing in neural sequence modeling, utilizing "physical pointers" for enhanced generalization to longer sequences and manipulating pointers for symbolic processes. This versatile plug-and-play module seamlessly integrates with various neural models.

  2. The innovative approach of employing a pointer to govern memory decoding stands out, with even a straightforward GRU model yielding impressive performance, showcasing the method's effectiveness and potential applications.

缺点

  1. The paper's limited evaluation tasks hinder its research impact, despite effectively enhancing Transformers and LLMs with symbolic reasoning. Its performance in broader applications is unconvincing, focusing mainly on QA tasks already validated by other networks. A wider range of NLU and NLG tasks is needed to prove its versatility and value.

  2. The baselines used for comparison are outdated, and large language models (LLMs) with over 10B parameters could outperform the proposed method on reasoning tasks. The paper should either compare the method with these LLMs or show how PANM enhances LLMs' reasoning abilities.

  3. The paper downplays the similarities between PANM and the Transformer, particularly in terms of positional encodings and attention computations. The current notation makes these connections unclear, and the paper should work to make them more explicit. By presenting PANM as a modification of the Transformer, rather than something entirely different, the distinctions and innovations of PANM could be more easily understood. The choice of notation, such as using h and g instead of query and key, further obscures the connection, and should be addressed to enhance clarity.

问题

  1. Why not conduct an experiment on LLM? I do think some billion-level network could be used in normal research lab, espcially with advanced packages such as DeepSpeed, etc.
评论

Thank you for your insightful review. We address your concerns and questions point by point below.

Summary

"First, I want to highlight ...". Could you specify the amendment you expected to see in our paper? We will try to address them if they are appropriate.

Weaknesses

  1. Thank you for sharing your opinions. We appreciate your insights and agree that expanding the evaluation tasks to a wider range of NLU and NLG applications would be beneficial. However, we do not think that lacking wider NLP evaluations is a good reason to reject a paper that focuses on improving the symbol-processing capability of general deep learning models. Generalization to longer sequences is a crucial aspect with far-reaching implications in various AI applications, particularly in addressing the core limitation of current deep learning architecture. While we acknowledge the importance of broader evaluations, we believe that our paper makes a meaningful contribution by emphasizing and advancing the field's understanding of symbol processing and length extrapolation within the context of general deep learning models. Further, we included experiments on question-answering and machine translation to prove the broad application of our approach. In total, we have conducted experiments on 4 major categories of tasks with 12 different datasets and compared our method with 16 baselines where our method consistently outperforms other models. Therefore, we believe that our evaluation is convincing and comprehensive.
  2. As explained in Appendix D's Baseline choice section, our baselines are competitive with SOTA in the considered tasks. Regarding your mention of LLM, could you clarify the setting when LLM with 10B could outperform our method? If the LLM is not finetuned on the task, we have shown that Chat-GPT with 175B still fails in our task, and can be considered to underperform our method (Appendix D6). In case the setting is to finetune LLM to the task, we need to conduct experiments to confirm that the finetuned LLM could perform our PANM+LLM, and this task is not trivial because finetuning LLM of 10B requires huge computing resources.
  3. We spent a paragraph in Appendix A to compare positional encoding and our physical pointers. Similarly, we highlighted the difference between our attention and softmax attention used by Transformer in the introduction and Sec. 2.3.1. In this revision, we have added another table (Appendix A Table 3) to summarize the difference between Transformer and our method (see below). | Difference | Transformer | PANM (Our) | | ----------- | ----------- |-----------| | Key Generation | Keys are computed based on input data. Hence, when meeting novel data during testing, Transformer will observe novel keys, and cannot work properly. | The keys in our approach are generated as fixed numbers, specifically physical memory addresses. These keys are entirely separate from the data.| |Extendable to Longer Sequences|The dimension of attention weights varies with input length, making arithmetic transformations on these attention weights infeasible as the sequence length increases.|The fixed nature of our physical addresses allows our pointers to be easily manipulated and extendable to longer sequences.| |Symbol Processing Advantages| The use of attention weights as implicit pointers may lack the explicitness needed for effective symbol processing.| Using physical addresses as keys in our approach is crucial for symbol processing as it explicitly allows pointer assignment, dereference, and arithmetic operations.| |Physical Address vs Positional Encoding| Positional encoding can be generated independently from data. However, they are not separated from the input data as our physical addresses. There is no explicit mechanism in Transformer to attend only to these | Our physical addresses are detached from the data, supporting the transformation of pointers through timesteps and isolating pointer manipulation from the input.|
评论

We respectfully disagree with the reviewer that our paper should "presenting PANM as a modification of the Transformer". PANM is based on memory-augmented neural network framework (MANN) [Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).], which was introduced even before Transformer. The notation we used follows the classical notation used in MANN, which was later rebranded into those in Transformer paper. In particular, h means the hidden state the GRU (kind of a controller in MANN), g is our function to map the memory content to the same space with h. They are not required to be written as K and V in Transformer. That said, we agree that there are similarities between PANM and Transformer, and we have added detailed comparisons above to clarify the difference, showing the innovation of our approach in regards to not only Transformer but also other approaches relying on softmax attention. The choice of notations does not hinder the readability of our paper.

Questions

It is not trivial for our resources to finetune LLM (billions of parameters) and we aim to improve fundamental and general deep neural networks, not to scale with LLMs, which can be pursued as a separate work. It is crucial to realize that our paper is not an NLP paper. Rather, it contributes a fundamental method for the general machine learning community. Please contribute to the discussion if you still insist that fine-tuning LLM is required to accept our paper.

审稿意见
6

Systematic generalization is the ability of a model to recognize and apply abstract rules to novel contexts that are semantically similar to the training data. Existing models tend to overfit the training data and perform poorly on out-of-distribution samples, such as longer sequences or sequences with novel compositions. The paper discusses the challenge of achieving compositional generalization in neural networks and the proposal of a novel memory architecture called Pointer-Augmented Neural Memory (PANM) to address this issue.

The work proposes a pointer-based mechanism inspired by both the human brain's symbolic processing through variable binding and computer programs' use of pointers to dynamically access data and programs. PANM explicitly models pointers as physical addresses and strictly isolates pointer manipulation from input data. PANM incorporates a memory structure similar to RAM in Von Neumann's architecture, where each memory slot consists of data and address components. The addresses are generated to explicitly simulate physical memory addresses, allowing for better generalization to longer sequences. To manipulate a pointer, the authors introduce an address bank containing physical addresses corresponding to the input sequence and a neural network called the Pointer Unit responsible for transforming pointers based on attention to the address bank. The paper describes two modes of pointer-based access: pointer dereference (Mode-1) and relational access (Mode-2).

The authors demonstrate the effectiveness of PANM in symbol-processing domains, including algorithms and context-free grammars, where it works seamlessly with LSTM and StackRNN. PANM is also applied to improve generalization in Transformer models on tasks such as compositional learning, SCAN, and mathematics datasets. The paper concludes the importance of integrating fundamental generalizing principles into existing deep learning models to address their limitations in systematic generalization.

优点

The paper introduces a novel memory architecture, Pointer-Augmented Neural Memory (PANM), which explicitly models pointers as physical addresses. This architecture provides a new approach to enhance compositional generalization in neural networks, addressing a significant challenge in deep learning.

The method demonstrates improved generalization, especially in processing tasks, such as algorithms, context-free grammars, and compositional learning. It enhances the model's ability to work with longer sequences and compositions.

The model designed two explicit modes for both of easy and complex ways, and the paper provided the analysis.

The paper showcases the versatility of PANM by applying it to different tasks, including question-answering and machine translation.

缺点

While PANM is proposed as a solution to compositional generalization, the paper does not provide a comprehensive comparison with existing memory architectures or models designed for similar purposes. It would be helpful to assess its performance against competing approaches.

The paper focuses on demonstrating the improvements achieved by PANM in specific tasks, such as algorithms and instruction generalization, but it is unknown if the method could be extended to practical adoption in natural language domains.

More related work for systematic generalization or compositional generalization in language or multimodal perspectives should be discussed. E.g., Compositional generalization through meta sequence-to-sequence learning 2019 A systematic assessment of syntactic generalization in neural language models 2020 Analogical reasoning for visually grounded language acquisition 2020 Generalization in multimodal language learning from simulation 2021

问题

Refer to above comments

伦理问题详情

NA

评论

Thank you for your valuable feedback. We address your concerns and questions point by point below.

  • "While PANM is proposed ...". As explained in Appendix D's Baseline choice section, we have compared with competitive and SOTA baselines in the task we considered. In particular, we have compared our method with 16 baselines across tasks. Most of them are memory-based models and designed for symbolic generalization. We note that, as explained in Related Work's second paragraph, we have not compared our model with task-specific methods using reinforcement learning or neural-symbolic architecture. Our focus is improving the systematic generalization of fundamental differentiable models.
  • "The paper focuses on ...". The reason we focus more on symbolic processing tasks is that these tasks remain very challenging for current deep neural networks and urgently demand new approaches like our solution. That said, as shown in Sec 3.4, we have tested our methods on practical NLP tasks such as Question Answering (QA) and Machine Translation (MT). The significant gains, ranging from 1-6% in QA and 5-40 perplexity scores in MT, show the potential of extending our method to practical NLP domains.
  • "More related work for ...". Thank you for your suggestions on the related works. We have incorporated the below discussion in our section "More Discussion on Related Works" (Appendix A). Among your suggested papers, the last two works on reasoning and generalization using image inputs, which is beyond the scope of our paper only targeting generalization for longer sequences of text/discrete inputs. The other two papers are more relevant. The approach by Lake et al., 2019 addresses systematic generalization with meta-learning training while our method adopts standard supervised training. The two approaches are complementary rather than competing. By focusing on improving the model architecture rather than the training procedures, our method can be used in various settings, not just SCAN tasks. The study conducted by Hu et al., 2020 targets a different problem-syntactic generalization, while our paper focuses on length extrapolation with different benchmarks. Some baselines such as LSTM and Transformer examined in the paper have been chosen as baselines in our paper.
审稿意见
8

The authors propose an external memory module called Pointer Augmented Neural Memory (PANM), leveraging two observations— 1. from neuroscience, for showing that the brain performs symbol processing using neural pointers and variable binding—(so, maybe, it is the architecture that needs an update), and that, 2. computers perform well at symbol processing. PANM is made up of three sub-modules— 1. memory slots, each with a data and an associated address, which the authors represent using incremental binary numbers. They claim this leads to better generalization to longer sequences over approaches that rely on softmax based attention for addressing, 2. a GRU based pointer unit for pointer operations, 3. a controller, which can produce outputs based on the pointer values and (optionally) the input. The aim is to improve the generalization abilities of conventional sequential models, and this is evaluated on mathematics datasets, more realistic question answering, and machine translation tasks.

优点

  • Using explicit pointers made of addresses can by design help with generalization to longer sequences by being independent of the sequence length.
  • Experimental: the generalisation results on algorithmic reasoning are stronger than other benchmarks by a huge margin of 10-20%.
  • The paper is extremely readable and well-presented.

缺点

  • I didn’t understand the different choices for the baselines for different tasks. For question answering tasks— why not consider models such as the Universal Transformer, which perform as well as the PANM augmented transformer on SCAN and also perform well on question answering? Is the point here to show PANM also helps transformers on NLP tasks? Or, can a broader claim be made that PAN can also perform better than existing transformer based approaches that perform well on the considered NLP tasks?
  • How many parameters does PANM add? The authors state that DNC and NTM have ~1.2 million parameters, but I think I am missing something, I cannot see the parameter count for PANM.
  • Say I am a practitioner and I want to decide if I want to choose PANM or U. TRM— looking at the performance of both the models, it is not obvious to me why I should prefer PANM over U. TRM since U. TRM performs very similarly to PANM on the tasks considered— could you highlight its advantages a bit more strongly over U. TRM, not necessarily in an experimental sense, but intuitively speaking what the benefits/non-benefits of an explicit pointer based memory module system are vis-a-vis a generalised transformer model design wise?
  • The comparison with Chat-GPT on algorithmic tasks is interesting— in Table 3, what would the failure examples of PANM look like, as are shown for Chat-GPT? Also, I maybe missing something— but are there any comparison tables on the other tasks apart from the couple of examples— mathematics, and SCAN? You can just point me to them, or maybe state why looking at them would be or not be relevant.

It is quite possible that I didn't understand or missed some things, and I am willing to adjust my score accordingly on the resolution of these doubts.

问题

  • Just a small suggestion, but could be helpful to see what p_t^a, *p_t^a, and z_t correspond to in the input sequence in Figure 1 as an example.
  • Possible typo in the first line of section 2.3.1— PANC instead of PANM.
  • The authors state that to use PANM in a plug-and-play fashion with an encoder-decoder based architecture, one can reduce the number of parameters in the decoder to reduce the overall number of parameters, which will increase by introducing PANM. It is not obvious to me what effects this will have— could you please elaborate on it? And what does it mean that the parameter counts of PANM and the Transformer baseline are kept the same when PANM is added as an external module— does this happen by cutting down on the decoder parameters? If so, what parameters are cut down? Let me know if this is a standard practice.
  • Also, is their name for the combined module consisting of the address bank and the data memory?

伦理问题详情

No ethics concerns.

评论

Thank you for your positive review. We address your concerns and questions point by point below.

Weaknesses

  • "I didn’t understand ...". Firstly, we would like to correct that our method outperforms Universal Transformer. Even with Relational Position Encoding, Universal Transformer (UT) falls short of our PANM in SCAN's most challenging scenarios. Specifically, when training lengths are set at 22, 24, and 25, PANM exhibits performance advantages of 2%, 35%, and 29%, respectively. Similarly, in more demanding mathematical tasks, PANM outshines UT+RPE by 11%. I (Universal Transformer without RPE is even worse). In the QA task, we opted against using UT as the backbone primarily due to its slow speed resulting from recurrent computation. Additionally, in our bAbI setting, Transformer and UT+RPE demonstrate minimal performance differences (see our revised Appendix Table 12), thus, we chose to integrate PANM with Transformer. In the SQUAD task, it is well known that BERT serves as a much stronger baseline compared to Universal Transformer. The successful enhancement of a potent baseline like BERT with PANM substantiates the practical efficacy of our approach. Our objective aligns closely with your second hypothesis: to demonstrate that PANM "can also perform better than existing transformer based approaches that perform well on the considered NLP tasks"
  • "How many parameters ...". It depends on the task and choice of the Controller. Usually, PANM's module's size is similar to the size of an attention-layer in Transformer. We report PANM model size in Appendix D1, the final sentence of the fifth paragraph. In the considering task, the whole architecture with PANM has ≈1.1 million parameters.
  • "Say I am a ...". As clarified earlier, PANM outperforms U. TRM (with and without RPE) in the toughest generalization testbed. We will clarify the benefit of our approach vs Transformer: | Difference | Transformer | PANM (Our) | | ----------- | ----------- |-----------| | Key Generation | Keys are computed based on input data. Hence, when meeting novel data during testing, Transformer will observe novel keys, and cannot work properly. | The keys in our approach are generated as fixed numbers, specifically physical memory addresses. These keys are entirely separate from the data.| |Extendable to Longer Sequences|The dimension of attention weights varies with input length, making arithmetic transformations on these attention weights infeasible as the sequence length increases.|The fixed nature of our physical addresses allows our pointers to be easily manipulated and extendable to longer sequences.| |Symbol Processing Advantages| The use of attention weights as implicit pointers may lack the explicitness needed for effective symbol processing.| Using physical addresses as keys in our approach is crucial for symbol processing as it explicitly allows pointer assignment, dereference, and arithmetic operations.| |Physical Address vs Positional Encoding| Positional encoding can be generated independently from data. However, they are not separated from the input data as our physical addresses. There is no explicit mechanism in Transformer to attend only to these | Our physical addresses are detached from the data, supporting the transformation of pointers through timesteps and isolating pointer manipulation from the input.|
  • "The comparison with Chat-GPT ...." PANM failure will focus on the end of the sequence. When sequence length increases, more characters at the end of the output sequence are likely wrong. For example, in the reverse task, input 123456789, true output 987654321, PANM output can be 987654322. The reason for this failure pattern is that PANM uses a Pointer Unit to recurrently transform the pointer across timestep. As the transformation is not always perfect, there could be error accumulation as the number of timesteps increases. Looking at Fig 2c as an example, attention errors (red) become clearer at the end of the decoding. Although in this example, the majority of attention is still correct, pointing to the right address. As the test length increases, these errors will amplify and possibly make the final prediction incorrect. We explained the reason for not carefully testing ChatGPT on other tasks in Appendix D6's first paragraph. There are two main reasons: (1) the task input is not suitable for text representation. (2) the test data is public and can be seen by ChatGPT. Another reason is the cost of systematically evaluating ChatGPT on these tasks: we have to manually extract the relevant answer from ChatGPT's output to have a reliable evaluation.
评论

Questions

  • "Just a small ...". Thank you for your suggestion. We have revised the figure to connect the p_t^a, *p_t^a to the input sequence as you suggested. z_t is the decoding input, not part of the input sequence.
  • "Possible typo ...". Thank you for pointing out the typo. We have fixed it.
  • "The authors state ...". We explained it in Appendix D3. We will provide more information here. PANM module can be viewed as one neural network layer. When plugging it into any encoder-decoder architecture, we just need to replace the last layer of the decoder with PANM to create a new model with a similar parameter count. For example, a Transformer with 3-layer attention encoder and 3-layer attention decoder could become Transformer+PANM consisting of the same 3-layer encoder and 2-layer attention decoder + PANM layer (PANM Memory+PANM Controller see Appendix Fig. 3). In this case, the last layer's parameters of the decoder are cut down. For small backbones and small datasets, it is a common practice to keep the baselines' model size the same to have a fair comparison. For a bigger backbone like BERT (>100 million parameters), adding PANM layer does not create a major change to the model size.
  • "Also, is their ...". We have not named the module consisting of the address bank and the data memory. We can call {the address bank, data memory and pointer unit} PANM memory (Fig 1a,b,c and Appendix Fig. 3).
评论

Thanks for the clarifying comments on my questions and considering the minor edits. I have more clarity on the implementation of the PANM module now.

评论

We thank the reviewers for their valuable feedback. We are glad that all reviewers appreciate the novelty of our approach. We will address your concerns and questions in individual responses. Here, we summarize the changes in this revision:

  • We have fixed typos and writing error
  • We have improved Figure 1 as Reviewer fEJf suggested
  • We have added discussions on related papers as Reviewer 2DCJ suggested
  • We have added discussions on our method vs Transformer to address Reviewer fEJf and VDhU's concerns.
AC 元评审

The paper introduces the Pointer-Augmented Neural Memory (PANM) model, designed to enhance the generalization capabilities of neural networks for longer sequence processing. The experiments demonstrate that PANM effectively improves in generalization accuracy.

The reviewers acknowledge the novelty of this method and good empirical results in the presented experiments. However, the reviewers also have reservations on the evaluation methodology and scope. In particular, the experiments on natural language tasks are rather small-scaled ones and only compare PANM with one standard Transformer baseline. Thus, It is not clear whether PANM would benefit real-world natural language applications. Besides, the authors should also include more comparisons of PANM with different LLMs and advanced reasoning techniques. Thus, the experiments are insufficient for a comprehensive evaluation.

Overall, this paper is a borderline case. We recommend rejection but if there is a lack of enough strong papers in this area or other areas, we are ok to flip the decision to an acceptance.

为何不给更高分

The paper, while presenting a novel approach, does not provide a sufficiently comprehensive evaluation across a sufficient range of tasks and strong baselines. The limited scope of evaluation and lack of detailed comparison with state-of-the-art models are a major drawback.

为何不给更低分

N/A

最终决定

Reject