6.0

/10

Poster3 位审稿人

最低6最高6标准差0.0

2.7

置信度

正确性3.3

贡献度3.0

表达3.0

NeurIPS 2024

Open-Book Neural Algorithmic Reasoning

Hefei Li,Chao Peng,Chenyang Xu,Zhengfeng Yang

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

Neural Algorithmic ReasoningGraph Neural Network

评审与讨论

审稿意见

评分: 6置信度: 22024-07-12

This paper proposed an open-book learning framework that allows networks to utilize the entire training dataset during reasoning, significantly enhancing performance on the CLRS Algorithmic Reasoning Benchmark and revealing intrinsic connections between different tasks through an attention-based mechanism.

优点

Innovation: The use of training datasets to enhance algorithmic reasoning tasks is novel.

Workload: Your research workload is significant. The article provides sufficient experimental support.

Writing quality: Your paper is well-written, with clear and precise language and a smooth flow of ideas. The structure is reasonable, and the logic is sound, making it very enjoyable to read.

Experimental analysis: Your experimental analysis is rigorous, as your experimental design is reasonable and analysis methods are scientifically reliable.

缺点

Introducing the additional memory seems to have a large storage overhead if the training set is large.
The author's proposal is similar to retrieval augmented generation (RAG) in NLP, so I hope it can be discussed.

问题

What are the implications if the testing phase is supported by unseen datasets?

局限性

The limitations of the study and the possible negative social impact have been well documented by the authors.

作者回复

2024-08-07

We thank the reviewer for the positive comments. The following addresses the questions and concerns one by one.

W1: Introducing the additional memory seems to have a large storage overhead if the training set is large.

A: The proposed open-book framework is flexible enough to adapt to different computational environments. As the number of auxiliary data points decreases, the memory required by the framework also decreases; when the number of auxiliary data points is 0, the memory requirement of the framework is the same as that of the original methods. In our experimental setting, the memory required increased by an average of 30% compared to the original methods.

W2: The author's proposal is similar to retrieval augmented generation (RAG) in NLP, so I hope it can be discussed.

A: Thanks for pointing this out. The proposed framework shares some similarities with the technique of retrieval-augmented generation in LLMs. Both approaches involve augmenting the final step of model generation with additional information to aid the output. Although the high-level concept is similar, our open-book framework targets smaller models and opens the black box of these models, resulting in some technical differences compared to LLMs. We'll add a discussion on this point in the final version.

Q: What are the implications if the testing phase is supported by unseen datasets?

A: Due to the design of the gate mechanism (see line 19 of Algorithm 1), our method achieves robustness. When supported by unseen data in the testing phase, the framework can adjust the influence of auxiliary data points through the gate function, resulting in robust performance.

审稿意见

评分: 6置信度: 22024-07-13

This paper presents open book Neural Algorithmic Reasoning (NAR). The central claim the authors investigate is whether open book reasoning -- allowing a model to query information from its training set relevant to the current query -- can be unified with existing NAR architectures. In doing so, the authors present a general framework for open book NAR. The authors find that their framework is more performative on the CLRS benchmark. Furthermore, the authors investigate open-book NAR with multi-task training and find that it achieves similar performance to the current best multi-task NAR algorithm -- exceeding the baseline in some tasks.

优点

Significance: Neural algorithmic reasoning models have been highly effective in real world use cases [1] and are extremely relevant to the NeurIPS community. This paper introduces an orthogonal direction of improvement over current work in the field. As such, it seems that this paradigm is applicable to any NAR model that follows the Encode - Process - Decode paradigm.
Clarity: The manuscript is well-structured and easy to read.

[1]: https://arxiv.org/abs/2108.11482

缺点

Robustness and Generalization: While I understand the logic of why open-book reasoning is relevant to NAR, wouldn't allowing the features learned at train time overfit to the training set and hurt out of domain performance? 'm concerned that, as the distribution shift increases, the efficacy of an open-book NAR model will decrease considerably faster than that of a vanilla NAR model. Concretely, I recommend the authors compare with the algorithm and dataset presented in [3] (one of the papers cited in the introduction). This experiment has the added advantage of lending credence to the generalization claim in L54-57 because, if open-book learning is more performant, it must be extracting “background knowledge” features invariant to distribution shift.

Presently, I'm recommending a Borderline Acceptance. Open book NAR seems to be highly desirable in real world scenarios, but I'm concerned that the performance gains come at the cost of OOD performance, which seems integral to the practical benefits of NAR algorithms. I'm willing to change my recommendation based on future discussion with the authors.

Minor comments:

L36: CLRS is definitely a great litmus test for NAR algorithms. I recommend the authors look into SALSA-CLRS [2] as well, where open-book performance might admit better scalability than current baselines.
L223: This section will benefit from an explanation of what metrics were used for each use case and why.

[2]: https://arxiv.org/pdf/2309.12253

[3]: https://arxiv.org/pdf/2302.10258

Increasing score to Weak Accept after discussions with the authors.

问题

(addressed in weaknesses section)

局限性

The authors have adequately addressed limitations, though I recommend an explicit limitations section in the appendix.

作者回复

2024-08-07

We thank the reviewer for the valuable comments. The following addresses the questions and concerns one by one.

W1: Concerns about the OOD performance.

A: The results presented in the paper are exactly the out-of-distribution performances of the proposed framework. As pointed out in Line 42 of the paper, the test instances are substantially larger in scale compared to those in the training set. For each dataset in the CLRS benchmark, the training set graphs contain approximately 12 nodes on average, while the test set graphs contain 64 nodes and are specifically generated to evaluate the model's OOD performance (as claimed by the CLRS creators). To further validate our framework's OOD performance, we include two additional experiments in the PDF attached to the global response. The overall design principle is consistent with that of the SALSA-CLRS dataset. In Figure 2, we fix the training set and increase the test set's graph size; in Figure 3, we fix the test set and vary the training set's graph size. We observe that even when the ratio of test to training graph sizes grows exponentially, the framework's performance across different datasets changes in a relatively smooth manner.

W2: This section will benefit from an explanation of what metrics were used for each use case and why.

A: Thanks for this comment. In the paper, we follow the standard evaluation metric used in the CLRS benchmark to ensure a fair comparison with previous work. The metric is called the F1 score, which involves dividing the algorithm's output into different types (such as node states, edge states, etc.), calculating the percentage of output elements that match the ground truth for each type, and then taking the average. We’ll add this detail in the final version.

2024-08-08

Thank you for your response. The authors have addressed my main concern with the paper. As such, I've increased the score to Weak Accept.

审稿意见

评分: 6置信度: 42024-07-15

The paper proposes a method to use the training dataset more explicitly during test time inference to improve performance. This is done with a dataset encoder module plus another processor module named open book processor. The authors validate their method in the single and multi-task set-up with good results. Ablations are performed in the multi-task set up to determine which algorithms are most helpful.

优点

The paper is clear and easy to understand.
The idea is interesting and well-executed.
The evaluation is good and convincing (thank you for the error bars).

缺点

The related work is very brief, arguably too brief. For instance [1,2,3] are missing.
There is no ablation for the architectural design decisions.
The description of hyper-parameters should be in the Appendix of the paper rather than the reader needing to open a different paper to find them. This would help this paper stand on it's own.
It would be good to re-iterate the metric (f1 score I believe) and training graph vs test graph size.
Only the final performance is measured on the larger graphs. Another important and interesting metric how well the actual algorithm is learned. For a given task, e.g. single-source shortest path, there are several algorithms that solve this problem (e.g. Dijkstra and Bellman-Ford). Not all algorithms are equally easy to learn (e.g. BellmanFord is much more nicely aligned to the GNN architecture than Dijkstra and thus easier). To what extent is the correct algorithm learned, this can be shown by the accuracy on various algorithm specific intermediate hints, e.g. next node selected in Dijkstra. I suspect that for instance in the multi-task set-up it happens that the "wrong" algorithm is learned (see Table 2 Dijkstra relying on BellmanFord). I think this matters because this is a kind of short-cut that the model may learn contrary to what we desire (defined by the training data + loss).

[1] https://proceedings.neurips.cc/paper_files/paper/2023/file/a2370db7c99791ad5d9f3ef48ad6d464-Paper-Conference.pdf [2] https://proceedings.neurips.cc/paper_files/paper/2021/file/a2802cade04644083dcde1c8c483ed9a-Paper.pdf [3] https://arxiv.org/pdf/2406.09308

I am willing to raise my score further if the above weaknesses and questions are addressed.

问题

How does the network perform as you scale the available training data points (as far as I can tell only 240 examples are tested)?
What does the generalisation curve look like across graph sizes (64,128,256,...)?
How many attention heads are you using for the cross-attention?

局限性

Few ablations (see questions)
Related work is too brief (see weaknesses)

作者回复

2024-08-07

We thank the reviewer for the constructive comments. This paper explores a new paradigm in neural algorithmic reasoning (NAR). We propose an open-book framework and provide a concrete implementation to demonstrate that open-book information does have the potential to enhance neural network reasoning capabilities. What we believe is that this work will influence future directions in NAR, as the open-book framework, combined with many other fancy techniques, will likely lead to more powerful neural architectures, and our implementation is just the beginning. The following addresses the questions and concerns one by one.

Q1: How does the network perform as you scale the available training data points (as far as I can tell only 240 examples are tested)?

A: In our experiments, we set the number of auxiliary data points to 240 (= 30*8), where 30 represents the number of algorithms (tasks) and 8 represents the number of task categories. This ensures that both the single-task and multi-task settings have the same number of auxiliary data points, and can facilitate future research on the impact of open-book information across different categories in the multi-task setting. Additional results are provided in the PDF attached to the global response. The performance of our method remains robust as the number of auxiliary data points varies; note that when the number of auxiliary data points is zero, our method becomes the existing model.

Q2: What does the generalisation curve look like across graph sizes (64,128,256,...)?

A: This is a very constructive question and provides new insight into further testing the reasoning capability. In the original test data provided in CLRS, the graph size is set to 64. We construct larger testing datasets and present our performance curve when the test graph size grows exponentially from 64 to 512 in Figure 2 of the attached PDF (in the global response). We observe that the curves for different algorithmic tasks vary, likely due to the complexity of the solution space for each task. Overall, as the testing size grows exponentially, the performance in most tasks declines smoothly. Another related experiment is shown in Figure 3, where we fix the testing size and vary the training size. We conclude that the average performance of our method remains robust as the testing/training ratio increases exponentially.

Q3: How many attention heads are you using for the cross-attention?

A: The results presented in the paper use only one attention head. We've tried other settings, but the results were almost the same. We'll add a short discussion about this in the final version.

W1: Concerns about learning "wrong" algorithms.

A: For the CLRS models, the training loss is defined as the average loss of outputs and all intermediate hints (If the hint is categorical, we calculate the cross-entropy loss; if it's scalar, we use the MSE loss.) Thus, the network is always trained to mimic each step of the algorithm rather than focusing solely on the final score. This training approach helps prevent shortcut learning. In Figure 4 of the attached PDF, we show the loss during the training of the Dijkstra algorithm. Furthermore, in Figure 5, we present the loss trajectory on the validation set, demonstrating that open-book information indeed enhances the step-by-step imitation process.

W2: It would be good to re-iterate the metric (f1 score I believe) and training graph vs test graph size.

A: You're right. The evaluation metric is the F1 score, where we first partition the algorithm's output into different types (such as node states, edge states, etc.), calculate the percentage of output elements that match the ground truth for each type, and then take the average. The maximum graph size in the training set is 16, while in the testing data, the graph size is 64.

W3: Other weaknesses.

A: Thank you for these valuable comments. We will add the missing reference as well as more training details, and incorporate additional ablation studies in the final version.

作者回复

2024-08-07

We thank all reviewers for the helpful and constructive comments. We greatly appreciate the time and effort you put into this review, and we will incorporate your suggestions into the final version. The paper explores a new paradigm in neural algorithmic reasoning. We propose an open-book framework and demonstrate that open-book information can indeed enhance the reasoning capabilities of existing architectures. We believe that there is significant potential for further exploration within the open-book framework, and that this work could influence future directions in NAR. To address some concerns raised by reviewers, we provide additional experimental results in the attached PDF file. Due to constraints, we only show five figures involving representative algorithmic tasks as the remaining tasks exhibit similar trends. We will add these additional results to the paper’s full version.

最终决定Accept (poster)

2024-09-25

After carefully reading the reviews and rebuttal, I agree with the reviewers that the work is interesting for the community. I believe that there were a few questions raised by the reviewers that were answered in the rebuttal fully. I urge the authors to incorporate these answers in the final version of the paper. In particular look at the related work section (and comments about this) as well as incorporating the additional results that were provided in the rebuttal.

Overall, I believe the work could also benefit from more analysis and ablation (e.g. checking when and how these examples from the episodic memory -- the open book -- are being used, and whether semantically this makes sense, etc.), rather than focusing just on the performance of the system in different settings. This can add another dimension of understanding and certainty in the starting hypothesis of the work. However, while it would be nice to have additional analysis, even without it, the work is already sufficiently interesting and useful for the community. So as long as all the details in the rebuttal are incorporated the paper should be ready for acceptance.