6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

4.3

置信度

COLM 2024

Chain-of-Symbol Prompting For Spatial Reasoning in Large Language Models

Hanxu Hu,Hongyuan Lu,Huajian Zhang,Yun-Ze Song,Wai Lam,Yue Zhang

OpenReview PDF

提交: 2024-03-23更新: 2024-08-26

TL;DR

A symbolic prompting methods representing complex spatial relationships by using simple symbols for LLMs

摘要

关键词

In-Context LearningChain-of-Thought PromptingSpatial Reasoning

评审与讨论

审稿意见

评分: 7置信度: 42024-05-03

This paper proposes to use abstract chain-of-thought prompts that simply use symbols, rather than natural language. This results in notable improvements across a variety of planning and spatial reasoning tasks. This seems like a useful insight that will improve applications of language models, and is also somewhat conceptually interesting.

接收理由

Clear, simple idea — that illustrates a more general point about natural language not necessarily being the most effective way to get language models to reason.
- I actually felt that this point could be more strongly emphasized than it is; perhaps including natural language in the CoT "distracts" the model with the fuzzier processes of next word prediction and/or makes the distribution higher entropy, while the CoS prompting is more formulaic and rigid and therefore avoids one or both of these. It seems worth discussing these possibilities, and ideally investigating them in some way (see below).
Strong improvements on the tasks evaluated.
- as a bonus, saves some compute/tokens.
Some nice robustness & scaling experiments.

拒绝理由

The paper doesn't quantify thoroughly when and why CoS is beneficial. While there are a diversity of comparisons, it would be useful to dig into some of them in more depth; for example, are there characteristic patterns of errors that CoT makes but CoS is more likely to avoid? Conversely, where CoS makes an error, does CoT tend to make the same error, or are they uncorrelated? Etc.
- It would be even better if it were possible to link some of these analyses to the CoS itself; e.g. showing that it contains more correct steps than the CoT.
- If possible, it would also be very interesting to quantify the distributional effects of the CoT to CoS switch along the lines of the proposed hypothesis above: for example, is the distribution over next tokens higher entropy in CoT than in CoS? Could that help explain the benefit?
It would be nice to see some evaluations on more challenging benchmarks like https://arxiv.org/abs/2206.10498
If I understand correctly, in Fig. 3 "None" means no symbol or space between the letters; it seems to me that having a space would also be a useful comparison condition to run (to see if the eponymous symbols are actually critical, or if it's just necessary to have some break in between to avoid tokenizer weirdness).

作者回复

2024-05-31

Dear reviewer,

Thank you very much for taking the time to evaluate the paper, as well as for your appreciation towards our paper.

We sincerely thank you for your suggestions for the potential development of CoS. We will definitely incorporate them as soon as we obtain the results in the near future work. They are:

Deeper analysis of when and why CoS is beneficial.
Check carefully on the number of correct steps with CoS and CoT.
Check the next token entropy to dig CoS.
Evaluate CoS on more challenging benchmarks like https://arxiv.org/abs/2206.10498.
Adds more comparison to Fig 3.

Again, Thank you for your effort in evaluating the paper. Have a nice day.

Best Regards,

Authors

2024-06-04

Thanks to the authors for the interesting work, I am looking forward to seeing these results and the follow-ups in the future!

审稿意见

评分: 6置信度: 52024-05-09

The work proposed the new in-context learning method called Chain of Symbol (CoS) which utilizes symbolic representations for spatial relation objects rather than natural language representations. They claim the proposed technique is more effective and reliable than commonly used CoT techniques. They prove the claim using four tasks involving spatial understanding of various dimensions (1D, 2D, and 3D spatial relations) with closed-source and open-source LLMs. The results indicate that CoS completely outperformed in performance and resources on all tasks.

接收理由

The proposed technique provides another strong representation for performing spatial reasoning on LLMs. The notable reasons are stated below:

The proposed method outperforms CoT in several benchmarks including both simple and complex spatial relations as well as resource-saving
The benchmarks used in the experiment have a good coverage of application regarding spatial relations
Provide the results that support the performance of the model on both open-source and close-source LLMs
Based on analysis of one of the benchmarks, the proposed method is effective in different language setting
Easy to apply in-context learning

拒绝理由

The improvement rapidly drops on the complex tasks compared to simple ones, which may show indifference between symbolic representation and natural representation in the complex setting
Applying to the realistic scenario will be challenging because the symbolic representation is not straightforward to obtain with reasonable coverage of concepts
According to LLama2 results, the proposed techniques only out-performed CoT with the largest model

给作者的问题

You provided the result of GPT-4 on the SPARTUN dataset; what are the results of GPT-4 on the other benchmarks?
Can you visualize the Llama2 results in the same format and details of the GPT results? More details will be helpful since the figure 4 plots is not very conclusive.
Would it be possible to apply CoS approach to the other types of reasoning beyond spatial?
In the analysis part of BrickWorld, did you experiment with different symbolic representations? What would be the expected results in more complex setting if the symbolic representations are changed? The extreme case would be using <- to represent left instead of right. Would the model still understand it?
Can you justify using GPT for creating the initial explanation compared to manually crafting CoS? is the motivation reducing the human annotation effort?
For CoT, the experiment is performed using manually crafted CoT. What about the results from CoT if you follow the same steps as CoS?

作者回复

2024-05-31

Dear reviewer,

Thanks for your reply and efforts in evaluating our paper. There could be some miscommunication we would like to clarify.

For your first concern: The improvement rapidly drops on the complex tasks compared to simple ones, which may show indifference between symbolic representation and natural representation in the complex setting

This paper encourages the community to pre-train LLMs in a symbolic manner in order to higher performance on complex tasks.

For your second concern: Applying to the realistic scenario will be challenging because the symbolic representation is not straightforward to obtain with reasonable coverage of concepts

This is true for some tasks. Yet, in addition to spatial reasoning, coding and logical reasoning in first-order logic can also benefit from CoS. They are important useful tasks where CoS could be potentially easily applied.

For your third concern: According to LLama2 results, the proposed techniques only out-performed CoT with the largest model

This paper encourages the community to pre-train LLMs in a symbolic manner so that smaller LLMs will also benefit from CoS.

For your first question: You provided the result of GPT-4 on the SPARTUN dataset; what are the results of GPT-4 on the other benchmarks?

We will include more results in the near future with more resources.

For your second question: Can you visualize the Llama2 results in the same format and details of the GPT results? More details will be helpful since the figure 4 plots is not very conclusive.

This will be included in the camera-ready version.

For your fourth question: Would it be possible to apply CoS approach to the other types of reasoning beyond spatial?

See the reply to the second concern above.

For your fourth question.

We think the results in Table 6 might provide insights into your question. Yes, the model still understands the case you suggested.

For the fifth and sixth questions.

There can be some miscommunication. CoT is generated by GPTs automatically, and we convert them manually into CoS for a fair comparison.

Again, Thank you for your effort in evaluating the paper. Have a nice day.

Best Regards,

Authors

2024-06-05

Thank you for your response, I have read all other reviews and the response. Based on the responses, it does not seem to me that there was any misunderstanding in my evaluation. My overall evaluation was positive, I kept the same.

审稿意见

评分: 6置信度: 42024-05-10

The paper regards the implementation and evaluation of a symbolic language to communicate with a large language model. The translation of the natural language query or instruction to its symbolic form is done manually by the user. The results are compared with those obtained with zero-shot chain-of-thought prompting and with few-shot chain-of-thought prompting where the prompts are also manually drafted but are in natural language.

接收理由

The idea is original.
The idea is evaluated with different tasks (brick world spatial reasoning, natural language visual reasoning, natural language navigation and spatial question answering).

拒绝理由

The expressiveness of the symbolic language is not theoretically defined and evaluated. It is not clear whether it exhaustively covers complex spatial configurations of objects in a real world settings.
The paper claims that the symbolic language is easy to use, but a users’ study that supports this claim is lacking.
The contribution towards novel techniques to improve language modeling is not clear. The paper rather suits a human-computer interaction venue.
It is not surprising that the obtained results improve compared to zero- and few-shot chain-of-thought reasoning as the symbolic language has less complex patterns than natural language and is an unambiguous controlled language to attend to when querying the LLM, but it requires knowledge engineering expertise from the user.
The paper seems to be written in a hurry and contains typos and grammatical errors.
Overall, the paper is not yet mature to be published but the authors are encouraged to improve their work.

给作者的问题

See the remarks above.

作者回复

2024-05-31

Dear reviewer,

Thanks for your reply,

For your concern: The expressiveness of the symbolic language is not theoretically defined and evaluated.

CoS is an empirical method, so we doubt the necessity of a theoretical definition of this method, while we clearly described the detailed pipeline of constructing CoS and transforming from CoT to Cos in Section 3 and in Appendix A.2.2.

For your concern: The paper claims that the symbolic language is easy to use, but a users’ study that supports this claim is lacking.

Our claim about 'easy to use' is based on the fact that our method doesn't need additional training and doesn't need a complex pipeline, it only needs simple transformation from CoT. Our method requires fewer token numbers compared with CoT. Because of its simplicity and using structured symbolic languages, it is easy for users to read and understand (see Figure 1).

For your concern: The contribution towards novel techniques to improve language modelling is not clear. The paper rather suits a human-computer interaction venue.

This conference is not only limited to topics about 'improving language modelling', and as you can see, we have chosen the research area of this paper as Alignment, Evaluation, LMs and interactions, which we think is aligned with the content of our paper based on https://colmweb.org/cfp.html. Using symbolic languages for in-context learning and prompting is worthy of exploration, and we believe the LLM community is open to diverse and meaningful topics like this.

For your concern: 'It is not surprising that the obtained results improve compared to zero- and few-shot chain-of-thought reasoning as the symbolic language has less complex patterns than natural language and is an unambiguous controlled language to attend to when querying the LLM, but it requires knowledge engineering expertise from the user.'

In fact 'symbolic languages have less complex patterns and an unambiguous language' is also our motivation for using symbols rather than natural languages, which we described in the introduction section. 'requires knowledge engineering expertise from the user', from the construction process of CoS in the paper, we only use some random symbols to replace objects relationships in CoT, which can't indicate it requires knowledge engineering expertise.

Again, Thank you for your effort in evaluating the paper. Have a nice day.

Best Regards,

Authors

2024-06-04

Thank you for the answers which are clarifying.

审稿意见

评分: 6置信度: 42024-05-10

This paper focuses on spatial reasoning using LLMs and it proposes a prompting method called "Chain-of-Symbol (CoS)", which follows the same idea of Chain-of-Thought (CoT) but expresses the reasoning process using concise symbols. This approach was evaluated on four datasets: Brick World, NLVR-based Manipulation, Natural Language Navigation, and SPARTUN. Few-shot CoS was shown to outperform few-shot CoT in most cases.

接收理由

The paper presents a new method called Chain-of-Symbol (CoS), which offers the benefit of shorter prompts and hence more cost- and time-efficient than CoT. It was also shown to outperform CoT.

拒绝理由

The experimental designs need more justification.

(1) Concern about the validity of the baseline: To show the effect of using concise symbols, a more fair comparison should design CoS and CoT to follow the same reasoning process while being different only in their expression (i.e., symbols vs. texts). Only in this more controlled comparison can one see if symbols are more effective than textual reasoning. However, prompts in the Appendix have shown inconsistent designs between CoS and CoT. For example, on Brick World (Table 4 vs. Table 5), while CoS reasons about the spatial relations before giving the answer sequence, CoT directly presents the answer sequence. Similarly, on Natural Language Navigation (Table 8 vs. Table 9), while CoS enumerates the stops of each path and shows its distance calculation before reaching the conclusion of the shortest choice, CoT seems to miss some of the calculation details (e.g., "4. The road from bank A to store B is shorter than the road from bank A to store G" without saying the calculation of the latter path). As a result, in the current less fair experimental setup, it is hard to understand whether the gain of CoS over CoT is an effect of using symbols, or if it is simply due to prompting the LLM with different reasoning logic.

(2) Experiments were performed using GPT-3.5-turbo, so it is unclear if the advantage of CoS holds for more advanced LLMs such as GPT-4.

(3) Concern about the choice of datasets for spatial reasoning: I have found the use of "NLVR-based Manipulation" and "Natural Language Navigation" as benchmark datasets for spatial reasoning confusing. They do not seem to focus on spatial reasoning. For "NLVR-based Manipulation", the involved "spatial" aspect is limited to left vs. middle vs. right boxes, whereas the main challenge of this dataset seems to be about "perception" (e.g., recognizing the color, size, and shape of objects). For "Natural Language Navigation", it is more about motion planning while there's no obvious challenge in inferring the spatial relationship among intermediate stops on a path.

Novel of CoS: How does CoS compare to Program of Thought, which also uses symbols to represent the reasoning process of problem solving?

Chen, Wenhu, et al. "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks." Transactions on Machine Learning Research (2023).

给作者的问题

Please address my concerns in Reason to Reject.

Figure 2: some texts are not properly displayed in "Natural Language Navigation".

作者回复

2024-05-31

Dear reviewer, Thanks for your reply.

For your first concern,

(1). Just as we described in Appendix A.2.2, we create exemplars based on the zero-shot CoT results, and manually convert them to the CoS version, and just as we described there, 'For the results of templating We attempted our best efforts in tuning the baselines, and we report the best results we achieved.' There are versions which are concisely parallel but we find using CoT is even lower than the current version, but your point makes sense, and we will add a more fair comparison in the latter version of the paper. In Table 8 and Table 9, there is only one place (just as you mentioned) which is not aligned with each other, this is due to some generation mistakes of LLMs, but other exemplars are aligned, which might mitigate this mistake and unfairness, and at the same time our results are significant.

(2). We want to conduct experiments using GPT-4 but unfortunately we don't have enough budget for the API of GPT-4.

(3). For the choice of dataset, we choose NLVR-based manipulation and Natural Language Navigation because they are both representative benchmarks and scenarios of reasoning/planning, and they are both easy to create and easy to understand. As you can see in the CoT results in Table 2 for these two tasks, they are still challenging, and it can also indicate that the CoS method can generalize to diverse scenarios not only in pure spatial reasoning tasks.

For your second concern,

For the novelty of CoS, we are indeed different with Program of Thought [1] : they generate code to execute the generated program (which leverage external tools), while we only uses LM itself to reason through symbolic languages.

The core novelty of this idea is that we are the first to discuss how to directly use symbols (without any other external tools or executors of programs) for language models, and we find it can both save tokens (inference time) and improve performance in some planning and reasoning tasks.

Again, Thank you for your effort in evaluating the paper. Have a nice day.

Best Regards,

Authors

[1] Chen, Wenhu, et al. "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks." Transactions on Machine Learning Research (2023).

2024-06-05

1.1 I hope the authors can expose the details and the results in the next version. A sentence of "We attempted our best efforts in tuning the baselines" is very ambiguous:) However, I found the result surprising (less rigorous step-by-step CoT outperforming the more rigorous one) and believe that more discussions are needed.

1.2 I'm sorry for the cost concern, but an experiment using GPT-4 makes more sense and will certainly tell people how much needed the CoS is.

1.3 The answer is not convincing. The original NLVR was designed to evaluate visional reasoning, including spatial reasoning, but the NLVR-Manipulation in experiments as I understand was annotated by the authors. Based on examples in the paper, the kind of spatial reasoning it needs is very shallow. The Natural Language Navigation benchmark was also created by the authors, not any prior work, hence "they are both representative benchmarks" is very misleading.

Thanks for the justification.

I increased my score as parts of my concerns were addressed.

最终决定Accept

2024-07-10

The authors propose chain-of-symbol prompting for spatial-reasoning-from-text tasks. The method involves manually constructing more compact/symbolic versions of chain-of-thoughts generated by models in a zero-shot fashion, and using these updated representations as in-context examples. This method outperforms a baseline of not making the proposed symbol substitution, sometimes significantly, with gpt-35-turbo, and llama-2 (and does so using fewer tokens).

Reviewers appreciated the simplicity, effectiveness, and efficiency of the proposed method. A few themes among the weaknesses:

Some reviewers (e.g., iTYN) hoped for more experiments about when/why the CoS method works, while others (e.g., C6LX, cm3f) hoped for more models covered, while others (e.g., iTYN) hoped for more datasets
Some reviewers highlighted presentation issues (e.g., ST4m brought up the "easy to use" discussion)

In my view, the two biggest remaining shortcomings are:

Why does CoS help? Are there error analyses that could be conducted on the individual instances where improvement occurs?
Does CoS really generalize to other similar tasks? Or to tasks beyond spatial reasoning from text?

Concerns were mostly addressed in conversation. Overall, this work provides (another) striking illustration of when/how representations can matter for LLMs --- while CoT and CoS represent the same information, CoS clearly does so in a manner more accessible to LLMs. The promised additional experiments the authors mentioned in their response are likely to provide additional missing context as requested by reviewers.