PaperHub
5.2
/10
Poster5 位审稿人
最低3最高7标准差1.5
7
3
6
6
4
3.2
置信度
正确性2.8
贡献度2.6
表达3.6
NeurIPS 2024

OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We propose a framework that enables exact arithmetic in a single autoregressive step, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities.

摘要

Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in *a single autoregressive step*, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic. Our implementation using Llama 3 with OccamNet as a symbolic model (OccamLlama) achieves 100% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o with and without a code interpreter. Furthermore, OccamLlama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that OccamLLMs can excel in arithmetic tasks, even surpassing much larger models. Code is available at https://github.com/druidowm/OccamLLM.
关键词
LLMLanguage ModelArithmeticOccamNetLlama

评审与讨论

审稿意见
7

This paper introduces a framework enabling Large Language Models (LLMs) to perform exact arithmetic operations efficiently and securely. By utilizing hidden states from an LLM to control a symbolic architecture named OccamNet, the framework achieves high accuracy and interpretability without compromising speed or security.

优点

  1. Addressing a Key Weakness: Popular LLMs struggle with simple arithmetic, limiting their understanding of physics and mathematics. This paper overcomes this limitation by integrating a symbolic model, OccamNet, which performs basic arithmetic with 100% accuracy.
  2. Innovative Integration: The authors propose a decoder mechanism to determine whether the next token should be generated from the LLM or OccamNet, and which operation OccamNet should execute. This approach of integrating OccamNet with other LLMs opens interesting future research directions, particularly in combining other tools with LLMs.

缺点

Decoder's Robustness: The effectiveness of OccamLLM heavily depends on the decoder's training across various prompts to accurately discern the required operation. There is a risk that unfamiliar prompts could lead the decoder to instruct OccamNet to perform incorrect operations.

问题

Handling Unrelated Numbers: How does the model manage sentences with numbers unrelated to the arithmetic operation, such as historical data in the question 'In 1932, 100 planets were discovered and in 1933, another 200 were found. What is the total number of planets discovered between 1932 and 1933? ' How does the string parser determine which numbers should be input into OccamNet?

局限性

Compound Expressions: The framework's reliance on prompting LLMs to decompose compound expressions is a significant limitation. If LLMs fail to adequately simplify these expressions, the decoder might incorrectly determine the operation to be performed or fail to revert to the LLM for complex expressions. Additionally, the paper lacks experimental validation that LLMs can effectively handle compound expressions.

评论

Questions:

Handling Unrelated Numbers: How does the model manage sentences with numbers unrelated to the arithmetic operation, such as historical data in the question 'In 1932, 100 planets were discovered and in 1933, another 200 were found. What is the total number of planets discovered between 1932 and 1933? ' How does the string parser determine which numbers should be input into OccamNet?

We understand your concern about arithmetic problems in which relevant numbers are not the most recent numbers in the text. We are happy to assert that, in most cases, OccamLlama is capable of solving them. To illustrate, below is OccamLlama's response to your example problem:

In 1932, 100 planets were discovered and in 1933, another 200 were found. What is the total number of planets discovered between 1932 and 1933? The number of planets discovered in 1932 was 100. The number of planets discovered in 1933 was 200. So the total number of planets discovered is 100 + 200 = 300\n\n#### 300.

As you can see, most language models tend to repeat the computation they wish to perform prior to stating the answer, which reinserts the relevant numbers as the most recent numbers in the text. The reasoning is done by the LLM and the arithmetic by OccamNet, showcasing the unique synergy that the OccamLLM system achieves.

At the same time, it is possible to train an OccamLlama with the ability to perform arithmetic correctly when the relevant numbers are not the most recent in the context. This can be done by simply providing OccamLlama the ten most recent numbers in the text instead of the two most recent, for example. In practice, we did not find this to be needed.

Limitations:

Compound Expressions: The framework's reliance on prompting LLMs to decompose compound expressions is a significant limitation. If LLMs fail to adequately simplify these expressions, the decoder might incorrectly determine the operation to be performed or fail to revert to the LLM for complex expressions.

Thank you for your comment. We note that our existing methods are already capable of performing arbitrarily complex arithmetic by chaining simpler calculations. Existing language models are generally trained or prompted to reason step-by-step, meaning that their generation can often be supplemented by a single-layer OccamNet. This is evidenced by our reasoning benchmarks, in which OccamLlama (with a single-layer OccamNet) performs quite well across a wide range of tasks.

Additionally, we can further improve OccamLlama systems with single-layer OccamNets by finetuning and/or prompting the model to think step-by-step.

Continued below

作者回复

Strengths:

Addressing a Key Weakness: Popular LLMs struggle with simple arithmetic, limiting their understanding of physics and mathematics. This paper overcomes this limitation by integrating a symbolic model, OccamNet, which performs basic arithmetic with 100% accuracy.

Innovative Integration: The authors propose a decoder mechanism to determine whether the next token should be generated from the LLM or OccamNet, and which operation OccamNet should execute. This approach of integrating OccamNet with other LLMs opens interesting future research directions, particularly in combining other tools with LLMs.

We appreciate your kind remarks!

Weaknesses:

Decoder's Robustness: The effectiveness of OccamLLM heavily depends on the decoder's training across various prompts to accurately discern the required operation. There is a risk that unfamiliar prompts could lead the decoder to instruct OccamNet to perform incorrect operations.

Thank you for your comment. We agree that this is a potential concern. However, we contend that out-of-distribution inputs are a problem for all machine learning models, so this issue is not specific to OccamLlama.

In fact, we find that OccamLlama displays remarkable generalization capabilities on out-of-distribution problems. To demonstrate this, we train the OccamNet decoder from scratch, using only numeric expressions and absolutely no text at all. This means that any problem with text, such as a word problem, is far out of distribution of the OccamNet decoder's training data. We test this model (using the standard router) on the mathematical reasoning benchmarks and obtain remarkably good results, shown visually in Figure b) of the global response PDF and presented in the table below. The OccamLlama 8B decoder trained only with numerical expressions performs on par with the model trained with both numbers and text, even achieving higher accuracy on some benchmarks. This shows that the OccamLLM framework is robust and points towards the fact that the representations of arithmetic that are built in the transformer body of the LLM and extracted by the OccamLLM Decoder are very general.

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlamaOccamLlama ArithLlama 2 7B ChatLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.492.7 ± 1.378.0 ± 2.193.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.271.6 ± 1.236.0 ± 1.379.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.576.0 ± 1.799.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.595.3 ± 0.923.3 ± 1.757.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.885.8 ± 1.743.9 ± 2.560.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.192.1 ± 1.279.1 ± 1.896.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.088.8 ± 1.061.5 ± 1.586.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.189.3 ± 1.156.8 ± 1.881.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

In contrast, we expect that finetuning Llama to perform arithmetic using only numeric examples and no text whatsoever would lead to extreme catastrophic forgetting and poor arithmetic performance on word problems. As such, we believe this data shows a remarkable robustness of OccamLLM.

As a second example of generalization, we tested OccamLlama on the Multilingual Grade School Math Benchmark (MGSM), a dataset consisting of GSM8K translated into 10 languages (Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai). We compute the drop in accuracy when switching from English to another language, given by the accuracy of a model on the English dataset minus the accuracy of the model on the dataset in a given language. For OccamLlama in Bengali and Telugu, due to time constraints, we evaluated on a subset of 100 and 50 randomly chosen problems, respectively. We will update the table below with the full results soon. We show the results in visually in Figure c) of the global response PDF and the table below:

Table: Accuracy drop on multilingual reasoning tasks. Lower is better.

DatasetOccamLlamaLlama 3 8B Instruct
Bengali27.0 ± 5.843.2 ± 3.9
Chinese21.2 ± 4.317.6 ± 3.7
French13.6 ± 4.213.6 ± 3.7
German14.0 ± 4.211.6 ± 3.6
Japanese47.2 ± 3.963.6 ± 3.4
Russian15.2 ± 4.212.8 ± 3.6
Spanish5.6 ± 4.19.2 ± 3.5
Swedish32.0 ± 4.238.4 ± 3.9
Telugu56.0 ± 6.061.6 ± 3.5
Thai20.0 ± 4.321.2 ± 3.8
Average38.0 ± 4.229.3 ± 3.7

The table above shows that OccamLlama has on average a smaller performance drop than llama between the English dataset and the non-English language datasets. The fact that OccamLlama (the decoders for which have never been trained on other languages) has on average better out-of-distribution behavior than Llama (a model trained on over 750 billion tokens of non-English text) is in our opinion quite remarkable.

We believe that these two tests demonstrate OccamLlama's robustness against out-of-distribution data.

Continued below

评论

Additionally, the paper lacks experimental validation that LLMs can effectively handle compound expressions.

Thank you for your comment. We believe that the mathematical reasoning benchmarks provide experimental validation that OccamLlama can effectively handle compound expressions. These problems are multi-step math problems requiring multiple calls to OccamNet to be completed successfully. Given OccamLlama's strong performance on these benchmarks, we hope you will agree that OccamLlama can effectively handle compound expressions. We will make this clearer in the paper.

Since submitting to NeurIPS, we identified a few minor bugs in our code which were limiting performance and slightly reweighted the proportions of the different training datasets. These changes maintained the perfect performance on arithmetic tasks and improved performance on mathematical reasoning tasks. The updated results are shown visually in Figure a) of the global response PDF and the table below:

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlamaLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.279.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.499.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.557.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.860.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.181.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

Here, we see that OccamLlama outperforms even GPT 4o and GPT 4o with Code on benchmarks requiring challenging arithmetic (MultiArith Float and MATH401), areas where OccamLlama shines. Additionally, the gap between OccamLlama and LLama 3 8B is substantially smaller on GSM8K. On average, OccamLlama outperforms both Llama and GPT 3.5 Turbo, with accuracy approximately halfway between that of GPT 3.5 and GPT 4o.

We also find that the performance of OccamLlama on reasoning benchmarks improves dramatically as we increase the base model from Llama 3 8B to Llama 3 70B (OccamLlama 70B).

To ensure we had results in time for the rebuttal deadline, we sampled 100 random questions from each dataset to determine OccamLlama 70B's performance, which is the cause for the relatively large error bars. We will update this response with the full results once they complete. The results are shown visually in Figure d) of the global response PDF and the table below:

Table: Accuracy on reasoning tasks. Higher is better.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.498.0 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.287.0 ± 3.479.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.4100.0 ± 0.099.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.599.0 ± 1.057.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.891.0 ± 2.960.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.193.0 ± 2.696.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.096.0 ± 2.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.194.9 ± 1.981.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

These results demonstrate that OccamLLM is capable of robust performance on large-scale models. Training the OccamLLM 70B system (in 16-bit precision and unoptimized code) required approximately 1.5-2 days on a system with two A100 nodes, demonstrating the low compute requirements of our method.

OccamLlama 70B shows consistent improvement over OccamLlama 8B. We believe this results from a combination of Llama 3 70B's improved reasoning capabilities and Llama 3 70B's improved representations (which enable OccamLlama to generalize more effectively). We expect that training with even larger base models would lead to further improvements. Remarkably, OccamLlama 70B outperforms GPT 4o and GPT 4o with Code on average across these benchmarks. It also outperforms GPT 3.5 Turbo on all but Single Eq.


We are very grateful for your comments and suggestions. We hope you will agree that the additional experiments and clarifications will strengthen our paper.

评论

Dear Reviewer,

We are writing to provide updated results from our completed evaluation runs. Below, we show the final results for OccamLlama 70B. We note that OccamLlama 70B maintains its strong performance when tested on the complete datasets. It still outperforms both GPT 4o and GPT 4o with Code on average. It also improves substantially compared to the 100-datapoint performance estimate on Single Eq. OccamLlama outperforms GPT 3.5 Turbo on all datasets except Single Eq, where it is less than one percentage point behind and within error bars.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8b InstructLlama 3 70B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.496.5 ± 0.993.4 ± 1.297.2 ± 0.895.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.290.1 ± 0.879.8 ± 1.194.8 ± 0.684.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.599.8 ± 0.298.2 ± 0.597.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.597.7 ± 0.657.3 ± 2.076.3 ± 1.777.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.889.5 ± 1.560.3 ± 2.471.6 ± 2.363.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.9 ± 0.896.3 ± 0.897.6 ± 0.797.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.093.2 ± 0.886.3 ± 1.194.5 ± 0.787.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.294.6 ± 0.981.9 ± 1.590.0 ± 1.286.2 ± 1.494.2 ± 1.093.4 ± 1.7

Above, we also include results from Llama 3 70B Instruct. OccamLlama 70B substantially outperforms Llama 3 70B on MultiArith Float and MATH401. It is roughly the same (1.3 percentage point deviation or less) as Llama 3 70B on all other datasets (ones with arithmetic simple enough for Llama to perform satisfactorily) except for GSM8K, which demonstrates OccamLlama's ability to not interfere with reasoning while still assisting with arithmetic. On average, OccamLlama 70B outperforms Llama 3 70B by nearly five percentage points, and OccamLlama 8B approximately matches Llama 3 70B's average performance.

We also provide the final table for the MGSM dataset:

DatasetOccamLlamaLlama 3 8b Instruct
Bengali31.6 ± 4.243.2 ± 3.9
Chinese21.2 ± 4.317.6 ± 3.7
French13.6 ± 4.213.6 ± 3.7
German14.0 ± 4.211.6 ± 3.6
Japanese47.2 ± 3.963.6 ± 3.4
Russian15.2 ± 4.212.8 ± 3.6
Spanish5.6 ± 4.19.2 ± 3.5
Swedish32.0 ± 4.238.4 ± 3.9
Telugu48.4 ± 3.961.6 ± 3.5
Thai20.0 ± 4.321.2 ± 3.8
Average24.9 ± 4.229.3 ± 3.7

We note that previously, there was a typo in this table. The averages were incorrect, incorrectly showing that OccamLlama had a larger performance drop than Llama on average. We apologize for any confusion. Our verbal statements in the rebuttal were correct, and Figure c) in the general rebuttal PDF had the correct averages for the previous data. Other than this typo, the table above closely resembles the previously reported table, so we do not comment further.

Thank you for your consideration.

评论

Dear Reviewer,

Thank you for your question. We believe that OccamLLM provides numerous advantages relative to code generation:

  • Improved generation speed and cost: LLM Code generation requires generating tokens for code, leading to the use of potentially many more tokens than are needed to respond to a query. This can make LLMs with code generation (unlike OccamLLM) slow and expensive. For example, in the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code, suggesting that GPT 4o with code is approximately 10 times more expensive per query than GPT 4o. Since OccamLLM does not require any additional autoregressive steps of the LLM, this suggests that OccamLLM is faster and cheaper than code generation.
  • Better results: The LLM code generation method we compared against, GPT 4o with Code Interpreter, actually performed worse than GPT 4o on average. We believe these results demonstrate that, for arithmetic, LLM code generation is not an optimal solution. OccamLlama 70B outperforms GPT 4o with code on our benchmarks, suggesting that OccamLLM provides a better solution.
  • No catastrophic forgetting: Code generation generally requires fine-tuning an LLM to write and run code. This risks catastrophic forgetting.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check code for malicious behavior. OccamLLM does not run arbitrary code, so it avoids these risks.
  • Reduced engineering: For the reasons mentioned in the previous bullet, building a secure system for LLM code execution, such as by hosting an LLM and a sandbox to safely run code, requires substantial engineering effort. Users can simply download our OccamLLM models and run them; no engineering is required.

As a result of OccamLlama's improved speed, cost, performance, security, and ease of use, we believe it is a compelling alternative to LLM code generation for arithmetic. We will make these advantages clearer in our paper.

Thank you again,

The Authors

评论

I thank the authors for their responses.

...it is possible to train an OccamLlama with the ability to perform arithmetic correctly...done by simply providing OccamLlama the ten most recent numbers in the text...

Regarding your statement that it is possible to train an OccamLlama to perform arithmetic correctly by simply providing it with the ten most recent numbers in the text, I understand that the input to OccamLlama includes the output of the decoder and the most recent numbers. Could you please clarify if this means that by training OccamLlama, the model can determine which numbers to use in the arithmetic tasks? For example, if the input numbers are '1932, 100, 1933, 200,' would the training allow OccamLlama to learn to use 100 and 200?"

评论

Dear Reviewer,

Thank you for your question.

Your interpretation is correct. If the input numbers are '1932, 100, 1933, 200,' the training would allow OccamLlama to learn to use 100 and 200.

During training, we specify how many numbers in the text are fed into OccamNet. OccamNet learns which of these numbers to use (and also in which order to feed them into operations). This occurs in our model now when an operation such as sqrt, sin, cos, exp, etc., is chosen, because these operations only require one input, but OccamNet receives the two most recent numbers.

So, in summary, OccamLlama is able to handle questions where the relevant numbers are dispersed in the text. It can determine which numbers are relevant. We will make this point clearer in the paper.

We appreciate your questions and comments,

The Authors

评论

I appreciate the authors response, which addressed all my questions. I increase my rate to 7. Great work!

评论

Dear Reviewer,

We extremely grateful for your thoughtful consideration of our work, and for increasing your score to a 7!

Thank you again,

The Authors

评论

I thank the authors for their responses.

Compared to LLMs that generate code to solve arithmetic, OccamLLM requires training a decoder and OccamLlama but does not need to fine-tune the LLMs on arithmetic code generation. Are there any other advantages compared to LLMs that generate code to solve arithmetic?

...it is possible to train an OccamLlama with the ability to perform arithmetic correctly...done by simply providing OccamLlama the ten most recent numbers in the text...

Regarding your statement that it is possible to train an OccamLlama to perform arithmetic correctly by simply providing it with the ten most recent numbers in the text, I understand that the input to OccamLlama includes the output of the decoder and the most recent numbers. Could you please clarify if this means that by training OccamLlama, the model can determine which numbers to use in the arithmetic tasks? For example, if the input numbers are '1932, 100, 1933, 200,' would the training allow OccamLlama to learn to use 100 and 200?"

审稿意见
3

The paper tackles the well known issue of performing arithmetic operations from LLMs. In particular, authors showed the following

  1. Llama + OccamNet model can be trained to perform a single arithmetic task
  2. Train the OccamNet model and Occam switch using synthetic data, both pure arithmetic task and simple reasoning problems
  3. The joint model outperform strong LLMs model on single arithmetic operations.

优点

  1. Adding an external OccamNet helps on avoiding finetuning the LLM model (which exposes the model to the risk of catastrophic forgetting or becoming less safe).
  2. The manual data collection is either cheap or completely non-existent, with the result that the idea can be replicated for other reasoning tasks.
  3. Introducing the MultiArith Float is a significant contribution, allowing the use for further research in the same domain.
  4. The method is clearly faster than current LLMs, although on a very small token proportions.

缺点

  1. My main concern is about novelty. The idea of offloading calculation to another specialized model with a Switch is not new.
  2. The work cannot be easily expanded on multi step calculations. In particular, for a task of length nn you probably require a nn layer OccamNet, which imposes a big constrain on the calculation you can actually do.
  3. There is no comparison between the OccamNet and another baseline, i.e. a trained Transformer which makes the Python code for the calculations only. The baseline should be easy to train, with no safety risk and with no catastrophic forgetting, but also comes with multi step calculations for free.
  4. The methodology depends on a constrained set of operations and numbers only, and the paper doesn't discuss how OccamLLM, and the Switch in particular, handles ambiguous or uncertain inputs (i.e. "compute one plus two"), making it less relevant for real world applications.
  5. On math benchmarks, especially on GSM8K, the results are pretty weak. Authors explained the reason, but it's not clear how to fix it without adding more data. In that case I'm not clear about the benefit of training the OccamNet and Switch compared to add these data to the LLM pretraining dataset.

问题

  1. How did you generate the OccamLLM switch training data ("we manually created and labeled several other examples for diverse scenarios, to explicitly teach the system in which cases it should or should not use OccamNet")?
  2. How robust is the model to unseen training data?

局限性

Authors did some work to address the limitations, but making Llama do the multi step calculations actually defeats the purpose of the work.

作者回复

Strengths:

Adding an external OccamNet helps on avoiding finetuning the LLM model (which exposes the model to the risk of catastrophic forgetting or becoming less safe). The manual data collection is either cheap or completely non-existent, with the result that the idea can be replicated for other reasoning tasks.

Introducing the MultiArith Float is a significant contribution, allowing the use for further research in the same domain.

The method is clearly faster than current LLMs, although on a very small token proportions.

We appreciate your kind remarks!

Weaknesses:

My main concern is about novelty. The idea of offloading calculation to another specialized model with a Switch is not new.

We acknowledge that offloading computation with a switch is a common procedure in mixture of experts models. However, to our knowledge, our work is unique in that we implement a decoder module which initializes the parameters of a second computational model at each autoregressive step. In our mind, this is closest to a hypernetwork, but unlike a hypernetwork we initialize OccamNet at each autoregressive step instead of just once per input.

We believe this step is quite novel. To use an analogy, if this same idea were used in the context of linear layers (i.e. using one linear layer to initialize another), one arrives at a model similar to attention. In the same way that we believe attention is sufficiently distinct from a linear layer to be considered novel, we believe that our method is sufficiently distinct from other gated computation methods to be considered novel.

The work cannot be easily expanded on multi step calculations. In particular, for a task of length N you probably require an N layer OccamNet, which imposes a big constrain on the calculation you can actually do.

Thank you for your interest in OccamLLM using a multi-layer OccamNet. We are exploring this direction in future work, and we beleive we can train performant OccamLLM systems with a many-layer OccamNet. In the present paper, we simply wish to demonstrate the potential promise of using multiple-layer OccamNets.

We note that our existing methods are already capable of performing arbitrarily complex arithmetic by chaining simpler calculations. Existing language models are generally trained or prompted to reason step-by-step, meaning that their generation can often be supplemented by a single-layer OccamNet. This is evidenced by our reasoning benchmarks, in which OccamLlama (with a single-layer OccamNet) performs quite well across a wide range of tasks.

Additionally, we can further improve OccamLlama systems with single-layer OccamNets by finetuning and/or prompting the model to think step-by-step.

There is no comparison between the OccamNet and another baseline, i.e. a trained Transformer which makes the Python code for the calculations only. The baseline should be easy to train, with no safety risk and with no catastrophic forgetting, but also comes with multi step calculations for free.

In the text, we describe what we believe are compelling reasons why OccamLLM works most naturally when representing the probability distribution over functions using OccamNet.

Regarding transformers and RNNs, we believe that OccamNet possesses a key advantage of being interpretable; simply by looking at the weights, it is possible for a human to determine which functions OccamNet assigns a high probability. We believe that this interpretability will make OccamNet easy for a decoder to initialize with the desired distribution. On the other hand, an RNN or transformer have substantially more complex relations between the weights and corresponding probability distribution, which we hypothesize would make learning a decoder for such models difficult.

This leads us to a key point: transformers and RNNs are effective for modeling complex multimodal distributions, but for this problem, we want to select a single function for each token, so the extra expressivity of these models is unneeded and likely detrimental to performance. We believe that OccamNet, a much simpler architecture, enables better parameter efficiency and performance.

We also tested using Equation Learner but we found it to be unstable. The EQL model represents a single function instead of a probability distribution over functions, and therefore the RL-inspired OccamNet loss cannot be used. The Mean Square Error loss used in this case requires many adjustments and tricks to enable the model to train, and we find that EQL explores the function space much less efficiently. Thus, it gets stuck and local minima. We didn't manage to get good performance with EQL.

Regarding your proposal to replace OccamNet with a transformer which produces Python code for the relevant calculation, we believe this still presents a security risk in that code such as __ import__('os').system('rm -rf /') could be generated and run. Although this is a particularly obvious example, it is in general extremely challenging to check general-purpose code for malicious behavior. We believe that one potential solution would be to train a transformer to generate a restricted language that is not Turing-complete, such as a prefix-notation formula. We are happy to run this baseline and report the result if you would like. At the same time, we are confident that, for the reasons listed above, OccamNet will perform better than this alternative.

Continued below

评论

Questions:

How did you generate the OccamLLM switch training data ("we manually created and labeled several other examples for diverse scenarios, to explicitly teach the system in which cases it should or should not use OccamNet")?

Thank you for your question.

The OccamLLM Switch training data comes from different sources. 25% comes from numerical expressions and 5% come from multi-step reasoning problems, as explained in the text. The remaining 70% was sampled from a set of 27 manually created prompts that we expected to cover most of the cases that we needed to teach the LLM about when to use and when not to use the single-layer OccamNet. Each sample was sampled hundreds of times with different numerical values to add diversity. Below, we present some of these prompts and explain what case they cover. The tokens of the numbers in bold were marked with a 1 to signal that the switch should route to OccamNet, and the rest of the numbers and all of the text tokens were marked with a 0 to signal that OccamNet should not be called in these cases.

The cases we cover with these prompts are:

  1. Single-layer OccamNet should be called to peform two-input arithmetic operations.
  2. OccamNet should not be called to write numbers that already appeared and do not require further arithmetic.
  3. OccamNet should not be called to round numbers.
  4. OccamNet should not be called to perform inline operations nested within a larger expression because it might contradict the assumption that the last two numbers are involved in the operation.
  5. OccamNet should not be used to convert magnitudes when the LLM does not explicitly write the conversion factor and relies in its internal knowledge; without the numeric value of the factor present in the text, OccamNet can't give the correct result.
  6. OccamNet can't compute percentages, as this operation was not included in the training set of operations. It could be done in two steps.
  7. Single-layer OccamNet can't compute a fraction of a number, as this would require two operations.
  8. Single-layer OccamNet can't perform operations with more than 2 arguments.

User query: An author writes 7.6 pages per session. After 9 sessions, the total pages written are

Assistant response: The author writes 7.6 pages per session. After 9 sessions, the author will have written 7.6 * 9 = 68.7 pages. The answer is 68.

Cases: 1, 2, 3


User query: I have 10 oranges and 6 apple trees, each of them with 3 apples. How much fruit do I have?

Assistant response: The total number of pieces of fruit is 10 oranges + 6 trees times 3 apples = 10 + (6 * 3) = 10 + 18 = 28. The answer is 28.

Cases: 1, 2, 4


User query: Sally has 7 pennies and 8 nickels. How many pennies does she have in total?

Assistant response: 1. Convert the number of nickles to pennies: 8 nickles is 40 pennies.

  1. Add total number of pennies 40 + 7 = 47.

The answer is 47.

Cases: 1, 2, 5


User query: What is 84 per cent of 230.77??

Assistant response: 84% of 230.77 = 193.85

Explanation:

84 / 100 = 0.84

0.84 * 230.77 = 193.85

Cases: 1, 2, 6


User query: There were 39 cats on a rock. 3 boats came and carried away 5 cats each, and later, 3/7 of the remaining cats ran after a mouse they'd seen. How many cats were left on the rock?

Assistant response: There were originally 39 cats. 3 boats came and each took away 5 cats. So 3 x 5 = 15

39 - 15 = 24

So there were 21 cats left. Then 3/7 of them ran away. 3/7 of 21 is 9

24 - 9 = 15

So there were 15 cats left.

Cases: 1, 2, 7


User query: 11 people have 2 apples, 2 oranges and 2 peaches each. How many pieces of fruit do they have combined?

Assistant response: Each person has 2 + 2 + 2 = 6 pieces of fruit.

In total, they have 6 * 11 = 66.

The answer is 66.

Cases: 1, 2, 8


Continued below

评论

How robust is the model to unseen training data?

We find that OccamLlama displays remarkable generalization capabilities on out-of-distribution problems. To demonstrate this, we train the OccamNet decoder from scratch, using only numeric expressions and absolutely no text at all. This means that any problem with text, such as a word problem, is far out of distribution of the OccamNet decoder's training data. We test this model (using the standard router) on the mathematical reasoning benchmarks and obtain remarkably good results, shown visually in Figure b) of the global response PDF and the table below. The OccamLlama 8B decoder trained only with numerical expressions performs on par with the model trained with both numbers and text, even achieving higher accuracy on some benchmarks. This shows that the OccamLLM framework is robust, and points towards the fact that the representations of arithmetic that are built in the transformer body of the LLM and extracted by the OccamLLM Decoder are very general.

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlamaOccamLlama ArithLlama 2 7B ChatLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.492.7 ± 1.378.0 ± 2.193.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.271.6 ± 1.236.0 ± 1.379.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.576.0 ± 1.799.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.595.3 ± 0.923.3 ± 1.757.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.885.8 ± 1.743.9 ± 2.560.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.192.1 ± 1.279.1 ± 1.896.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.088.8 ± 1.061.5 ± 1.586.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.189.3 ± 1.156.8 ± 1.881.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

In contrast, we expect that finetuning Llama to perform arithmetic using only numeric examples and no text whatsoever would lead to extreme catastrophic forgetting and poor arithmetic performance on word problems. As such, we believe this data shows a remarkable robustness of OccamLLM.

As a second example of generalization, we tested OccamLlama on the Multilingual Grade School Math Benchmark (MGSM), a dataset consisting of GSM8K translated into 10 languages (Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai). We compute the drop in accuracy when switching from English to another language, given by the accuracy of a model on the English dataset minus the accuracy of the model on the dataset in a given language. For OccamLlama in Bengali and Telugu, due to time constraints, we evaluated on a subset of 100 and 50 randomly chosen problems, respectively. We will update the table below with the full results soon. We show the results in visually in Figure c) of the global response PDF and the table below:

Table: Accuracy drop on multilingual reasoning tasks. Lower is better.

DatasetOccamLlama ArithLlama 3 8b Instruct
Bengali27.0 ± 5.843.2 ± 3.9
Chinese21.2 ± 4.317.6 ± 3.7
French13.6 ± 4.213.6 ± 3.7
German14.0 ± 4.211.6 ± 3.6
Japanese47.2 ± 3.963.6 ± 3.4
Russian15.2 ± 4.212.8 ± 3.6
Spanish5.6 ± 4.19.2 ± 3.5
Swedish32.0 ± 4.238.4 ± 3.9
Telugu56.0 ± 6.061.6 ± 3.5
Thai20.0 ± 4.321.2 ± 3.8
Average38.0 ± 4.229.3 ± 3.7

The table above shows that OccamLlama has on average a smaller performance drop than llama between the English dataset and the non-English language datasets. The fact that OccamLlama (the decoders for which have never been trained on other languages) has on average better out-of-distribution behavior than Llama (a model trained on over 750 billion tokens of non-English text) is in our opinion quite remarkable.

We believe that these two tests demonstrate OccamLlama's robustness against out-of-distribution data.

Limitations:

Authors did some work to address the limitations, but making Llama do the multi step calculations actually defeats the purpose of the work.

We respectfully disagree. OccamLlama ensures that arithmetic is correct. As we mention above, most language models already perform step-by-step arithmetic, so OccamLlama does not use any more tokens than a normal language model generation.


We are very grateful for your comments and suggestions. We hope you will agree that the additional experiments and clarifications will strengthen our paper.

评论

The methodology depends on a constrained set of operations and numbers only, and the paper doesn't discuss how OccamLLM, and the Switch in particular, handles ambiguous or uncertain inputs (i.e. "compute one plus two"), making it less relevant for real world applications.

We understand your concern about less structured arithmetic problems, and we are happy to assert that OccamLlama is capable of solving these problems. For example, below we show input-output pairs of basic queries to OccamLlama that demonstrate its ability to perform arithmetic even when numbers are provided as words:

  • Six minus seven =? -1\n\nSix minus seven is equal to -1.
  • Three + four = ? 7\n\n3 + 4 = 7\n\nLet me know if you have any other questions!

The above calculations are computed correctly because we use a number parser that parses strings such as three and six as numbers. One can change OccamNet's behavior on tasks such as this by changing the number parser. We didn't parse words as numbers when performing our benchmarks, but we do not expect that this would change OccamLlama's performance substantially.

On math benchmarks, especially on GSM8K, the results are pretty weak. Authors explained the reason, but it's not clear how to fix it without adding more data. In that case I'm not clear about the benefit of training the OccamNet and Switch compared to add these data to the LLM pretraining dataset.

Thank you for your comment. Since submitting to NeurIPS, we identified a few minor bugs in our code which were limiting performance and slightly reweighted the proportions of the different training datasets. These changes maintained the perfect performance on arithmetic tasks and improved performance on mathematical reasoning tasks. The updated results are shown visually in Figure a) of the global response PDF and the table below:

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.498.0 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.287.0 ± 3.479.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.4100.0 ± 0.099.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.599.0 ± 1.057.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.891.0 ± 2.960.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.193.0 ± 2.696.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.096.0 ± 2.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.194.9 ± 1.981.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

Here, we see that OccamLlama outperforms even GPT 4o and GPT 4o with Code on benchmarks requiring challenging arithmetic (MultiArith Float and MATH401), areas where OccamLlama shines. Additionally, the gap between OccamLlama and LLama 3 8B is substantially smaller than before on GSM8K. On average, OccamLlama outperforms both Llama and GPT 3.5 Turbo, with accuracy approximately halfway between that of GPT 3.5 and GPT 4o.

We also find that the performance of OccamLlama on reasoning benchmarks improves dramatically as we increase the base model from Llama 3 8B to Llama 3 70B.

To ensure we had results in time for the rebuttal deadline, we sampled 100 random questions from each dataset to determine OccamLlama 70B's performance, which is the cause for the relatively large error bars. We will update this response with the full results once they complete. The results are shown visually in Figure d) of the global response PDF and the table above.

These results demonstrate that OccamLLM is capable of robust performance on large-scale models. Training the OccamLLM 70B system (in 16-bit precision and unoptimized code) required approximately 1.5-2 days on a system with two A100 nodes, demonstrating the low compute requirements of our method.

OccamLlama 70B shows consistent improvement over OccamLlama 8B. We believe this results from a combination of Llama 3 70B's improved reasoning capabilities and Llama 3 70B's improved representations (which enable OccamLlama to generalize more effectively). We expect that training with even larger base models would lead to further improvements. Remarkably, OccamLlama 70B outperforms GPT 4o and GPT 4o with Code on average across these benchmarks. It also outperforms GPT 3.5 Turbo on all but Single Eq.

Continued below

评论

Dear Reviewer,

We are writing to provide updated results from our completed evaluation runs. Below, we show the final results for OccamLlama 70B. We note that OccamLlama 70B maintains its strong performance when tested on the complete datasets. It still outperforms both GPT 4o and GPT 4o with Code on average. It also improves substantially compared to the 100-datapoint performance estimate on Single Eq. OccamLlama outperforms GPT 3.5 Turbo on all datasets except Single Eq, where it is less than one percentage point behind and within error bars.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8b InstructLlama 3 70B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.496.5 ± 0.993.4 ± 1.297.2 ± 0.895.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.290.1 ± 0.879.8 ± 1.194.8 ± 0.684.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.599.8 ± 0.298.2 ± 0.597.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.597.7 ± 0.657.3 ± 2.076.3 ± 1.777.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.889.5 ± 1.560.3 ± 2.471.6 ± 2.363.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.9 ± 0.896.3 ± 0.897.6 ± 0.797.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.093.2 ± 0.886.3 ± 1.194.5 ± 0.787.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.294.6 ± 0.981.9 ± 1.590.0 ± 1.286.2 ± 1.494.2 ± 1.093.4 ± 1.7

Above, we also include results from Llama 3 70B Instruct. OccamLlama 70B substantially outperforms Llama 3 70B on MultiArith Float and MATH401. It is roughly the same (1.3 percentage point deviation or less) as Llama 3 70B on all other datasets (ones with arithmetic simple enough for Llama to perform satisfactorily) except for GSM8K, which demonstrates OccamLlama's ability to not interfere with reasoning while still assisting with arithmetic. On average, OccamLlama 70B outperforms Llama 3 70B by nearly five percentage points, and OccamLlama 8B approximately matches Llama 3 70B's average performance.

We also provide the final table for the MGSM dataset:

DatasetOccamLlamaLlama 3 8b Instruct
Bengali31.6 ± 4.243.2 ± 3.9
Chinese21.2 ± 4.317.6 ± 3.7
French13.6 ± 4.213.6 ± 3.7
German14.0 ± 4.211.6 ± 3.6
Japanese47.2 ± 3.963.6 ± 3.4
Russian15.2 ± 4.212.8 ± 3.6
Spanish5.6 ± 4.19.2 ± 3.5
Swedish32.0 ± 4.238.4 ± 3.9
Telugu48.4 ± 3.961.6 ± 3.5
Thai20.0 ± 4.321.2 ± 3.8
Average24.9 ± 4.229.3 ± 3.7

We note that previously, there was a typo in this table. The averages were incorrect, incorrectly showing that OccamLlama had a larger performance drop than Llama on average. We apologize for any confusion. Our verbal statements in the rebuttal were correct, and Figure c) in the general rebuttal PDF had the correct averages for the previous data. Other than this typo, the table above closely resembles the previously reported table, so we do not comment further.

Thank you for your consideration.

评论

Dear Reviewer,

In discussions with another reviewer, the reviewer found the following clarifications helpful, leading them to raise their score to a 7. We are sending them here in case it helps your consideration of our rebuttal.

We believe that OccamLLM provides numerous advantages relative to code generation:

  • Improved generation speed and cost: LLM Code generation requires generating tokens for code, leading to the use of potentially many more tokens than are needed to respond to a query. This can make LLMs with code generation (unlike OccamLLM) slow and expensive. For example, in the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code, suggesting that GPT 4o with code is approximately 10 times more expensive per query than GPT 4o. Since OccamLLM does not require any additional autoregressive steps of the LLM, this suggests that OccamLLM is faster and cheaper than code generation.
  • Better results: The LLM code generation method we compared against, GPT 4o with Code Interpreter, actually performed worse than GPT 4o on average. We believe these results demonstrate that, for arithmetic, LLM code generation is not an optimal solution. OccamLlama 70B outperforms GPT 4o with code on our benchmarks, suggesting that OccamLLM provides a better solution.
  • No catastrophic forgetting: Code generation generally requires fine-tuning an LLM to write and run code. This risks catastrophic forgetting.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check code for malicious behavior. OccamLLM does not run arbitrary code, so it avoids these risks.
  • Reduced engineering: For the reasons mentioned in the previous bullet, building a secure system for LLM code execution, such as by hosting an LLM and a sandbox to safely run code, requires substantial engineering effort. Users can simply download our OccamLLM models and run them; no engineering is required.

As a result of OccamLlama's improved speed, cost, performance, security, and ease of use, we believe it is a compelling alternative to LLM code generation for arithmetic. We will make these advantages clearer in our paper.

One other attractive feature we clarified is that OccamLlama is able to handle questions where the relevant numbers are dispersed in the text. It can determine which numbers are relevant in the text and perform arithmetic on those numbers.

Thank you again,

The Authors

评论

I want to thank authors for the very detailed explanation.

In our mind, this is closest to a hypernetwork, but unlike a hypernetwork we initialize OccamNet at each autoregressive step instead of just once per input.

This is a very strong claim, which cannot be sustained. In particular, there is no guarantee that your additional model is correct (it's not doing 100% on your arithmetic benchmark) and I'm guessing if one parser with implicit code execution capability would still do the work. I'm not saying that the real world applications are limited, but this method doesn't guarantee a 100% precision on a subject which models are not designed to perform well enough.

Regarding transformers and RNNs, we believe that OccamNet possesses a key advantage of being interpretable; simply by looking at the weights, it is possible for a human to determine which functions OccamNet assigns a high probability. We believe that this interpretability will make OccamNet easy for a decoder to initialize with the desired distribution. On the other hand, an RNN or transformer have substantially more complex relations between the weights and corresponding probability distribution, which we hypothesize would make learning a decoder for such models difficult.

Although this is a particularly obvious example, it is in general extremely challenging to check general-purpose code for malicious behavior. We believe that one potential solution would be to train a transformer to generate a restricted language that is not Turing-complete, such as a prefix-notation formula. We are happy to run this baseline and report the result if you would like.

My feeling is that we are over engineering one problem which can be solved just training a model to produce Python code (and the model is different, so no safety problem would occur).

We understand your concern about less structured arithmetic problems, and we are happy to assert that OccamLlama is capable of solving these problems. For example, below we show input-output pairs of basic queries to OccamLlama that demonstrate its ability to perform arithmetic even when numbers are provided as words:

Thanks for this.

Below, we present some of these prompts and explain what case they cover. The tokens of the numbers in bold were marked with a 1 to signal that the switch should route to OccamNet, and the rest of the numbers and all of the text tokens were marked with a 0 to signal that OccamNet should not be called in these cases.

Thanks for providing the data. From this data (in particular the one related to cats), it's very clear that one parser (of the raw data or the generated tokens) would probably work better. So the innovation would still be in the Switch, although this is not new in the literature.

OccamLlama ensures that arithmetic is correct.

This is not correct, as authors showed. Occam cannot perform the calculation at 100% correctness, defeating the overall purpose.

datasets [...] except for GSM8K, which demonstrates OccamLlama's ability to not interfere with reasoning while still assisting with arithmetic

This is a bold claim. OccamLlama performs 6% worse than Llama 8B and 4% worse than Llama 70B. So there is some interference from Occam on the hardest elementary math benchmark for LLMs.

评论

Author comment: In our mind, this is closest to a hypernetwork, but unlike a hypernetwork we initialize OccamNet at each autoregressive step instead of just once per input.

Reviewer comment: This is a very strong claim, which cannot be sustained. In particular, there is no guarantee that your additional model is correct (it's not doing 100% on your arithmetic benchmark) and I'm guessing if one parser with implicit code execution capability would still do the work. I'm not saying that the real world applications are limited, but this method doesn't guarantee a 100% precision on a subject which models are not designed to perform well enough.

We apologize for any misunderstanding. We agree that there is no guarantee our additional model is correct. The connection we see with hypernetworks is that both our method and hypernetworks use one model to initialize another.

Reviewer comment: My feeling is that we are over engineering one problem which can be solved just training a model to produce Python code (and the model is different, so no safety problem would occur).

Thank you for your comment. We agree that it is viable to use a symbolic model which produces Python code. In fact, in our paper, we included such a model as a valid OccamLLM approach: "In principle, OccamLLM can work with any symbolic model, i.e., any model that can parameterize a set of symbolic functions or a distribution over such functions. We choose OccamNet as opposed to, for example, a transformer...." We consider the key contributions of our paper the architecture in which a decoder initializes a symbolic model and a switch, and we put forward OccamNet as a reasonable choice for such a symbolic model, among other potential symbolic models. We will clarify this point in the paper.

Reviewer comment: Thanks for providing the data. From this data (in particular the one related to cats), it's very clear that one parser (of the raw data or the generated tokens) would probably work better. So the innovation would still be in the Switch, although this is not new in the literature.

Thank you for your comments. We understand why one would arrive at this conclusion from the training data we provided. However, we note that OccamLlama works for problems expressed in words, which cannot easily be solved by a math parser. For example, OccamLlama can solve the question "Alice has 3 apples and Bob has four apples. How many apples do they have together?" without writing any equations, demonstrating that it goes beyond the ability of a parser. We will emphasize this in the paper.

Author comment: OccamLlama ensures that arithmetic is correct.

Reviewer comment: This is not correct, as authors showed. Occam cannot perform the calculation at 100% correctness, defeating the overall purpose.

Thank you for your comment, and we agree that OccamLLM can make mistakes in choosing which operation to perform. We simply meant here that OccamLLM performs arithmetic operations which are exact, as opposed to operations performed internally by the LLM which are approximations. We find that when OccamLLM is trained, it can almost always sustain correct arithmetic through many steps of reasoning, leading to exactly correct answers.

In practice, the cases where OccamLlama fails appear to almost always be due to an imperfect switch. When the switch triggers correctly, the decoder almost always works correctly. We emphasize that the key contribution of our paper is not the switch; as reviewers have pointed out, there are probably many other people training switches. As such, the switch could be improved in future work, and we did not focus heavily on it here.

Author comment: datasets [...] except for GSM8K, which demonstrates OccamLlama's ability to not interfere with reasoning while still assisting with arithmetic

Reviewer comment: This is a bold claim. OccamLlama performs 6% worse than Llama 8B and 4% worse than Llama 70B. So there is some interference from Occam on the hardest elementary math benchmark for LLMs.

We apologize for any misunderstanding. We agree that there is some interference. We simply meant that OccamLlama is able to assist with arithmetic while not interfering with reasoning, not that it always does. We acknowledge that OccamLlama is not perfect, and we have plans to improve it through self-supervised training in future work. As we mentioned above, OccamLlama's mistakes stem mainly from the switch, which is not the focus of our work.


While we understand your concerns, we hope you will consider increasing your score.

Thank you,

The Authors

审稿意见
6

This paper proposes OccamLLM, an approach to enable models to have better arithmetic abilities. Although LLMs have impressive generative abilities, these models struggle with simple arithmetic. Therefore this paper proposes OccamLLM, which integrates OccamNet, a symbolic netural architecture that can perform arithmetic operations, into the decoding process of the LLM. The approach adds two further steps to the decoding process: First a text parser extracts all numeric values from the input text, which it fed into the OccamNet which evaluates the expression. Secondly, the decoder determines whether to use the language model or the OccamNet for generating the next token. Both systems are trained, where the OccamNet is trained with synthetic examples from templates, and the selection decoder is trained with several manual annotation examples. The paper then evaluates OccamLLM on simple arithmetic tasks, and finds that OccamLLM can get 100% accuracy for these 7-digit calculations, while the base Llama3-8B model struggles with these computations. Further, the system is applied to general reasoning tasks, as well as 2-step and 3-step arithmetic tasks.

优点

  • S1) The proposed method offers some valuable advantages over existing approaches. It does not need to update the model weights and therefore will not suffer from catastrophic forgetting (like integrating LLM tools may), while being more computationally efficient than code generation which will require 2 stages and intermediate code that can be 50x longer.

  • S2) For single step arithmetic with 7-digit inputs, the OccamLLM is very robust and is always able to generate the correct answer, while the baseline LLama3 8B model struggles with these calculations.

  • S3) The use of an OccamNet is highly interpretable, as one can look at the predicted computational DAG, which can also be useful for interpreting why a calculation goes wrong when it does.

缺点

  • W1) When applied to word-based mathematical reasoning problems, the system only performs better than its baseline (Llama3-8B) when high precision is required (e.g. MultiArith Float and MATH401) but otherwise performance is quite similar.

  • W2) The strongest results shown are for 1-step arithmetic, however considering this is a very simple set up and is matched with the training methodology. Whether the OccamNet generalises well to similar 1 step arithmetic when embedded in more complicated text-based questions, more similar to the results in section 4.2, is less apparent.

  • W3) As mentioned in the limitation, many of the results focus on single 1-step arithmetic problems, and although results for 2 and 3 step are still reasonable, there is performance degradation. I’d speculate that the code-based LLMs may not suffer from such issues.

问题

How did you generate the training and evaluation dataset, and what templates were used? I’m aware that Appendix A.1 details split sizes and the type of questions, but are all the queries just the raw calculation, e.g. input=’1254+854’ (+ all the prompting such as putting ‘Answer:’ or requesting to give the response in decimals).

局限性

The authors have adequately discussed limitations

作者回复

Strengths:

S1) The proposed method offers some valuable advantages over existing approaches. It does not need to update the model weights and therefore will not suffer from catastrophic forgetting (like integrating LLM tools may), while being more computationally efficient than code generation which will require 2 stages and intermediate code that can be 50x longer.

S2) For single step arithmetic with 7-digit inputs, the OccamLLM is very robust and is always able to generate the correct answer, while the baseline LLama3 8B model struggles with these calculations.

S3) The use of an OccamNet is highly interpretable, as one can look at the predicted computational DAG, which can also be useful for interpreting why a calculation goes wrong when it does.

We appreciate your kind remarks!

Weaknesses:

W1) When applied to word-based mathematical reasoning problems, the system only performs better than its baseline (Llama3-8B) when high precision is required (e.g. MultiArith Float and MATH401) but otherwise performance is quite similar.

We acknowledge that the OccamLLM framework does not bring substantial improvements in general reasoning tasks that do not require challenging arithmetic. Indeed, the target of this system is to ensure that the arithmetic is performed correctly, as incorrect arithmetic disturbs the reasoning process of LLMs. It is expected that the improvements provided by this method are only prominent in tasks that require challenging arithmetic operations; otherwise, the LLM does require OccamNet and can sucessfully follow its natural reasoning path. The reasoning heavy-lifting is still performed by the LLM, and OccamNet assists it to ensure that the computations are accurate and the reasoning can proceed satisfactorily.

Since submitting to NeurIPS, we identified a few minor bugs in our code which were limiting performance and slightly reweighted the proportions of the different training datasets. These changes maintained the perfect performance on arithmetic tasks and improved performance on mathematical reasoning tasks. The updated results are shown visually in Figure a) of the global response PDF, and we also present them in the table below:

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlamaLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.279.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.499.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.557.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.860.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.181.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

Here, we see that OccamLlama outperforms even GPT 4o and GPT 4o with Code on benchmarks requiring challenging arithmetic (MultiArith Float and MATH401), areas where OccamLlama shines. Additionally, the gap between OccamLlama and LLama 3 8B is substantially smaller on GSM8K. On average, OccamLlama outperforms both Llama and GPT 3.5 Turbo, with accuracy approximately halfway between that of GPT 3.5 and GPT 4o.

We also find that the performance of OccamLlama on reasoning benchmarks improves dramatically as we increase the base model from Llama 3 8B to Llama 3 70B (OccamLlama 70B).

To ensure we had results in time for the rebuttal deadline, we sampled 100 random questions from each dataset to determine OccamLlama 70B's performance, which is the cause for the relatively large error bars. We will update this response with the full results once they complete. The results are shown visually in Figure d) of the global response PDF and are also presented in the table below:

Table: Accuracy on reasoning tasks. Higher is better.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.498.0 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.287.0 ± 3.479.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.4100.0 ± 0.099.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.599.0 ± 1.057.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.891.0 ± 2.960.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.193.0 ± 2.696.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.096.0 ± 2.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.194.9 ± 1.981.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

These results demonstrate that OccamLLM is capable of robust performance on large-scale models. Training the OccamLLM 70B system (in 16-bit LLM precision and unoptimized code) required approximately 1.5-2 days on a system with two A100 nodes, demonstrating the low compute requirements of our method.

OccamLlama 70B shows consistent improvement over OccamLlama 8B. We believe this results from a combination of Llama 3 70B's improved reasoning capabilities and Llama 3 70B's improved representations (which enable OccamLlama to generalize more effectively). We expect that training with even larger base models would lead to further improvements. Remarkably, OccamLlama 70B outperforms GPT 4o and GPT 4o with Code on average across these benchmarks. It also outperforms GPT 3.5 Turbo on all but Single Eq.

Continued below

评论

W2) The strongest results shown are for 1-step arithmetic, however considering this is a very simple set up and is matched with the training methodology. Whether the OccamNet generalises well to similar 1 step arithmetic when embedded in more complicated text-based questions, more similar to the results in section 4.2, is less apparent.

Thank you for your comment. We believe that the mathematical reasoning benchmarks provide experimental validation that OccamLlama can effectively handle more complicated text-based questions. These problems are multi-step math problems requiring multiple calls to OccamNet to be completed successfully. Given OccamLlama's strong performance on these benchmarks, we hope you will agree that OccamLlama can effectively handle compound expressions. We will make this clearer in the paper.

Regarding generalization, we find that OccamLlama displays remarkable generalization capabilities on out-of-distribution problems. To demonstrate this, we train the OccamNet decoder from scratch, using only numeric expressions and absolutely no text at all. This means that any problem with text, such as a word problem, is far out of distribution of the OccamNet decoder's training data. We test this model (using the standard router) on the mathematical reasoning benchmarks and obtain remarkably good results, shown visually in Figure b) of the global response PDF and presented in the table below. The OccamLlama 8B decoder trained only with numerical expressions performs on par with the model trained with both numbers and text, even achieving higher accuracy on some benchmarks. This shows that the OccamLLM framework is robust and points towards the fact that the representations of arithmetic that are built in the transformer body of the LLM and extracted by the OccamLLM Decoder are very general.

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlamaOccamLlama ArithLlama 2 7B ChatLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.492.7 ± 1.378.0 ± 2.193.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.271.6 ± 1.236.0 ± 1.379.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.576.0 ± 1.799.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.595.3 ± 0.923.3 ± 1.757.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.885.8 ± 1.743.9 ± 2.560.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.192.1 ± 1.279.1 ± 1.896.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.088.8 ± 1.061.5 ± 1.586.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.189.3 ± 1.156.8 ± 1.881.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

In contrast, we expect that finetuning Llama to perform arithmetic using only numeric examples and no text whatsoever would lead to extreme catastrophic forgetting and poor arithmetic performance on word problems. As such, we believe this data shows a remarkable robustness of OccamLLM.

As a second example of generalization, we tested OccamLlama on the Multilingual Grade School Math Benchmark (MGSM), a dataset consisting of GSM8K translated into 10 languages (Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, and Thai). We compute the drop in accuracy when switching from English to another language, given by the accuracy of a model on the English dataset minus the accuracy of the model on the dataset in a given language. For OccamLlama in Bengali and Telugu, due to time constraints, we evaluated on a subset of 100 and 50 randomly chosen problems, respectively. We will update the table below with the full results soon. We show the results in visually in Figure c) of the global response PDF and the table below:

Table: Accuracy drop on multilingual reasoning tasks. Lower is better.

DatasetOccamLlama ArithLlama 3 8b Instruct
Bengali27.0 ± 5.843.2 ± 3.9
Chinese21.2 ± 4.317.6 ± 3.7
French13.6 ± 4.213.6 ± 3.7
German14.0 ± 4.211.6 ± 3.6
Japanese47.2 ± 3.963.6 ± 3.4
Russian15.2 ± 4.212.8 ± 3.6
Spanish5.6 ± 4.19.2 ± 3.5
Swedish32.0 ± 4.238.4 ± 3.9
Telugu56.0 ± 6.061.6 ± 3.5
Thai20.0 ± 4.321.2 ± 3.8
Average38.0 ± 4.229.3 ± 3.7

Continued below

评论

The table above shows that OccamLlama has on average a smaller performance drop than llama between the English dataset and the non-English language datasets. The fact that OccamLlama (the decoders for which have never been trained on other languages) has on average better out-of-distribution behavior than Llama (a model trained on over 750 billion tokens of non-English text) is in our opinion quite remarkable.

We believe that these two tests demonstrate OccamLlama's robustness against out-of-distribution data.

W3) As mentioned in the limitation, many of the results focus on single 1-step arithmetic problems, and although results for 2 and 3 step are still reasonable, there is performance degradation. I’d speculate that the code-based LLMs may not suffer from such issues.

Thank you for your interest in OccamLLM using a multi-layer OccamNet. We are exploring this direction in future work, and we believe we can train performant OccamLLM systems with a many-layer OccamNet. In the present paper, we simply wish to demonstrate the potential promise of using multiple-layer OccamNets.

We note that our existing methods are already capable of performing arbitrarily complex arithmetic by chaining simpler calculations. Existing language models are generally trained or prompted to reason step-by-step, meaning that their generation can often be supplemented by a single-layer OccamNet. This is evidenced by our reasoning benchmarks, in which OccamLlama (with a single-layer OccamNet) performs quite well across a wide range of tasks.

Additionally, we can further improve OccamLlama systems with single-layer OccamNets by finetuning and/or prompting the model to think step-by-step.

Questions:

How did you generate the training and evaluation dataset, and what templates were used? I’m aware that Appendix A.1 details split sizes and the type of questions, but are all the queries just the raw calculation, e.g. input=’1254+854’ (+ all the prompting such as putting ‘Answer:’ or requesting to give the response in decimals).

The training data follows the structure presented in Appendix A: one or several concatenated simple input-output pairs are passed to the model. For diversity, we use both unformatted inputs and data following the chat format in which Llama 3 8B Instruct was finetuned. In the case of the OccamNet decoder, we append "Answer = " at the beginning of the assistant response in the first round of training and leave it blank in the second round, as explained in Appendix A.1. The two examples below show the different cases. The ■ marks where the decoder is trained to initialize OccamNet based on the previous token to predict the right number.


Unformatted text:

242 - -47 = 289.0000

sqrt(82) = 9.0554

-7 + -10 = -17.0000

0.6531 + 0.4395 = 1.0926

log(8767.8356) = 9.0788

log(3553.0728) = 8.1756

cos(-4.4649) = -0.2450

sin(-7.5997) = -0.9678

cos(11.1069) = ■


** Formatted text with or without "Answer = ":

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

<|eot_id|><|start_header_id|>user<|end_header_id|>

336.75 * 60.96 =<|eot_id|><|start_header_id|>assistant<|end_header_id|>

(Answer = )■


Although these two examples show numeric data points without text, the training dataset also includes word problems. We give examples of word problems in our Continuation of Rebuttal (2) for reviewer Z5Za.

For the evaluation datasets, we used the benchmarks described in the paper and included each question in the user prompt. The system prompts are as described in the paper.


We are very grateful for your comments and suggestions. We hope you will agree that the additional experiments and clarifications will strengthen our paper.

评论

Dear Reviewer,

We are writing to provide updated results from our completed evaluation runs. Below, we show the final results for OccamLlama 70B. We note that OccamLlama 70B maintains its strong performance when tested on the complete datasets. It still outperforms both GPT 4o and GPT 4o with Code on average. It also improves substantially compared to the 100-datapoint performance estimate on Single Eq. OccamLlama outperforms GPT 3.5 Turbo on all datasets except Single Eq, where it is less than one percentage point behind and within error bars.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8b InstructLlama 3 70B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.496.5 ± 0.993.4 ± 1.297.2 ± 0.895.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.290.1 ± 0.879.8 ± 1.194.8 ± 0.684.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.599.8 ± 0.298.2 ± 0.597.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.597.7 ± 0.657.3 ± 2.076.3 ± 1.777.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.889.5 ± 1.560.3 ± 2.471.6 ± 2.363.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.9 ± 0.896.3 ± 0.897.6 ± 0.797.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.093.2 ± 0.886.3 ± 1.194.5 ± 0.787.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.294.6 ± 0.981.9 ± 1.590.0 ± 1.286.2 ± 1.494.2 ± 1.093.4 ± 1.7

Above, we also include results from Llama 3 70B Instruct. OccamLlama 70B substantially outperforms Llama 3 70B on MultiArith Float and MATH401. It is roughly the same (1.3 percentage point deviation or less) as Llama 3 70B on all other datasets (ones with arithmetic simple enough for Llama to perform satisfactorily) except for GSM8K, which demonstrates OccamLlama's ability to not interfere with reasoning while still assisting with arithmetic. On average, OccamLlama 70B outperforms Llama 3 70B by nearly five percentage points, and OccamLlama 8B approximately matches Llama 3 70B's average performance.

We also provide the final table for the MGSM dataset:

DatasetOccamLlamaLlama 3 8b Instruct
Bengali31.6 ± 4.243.2 ± 3.9
Chinese21.2 ± 4.317.6 ± 3.7
French13.6 ± 4.213.6 ± 3.7
German14.0 ± 4.211.6 ± 3.6
Japanese47.2 ± 3.963.6 ± 3.4
Russian15.2 ± 4.212.8 ± 3.6
Spanish5.6 ± 4.19.2 ± 3.5
Swedish32.0 ± 4.238.4 ± 3.9
Telugu48.4 ± 3.961.6 ± 3.5
Thai20.0 ± 4.321.2 ± 3.8
Average24.9 ± 4.229.3 ± 3.7

We note that previously, there was a typo in this table. The averages were incorrect, incorrectly showing that OccamLlama had a larger performance drop than Llama on average. We apologize for any confusion. Our verbal statements in the rebuttal were correct, and Figure c) in the general rebuttal PDF had the correct averages for the previous data. Other than this typo, the table above closely resembles the previously reported table, so we do not comment further.

Thank you for your consideration.

评论

Dear Reviewer,

In discussions with another reviewer, the reviewer found the following clarifications helpful, leading them to raise their score to a 7. We are sending them here in case it helps your consideration of our rebuttal.

We believe that OccamLLM provides numerous advantages relative to code generation:

  • Improved generation speed and cost: LLM Code generation requires generating tokens for code, leading to the use of potentially many more tokens than are needed to respond to a query. This can make LLMs with code generation (unlike OccamLLM) slow and expensive. For example, in the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code, suggesting that GPT 4o with code is approximately 10 times more expensive per query than GPT 4o. Since OccamLLM does not require any additional autoregressive steps of the LLM, this suggests that OccamLLM is faster and cheaper than code generation.
  • Better results: The LLM code generation method we compared against, GPT 4o with Code Interpreter, actually performed worse than GPT 4o on average. We believe these results demonstrate that, for arithmetic, LLM code generation is not an optimal solution. OccamLlama 70B outperforms GPT 4o with code on our benchmarks, suggesting that OccamLLM provides a better solution.
  • No catastrophic forgetting: Code generation generally requires fine-tuning an LLM to write and run code. This risks catastrophic forgetting.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check code for malicious behavior. OccamLLM does not run arbitrary code, so it avoids these risks.
  • Reduced engineering: For the reasons mentioned in the previous bullet, building a secure system for LLM code execution, such as by hosting an LLM and a sandbox to safely run code, requires substantial engineering effort. Users can simply download our OccamLLM models and run them; no engineering is required.

As a result of OccamLlama's improved speed, cost, performance, security, and ease of use, we believe it is a compelling alternative to LLM code generation for arithmetic. We will make these advantages clearer in our paper.

One other attractive feature we clarified is that OccamLlama is able to handle questions where the relevant numbers are dispersed in the text. It can determine which numbers are relevant in the text and perform arithmetic on those numbers.

Thank you again,

The Authors

评论

Dear Reviewer,

We are writing to gently ask that you consider our rebuttal. We appreciate your thoughtful review and would greatly appreciate your further comments and thoughts.

Thank you,

The Authors

评论

Hi, I thank the authors for the incredibly detailed responses and additional results.

With respect to W1), I believe that my initial point remains and the provided results demonstrate similarly that performance improves only for tasks which require high precision (MultiArith Float, MATH 401). I agree that the model does seem to aid with challenging maths problems where there are several decimal places or irrational numbers present, but as stated, there aren't benefits for general mathematical reasoning tasks. Not saying this isn't a meaningful contribution alone, just that it's a limitation of the current results.

I believe that W2 has been adequately addressed, by noting that the model is trained with only numerical values but not text, yet the model learns to generalize and be applicable OOD

W3) Though it is noted that Occamnet may have practical advantages over code generation, it does degrade as the number of steps increases, and in this circumstance, the code generation may perform better, which I believe remains a limitation of the work compared to alternative code-generation approaches.

I thank the authors for their efforts with the rebuttal, but I elect to maintain my score of 6.

评论

Hi, I thank the authors for the incredibly detailed responses and additional results.

Thank you!

With respect to W1), I believe that my initial point remains and the provided results demonstrate similarly that performance improves only for tasks which require high precision (MultiArith Float, MATH 401). I agree that the model does seem to aid with challenging maths problems where there are several decimal places or irrational numbers present, but as stated, there aren't benefits for general mathematical reasoning tasks. Not saying this isn't a meaningful contribution alone, just that it's a limitation of the current results.

We appreciate your comments. We agree that OccamLLM does not improve reasoning. At the same time, we would not expect a method for improving arithmetic to improve reasoning.

I believe that W2 has been adequately addressed, by noting that the model is trained with only numerical values but not text, yet the model learns to generalize and be applicable OOD

Thank you!

W3) Though it is noted that Occamnet may have practical advantages over code generation, it does degrade as the number of steps increases, and in this circumstance, the code generation may perform better, which I believe remains a limitation of the work compared to alternative code-generation approaches.

We appreciate your comment. While the chance of OccamLlama making a mistake increases with generation length, we believe this is an issue common to all autoregressive models, including a code-generation model.

Additionally, we note that the cases where OccamLlama fails appear to almost always be due to an imperfect switch. When the switch triggers correctly, the decoder almost always works correctly. We emphasize that the key contribution of our paper is not the switch; as reviewers have pointed out, there are probably many other people training switches. As such, the switch could be improved in future work, and we did not focus heavily on it here.


We appreciate your thoughtful response. While we understand your comments, we hope the above responses will alleviate any concerns. We thank you again for voting to accept our paper.

The Authors

审稿意见
6

This paper introduces OccamLLM, which is a novel framework that enables exact arithmetic in a single autoregressive step and enhances the arithmetic capabilities of LLMs. In detail, OccamLLM adds several additional decoders to the original LLM to help it conduct arithmetic tasks using a symbolic model named OccamNet. While all the decoders take the hidden states of LLM as input, one of them (“OccamLLM Switch”) will focus on predicting whether the LLM should use OccamNet to conduct arithmetic tasks at the current time step and others (“OccamLLM Decoder”) will predict the weights of OccamNet that are used during the arithmetic task. These additional decoders are trained on synthetic datasets, during which the original weights of LLM are frozen to avoid catastrophic forgetting. Evaluation shows that OccamLLM can significantly improve the arithmetic capabilities of LLM, without harming its reasoning and language modeling capabilities.

优点

  • The idea of adding OccamLLM Decoders to LLM and using it to predict the weights of OccamNet for arithmetic tasks is novel.
  • The paper is generally well-written.

缺点

  • The improvements brought by OccamLLM on general reasoning tasks are limited. While OccamLlama outperforms Llama 3 8B on MultiArith Float and MATH401, it fails to bring meaningful improvements on all the other reasoning benchmarks.
  • It is unclear whether OccamLLM can consistently improve the arithmetic capabilities of larger models with 13B and 70B parameters.

问题

  • What would be the performance of OccamLlama on GSM8K when using a two layer Complete OccamNet?

局限性

  • The limitation section of this paper is rather comprehensive.
作者回复

The idea of adding OccamLLM Decoders to LLM and using it to predict the weights of OccamNet for arithmetic tasks is novel. The paper is generally well-written.

We appreciate your kind remarks!

The improvements brought by OccamLLM on general reasoning tasks are limited. While OccamLlama outperforms Llama 3 8B on MultiArith Float and MATH401, it fails to bring meaningful improvements on all the other reasoning benchmarks.

Thank you for your comment. Since submitting to NeurIPS, we identified a few minor bugs in our code and slightly reweighted the proportions of the different training datasets. These changes maintained the perfect performance on arithmetic tasks and improved performance on mathematical reasoning tasks. The updated results are shown in Figure a) of the global response PDF and the table below:

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlamaLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.279.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.499.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.557.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.860.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.181.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

OccamLlama outperforms even GPT 4o and GPT 4o with Code on benchmarks requiring challenging arithmetic (MultiArith Float and MATH401). Additionally, the gap between OccamLlama and LLama 3 8B is substantially smaller on GSM8K. On average, OccamLlama has accuracy approximately halfway between that of GPT 3.5 and GPT 4o.

The performance of OccamLlama also improves dramatically when trained with Llama 3 70B, as discussed below.

It is unclear whether OccamLLM can consistently improve the arithmetic capabilities of larger models with 13B and 70B parameters.

Thank you for your comments. To demonstrate OccamLLM's ability to scale to larger models, we trained OccamLlama using Llama 3 70B Instruct as a base model (OccamLlama 70B). We evaluated the results on the mathematical reasoning benchmarks. To ensure we had results in time for the rebuttal deadline, we sampled 100 random questions from each dataset to determine OccamLlama 70B's performance, which is the cause for the relatively large error bars. We will update this response with the full results once they complete. The results are shown visually in Figure d) of the global response PDF and the table below:

Table: Accuracy on reasoning tasks. Higher is better.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.498.0 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.287.0 ± 3.479.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.4100.0 ± 0.099.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.599.0 ± 1.057.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.891.0 ± 2.960.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.193.0 ± 2.696.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.096.0 ± 2.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.194.9 ± 1.981.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

These results demonstrate that OccamLLM is capable of robust performance on large-scale models. Training the OccamLLM 70B system (in 16-bit LLM precision and unoptimized code) required approximately 1.5-2 days on a system with two A100 nodes, demonstrating the low compute requirements of our method.

OccamLlama 70B shows consistent improvement over OccamLlama 8B. We believe this results from a combination of Llama 3 70B's improved reasoning capabilities and Llama 3 70B's improved representations (which enable OccamLlama to generalize more effectively). We expect that training with even larger base models would lead to further improvements. Remarkably, OccamLlama 70B outperforms GPT 4o and GPT 4o with Code on average across these benchmarks. It also outperforms GPT 3.5 Turbo on all but Single Eq.

What would be the performance of OccamLlama on GSM8K when using a two layer Complete OccamNet?

Thank you for your interest in OccamLLM using a multi-layer OccamNet. We are exploring this direction in future work, and we beleive we can train performant OccamLLM systems with a many-layer OccamNet. In the present paper, we simply wish to demonstrate the potential promise of using multiple-layer OccamNets.

We note that our existing methods are already capable of performing arbitrarily complex arithmetic by chaining simpler calculations. Language models are generally trained or prompted to reason step-by-step, meaning that their generation can often be supplemented by a single-layer OccamNet. This is evidenced by our reasoning benchmarks, in which OccamLlama (with a single-layer OccamNet) performs quite well across a wide range of tasks.

Additionally, we can further improve OccamLlama systems with single-layer OccamNets by finetuning and/or prompting the model to think step-by-step.


We are very grateful for your comments and suggestions. We hope you will agree that the additional experiments and clarifications will strengthen our paper.

评论

Dear Reviewer,

We are writing to provide updated results from our completed evaluation runs. Below, we show the final results for OccamLlama 70B. We note that OccamLlama 70B maintains its strong performance when tested on the complete datasets. It still outperforms both GPT 4o and GPT 4o with Code on average. It also improves substantially compared to the 100-datapoint performance estimate on Single Eq. OccamLlama outperforms GPT 3.5 Turbo on all datasets except Single Eq, where it is less than one percentage point behind and within error bars.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8b InstructLlama 3 70B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.496.5 ± 0.993.4 ± 1.297.2 ± 0.895.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.290.1 ± 0.879.8 ± 1.194.8 ± 0.684.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.599.8 ± 0.298.2 ± 0.597.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.597.7 ± 0.657.3 ± 2.076.3 ± 1.777.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.889.5 ± 1.560.3 ± 2.471.6 ± 2.363.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.9 ± 0.896.3 ± 0.897.6 ± 0.797.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.093.2 ± 0.886.3 ± 1.194.5 ± 0.787.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.294.6 ± 0.981.9 ± 1.590.0 ± 1.286.2 ± 1.494.2 ± 1.093.4 ± 1.7

Above, we also include results from Llama 3 70B Instruct. OccamLlama 70B substantially outperforms Llama 3 70B on MultiArith Float and MATH401. It is roughly the same (1.3 percentage point deviation or less) as Llama 3 70B on all other datasets (ones with arithmetic simple enough for Llama to perform satisfactorily) except for GSM8K, which demonstrates OccamLlama's ability to not interfere with reasoning while still assisting with arithmetic. On average, OccamLlama 70B outperforms Llama 3 70B by nearly five percentage points, and OccamLlama 8B approximately matches Llama 3 70B's average performance.

Thank you for your consideration.

评论

Dear Reviewer,

In discussions with another reviewer, the reviewer found the following clarifications helpful, leading them to raise their score to a 7. We are sending them here in case it helps your consideration of our rebuttal.

We believe that OccamLLM provides numerous advantages relative to code generation:

  • Improved generation speed and cost: LLM Code generation requires generating tokens for code, leading to the use of potentially many more tokens than are needed to respond to a query. This can make LLMs with code generation (unlike OccamLLM) slow and expensive. For example, in the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code, suggesting that GPT 4o with code is approximately 10 times more expensive per query than GPT 4o. Since OccamLLM does not require any additional autoregressive steps of the LLM, this suggests that OccamLLM is faster and cheaper than code generation.
  • Better results: The LLM code generation method we compared against, GPT 4o with Code Interpreter, actually performed worse than GPT 4o on average. We believe these results demonstrate that, for arithmetic, LLM code generation is not an optimal solution. OccamLlama 70B outperforms GPT 4o with code on our benchmarks, suggesting that OccamLLM provides a better solution.
  • No catastrophic forgetting: Code generation generally requires fine-tuning an LLM to write and run code. This risks catastrophic forgetting.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check code for malicious behavior. OccamLLM does not run arbitrary code, so it avoids these risks.
  • Reduced engineering: For the reasons mentioned in the previous bullet, building a secure system for LLM code execution, such as by hosting an LLM and a sandbox to safely run code, requires substantial engineering effort. Users can simply download our OccamLLM models and run them; no engineering is required.

As a result of OccamLlama's improved speed, cost, performance, security, and ease of use, we believe it is a compelling alternative to LLM code generation for arithmetic. We will make these advantages clearer in our paper.

One other attractive feature we clarified is that OccamLlama is able to handle questions where the relevant numbers are dispersed in the text. It can determine which numbers are relevant in the text and perform arithmetic on those numbers.

Thank you again,

The Authors

评论

Thanks for the detailed response with clarifications and additional experiments. However, while OccamLlama 70B shows consistent improvement over OccamLlama 8B, the improvements of OccamLlama 8B over Llama 3 8B on most reasoning tasks apart from MultiArith Float and MATH401 are still not clear to me, even with the updated results. As a result, I will be keeping my score the same.

评论

Thanks for the detailed response with clarifications and additional experiments. However, while OccamLlama 70B shows consistent improvement over OccamLlama 8B, the improvements of OccamLlama 8B over Llama 3 8B on most reasoning tasks apart from MultiArith Float and MATH401 are still not clear to me, even with the updated results. As a result, I will be keeping my score the same.

We appreciate your consideration of our rebuttals. We understand your concern.

We believe that OccamLlama dramatically outperforms Llama across model sizes on MultiArith Float and MATH401 because these two datasets are the ones with challenging arithmetic that Llama cannot perform by itself.

We would not expect OccamLlama to improve over Llama on problems where Llama can already perform the arithmetic, which is the case for the other benchmarks we tested.

Unfortunately, we could not find many benchmarks with math problems containing both challenging reasoning and challenging arithmetic. This could be an area for future work. However, we see OccamLlama's competitive performance on benchmarks for which Llama already performs arithmetic well as a demonstration that OccamLlama is robust to a wide range of problems and is capable of being used in general mathematical problem-solving tasks.


While we understand your comment, we hope the above response will alleviate any concerns. We thank you again for voting to accept our paper.

The Authors

审稿意见
4

The paper "OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step" introduces a novel framework for enabling exact arithmetic within large language models (LLMs) by integrating a symbolic architecture called OccamNet. This framework, termed OccamLLM or OccamLlama when using the Llama model, allows for single-step arithmetic computations directly within the LLM’s autoregressive process. The key contributions include achieving 100% accuracy on various arithmetic operations, outperforming models like GPT-4 without code interpreters, and demonstrating superior performance on complex arithmetic reasoning tasks.

优点

Originality: The integration of OccamNet with LLMs for single-step arithmetic is a novel approach.

Quality: The experiments are comprehensive, covering a range of arithmetic operations and reasoning tasks.

Clarity: The paper is well-organized, with clear descriptions of the methodology and results.

Significance: This work addresses a critical limitation in LLMs' arithmetic capabilities, potentially impacting various applications requiring precise mathematical computations.

缺点

Limited Novelty in Approach: The paper stitches two existing architectures together to solve confined arithmetic tasks. While this combination is innovative, it does not introduce entirely new methodologies or architectures. The novelty is more in the application than in the fundamental approach.

Inadequate Justification for Methodology: The paper does not convincingly justify why the proposed methodology is preferable over existing methods such as code execution. The evaluation is extensive, but the practical advantages in terms of performance, computational efficiency, and real-world applicability are not fully established.

Scalability and Generalization: The framework is tested on specific arithmetic tasks and reasoning problems, but its scalability to more complex, multi-step arithmetic problems in real-world applications remains uncertain. There is limited discussion on how the approach would handle more extensive and varied datasets.

Security Concerns: The claim that the approach is more secure than code execution is made without detailed analysis. It's not clear why code execution is such a concern that warrents this new methodology. Moreover, there could be other vulnerabilities associated with using a symbolic architecture that are not addressed in the paper.

Evaluation Metrics: The evaluation focuses heavily on accuracy, but other metrics such as computational efficiency, speed, and resource utilization are not thoroughly discussed. These factors are crucial for practical deployment and comparison with other methods.

In short, I'm not convinced OccamLLM is relevant and why code execution is not superior.

问题

Scalability: How does OccamLLM perform on more complex, multi-step arithmetic problems in real-world applications?

Security: Could you provide a more detailed analysis of the security benefits and potential vulnerabilities of using OccamNet for arithmetic tasks?

局限性

No ethics concerns.

作者回复

The integration of OccamNet with LLMs for single-step arithmetic is a novel approach...

We appreciate your kind remarks!

Limited Novelty in Approach: The paper stitches two existing architectures together to solve confined arithmetic tasks. While this combination is innovative, it does not introduce entirely new methodologies or architectures. The novelty is more in the application than in the fundamental approach.

Thank you for your comment. We acknowledge that the two key architectures which we combine in our model are known. However, we contest that systems combining known architectures are often still novel. For example, speculative decoding uses known generative architectures but combines them in novel ways.

Additionally, we believe our method of combining the two architectures is creative and novel. To our knowledge, our work is unique in that we implement a decoder module which initializes the parameters of a second computational model at each autoregressive step. In our mind, this is closest to a hypernetwork, but unlike a hypernetwork we initialize OccamNet at each autoregressive step instead of just once per input.

We believe this step is quite novel. To use an analogy, if this same idea were used in the context of linear layers (i.e. using one linear layer to initialize another), one arrives at a model similar to attention. In the same way that we believe attention is sufficiently distinct from a linear layer to be considered novel, we believe that our method is sufficiently distinct from other gated computation methods to be considered novel.

Inadequate Justification for Methodology: The paper does not convincingly justify why the proposed methodology is preferable over existing methods such as code execution. The evaluation is extensive, but the practical advantages in terms of performance, computational efficiency, and real-world applicability are not fully established.

Thank you for your comments. We argue for the following advantages of OccamLlama over code generation or finetuning:

Fine-tuning

  • No catastrophic forgetting: Fine-tuning risks catastrophic forgetting, especially since arithmetic training datasets likely contain very little text.
  • Improved arithmetic: As shown in our arithmetic and mathematical reasoning benchmarks, even larger models such as GPT 4o fall short of OccamLLM for arithmetic more challenging than three-digit multiplication, so we also expect finetuned models to fall short of OccamLLM.

Code generation

  • No catastrophic forgetting: Code generation generally requires fine-tuning an LLM to write and run code.
  • Improved generation speed and cost: In the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code. This suggests that OccamLLM would be faster and cheaper compared to code generation.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check general-purpose code for malicious behavior. We believe that the two best solutions are 1) to run all code in an isolated virtual machine and 2) to run a restricted language that is not Turing-complete.
  • Reduced engineering: Building a system that hosts an LLM and a sandbox to safely run code requires a high engineering effort.

We will make these points clearer in the paper.

Scalability and Generalization: The framework is tested on specific arithmetic tasks and reasoning problems, but its scalability to more complex, multi-step arithmetic problems in real-world applications remains uncertain. There is limited discussion on how the approach would handle more extensive and varied datasets.

Thank you for your comment. We believe that the mathematical reasoning benchmarks provide experimental validation that OccamLlama can effectively handle compound expressions. These are multi-step math problems requiring multiple calls to OccamNet. Given OccamLlama's strong performance on these benchmarks, we hope you will agree that OccamLlama can effectively handle compound expressions. We will make this clearer in the paper.

Since submitting to NeurIPS, we identified a few minor bugs in our code and slightly reweighted the proportions of the different training datasets. These changes maintained the perfect performance on arithmetic tasks and improved performance on mathematical reasoning tasks. The updated results are shown in Figure a) of the global response PDF and the table below:

Table: Accuracy on reasoning tasks. Higher is Better.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8b InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.498.0 ± 1.493.4 ± 1.295.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.287.0 ± 3.479.8 ± 1.184.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.4100.0 ± 0.099.8 ± 0.297.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.599.0 ± 1.057.3 ± 2.077.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.891.0 ± 2.960.3 ± 2.463.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.193.0 ± 2.696.3 ± 0.897.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.096.0 ± 2.086.3 ± 1.187.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.194.9 ± 1.981.9 ± 1.386.2 ± 1.294.2 ± 0.893.4 ± 1.5

Continued below

评论

As shown above, OccamLlama outperforms even GPT 4o and GPT 4o with Code on MultiArith Float and MATH401. Additionally, the gap between OccamLlama and LLama 3 8B is substantially smaller on GSM8K. On average, OccamLlama has accuracy approximately halfway between that of GPT 3.5 and GPT 4o.

We also find that the performance of OccamLlama on reasoning benchmarks improves dramatically as we increase the base model from Llama 3 8B to Llama 3 70B (OccamLlama 70B).

To ensure results in time for the rebuttal deadline, we sampled 100 random questions from each dataset to determine OccamLlama 70B's performance, resulting in the relatively large error bars. We will update this response with the full results once they complete. The results are shown visually in Figure d) of the global response PDF and the table above.

OccamLlama 70B shows consistent improvement over OccamLlama 8B and we expect that training with even larger base models would lead to further improvements. Remarkably, OccamLlama 70B outperforms GPT 4o and GPT 4o with Code on average. It also outperforms GPT 3.5 Turbo on all but Single Eq.

Security Concerns: The claim that the approach is more secure than code execution is made without detailed analysis. It's not clear why code execution is such a concern that warrants this new methodology. Moreover, there could be other vulnerabilities associated with using a symbolic architecture that are not addressed in the paper.

We note that security is only one of the reasons we believe OccamLlama is superior to code generation for arithmetic. Other reasons are provided above.

We describe the security risks of LLM code generation above. In contrast, OccamNet restricts the allowed computation to a prespecified set of options (function compositions up to a given depth). This means that, barring a preventable vulnerability in its implementation, OccamLlama is a secure system that does not allow for arbitrary code generation and the potential exploits that come with it.

Evaluation Metrics: The evaluation focuses heavily on accuracy, but other metrics such as computational efficiency, speed, and resource utilization are not thoroughly discussed. These factors are crucial for practical deployment and comparison with other methods.

Although we focus on accuracy, in the paper we also present relative error as a useful metric.

Regarding computational efficiency and speed, we demonstrate in the paper that GPT-4o with Code requires on average more than 50 times more autoregressive generations than OccamLlama to answer the single-arithmetic benchmarks. Since OccamNet and the decoders are negligibly small compared to a LLM, this demonstrates that, all else equal, OccamLlama should be significantly faster and more efficient than a LLM-code-generation-based approach to arithmetic.

Additionally, as noted above, we find that GPT 4o with Code is substantially more expensive than GPT 4o at answering the same questions. This further supports our claim that OccamLlama should be significantly faster and more efficient than a LLM-code-generation-based approach to arithmetic.

Because of how small OccamNet and the decoders are compared to a LLM, an optimized codebase should see comparable speed and resource utilization between OccamLlama and base llama.

Due to the closed nature of GPT 3.5 and GPT 4o, it is difficult to make more direct comparisons regarding speed, efficiency, and resource utilization. We hope that the above observations are sufficient to satisfy your concerns.

In short, I'm not convinced OccamLLM is relevant and why code execution is not superior.

We respectfully disagree. Above, we have outlined how OccamLLM is a novel and performant alternative to finetuning and code generation for LLM arithmetic. We believe OccamLLM's performance and the advantages listed above are compelling and provide sufficient motivation for the use of OccamLLM in practice.

How does OccamLLM perform on more complex, multi-step arithmetic problems in real-world applications?

We believe that the mathematical reasoning benchmarks provide experimental validation that OccamLlama can effectively handle more complex, multi-step arithmetic problems in real-world applications. These problems are multi-step math problems requiring multiple calls to OccamNet to be completed successfully. They are standard benchmarks for mathematical reasoning in LLMs. Given OccamLlama's strong performance on these benchmarks, we hope you will agree that OccamLlama can effectively handle real-world mathematical problems. We will make this clearer in the paper.

Could you provide a more detailed analysis of the security benefits and potential vulnerabilities of using OccamNet for arithmetic tasks?

We describe the security risks of OccamLlama and LLM code generation above.


We are very grateful for your comments and suggestions. We hope you will agree that the additional experiments and clarifications will strengthen our paper.

评论

Dear Reviewer,

We are writing to provide updated results from our completed evaluation runs. Below, we show the final results for OccamLlama 70B. We note that OccamLlama 70B maintains its strong performance when tested on the complete datasets. It still outperforms both GPT 4o and GPT 4o with Code on average. It also improves substantially compared to the 100-datapoint performance estimate on Single Eq. OccamLlama outperforms GPT 3.5 Turbo on all datasets except Single Eq, where it is less than one percentage point behind and within error bars.

DatasetOccamLlama 8BOccamLlama 70BLlama 3 8b InstructLlama 3 70B InstructGPT 3.5 TurboGPT 4oGPT 4o Code
AddSub91.6 ± 1.496.5 ± 0.993.4 ± 1.297.2 ± 0.895.4 ± 1.197.0 ± 0.997.5 ± 1.1
GSM8K73.5 ± 1.290.1 ± 0.879.8 ± 1.194.8 ± 0.684.8 ± 1.096.1 ± 0.594.0 ± 1.7
MultiArith99.2 ± 0.498.5 ± 0.599.8 ± 0.298.2 ± 0.597.2 ± 0.799.7 ± 0.299.5 ± 0.5
MultiArith Float98.2 ± 0.597.7 ± 0.657.3 ± 2.076.3 ± 1.777.3 ± 1.796.2 ± 0.889.5 ± 2.2
MATH40185.0 ± 1.889.5 ± 1.560.3 ± 2.471.6 ± 2.363.1 ± 2.476.6 ± 2.178.0 ± 2.9
Single Eq92.9 ± 1.196.9 ± 0.896.3 ± 0.897.6 ± 0.797.8 ± 0.698.0 ± 0.699.0 ± 0.7
SVAMP88.6 ± 1.093.2 ± 0.886.3 ± 1.194.5 ± 0.787.8 ± 1.096.2 ± 0.696.5 ± 1.3
Average89.9 ± 1.294.6 ± 0.981.9 ± 1.590.0 ± 1.286.2 ± 1.494.2 ± 1.093.4 ± 1.7

Above, we also include results from Llama 3 70B Instruct. OccamLlama 70B substantially outperforms Llama 3 70B on MultiArith Float and MATH401. It is roughly the same (1.3 percentage point deviation or less) as Llama 3 70B on all other datasets (ones with arithmetic simple enough for Llama to perform satisfactorily) except for GSM8K, which demonstrates OccamLlama's ability to not interfere with reasoning while still assisting with arithmetic. On average, OccamLlama 70B outperforms Llama 3 70B by nearly five percentage points, and OccamLlama 8B approximately matches Llama 3 70B's average performance.

Thank you for your consideration.

评论

Dear Reviewer,

In discussions with another reviewer, the reviewer found the following clarifications helpful, leading them to raise their score to a 7. We are sending them here in case it helps your consideration of our rebuttal.

We believe that OccamLLM provides numerous advantages relative to code generation:

  • Improved generation speed and cost: LLM Code generation requires generating tokens for code, leading to the use of potentially many more tokens than are needed to respond to a query. This can make LLMs with code generation (unlike OccamLLM) slow and expensive. For example, in the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code, suggesting that GPT 4o with code is approximately 10 times more expensive per query than GPT 4o. Since OccamLLM does not require any additional autoregressive steps of the LLM, this suggests that OccamLLM is faster and cheaper than code generation.
  • Better results: The LLM code generation method we compared against, GPT 4o with Code Interpreter, actually performed worse than GPT 4o on average. We believe these results demonstrate that, for arithmetic, LLM code generation is not an optimal solution. OccamLlama 70B outperforms GPT 4o with code on our benchmarks, suggesting that OccamLLM provides a better solution.
  • No catastrophic forgetting: Code generation generally requires fine-tuning an LLM to write and run code. This risks catastrophic forgetting.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check code for malicious behavior. OccamLLM does not run arbitrary code, so it avoids these risks.
  • Reduced engineering: For the reasons mentioned in the previous bullet, building a secure system for LLM code execution, such as by hosting an LLM and a sandbox to safely run code, requires substantial engineering effort. Users can simply download our OccamLLM models and run them; no engineering is required.

As a result of OccamLlama's improved speed, cost, performance, security, and ease of use, we believe it is a compelling alternative to LLM code generation for arithmetic. We will make these advantages clearer in our paper.

One other attractive feature we clarified is that OccamLlama is able to handle questions where the relevant numbers are dispersed in the text. It can determine which numbers are relevant in the text and perform arithmetic on those numbers.

Thank you again,

The Authors

评论

I have read the rebuttal and the reviews of other authors.

As mentioned by other reviewers, adding a switch network is not novel. Moreover, the math benchmark are quite contrived and already solved by a much more general solutions (i.e. code generation). In order for me to consider this paper it would have to provide something that's novel, i.e. solving math problems that are currently out of reach by current models, or expanding the switch network to many different, highly specialized, sub-networks. Regardless, I don't think this submission achieves the novelty I would expect at this conference.

If the selling point is that code generation is slow, or lacks privacy, I would need to see a much deeper experimental section on that. However, I still do not think the lack of novelty would make up for it.

My score remains the same.

评论

As mentioned by other reviewers, adding a switch network is not novel.

We agree with you that adding a switch network is not novel, and we do not wish to claim otherwise. We believe the novelty of our work lies in the initialization of a symbolic model using a decoder on the hidden states of a language model.

We need the switch in order to enable autoregressive generation, but the key point we want to show is that the OccamNet-decoder system can perform arithmetic correctly in a variety of contexts.

Moreover, the math benchmark are quite contrived

We do not believe the benchmarks are contrived. We took all datasets except MultiArith Float from existing papers (MATH401 from https://arxiv.org/abs/2304.02015 and the remainder from https://arxiv.org/pdf/2304.01933). We created MultiArith Float before running the experiments to ensure there was a dataset with sufficiently challenging arithmetic. We emphasize that we selected these datasets before running any tests to ensure that our results were not contrived. The results of all datasets on which we tested OccamLlama are included in this paper.

and already solved by a much more general solutions (i.e. code generation).

We appreciate your concern. However, we emphasize that GPT 4o with Code performed worse than GPT 4o on average accross these benchmarks. We believe this shows that code generation does not solve LLM arithmetic entirely, at least as of the current generation of top foundation models.

In order for me to consider this paper it would have to provide something that's novel, i.e. solving math problems that are currently out of reach by current models, or expanding the switch network to many different, highly specialized, sub-networks. Regardless, I don't think this submission achieves the novelty I would expect at this conference.

We appreciate your comments. We believe our performance on the MultiArith Float and MATH401 benchmarks (2% and 11.5% higher performance than GPT 4o and GPT 4o + Code) demonstrate that OccamLlama can solve problems that are currently out of reach by current models. Our work focuses on arithmetic, so we focus on problems with challenging arithmetic.

Regarding expanding the switch network to many different and specialized subnetworks, this is something we are exploring for future work. We agree this is an exciting and novel direction for future work.

If the selling point is that code generation is slow, or lacks privacy, I would need to see a much deeper experimental section on that. However, I still do not think the lack of novelty would make up for it.

We appreciate your comment. We believe that our existing results already show a convincing advantage for OccamLlama over code generation. Our results demonstrate that GPT 4o with code performs worse than OccamLlama 70B (and GPT 4o without code) on average across reasoning benchmarks. We also show that GPT 4o uses more than 50 times more generation tokens than OccamLlama requires on our arithmetic benchmarks.

Regarding novelty, we do believe that the method of using an LLM to generate model initializations at each autoregressive step is novel and could be extended to other use cases.


While we understand your concerns, we hope you will consider increasing your score.

Thank you,

The Authors

作者回复

Responses to general concerns

We would like to thank all reviewers for their thoughtful comments. Below we address general concerns.

We only tried small models

To demonstrate OccamLLM's scalability, we trained OccamLlama using Llama 3 70B Instruct (OccamLlama 70B) and evaluated it in Figure d) (attached PDF).

These results demonstrate that OccamLLM is capable of efficient training and robust performance on large-scale models. OccamLlama 70B shows consistent improvement over OccamLlama 8B, outperforming GPT 4o and GPT 4o with Code on average across these benchmarks. It also outperforms GPT 3.5 Turbo on all but Single Eq.

We might not generalize to out-of-distribution data

Some reviewers were concerned about how our approach would generalize to specific unseen situations. We address these comments by showing that our method works for these specific cases.

We also show:

  1. That an OccamLlama model for which the OccamNet decoder model trained on only numeric expressions and absolutely no text performs comparably on mathematical reasoning benchmarks (which include many word problems) to an OccamLlama model trained on both numeric expressions and word problems.

  2. That an OccamLlama model trained only using English expressions displays on average a smaller drop in performance when solving problems in non-English languages than Llama 3 8B, even though Llama 3 has been trained on over 750 billion non-English language tokens.

We believe that these two tests demonstrate OccamLlama's robustness against out-of-distribution data.

Comparisons to other symbolic architectures besides OccamNet

We explain why OccamNet is most suitable as a symbolic model for OccamLLM, highlighting its interpretability advantage. We tested Equation Learner but found it unstable and prone to poor local minima.

Reviewer Z5Za asks that we specifically test replacing OccamNet with a transformer that generates and runs Python code as a secure alternative. We point out a security vulnerability and propose to train a transformer that generates prefix-notation expressions instead, which believe is a safer alternative.

Questions about novelty

We appreciate that reviewers bzKn, GTF8, and JtJz recognize the novelty of combining OccamNet with an LLM.

We acknowledge that the two key architectures which we combine in our model are known. However, we contest that systems combining known architectures are often still novel. For example, speculative decoding uses known generative architectures but combines them in novel ways.

Additionally, we believe our method of combining the two architectures is creative and novel. To our knowledge, our work is unique in that we implement a decoder module which initializes the parameters of a second computational model at each autoregressive step. In our mind, this is closest to a hypernetwork, but unlike a hypernetwork we initialize OccamNet at each autoregressive step instead of just once per input.

We believe this step is quite novel. To use an analogy, if this same idea were used in the context of linear layers (i.e. using one linear layer to initialize another), one arrives at a model similar to attention. In the same way that we believe attention is sufficiently distinct from a linear layer to be considered novel, we believe that our method is sufficiently distinct from other gated computation methods to be considered novel.

Questions about multi-layer OccamNets

We are exploring OccamLLM with a multi-layer OccamNet as future work. In the present paper, we simply wish to demonstrate the potential promise of using multiple-layer OccamNets.

We note that our existing methods can already perform arbitrarily complex arithmetic by chaining simpler calculations. This is how OccamLlama (with a single-layer OccamNet) performs quite well across a wide range mathematical reasoning benchmarks.

Concerns about multi-step reasoning results

We demonstrate small training tweaks that cause OccamLlama to outperform even GPT 4o and GPT 4o with Code on MultiArith Float and MATH401 and to substantially diminish the gap with LLama 3 8B on GSM8K. On average, OccamLlama outperforms both Llama and GPT 3.5 Turbo, with accuracy approximately halfway between that of GPT 3.5 and GPT 4o.

We also find that the performance of OccamLlama on reasoning benchmarks improves dramatically as we increase the base model from Llama 3 8B to Llama 3 70B, as discussed above.

Advantages compared to code generation or finetuning

We argue for the following advantages of OccamLlama over code generation or finetuning:

Fine-tuning

  • No catastrophic forgetting.
  • Improved arithmetic. As shown in our arithmetic and mathematical reasoning benchmarks, even larger models such as GPT 4o fall short of OccamLLM for arithmetic more challenging than three-digit multiplication, so we also expect finetuned models to fall short of OccamLLM.

Code generation

  • No catastrophic forgetting.
  • Improved generation speed and cost. In the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code. This suggests OccamLLM would be faster and cheaper than code generation.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check general-purpose code for malicious behavior.
  • Reduced engineering: Building a system that hosts an LLM and a sandbox to safely run code requires a high engineering effort.

We again thank the reviewers for their comments. We will make all of these points clearer in the paper.

评论

Dear Area Chairs,

We thank you for your consideration of our paper. We wish to provide a few clarifications we found to be particularly relevant based on our conversations with reviewers:


A recurring point of clarification concerned the advantages of OccamLLM over LLM code generation. We list key advantages of OccamLLM over code generation below:

  • Improved generation speed and cost: LLM Code generation requires generating tokens for code, leading to the use of potentially many more tokens than are needed to respond to a query. This can make LLMs with code generation (unlike OccamLLM) slow and expensive. For example, in the arithmetic benchmarks, we found that GPT 4o with code required on average more than 50 completion tokens, far more than the single autoregressive step required by OccamLlama to answer each question. Additionally, our total expenses for GPT 4o with code were more than double that of GPT 4o without code, even though we made almost 5 times fewer queries to GPT 4o with code, suggesting that GPT 4o with code is approximately 10 times more expensive per query than GPT 4o. Since OccamLLM does not require any additional autoregressive steps of the LLM, this suggests that OccamLLM is faster and cheaper than code generation.
  • Better results: The LLM code generation method we compared against, GPT 4o with Code Interpreter, actually performed worse than GPT 4o on average. We believe these results demonstrate that, for arithmetic, LLM code generation is not an optimal solution. OccamLlama 70B outperforms GPT 4o with code on our benchmarks, suggesting that OccamLLM provides a better solution.
  • No catastrophic forgetting: Code generation generally requires fine-tuning an LLM to write and run code. This risks catastrophic forgetting.
  • Improved security: Running LLM-generated code is potentially unsafe. For example, the LLM could ask to run the python code __ import__('os').system('rm -rf /'). It is in general extremely challenging to check code for malicious behavior. OccamLLM does not run arbitrary code, so it avoids these risks.
  • Reduced engineering: For the reasons mentioned in the previous bullet, building a secure system for LLM code execution, such as by hosting an LLM and a sandbox to safely run code, requires substantial engineering effort. Users can simply download our OccamLLM models and run them; no engineering is required.

We wish to emphasize that, although it is commonly assumed (including among the authors prior to this work) that code generation can solve the problem of arithmetic in LLMs, our results show that GPT 4o + Code is worse on average on mathematical problem-solving benchmarks than GPT 4o. This suggests that, as of now, LLM code generation does not solve the problem of LLM arithmetic as well as one might expect. OccamLlama provides a performant alternative.


Regarding the OccamLLM switch, we agree with reviewers that adding a switch network is not novel, and we do not wish to claim otherwise. We argue that the novelty of our work lies in the initialization of a symbolic model by decoding the hidden states of a language model.

We need the switch in order to enable autoregressive generation, but the key point we want to show is that the OccamNet-decoder system can perform arithmetic correctly in a variety of contexts.


One concern of reviewers was that OccamLLM does not improve performance on math reasoning benchmarks. We agree that OccamLLM does not improve reasoning. At the same time, we would not expect a method for improving arithmetic to improve reasoning.

OccamLlama dramatically outperforms Llama across model sizes on MultiArith Float and MATH401 because these two datasets are the ones with challenging arithmetic that Llama cannot perform by itself. We would not expect OccamLlama to improve over Llama on problems where Llama can already perform the arithmetic, which is the case for the other benchmarks we tested. Instead, we see OccamLlama's competitive performance on benchmarks for which Llama already performs arithmetic well as a demonstration that OccamLlama is robust to a wide range of problems and is capable of being used in general mathematical problem-solving tasks.


We would like to thank the reviewers for their thoughtful comments. We appreciate their interest in our method for fast and exact LLM arithmetic. We believe OccamLLM is a novel development that has the potential to dramatically improve LLM applications to problems requiring extensive arithmetic. We would especially like to thank reviewer JtJz for agreeing with us and increasing their score to a 7 in response to our discussion.

Warmly,

The Authors.

最终决定

OccamLLM augments a pre-trained LLM with an auxiliary neuro-symbolic network that can perform arithmetic operations. At each autoregressive step, a switch chooses whether to output a token from the LLM, or from the arithmetic network. Unlike other work which combines LLMs + tools, the LLM itself has frozen weights -- it does not need any finetuning. Instead, the decoder and switch are trained to observe the hidden states of a pre-trained LLM. OccamLLM is also potentially faster, since it can do arithmetic "in its head" for each token, and does not have to invoke the main LLM to generate code.

The authors demonstrate that on arithmetic-heavy math problems, OccamLLM does much better than non-augmented LLMs.

Reviewer responses were very mixed to this paper. One accept (7), two weak accepts (6), one weak reject (4), and one reject (3). Reviewers brought up three main complaints. First, several reviewers questioned the novelty of the approach, since augmenting an LLM with a switch is not new. Second, several reviewers complained that the arithmetic network was strictly less general than code generation, and did not think that the advantages (speed, simplicity) outweighed the loss of generality. Third, while OccamLLM performed much better than comparable LLMs on tasks that required heavy arithmetic, it actually performed slightly worse on tasks that did not.

After reading both the paper and the reviews, I think some of the reviewers were too harsh. The paper is very well-written, and the authors very clearly spell out the benefits of having a small arithmetic network that runs on each token, vs. the cost and complexity of full code generation. I agree that code generation is more powerful, but I do not think that alternative mechanisms should be rejected out of hand.

I also disagree with the critique that the authors' approach is not sufficiently novel. Incorporating a dedicated arithmetic unit into an LLM does seem to be a novel combination (none of the reviewers disputed this point), even if the precise mechanism for integration borrows ideas from prior work.

Unfortunately, I agree with the other reviewers that there is cause for concern, because OccamLLM does worse than competing models on tasks that are not arithmetic-heavy. The authors argue that a big advantage of OccamLLM is that it does not suffer from catastrophic forgetting due to fine-tuning. However, even without fine-tuning, the switch is still interfering with the operation of the LLM in other ways.

In total, I personally would lean towards accepting this paper, rather than rejecting it. Augmenting LLMs with tool use is currently an area of intensive research, and this paper shows both the advantages, and disadvantages, of one form of integration. The paper is well-written, the experiments are well-done, and there are no technical problems, so I think that it deserves to be published.