3.5

/10

withdrawn4 位审稿人

最低3最高5标准差0.9

3.5

置信度

正确性2.3

贡献度2.0

表达2.0

ICLR 2025

Unlocking the Theory Behind Scaling 1-Bit Neural Networks

Majid Daliri,Zhao Song,Chiwun Yang

OpenReview PDF

提交: 2024-09-19更新: 2024-12-03

摘要

Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a *Scaling Law for 1-bit Neural Networks*. In this paper, we present the first theoretical result that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to $\\{-1, +1\\}$, the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.

关键词

1-bit neural networkquantizationneural tangent kernel

评审与讨论

审稿意见

评分: 3置信度: 32024-10-18

The paper shows that a modified 1-bit soft-commite machine converges to an arbitrarily small loss as the width of the hidden layer goes to infinity. It also gives a theoretical upper bound on the rate of convergence. They also show that a 1-bit model can be trained to estimate complex functions with high accuracy.

优点

The paper introduces a variation of the soft-commite machine that can be analytically traced to show the convergence of 1-bit models.
The convergence is analysed for the number of model parameters and the size of the training data set.

缺点

Given that the paper provides numerical results, I found it problematic that the main result of the paper (the functional form of the convergence of the training loss of 1-bit models) is not verified with numerical results.
The finding of a "scaling law" in Fig.1 with four data points, no comparison with a theory curve, and significant fluctuations is, in my opinion, not a valid verification of Proposition 4.3 at all.
The figures are not properly described/referenced in the text.
The introduction of $\kappa$ in the model seems a bit arbitrary, especially considering that it is necessary to "plug an appropriate value of $\kappa$ " (Line 391) into the model to ensure learning.
The abstract does not adequately explain what is being done in the paper. Instead of citing previous results in the abstract, I would like to see the model and a more precise definition of the results there.

问题

See Weaknesses

审稿意见

评分: 3置信度: 42024-11-04

This paper conducts a theoretical analysis on scaling 1-bit neural networks. The authors use the NTK framework to derive a scaling law formula for 1-bit quantized networks. Experiments on simulating target functions demonstrate the effectiveness of the proposed scaling law.

优点

This paper studies an important problem as to scaling 1-bit neural networks. Given the increasing deployment of large foundation models, how to reduce their energy consumptions become an increasingly important problem. 1-bit quantization is a promising approach to study.
The authors conduct thorough experiments to validate the theoretical analysis.

缺点

This paper studies the scaling law of 1-bit neural networks. However, the analysis is performed on simple toy models. The original scaling law as proposed by Kaplan [1] is conducted on Transformers with millions/billions of parameters with huge amounts of pretraining compute. The goal of scaling law should be to study how to reliably predict the performance of pretraining with larger compute. However, the setting of this paper does not fit in the category of “scaling law”, given the small scale experiments conducted on toy models.
The experiments do not support the claim that there is a scaling law for 1-bit quantized neural networks. The authors do not demonstrate the exact scaling law formula, e.g. what are the values of hyperparameters in proposition 4.3 when fitting this curve to the experiments in Figure 1. Obtaining the coefficient in the scaling law is of practical importance, which can be used to predict performance of training larger 1-bit quantized models.
What is the justification for using NTK for studying this problem? The introduction of section 3.3 seems rather abrupt, which lacks the discussion for motivation.

问题

Could the authors provide evaluation on practical models, e.g., GPT-2 and validate the proposed scaling law?

审稿意见

评分: 3置信度: 32024-11-04

The work studies theoretically the power of 1-bit networks, using NTK theory. Specifically, the authors bound the convergence time for training 1-bit quantized network using Quantization Aware Training, where the network is quantized in the forward pass, and computes the gradient with a straight-through estimator in the backward pass. They also relate the convergence guarantees to scaling laws for 1-bit networks, and complement these scaling laws with experiments. Additionally, the authors bound the difference between the quantized and non-quantized network at initialization and during training.

优点

The paper studies an important topic of training quantized networks, from a theoretical point of view.
Showing convergence of 1-bit network with quantization aware training is, to my knowledge, novel.
Studying the relation between 1-bit network and full-precision networks is interesting and important.

缺点

Lemma 4.1: the results in the lemma depend on $\lambda$ , which I assume is $\lambda_{\min}(H^*)$ (this should be properly stated). However, if this is the case $\lambda$ should depend explicitly on $\kappa^2$ , and the dependence on $\kappa^2$ should be tracked throughout the results. It is better to define $\lambda$ as the minimal eigenvalue of the non-scaled Gram matrix, and have the explicit dependence on $\kappa^2$ in the results.
It is worth noting that the scaling laws derived in proposition 4.3 are very different form than the Kaplan scaling laws (e.g., $N$ and $D$ in the Kaplan scaling laws are in the denominator with fractional power, and are not exponentiated).
The authors claim that the function implemented by the quantized network approaches the function computed by the full-precision network as the network size grows, but I do not see how this is reflected in the results. Specifically, it is worthwhile to write down an explicit bound on the difference between the two networks that goes to zero with $m$ , when keeping other parameters (e.g. dataset size and network scale) constant. I currently did not see such a bound, but maybe I am missing something.
Related to the above, the main results that relate the quantized and non-quantized networks (Theorems 5.1 and 5.2) rely on the scaling parameter $\kappa$ being very small. However, clearly taking $\kappa$ to zero makes both functions go to zero, and thus trivially the difference between the two functions becomes small. But this is also true for any (bounded) function multiplied by $\kappa$ , so it seems that these results trivially hold. Do the results simply imply that all functions go to zero with $\kappa$ and thus become close, or can these result hold when both functions are bounded away from zero (e.g. by a constant)?
The functions studied in 6.1 are relatively low-dimensional, which may mean that they can be approximated easily in the kernel regime by any kernel. Did you try experimenting with functions of higher dimensions?

Additionally, there are some unclear sentences/phrases in the paper, e.g.:

Lines 392-395, these two sentences are not clear.
Line 347
Line 260-261
Line 422: "we aimed to learn rigorous functions" - what does "rigorous function" mean here?

问题

See above.

审稿意见

评分: 5置信度: 42024-11-05

This paper provides theoretical justification for scaling 1-bit neural networks, showing that their training dynamics converge to kernel-like behavior as model width increases. The authors demonstrate that the training loss can become arbitrarily small with sufficient network width and that the generalization difference between 1-bit and full-precision networks shrinks with scale. Using the Neural Tangent Kernel (NTK) framework, the authors guarantee training convergence to minimal loss. Preliminary empirical results confirm that 1-bit models perform comparably to full-precision networks on complex tasks, with significant efficiency gains. This work suggests that scaling quantization aware training of 1-bit neural networks are a promising direction for efficient deep learning.

优点

The topic this paper tries to push is timely, as there have been several works in Quantization Aware Training of low-bitwidth models recently, which show how these models start working better at scale.
This paper takes an interesting NTK perspective to justify training dynamics and generalization of binary neural networks. They also derive a scaling law of 1-bit neural networks under certain assumptions.

缺点

The main weakness of the paper is the lack of realistic experiments and the positioning of the paper. The authors use 1-bit LLMs to motivate the paper in the abstract and introduction, but their theory and experiments are for quantized MLPs at a very small scale. The authors should rewrite the abstract and introduction to clarify what their focus is on rather than overemphasizing low-bitwidth LLMs. For example, authors should explicitly state their focus is on quantized MLPs for learning functions, and clarify how their results may or may not extend to large-scale LLMs.
In section 3.2, the authors mention $f$ is a two layer attention model; however, the equations suggest it’s a two-layer MLP. Symbols like $m$ are not defined clearly. This section needs better clarity.
The choice of experiments (learning rigorous functions) seems arbitrary to me, given that the motivation of the paper is 1-bit LLMs. Can the authors explain this disparity?
Authors should look at related papers that try to empirically derive the scaling law of low-bit (binary and ternary) quantization-aware training of LLMs like [1], which might make the arguments in their paper stronger. They also provide the model checkpoints and losses at various scales for 1-bit and 1.58-bit LLMs. I think if the authors want to position their paper as validating the scaling of low-bit LLMs using their theory, they can use the existing checkpoints from this work.

[1] A Kausal et al. Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale

问题

Please see the weaknesses. I'll ask more questions after looking at the author's rebuttal.

撤稿通知

2024-12-03

We sincerely thank all reviewers for their constructive suggestions and helpful feedback. We have decided to withdrawal this paper.