PaperHub
5.5
/10
Rejected4 位审稿人
最低3最高8标准差1.8
6
5
3
8
4.0
置信度
正确性2.5
贡献度2.5
表达2.5
ICLR 2025

BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM inference

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

LO-BCQ is a PTQ algorithm that minimizes quantization MSE of both weights and activations to achieve accurate W4A4 quantization for LLM inference.

摘要

关键词
Post-training QuantizationLarge Language ModelsCodebooksClusteringBlock Clustered Quantization

评审与讨论

审稿意见
6

This paper introduce a novel block quantization method called LO-BCQ, which achieves state-of-the-art level while avoiding additional memory overheads by iterative block clustering and quantization through locally optimization pipeline.

优点

  1. This paper iteratively optimizes block clustering and per-cluster quantization to minimize MSE loss for both weight and activation and achieves advanced level.
  2. LO-BCQ avoids memory overheads caused by per-block specific codebook and scaling factors in previous methods by sharing scaling factors in a group of blocks.

缺点

  1. The paper ignores to describe the distinctions between block quantization and group-wise quantization. because from the perspective of form and definition, there seems to be no difference between them.
  2. This paper solely conduct comparison experiments with block quantization based methods. However, there are many other widely used PTQ methods for LLMs such as OmniQuant, GPTQ, QuIP, etc.
  3. In addition to accuracy, efficiency is an important criterion for measuring the effectiveness of a proposed method. Compared to general PTQ methods, LO-BCQ involves numerous optimization processes, which to some extent reduce the method's efficiency.
  4. The logical flow of the writing needs to be improved, especially for Introduction.

问题

  1. For weakness1, if feasible, please involve detailed analysis and experiments for block quantization and group-wise quantization, such as AWQ-g128.
  2. For weakness2, to demonstrate the advanced nature of your method, please include comparisons with the mentioned PTQ methods in your experimental tables.
  3. For weakness3, Please provide the running time consumption of LO-BCQ. If the proposed method is faster than OmniQuant, which is currently known for a relatively low-efficient but effective PTQ method, it would still be within an acceptable range.
  4. The author should check for some spelling errors in the manuscript, such as "biwidth" in line 054.
评论

We greatly appreciate the comments and questions of the reviewer. We thank the reviewer for acknowledging our locally optimal block clustering approach for quantization that achieves current state-of-the-art during 4-bit quantization of both weights and activations across several LLMs and datasets.

The reviewer has raised valid criticism regarding the terminology of block quantization vs group quantization. First, we would like to apologize for this confusion and clarify that the group quantization and block quantization methods are the same. As quantization for LLMs is a rapidly evolving field, there hasn’t been a consensus on unified terminology for this approach. Further, we would like to emphasize that LO-BCQ improves over group-quantization that typically quantizes each group to a given number format by allowing each group to map to a set of codebooks that minimize quantization error.

To demonstrate the advanced nature of our method, we perform detailed comparison against the weight-only quantization proposals listed by the reviewer. While the techniques in Omniquant, GPTQ, AWQ and QuiP have been demonstrated in weight quantization, we would like to highlight that LO-BCQ can be applied to both weights and activations. That is the codebooks we calibrate through LO-BCQ are universal across operand types and models. Further, while LO-BCQ is a post-training quantization (PTQ) method, GPTQ and QuiP involves finetuning to recover accuracy loss from quantization. We have compared the Wikitext perplexity loss of LO-BCQ against these works in Figure 1 of our paper.

In the table below, we compare weight-only LO-BCQ to AWQ and GPTQ, all with group size = 128 for a fair comparison. As shown, LO-BCQ achieves comparatively lower loss in perplexity. Further, we evaluate this loss on Wikitext-103 dataset, which is a much larger dataset compared to Wikitext2 which is used by GPTQ and AWQ.

Wikitext PPL Loss Compared to Unquantized Baseline

ModelGPTQAWQLO-BCQ (2 Codebooks)LO-BCQ (4 Codebooks)LO-BCQ (8 Codebooks)
LLAMA2-7B0.220.130.140.120.09
LLAMA2-70B0.100.090.090.070.06

Additionally, in the table below, we compare against Omniquant and QuiP across various datasets such as Wikitext, PIQA and Winogrande. As shown, LO-BCQ achieves significant perplexity improvement over OmniQuant and QuiP. Further, the loss in accuracy in PIQA and Winogrande datasets are maintained to be <1% with 8 codebooks.

Wikitext PPL Loss Compared to Unquantized Baseline

ModelOmniquantQuiPLO-BCQ (2 Codebooks)LO-BCQ (4 Codebooks)LO-BCQ (8 Codebooks)
LLAMA2-7B0.270.190.140.120.09
LLAMA2-70B0.150.100.090.070.06

PIQA Accuracy

ModelUnquantized BaselineOmniquantQuiPLO-BCQ (2 Codebooks)LO-BCQ (4 Codebooks)LO-BCQ (8 Codebooks)
LLAMA2-7B78.0777.178.477.9777.9777.58
LLAMA2-70B81.5680.781.481.1881.3481.45

Winogrande Accuracy

ModelUnquantized BaselineOmniquantQuiPLO-BCQ (2 Codebooks)LO-BCQ (4 Codebooks)LO-BCQ (8 Codebooks)
LLAMA2-7B69.3867.067.668.2769.3069.69
LLAMA2-70B79.9575.877.179.3279.7980.03
评论

We agree with the reviewer that latency of inference is an important metric to evaluate the performance of our proposal. For this rebuttal, given the limited time, we would like to qualitatively discuss the performance of LO-BCQ. As demonstrated in Keller et al [3], with a dedicated inference accelerator for 64-wide group quantization, considerable speedup can be achieved, especially during the pre-fill phase in LLMs involving GEMM computations. In LO-BCQ, in addition to group quantization, we perform a decoding step where each 4-bit scalar is decoded to 6-bit integers. As a result of a small number of codebooks utilized by LO-BCQ (footprint = 0.19KB), one can potentially store them in shared memory of each SM resulting in a small look-up cost. A dedicated hardware design can amortize the cost of this lookup through spatial and temporal re-use of operands. We would like to emphasize that, as a result of a small number of codebooks (<= 16) in our work, the cost of this lookup is considerably small compared to other codebook-based proposals such as QuIP[4] and AQLM[5] where the number of codebooks are as large as 2^16. We would like to clarify that the clustering and calibration of codebooks is performed prior to inference (offline). We freeze the codebooks identified by LO-BCQ algorithm across our evaluations.

审稿意见
5

This paper introduces a method named block clustered quantization (BCQ), which decomposes operand tensors into blocks, clustering them based on statistics and designing codebooks for each cluster. When weight and activation scalars are encoded to W4A4 format, BCQ achieves the current state-of-the-art by demonstrating < 1% loss in inference accuracy across several LLMs.

优点

  1. This paper is well written, clearly described, and presents an effective methodology that makes group quantization a significant enhancement across a wide range of tasks and models;
  2. Combining a quantization problem with a clustering problem is novel enough, I think, and the author's algorithm is simple and effective enough.

缺点

  1. Although the authors' group quantization achieved a significant improvement in accuracy, the paper does not report results on inference speed. From QServe we can learn that group quantization is very inefficient in terms of speed of reasoning. Does the authors' proposed quantization scheme have any performance advantage over group quantization, except for the advantage of saving some storage? Can the authors provide the corresponding latency metrics?
  2. The authors' experiments should include other model types such as LLaMA3 and Mistral, or even LLaMA3.1 and LLaMA3.2, and I am very much looking forward to the performance of the authors' method on MOE LLM model.
  3. Figure 2, Figure 3 and Figure 4 can be modified to better illustrate and/or an algorithm description of the steps.

问题

  1. Has the author's algorithm been compared to GPTQ, which is a very effective way to quantify large models in PTQ quantization, and how does the author's method compare to GPTQ in terms of effectiveness and efficiency?
评论

Finally, we quantitively compare LO-BCQ to GPTQ[6], which is an effective method for 4-bit weight-only PTQ. Here, GPTQ quantizes weights to 4-bits with group-size of 128 while maintaining activations in 8-bits. In contrast, LO-BCQ quantizes both weights and activations to 4-bits with group size of 64. With >=8 codebooks and effective bitwidth of 4.5 and 4.625, respectively, LO-BCQ achieves lower perplexity loss compared to GPTQ despite quantizing activations to 4-bits. Further, we evaluate LO-BCQ on Wikitext-103, which is a significantly larger datatset compared to Wikitext-2 used in evaluating GPTQ. By demonstrating lower perplexity loss in larger dataset, we show that LO-BCQ is a significantly more efficient quantization method compared to GPTQ.

PPL Loss Compared to Unquantized Baseline

ModelGPTQ (W4A8)LO-BCQ (W4A4, 8 Codebooks)LO-BCQ (W4A4, 16 Codebooks)
LLAMA2-7B0.220.130.12
LLAMA2-70B0.100.090.07

References: [1]: Rouhani et al, “Microscaling data formats for deep learning, 2023” [2]: Rouhani et al. “With shared microexponents, a little shifting goes a long way”. ISCA ’23 [3]: Keller et al, “A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-Vector Scaled 4-bit Quantization in 5 nm”. JSSC’23 [4]: Chee et al, “QuIP: 2-Bit Quantization of Large Language Models With Guarantees, 2024 [5]: Egiazarian et al, “Extreme compression of large language models via additive quantization, 2024.” [6]: Frantar et al, “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”

评论

Since there is no feedback, I keep my rating to reject the work.

评论

We greatly appreciate the comments and questions of the reviewer. We thank the reviewer for acknowledging our novel and effective clustering approach for quantization that achieves current state-of-the-art by demonstrating <1% loss in inference accuracy across several LLMs and datasets during 4-bit quantization of both weights and activations.

The reviewer’s main concern is regarding the cost of group quantization and how LO-BCQ can improve over group quantization. We agree with the reviewer that group quantization is computationally expensive compared to conventional quantization methods due to the scale factor computations per-group -- the smaller the group-size, the larger the cost of this computation. However, for a given group-size, LO-BCQ achieves significantly better inference accuracy compared to other group quantization methods since LO-BCQ can flexibly choose a codebook from a set of codebooks. As a result, LO-BCQ enables increasing group size by minimally trading-off accuracy, reducing both storage and computational costs. We demonstrate this behavior in the table below, where LO-BCQ maintains the perplexity loss on Wikitext-103 dataset to <0.1 despite the large group size of 64 compared to other works such as MXFP4[1] and MX4[2].

LLAMA2-70B: Loss Compared to Unquantized Baseline (Lower is Better)

MethodGroup SizeWiki-103PIQABoolQHellaSwagWinogrande
LO-BCQ (8 CODEBOOKS)640.090.111.200.511.58
LO-BCQ (16 CODEBOOKS)640.070.060.980.230.79
MXFP4[1]320.540.982.172.033.63
MX4[2]160.440.982.822.033.55

We agree with the reviewer that latency of inference is an important metric to evaluate the performance of our proposal. For this rebuttal, given the limited time, we would like to qualitatively discuss the performance of LO-BCQ. As demonstrated in Keller et al [3], with a dedicated inference accelerator for 64-wide group quantization, considerable speedup can be achieved, especially during the pre-fill phase in LLMs involving GEMM computations. In LO-BCQ, in addition to group quantization, we perform a decoding step where each 4-bit scalar is decoded to 6-bit integers. As a result of a small number of codebooks utilized by LO-BCQ (footprint = 0.19KB), one can potentially store them in shared memory of each SM resulting in a small look-up cost. A dedicated hardware design can amortize the cost of this lookup through spatial and temporal re-use of operands. We would like to emphasize that, as a result of a small number of codebooks (<= 16) in our work, the cost of this lookup is considerably small compared to other codebook-based proposals such as QuIP[4] and AQLM[5] where the number of codebooks are as large as 2^16.

In response to the reviewer’s request for additional experimental results, we provide evaluations of LO-BCQ on Llama3.1-8B and Nemotron4-340B models and Wikitext-103 dataset. As shown, for a given group size, LO-BCQ achieves increasingly better perplexity as the number of codebooks are increased.

Nemotron4-340B: Perplexity (PPL) on Wikitext-103 (Baseline PPL = 3.48)

Num CodebooksGroup Size = 64Group Size = 32Group Size = 16
23.673.633.61
43.623.613.58
83.603.593.57
163.593.563.55

Llama3.1-8B: Perplexity (PPL) on Wikitext-103 (Baseline PPL = 5.57)

Num CodebooksGroup Size = 64Group Size = 32Group Size = 16
26.406.186.07
46.076.035.95
85.975.965.87
165.945.885.84

We agree with the reviewer that additional experimental evaluations on MOE models would be helpful in fully demonstrating the benefits of our method. We will explore this as future work.

审稿意见
3

The Paper introduces Block Clustered Quantization (BCQ), a novel method aimed at achieving accurate sub-8-bit quantization for weights and activations in large language models without relying on quantization-aware training (QAT). BCQ decomposes operand tensors into blocks, clusters these blocks and designs dedicated optimal quantization codebooks for each cluster. This approach allows for tailored quantization that minimizes mean squared error (MSE) and maintains high inference accuracy.

优点

  • Using LO-BCQ algorithm utilizing block clustering and dedicated codebooks effectively minimizes mean squared error (MSE) while maintaining high accuracy.
  • The paper considers several datasets to evaluate, especially include the MMLU boolq and Race datasets.
  • Achieving less than 1% loss when quantizing weights and activations to 4-bits.

缺点

  • The experiment shows that LO-BCQ takes good performance but it is built on multiple codebooks and small block division, the current LLM has a huge number of parameters, 8 elements for a block division will lead to a large computational overhead, and it is still a question whether the good performance can be maintained if the size of the block is increased.

  • LO-BCQ's fine-grained quantization does not seem to have an advantage over per-group quantization such as Atom[1], which quantizes 128 consecutive elements as a group, and achieves excellent results in W4A4, and even better if Atom's grouping is adjusted to 64 or smaller. In addition, LO-BCQ introduces operations such as clustering, indexing, updating codebooks, etc., which costs more computational overhead. [1]Zhao, Yilong, et al. "Atom: Low-bit quantization for efficient and accurate llm serving.

问题

  • In Table 4 Ablation experiments, only the case where LbL_{b} is less than 8 is discussed, what is the performance of the model if the size of LbL_{b} is aligned with the group quantization e.g. 128, 64?
  • The performance improvement of LO-BCQ seems to partly come from the clustering of blocks, the value of activation in LLM has a large difference in the distribution of different datasets, does the use of pre-calibrate for activation to calculate the codebook of activation have an impact on the language modeling ability of LLM?
  • What is the impact on model inference speed of performing clustering with block size 8, indexing, and looking up codebooks during model inference? Does it mask the gains from low-bit quantization?
评论

LO-BCQ (Group Size=64): Loss compared to unquantized baseline (lower is better)

Num CodebooksModelWiki-103PIQABoolQHellaSwagWinogrande
2Llama2-7B0.250.981.711.170.48
Llama2-70B0.210.112.411.151.18
4Llama2-7B0.21.251.921.530.95
Llama2-70B0.110.870.950.481.5
8Llama2-7B0.130.981.860.59-0.39
Llama2-70B0.090.111.20.511.58
16Llama2-7B0.120.211.190.481.19
Llama2-70B0.070.060.980.230.79

ATOM[1] (Group Size = 128): Loss compared to unquantized baseline (lower is better)

ModelWiki2PIQABoolQHellaSwagWinogrande
LLAMA-7B0.481.093.333.183.16
LLAMA-65B0.360.380.241.614.5

MX4[2] (Group Size = 16): Loss compared to unquantized baseline (lower is better)

ModelWiki2PIQABoolQHellaSwagWinogrande
LLAMA-7B0.671.035.311.913.16
LLAMA-65B0.440.982.822.033.55

MXFP4[3] (Group Size = 32): Loss compared to unquantized baseline (lower is better)

ModelWiki2PIQABoolQHellaSwagWinogrande
LLAMA-7B0.700.545.292.881.90
LLAMA-65B0.540.982.172.033.63
  1. Impact on language-modeling task (perplexity on Wikitext-103) with universally calibrated codebooks: We calibrate our codebooks on activations from GPT3-126M model and freeze the codebooks across our evaluations. As demonstrated by the perplexity achieved by LO-BCQ on Wikitext-103 dataset, the pre-calibrated codebooks are effective in quantizing activations of various models we considered.

Wikitext-103, Group size=64 : Loss compared to unquantized baseline (lower is better)

NUM CODEBOOKSLlama2-7BLlama2-70BNemotron4-15BNemotron4-340BGPT3-22B
20.250.210.430.190.20
40.200.110.330.140.10
80.130.090.260.120.08
160.120.070.210.110.09

To validate this further, we compare the perplexity achieved by codebooks calibrated individually for each operand (weights and activations) in each layer of Llama2-7B. As shown, the universally calibrated codebooks perform as well as layer-wise calibrated codebooks especially with >= 8 codebooks.

Wikitext-103, Group size=64, Llama2-7B: Loss compared to unquantized baseline (lower is better)

NUM CODEBOOKSUniversally CalibratedLayerwise Calibrated
20.250.23
40.200.16
80.130.13
160.120.11
  1. Impact of clustering, indexing and look-up steps of LO-BCQ on inference performance:

We agree with the reviewer that the clustering and identifying codebooks is an expensive procedure and will impact inference performance. However, we would like to clarify that the clustering and calibration of codebooks is performed prior to inference (offline). We freeze the codebooks identified by LO-BCQ algorithm across our evaluations.

Further, the codebook selectors are pre-determined for the weights and are computed on-the-fly for activations. The cost of determining the codebook selector is amortized by the large size of dot-product dimension (e.g. 8192 in Llama2-70B).

Finally, each 4-bit scalar is decoded to 6-bit integer during GEMM/GEMV computations. As a result of a small number of codebooks utilized by LO-BCQ (footprint = 0.19KB), one can potentially store them in shared memory of each SM resulting in a small look-up cost. A dedicated hardware design can amortize the cost of this lookup through spatial and temporal re-use of operands.

评论

References: [1]: Zhao, Yilong, et al. "Atom: Low-bit quantization for efficient and accurate llm serving. [2]: Rouhani et al. “With shared microexponents, a little shifting goes a long way”. ISCA ’23 [3]: Rouhani et al, “Microscaling data formats for deep learning, 2013” [4]: Tseng et al, “Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024 [5]: Egiazarian et al, “Extreme compression of large language models via additive quantization, 2024.”

评论

We thank the reviewer for their thorough comments and questions. We are glad to know that the reviewer acknowledges our extensive experimental evaluations across several LLMs and datasets. We also appreciate the reviewer emphasizing the fact that we have achieved <1% accuracy loss in several inference tasks when quantizing both weights and activations to 4-bits.

The main concern of the reviewer is regarding the group-size or ‘block division’ of quantization. We agree with the reviewer that a small group-size such as 8 can lead to large computational overheads due to high precision scale factor multiplication per-group. We would like to clarify that the block-array size denoted by ‘LAL_A’ in our work is equivalent to group-size, that is, each block array shares a scale-factor. Therefore, in our work, scale factors are shared among groups of size 64, rather than 8, which has much smaller overhead. Indeed, one of the strengths of our work is using different granularities for scaling and codebook indexing (the latter which has size of 8). Therefore, LO-BCQ leverages the benefits of finer-grained quantization while avoiding a prohibitive overhead associated with scaling factors. We clarify this in more detail in part 1 of our response.

We agree with the reviewer that Atom[1], a group-quantization based proposal, demonstrates impressive results across several inference tasks. We have now cited this work in our paper. With LO-BCQ, we demonstrate comparatively better inference accuracy than previous group quantization proposals across several benchmarks, including Atom[1] in part 2 of our response. The superior accuracy achieved by LO-BCQ can be attributed to flexibly mapping each block to a codebook that achieves least quantization error. In contrast, group quantization proposals associate a single number format to each group. Additionally, the reviewer has asked an important question regarding the calibration of codebooks for activations that are typically generated on-the-fly during inference. We have provided the performance of LO-BCQ on language modelling task with universally calibrated codebooks and compared against calibrating layerwise in part 3 of our response.

Finally, the reviewer has raised valid concerns regarding the performance of our clustering procedure to identify the locally optimal codebooks. We would like to clarify that these steps are performed during the calibration phase (offline) and does not impact the latency of inference. We discuss this further in part 4 of our response.

  1. Block-size (LbL_b) vs Block Array size (LAL_A): LO-BCQ shares a codebook per-LbL_b-element block and a scale-factor per-LAL_A-element block array. Here, block array is equivalent to “group size” in previous proposals such as Atom[1]. During inference, we perform LAL_A wide inner dot-products which are scaled by the per-block array scale factors. Each 4-bit scalar within the block array is decoded into a 6-bit integer by looking up a 16-entry codebook. This codebook is selected by the codebook selector that is shared by each block of LbL_b (typically = 8) elements within the block array. We would like to emphasize that, as a result of a small number of codebooks (<= 16) in our work, the cost of this lookup is considerably small compared to other codebook-based proposals such as QuIP#[4] and AQLM[5] where the number of codebooks are as large as 2162^{16}.

  2. Comparing LO-BCQ to other group quantization proposals: We present the inference accuracy achieved by LO-BCQ with a large group size of 64 across several datasets and models. Across these benchmarks, we demonstrate significantly better inference accuracy compared to other group quantization proposals such as Atom[1], MX[2] and MXFP[3]. We find that with a group size of 64, LO-BCQ achieves significantly lower perplexity loss on Wikitext datatset compared to other works. We share an 8-bit scale-factor across 64 element block arrays, and each block of 8 shares a codebook selector. With 2, 4, 8 and 16 codebooks, the effective bit-width of LO-BCQ is 4.25, 4.375, 4.5 and 4.625, respectively. Similarly, we demonstrate significantly lower accuracy loss in datasets such as PIQA, BoolQ, HellaSwag and Winogrande. The superior performance of LO-BCQ is from effectively designing codebooks that minimize quantization error and flexibly mapping each block to a codebook that best quantizes it.

评论

Additionally, with a group size of 128 for LO-BCQ, we would like to present a comparison to Atom. Here, LO-BCQ with 22, 44, 88 and 1616 codebooks result in an effective bitwidth of 4.194.19, 4.314.31, 4.444.44 and 4.564.56, respectively, for both weights and activations.

Wikitext PPL Loss compared to unquantized baseline

ModelAtomLO-BCQ (2 codebooks)LO-BCQ (4 codebooks)LO-BCQ (8 codebooks)LO-BCQ (16 codebooks)
LLAMA2-7B0.480.320.210.150.15
LLAMA2-70B0.360.290.150.120.10
审稿意见
8

The paper presents Block Clustered Quantization (BCQ), a post-training quantization (PTQ) method designed to compress both weights and activations in large language models (LLMs) to 4-bit precision. BCQ divides tensors into small blocks, clusters these blocks based on their statistical properties, and then quantizes each cluster using a dedicated codebook, reducing memory footprint and computational load while preserving model accuracy. The Locally Optimal Block Clustered Quantization (LO-BCQ) algorithm iteratively refines clusters and codebooks to minimize quantization error. With solid experiments with varying configurations on Llama2 and GPT3 models, it has shown LO-BCQ outperforms the states-of-the-art methods on W4A4 quantization.

优点

  1. Introducing iterative block clustering to derive an optimal quantization codebook for each block is novel, providing a robust and flexible mapping to the most accurate representation of each block. This approach, I believe, is the core factor driving the success of this method.

  2. It is a compelling finding that the codebook is universally applicable for quantizing any tensor, at any layer, across different models.

缺点

I find this paper impressive, with no obvious weaknesses in its methodology or presentation. Please refer to the Questions section for some discussion points

问题

  1. How does this method perform on W2A2 and W8A8 quantization, and how does it compare to other methods? What are the bottlenecks when applying this method to more aggressive quantization?

    1. As shown in Fig. 6, the range of values within some codebooks is smaller than [-1, 1], which appears to compensate for the large L_A. Have you tested even smaller L_A values, and how do they perform?

    2. 223: fond → found.

评论

We greatly appreciate the comments and questions of the reviewer. We thank the reviewer acknowledging that our codebook-based approach is universally applicable for quantizing any tensor, at any layer, across different models.

We have conducted some preliminary studies with LO_BCQ for sub-4-bit quantization of weights (W2A4 quantization). In this regime, we find that quantization-aware training is needed to bridge the accuracy gap from unquantized baseline. Further, reducing the block array size (LAL_A) and block size (LbL_b) result in improved accuracy at the cost of increased effective bit-width. This is because at smaller LAL_A and LbL_b, the per-block array scale-factors and per-block codebook selectors are shared among fewer scalars.

The reviewer is correct is pointing out that when LAL_A is large, the range of values within some codebooks are smaller than [-1,1] and these codebooks are used primarily for blocks with relatively small values. Although we have not experimented with LAL_A smaller than 16, we would like to discuss the general trends. We find that the inference accuracy improves as LAL_A decreases at the cost of increased bit-width overhead for storing the per-block array scale factors. Similarly, the inference accuracy improves when NcN_c (number of codebooks) increases at the cost of increased bit-width overhead for storing the per-block codebook selectors. Empirically, we find that it is worthwhile to increase the number of codebooks NcN_c rather than decrease LAL_A. For, instance the table below reports the Wikitext-103 perplexity on Llama2-70B model with various LAL_A and NcN_c values for LbL_b=8. With LAL_A=64, increasing the number of codebooks to 16 results in better perplexity (lower is better), compared to decreasing LAL_A to 16, where the effective bit-width in both these configurations is 4.625.

Wikitext-103 perplexity on Llama2-70B model

$ N_c )24816
( L_A = 64 $3.353.253.233.21
LA=32L_A = 323.273.243.223.20
LA=16L_A = 163.253.223.203.19
评论

We would like to thank all the reviewers for their insightful comments and questions. We appreciate the interest in codebook-based approach for achieving 4-bit weight and activation quantization in LLMs. We would like to emphasize that to the best of our knowledge, we are the first to achieve < 1% loss in downstream task accuracy when both LLM activations and weights are quantized to 4-bits during PTQ (no finetuning).

Key Modifications:

The main criticism of our work is regarding the group-size (referred as block array size in our paper) of quantization. We agree with the reviewers that a small group-size such as 8 can lead to large computational overheads due to high precision scale factor multiplication per-group. However, we would like to clarify that the block-array size denoted by ‘LAL_A’ in our work is equivalent to group-size, that is, each block array shares a scale-factor. Therefore, in our work, scale factors are shared among groups of size 64, rather than 8, which has much smaller overhead.

In response to reviewers’ suggestion to compare inference accuracy of LO-BCQ against other weight-only quantization proposals, we report perplexity loss on Wikitext dataset during weight-only LO-BCQ. We explore a large group-size of 128 and show that LO-BCQ achieves lower perplexity loss compared to other methods. Here, the LO-BCQ configuration is denoted by tuple {LA,Lb,Nc}\{L_A, L_b, N_c\} = {Length of block array, Length of block, Number of codebooks}. LO-BCQ with 22, 44, 88 and 1616 codebooks result in an effective bitwidth of 4.194.19, 4.314.31, 4.444.44 and 4.564.56, respectively.

Wikitext perplexity loss compared to unquantized baseline

MethodLlama2-7BLlama2-70B
GPTQ0.220.10
AWQ0.130.09
OmniQ0.270.15
QuiP#0.190.10
LO-BCQ {128,8,2}0.140.09
LO-BCQ {128,8,4}0.120.07
LO-BCQ {128,8,8}0.090.06
LO-BCQ {128,8,16}0.080.05

The superior performance of LO-BCQ is from effectively designing codebooks that minimize quantization error and flexibly mapping each block to a codebook that best represents it.

Other modifications

  • Figure1: We include comparison to Atom which is an excellent group quantization-based proposal for quantizing both weights and activations in LLMs-
  • Table 3. (b): We report weight-only quantization results for LO-BCQ with a large group size of 128 and compare it to GPTQ, AWQ, OmniQ and QuiP#.
  • Table 7: We report Wikitext-103 perplexity results with Nemotron4-340B model and various LO-BCQ configurations
  • We fixed the spelling errors pointed out by the reviewers.
AC 元评审

The paper presents a novel quantization technique, Block Clustered Quantization (BCQ), aimed at significantly reducing the storage and computational needs of large language models (LLMs) while maintaining near-state-of-the-art inference accuracy. This is achieved through the decomposition of operand tensors into blocks, clustering these blocks based on statistical characteristics, and employing tailored quantization codebooks for each cluster. The authors argue that the proposed method is innovative and addresses the critical challenge of achieving efficient sub-8-bit quantization without further training costs.

The reviewers exhibited varied perspectives on the paper’s merits. While some acknowledged the algorithm's technical novelty and its potential for improving quantization accuracy, concerns were raised regarding the lack of clarity in differentiating BCQ from existing group-wise quantization methods. Notably, one reviewer recognized the paper's innovative approach yet categorized it as marginally below the acceptance threshold, indicating mixed opinion among the reviewers.

Despite receiving a rebuttal from the authors, significant concerns remained unaddressed, particularly surrounding inference latency. While the authors asserted that their method improves inference accuracy, several reviewers questioned whether the added complexity of block clustering and codebook indexing could offset these gains, particularly in practical applications. One reviewer emphasized that the proposed method might not deliver substantial improvements in inference speed, echoing concerns that persisted beyond the authors' responses.

In conclusion, while the paper has merit and could contribute to the field of quantization in LLMs, the unresolved issues regarding inference latency and the clarity of its novelty compared to existing methods hinder unanimous support for acceptance. Based on the reviewers' insights, the paper does not currently meet the criteria for acceptance due to these outstanding concerns.

审稿人讨论附加意见

The reviewers exhibited varied perspectives on the paper’s merits. While some acknowledged the algorithm's technical novelty and its potential for improving quantization accuracy, concerns were raised regarding the lack of clarity in differentiating BCQ from existing group-wise quantization methods. Notably, one reviewer recognized the paper's innovative approach yet categorized it as marginally below the acceptance threshold, indicating mixed opinion among the reviewers.

Despite receiving a rebuttal from the authors, significant concerns remained unaddressed, particularly surrounding inference latency. While the authors asserted that their method improves inference accuracy, several reviewers questioned whether the added complexity of block clustering and codebook indexing could offset these gains, particularly in practical applications. One reviewer emphasized that the proposed method might not deliver substantial improvements in inference speed, echoing concerns that persisted beyond the authors' responses.

最终决定

Reject