PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
4
3
5
3.8
置信度
创新性2.3
质量2.5
清晰度2.8
重要性1.8
NeurIPS 2025

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29

摘要

关键词
QuantizationMatrix MultiplicationLarge Language Models

评审与讨论

审稿意见
4

This paper introduces CodeGEMM, an optimized matrix multiplication kernel for codebook-based quantized large language models. Traditional codebook-based quantization methods suffer from significant latency due to the need for repeated dequantization and frequent codebook access.

CodeGEMM precompute and storing inner product results between codebook centroids and input data in the programmable cache, eliminating the need for on-the-fly dequantization. The kernel is designed to be flexible across various quantization hyperparameters and model configurations.

优缺点分析

Strengths

  1. The Psumbook-based approach is a practical and innovative solution to a key bottleneck in codebook-based quantization, significantly reducing both space and compute complexity.

  2. The kernel is adaptable to a wide range of quantization settings, supporting diverse codebook sizes, vector lengths, and group normalization strategies, making it practical for real deployment.

Weaknesses

  1. CodeGEMM is constrained by on-chip cache size, making very large codebooks (e.g., b=16) infeasible. The workaround (smaller codebooks, fine-grained group normalization) is reasonable, but may limit certain applications.

  2. The performance improvements are well justified empirically, but the paper could benefit from a more formal treatment of complexity, cache usage, and the limits of the Psumbook design.

  3. Figure 1 - Figure 3 appears to be a non-vector image. It would be better to replace it with a vector graphic for improved clarity and scalability.

问题

see weaknesses

局限性

Provide a small-scale analysis or discussion of possible failure cases where Psumbook overhead could become a new bottleneck (e.g., if codebook structure is highly non-uniform or input batch sizes are extreme).

If possible, add error bars or a variance table to the main throughput/accuracy results.

最终评判理由

After reading the author's rebuttal and the comments from other reviewers, I maintain my original score.

格式问题

None

作者回复

Weaknesses

CodeGEMM is constrained by on-chip cache size, making very large codebooks (e.g., b=16) infeasible. The workaround (smaller codebooks, fine-grained group normalization) is reasonable, but may limit certain applications.

  • We agree, and this is indeed a known limitation of our current implementation. As noted, very large codebooks such as b=16b=16 can offer better accuracy but exceed the capacity of on-chip memory (e.g., GPU shared memory), making them impractical in our current design. However, as shown in Table 3, recent calibration techniques such as PV-Tuning have proven highly effective at mitigating accuracy degradation even under smaller codebooks like b=8b=8, which previously showed substantial loss. We are optimistic that as more advanced tuning and quantization algorithms are developed, the remaining accuracy gap will narrow further, without requiring excessively large codebooks.

The performance improvements are well justified empirically, but the paper could benefit from a more formal treatment of complexity, cache usage, and the limits of the Psumbook design.

  • Thank you for pointing this out. While Section 3 provides a high-level discussion of the complexity, cache behavior, and architectural constraints of the Psumbook design, we agree that a more rigorous and detailed analysis could enhance the clarity and completeness of the paper. Our current version emphasizes conceptual clarity and accessibility to help readers understand the motivation behind CodeGEMM. That said, your suggestion is well taken, and we plan to incorporate a more formal treatment in the final version.

Figure 1 - Figure 3 appears to be a non-vector image. It would be better to replace it with a vector graphic for improved clarity and scalability.

  • Thank you for pointing this out. We agree that vector graphics would improve readability and scalability. While we are unable to modify the manuscript during the rebuttal phase, we will update these figures with vectorized versions in the camera-ready version if accepted.

If possible, add error bars or a variance table to the main throughput/accuracy results.

  • We appreciate this suggestion. Although we cannot revise the main figures at this stage, we have annotated all numerical results used during rebuttal with ±2σ error margins to ensure transparency and reliability. We will include proper error bars and/or variance tables in the camera-ready version.

We sincerely appreciate the insightful suggestions provided. Throughout the rebuttal period, we have made our best effort to address the reviewers' feedback and strengthen the paper accordingly. That said, we recognize that there may still be areas for improvement. If there are additional suggestions or concerns, we would be more than happy to incorporate them. Thank you again for your thoughtful and constructive comments.

评论

After reading the author's rebuttal and the comments from other reviewers, I maintain my original score.

评论

Thank you for taking the time to review our paper and for considering our rebuttal. We appreciate your efforts throughout the review process.

审稿意见
4

CodeGEMM is a GPU GEMM kernel designed for large language models that eliminates the costly dequantization step by pre-computing all inner products between input activations and centroid vectors, storing the results in a small shared-memory “Psumbook.” At inference time, the kernel merely indexes this table with the quantized codes, reducing cache traffic and computational complexity while accommodating a broad range of quantization hyperparameters (bits per code, number of codebooks, vector length, group size) without kernel rewrites. Evaluations on 2-bit Llama-3 (8 B and 70 B) show up to 2.27× kernel-level speedups and 1.8× higher end-to-end throughput over state-of-the-art codebook methods, illustrating that a codebook-centric GEMM can translate the theoretical storage gains of extreme low-bit quantization into real-world LLM inference performance.

优缺点分析

Strengths:

  • This paper is overall well written and easy to follow. The method and results are clearly presented.
  • Motivation and the design of the kernel are visualised with side-by-side diagrams (Figure 1) and a step-wise schematic (Figure 3).
  • The authors derive both computational and space complexity, showing CodeGEMM reduces the multiply-accumulate count by a factor ≈ m/v.
  • It provides up to 2.27× speed-up compared to other codebook-based quantization methods with comparable accuracy in the 2-bit configuration.

Weaknesses:

  • It only provides benchmark results of the kernel under two different model sizes. Since this work is all about kernel optimization, more benchmark results under different problem sizes are supposed to be given.
  • Implementation details such as choosing t_w = 32, t_h = 2048 are mentioned but not justified.
  • Because the Psumbook must reside in shared memory, very large codebooks (b = 16) are out of scope; the authors themselves list this as a limitation.

问题

  1. Can you provide more benchmark results for the kernel for different problem sizes (M, N, K)?
  2. Line 175 says "each input vector interacts with a limited set of centroids". Can you provide any evidence for this assumption?
  3. If higher bits per weight were used, what would the speedup be like?

局限性

Yes

最终评判理由

My concerns were addressed by the additional benchmark data. So, I'd like to maintain my score.

格式问题

No

作者回复

Weaknesses

It only provides benchmark results of the kernel under two different model sizes. Since this work is all about kernel optimization, more benchmark results under different problem sizes are supposed to be given.

Can you provide more benchmark results for the kernel for different problem sizes (M, N, K)?

  • Thank you for the suggestion, and we apologize for not including a more comprehensive set of benchmarks earlier. We have now conducted additional experiments across a wide range of (M, N, K) configurations. (Values in parentheses denote ±2σ error margins of 128 samples.)
  • Unlike cuBLAS, which utilizes Tensor Cores and maintains relatively stable latency across batch sizes, recently proposed quantized kernels, including ours, run on CUDA cores and tend to exhibit increased latency as the batch size (M) grows. Among these state-of-the-art kernels, CodeGEMM consistently demonstrates strong performance on large-scale matrix multiplications, where its compute and memory efficiency are best utilized.
MNKcuBLASAQLM_m1v8b16AQLM_m2v8CodeGEMM_m2v8CodeGEMM_m1v4QUIP#_e8pQTIP
12048204819.82 (±0.84)28.84 (±4.05)20.55 (±0.52)20.75 (±1.43)20.66 (±0.90)19.47 (±0.85)19.44 (±1.46)
42048204819.99 (±1.95)74.67 (±9.95)43.31 (±0.64)44.04 (±0.71)41.92 (±1.11)36.71 (±1.74)36.00 (±0.76)
82048204819.79 (±1.09)135.36 (±6.28)73.03 (±1.02)75.18 (±1.13)69.72 (±2.45)59.44 (±1.06)57.87 (±1.11)
18192204830.57 (±1.36)28.84 (±4.05)28.83 (±0.67)25.94 (±0.89)26.70 (±0.88)25.52 (±0.68)27.08 (±0.82)
48192204831.31 (±1.33)74.67 (±9.95)76.15 (±1.90)63.97 (±1.17)65.36 (±1.03)60.70 (±0.56)66.18 (±0.90)
88192204831.70 (±1.54)135.36 (±6.28)138.09 (±1.00)115.39 (±1.34)116.11 (±1.61)107.85 (±0.89)118.99 (±17.80)
12048819227.52 (±1.13)60.47 (±18.39)30.93 (±0.74)24.28 (±0.84)23.81 (±1.72)23.44 (±0.81)24.90 (±1.19)
42048819229.82 (±1.37)203.86 (±10.30)82.18 (±1.29)56.21 (±1.24)52.57 (±1.21)51.91 (±1.33)59.03 (±0.89)
82048819228.69 (±1.03)396.44 (±13.32)149.98 (±1.90)98.92 (±0.96)90.73 (±1.12)89.91 (±1.28)103.24 (±0.78)
14096409628.00 (±1.51)63.13 (±15.02)32.28 (±0.92)24.76 (±0.82)24.97 (±0.96)23.96 (±0.89)26.74 (±0.96)
44096409628.54 (±0.94)210.03 (±17.94)89.76 (±1.03)60.58 (±1.28)57.79 (±1.17)53.92 (±1.37)62.74 (±0.97)
84096409628.11 (±0.74)396.37 (±19.42)165.49 (±0.80)108.16 (±1.15)103.92 (±0.93)93.43 (±1.04)110.84 (±0.83)
114336409688.67 (±7.59)168.12 (±6.50)64.76 (±2.44)38.85 (±1.15)37.51 (±1.24)38.91 (±0.69)51.30 (±1.15)
414336409689.08 (±6.37)632.69 (±16.72)217.68 (±7.61)111.20 (±1.35)106.90 (±1.51)113.28 (±0.85)161.23 (±0.99)
814336409689.29 (±6.19)1252.55 (±17.89)422.89 (±10.00)211.37 (±2.31)196.68 (±2.60)212.55 (±0.86)308.37 (±0.97)
140961433686.31 (±6.76)169.31 (±4.36)58.70 (±0.63)36.15 (±1.46)33.92 (±1.12)37.27 (±1.01)43.85 (±0.70)
440961433686.51 (±6.66)635.74 (±9.95)193.41 (±0.65)103.15 (±2.33)92.61 (±2.05)106.63 (±0.73)133.36 (±1.32)
840961433686.49 (±7.32)1253.11 (±11.63)372.97 (±1.77)192.63 (±3.34)170.16 (±2.65)199.31 (±0.83)252.12 (±0.94)
18192819296.40 (±7.16)188.91 (±4.24)62.50 (±0.96)37.99 (±1.03)35.45 (±0.92)38.31 (±0.65)49.86 (±1.00)
481928192100.41 (±6.41)713.24 (±4.30)208.11 (±0.81)111.00 (±1.63)98.66 (±1.79)111.08 (±0.66)157.26 (±0.91)
88192819295.45 (±8.13)1408.68 (±5.48)402.29 (±2.42)207.73 (±2.37)184.25 (±2.18)208.29 (±0.91)299.24 (±1.55)
1286728192297.74 (±11.93)625.53 (±10.30)181.54 (±1.83)86.48 (±1.13)76.71 (±0.75)101.98 (±2.07)134.03 (±1.10)
4286728192303.10 (±7.94)2462.88 (±3.48)684.92 (±1.66)305.47 (±1.43)264.31 (±1.45)366.74 (±3.86)492.14 (±1.77)
8286728192295.11 (±10.35)4913.52 (±6.10)1355.70 (±3.20)597.22 (±1.71)514.85 (±1.72)718.13 (±5.03)970.35 (±1.71)
1819228672302.42 (±6.26)618.61 (±2.04)180.38 (±1.32)86.20 (±0.84)76.50 (±1.82)101.13 (±2.12)124.90 (±1.03)
4819228672292.59 (±6.29)2437.82 (±2.30)679.24 (±2.01)305.14 (±1.48)263.70 (±2.13)361.95 (±3.70)455.84 (±1.19)
8819228672293.69 (±7.43)4860.85 (±3.99)1344.49 (±1.70)596.63 (±2.25)515.12 (±3.34)710.94 (±4.49)897.41 (±1.38)

Implementation details such as choosing tw=32,th=2048t_w = 32, t_h = 2048 are mentioned but not justified.

  • Thank you for pointing this out. Initially, we chose the tile dimensions tw=32t_w = 32 and th=2048t_h = 2048 based on heuristic tuning. However, prompted by your feedback, we conducted a more systematic analysis of these implementation parameters.
  • We found that th=2048t_h = 2048 consistently provided the best performance across a wide range of workloads, supporting our original choice. Interestingly, for twt_w, smaller values such as tw=32t_w = 32 worked well for relatively small matrix sizes, whereas larger values like tw=64t_w = 64 yielded better performance on large-scale matrix multiplications. We believe this is because larger matrices benefit from coarser tiling, which reduces launch overhead and improves partial sum reduction at scale.
  • We will include a brief discussion of these observations in the implementation section of the final version.
MNKt_wt_hCodeGEMM_m2v8CodeGEMM_m1v4
14096409632204826.57 (±1.40)25.07 (±1.28)
14096409664204826.76 (±0.93)25.40 (±1.08)
140964096128204829.61 (±0.83)26.81 (±0.99)
14096409632409628.95 (±0.88)27.60 (±0.86)
14096409664409628.49 (±1.29)27.68 (±1.01)
140964096128409637.58 (±1.02)32.87 (±0.86)
18192819232204839.04 (±1.19)36.02 (±0.98)
18192819264204837.23 (±0.99)35.33 (±0.83)
181928192128204840.09 (±0.80)38.54 (±1.43)
18192819232409637.78 (±0.81)36.17 (±1.18)
18192819264409638.29 (±1.08)37.70 (±0.67)
181928192128409645.40 (±0.70)42.75 (±1.20)

Because the Psumbook must reside in shared memory, very large codebooks (b = 16) are out of scope; the authors themselves list this as a limitation.

  • We agree, and this is indeed a known limitation of our current implementation. As noted, very large codebooks such as b=16 can offer better accuracy but exceed the capacity of on-chip memory (e.g., GPU shared memory), making them impractical in our current design. However, as shown in Table 3, recent calibration techniques such as PV-Tuning have proven highly effective at mitigating accuracy degradation even under smaller codebooks like b=8, which previously showed substantial loss. We are optimistic that as more advanced tuning and quantization algorithms are developed, the remaining accuracy gap will narrow further, without requiring excessively large codebooks.

Questions

Line 175 says "each input vector interacts with a limited set of centroids". Can you provide any evidence for this assumption?

  • Thank you for pointing this out. The phrasing in Line 175 may have caused confusion. As illustrated in Figure 1 of the paper, each output activation is computed by combining the input activation with all 2b2^b centroids per codebook, meaning there is no restriction on which centroids are involved in the computation.
  • However, when the number of output channels NN exceeds the number of centroid combinations 2b2^b, which is common in practice, the pigeonhole principle implies that multiple output channels will share the same centroid indices. This leads to reuse opportunities in the partial sum computation. CodeGEMM exploits this by precomputing the Psumbook, which enables efficient reuse of these dot products across multiple output channels.

If higher bits per weight were used, what would the speedup be like?

  • Thank you for the suggestion. We also conducted latency measurements for higher-bit quantization settings using the kernel configuration (g=128,b=8,tw=32,th=2048g = 128, b = 8, t_w = 32, t_h = 2048). For reference, FP16 cuBLAS latency is also included.
  • As shown in the results, higher average bit precision generally leads to increased latency, especially when the number of codebooks (m) increases. This trend is more pronounced for larger matrix sizes, such as M = 8192. The measurements confirm that even at higher bit precisions, CodeGEMM maintains competitive performance compared to FP16 baselines, while offering a flexible trade-off between accuracy and efficiency depending on the (m, v) configuration.
MNKmvbitlatency (us)
140964096N/AN/A16.00028.118 (±0.993)
140964096142.12625.074 (±1.281)
140964096244.12727.009 (±0.685)
140964096181.12724.015 (±0.870)
140964096282.12926.574 (±1.400)
140964096383.12627.385 (±0.701)
140964096484.12729.797 (±0.545)
181928192N/AN/A16.00095.785 (±6.836)
181928192142.12536.020 (±0.980)
181928192244.12549.636 (±1.365)
181928192181.12531.883 (±0.870)
181928192282.12639.040 (±1.190)
181928192383.12647.210 (±0.985)
181928192484.12758.364 (±0.988)

We sincerely appreciate the insightful suggestions provided. If there are additional suggestions or concerns, we would be more than happy to incorporate them.

评论

Thanks for providing more data. I am curious if there are any restrictions on the M dimension in the benchmark? All the values of M are very small, and make the problem memory-bound.

评论
  • Thank you for your question and for pointing out the limited range of M values in our benchmarks.
  • In contrast to the prefill phase of LLM inference, where large batch sizes (that is, large M) are typical, the decode phase inherently operates on small batches. This is especially true in practical scenarios such as autoregressive generation and asynchronous serving, where tokens are produced one at a time and user requests arrive irregularly. As a result, optimizing for the small-M regime is crucial for reducing latency in real-world applications. Our benchmarks focus on this setting because the latency overhead from full weight dequantization is substantial at small batch sizes.
  • As shown below, the cost of dequantization is more than three times that of cuBLAS GEMM when the batch size is four or fewer, but becomes negligible as the batch size increases. This supports the commonly adopted strategy, as seen in vLLM’s AQLM implementation (code), which applies custom quantized kernels for small batches (e.g., BS ≤ 6) and switching to dequantization + cuBLAS for larger batches .

(1) LLaMA 3.1–8B (aggregate latency of linear layers in a transformer decoder)

BScuBLAS (ms)Dequant (ms)Overhead (Dequant/cuBLAS)
10.329671.027423.12
40.335051.027423.07
10241.860901.027420.55
819213.039471.027420.08

(2) LLaMA 3.1–70B

BScuBLAS (ms)Dequant (ms)Overhead (Dequant/cuBLAS)
11.109583.630553.27
41.099513.630553.30
10246.431433.630550.56
819251.378133.630550.07
  • The performance advantage of our method in this regime stems from the characteristics of current GPU architectures. Large batches are efficiently processed by dense matrix engines such as Tensor Cores, whereas small batches are typically memory-bound and do not fully benefit from such hardware acceleration. Nonetheless, we believe that the CodeGEMM computation pattern can be extended to multi-batch inference when paired with a memory hierarchy and processing unit design tailored to its access characteristics. This presents a promising direction for future accelerator design.

  • We plan to incorporate this clarification into the camera-ready version of the manuscript, if accepted. We sincerely appreciate your time and thoughtful feedback throughout the review process. If there are any further questions or points you would like us to clarify, we would be more than happy to provide additional details.

评论

In the decoding phase, the batch size is much smaller compared to the prefill phase. But this number can still range from ten to hundreds. From the benchmark results, the codegemm kernel doesn't have performance advantages for batch sizes 4 or 8. How do you justify the results in such cases?

评论
  • Thank you for the thoughtful question. First, we would like to clarify a small misunderstanding in the data. For a fair comparison, the cuBLAS latency should include the additional dequantization overhead. Under this accounting, CodeGEMM shows competitive performance even at batch sizes 8 and 16.

Llama-3-8B-Total (latency in ms)

BScuBLASDequantcuBLAS+DequantAQLM_m1v8b16AQLM_m2v8QUIP#_e8pQTIPCodeGEMM_m2v8CodeGEMM_m1v4
10.332451.027421.359870.645510.250120.162630.189940.172180.15269
40.333431.027421.360852.373480.794250.444530.550390.490580.40506
80.336121.027421.363544.695321.51530.817961.033690.909030.74442
160.339531.027421.366959.266822.958811.553981.990551.748041.41579
  • As you correctly point out, in data center deployments, techniques such as continuous batching are often used to aggregate multiple requests and increase the effective batch size in the decoding phase, sometimes reaching tens or even hundreds.However, there are also important scenarios where throughput gains from batching are not possible.A representative example is on-device inference, where the decoding phase batch size is typically small (e.g., 1–4) [1].

  • Moreover, the batch-size sensitivity of quantized matmul on GPUs is a well-known challenge shared across many state-of-the-art methods, such as QuIP#[2] and QTIP[3].This is not due to the ineffectiveness of these algorithms themselves, but rather due to architectural constraints of GPUs (e.g., occupancy limits, limited shared memory bandwidth). We hope that future hardware designs, such as custom ASICs, can alleviate this limitation.

  • Your question helped us identify a point that could be explained more clearly.If accepted, we will incorporate this clarification into the paper for greater transparency.We welcome any further suggestions you may have.

[1] Spector, Benjamin, and Chris Re. "Accelerating llm inference with staged speculative decoding." arXiv preprint arXiv:2308.04623, 2023.

[2] Tseng, Albert, et al. "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." International Conference on Machine Learning, 2024.

[3] Tseng, Albert, et al. "Qtip: Quantization with trellises and incoherence processing." Advances in Neural Information Processing Systems, 2024.

评论

Thanks for the clarification.

评论

We sincerely appreciate your active participation in the discussion. The points you raised prompted us to provide additional explanations and analyses, which we believe have improved the overall clarity and completeness of our submission.

审稿意见
3

The paper presents CodeGEMM, a new GPU kernel for weight-only, codebook-based quantization in large language models (LLMs). Instead of performing on-the-fly dequantization by loading full centroid vectors into cache, CodeGEMM precomputes all possible inner products between centroids and input subvectors—storing them in a compact Psumbook. During GEMM, it uses each code to index directly into this Psumbook, eliminating repeated dequantization and reducing both computational and space complexity. The kernel is parameterized by vector length vv, number of codebooks mm, bits per code bb, and tile width twt_w, enabling exploration of the latency memory accuracy trade-off.

优缺点分析

Strengths:

  • Precomputing and caching inner products removes the need to fetch full centroids for each code, directly addressing the shared-memory capacity bottleneck in existing kernels
  • Kernel-level benchmarks demonstrate up to 2.18× improvement over FP16 cuBLAS and 1.64× over AQLM on A100 GPUs, and throughput evaluations confirm real-world speedups in the HuggingFace Llama pipeline

Weaknesses:

  • The reliance on fitting the Psumbook in shared memory forces b8b\le8 and precludes exploration of larger codebook widths (e.g., b=16b=16), limiting accuracy potential in extreme low-bit regimes .
  • The focus is on latency and perplexity; energy consumption and DRAM bandwidth savings which are critical in production are unreported.
  • It remains unclear how much time is spent constructing the Psumbook versus retrieving entries during GEMM. A breakdown would clarify whether build overheads offset retrieval gains at different batch sizes or tile shapes.
  • While NeurIPS checklist items are cited, the paper omits error bars (claimed negligible variance) and details on code release versioning, which may hinder exact reproduction .

问题

  • Can you provide detailed microbenchmarks separating the time spent building versus reading the Psumbook, especially for varying twt_w and batch sizes?
  • How does CodeGEMM fare when coupled with codebooks generated by TurboQuant (https://arxiv.org/abs/2504.19874)?
  • Have you measured DRAM traffic reductions, and can you provide quantitative comparisons to dequantization-based kernels?
  • Do you envision streamed or hierarchical Psumbook designs that allow b>8b>8 without cache overflow, and what challenges do these impose?

局限性

See Weaknesses

最终评判理由

I thank the authors for their response and have read it carefully. While I acknowledge that some concerns may have been addressed to the satisfaction of other reviewers, my own vote is a borderline rejection.

格式问题

  • Eq. 1 Caption/Table 1: The notation “g = −1 indicates row-wise group normalization” may confuse readers; consider explicit table footnote formatting.
  • In Table 3’s caption: “including MMLU (5-shot) and and 0-shot tasks such as…” — the repeated “and” should be fixed
作者回复

Weaknesses

The reliance on fitting the Psumbook in shared memory forces b8b\le8 and precludes exploration of larger codebook widths (e.g., b=16b=16), limiting accuracy potential in extreme low-bit regimes .

  • We agree, and this is indeed a clear limitation of our current work. We believe that model compression serves different goals depending on the deployment scenario. If the primary objective is to minimize memory footprint with minimal accuracy degradation, then using larger codebooks such as b = 16 is certainly effective.
  • However, when practical speedup is also a goal, algorithm-hardware co-design becomes essential. To support this, we deliberately introduced constraints during the design of our compression algorithm to ensure compatibility with on-chip memory and efficient kernel execution.
  • As shown in Table 3, recent calibration methods such as PV-Tuning offer promising ways to recover much of the accuracy lost under aggressive quantization. Notably, PV-Tuning was particularly effective in the b = 8 regime, where degradation was previously much higher compared to b = 16. We expect that as more advanced tuning and compression algorithms emerge, the remaining accuracy gap will continue to shrink even under shared memory constraints.

While NeurIPS checklist items are cited, the paper omits error bars (claimed negligible variance) and details on code release versioning, which may hinder exact reproduction .

  • We apologize for the omission. To support reproducibility, we provide the exact version information of the software environment used in our experiments: python==3.12.3, CUDA==12.6, torch==2.6.0, transformers==4.46.3, lm-eval==0.4.5
  • We also fully agree that including error bars improves transparency. For this rebuttal, we have added 2-sigma error margins over 128 measurement samples to all reported numbers in our responses to the reviewers. While we are unable to modify the manuscript at this stage, we will include error bars in all relevant results in the camera-ready version.

Questions

The focus is on latency and perplexity; energy consumption and DRAM bandwidth savings which are critical in production are unreported.

Have you measured DRAM traffic reductions, and can you provide quantitative comparisons to dequantization-based kernels?

  • Thank you for the suggestion. We agree on the importance of hardware utilization and conducted a set of experiments to compare the efficiency of different kernels.
  • To evaluate DRAM traffic and power efficiency, we used nvidia-smi queries [1] sampled every 100 milliseconds for 10 seconds, then averaged the results. The results show that CodeGEMM achieves significantly better compute efficiency (FLOPS per watt) compared to dequantization-based kernels.
  • The table below summarizes the results for various methods on a matrix multiplication workload with M = 1, N = 28672, and K = 8192. All values in parentheses indicate 2-sigma error margins over 128 measurement samples.
MethodLatency (us)TFLOPSPower (W)GFLOPS/WGPU Util (%)Mem Util (%)
cuBLAS297.74 (±11.93)1.58318.55 (±6.26)4.9596.87 (±0.73)96.94 (±0.48)
aqlm_1x16625.53 (±10.30)0.75126.54 (±0.49)5.9399.00 (±0.00)6.00 (±0.00)
aqlm_2x8181.54 (±1.83)2.59254.20 (±2.47)10.1892.84 (±1.58)19.96 (±0.39)
codegemm_m2v886.48 (±1.13)5.43304.69 (±6.11)17.8385.32 (±1.58)43.76 (±0.95)
codegemm_m1v476.71 (±0.75)6.12316.38 (±8.37)19.3684.47 (±2.28)49.80 (±1.21)
  • These results highlight that CodeGEMM not only achieves lower latency but also delivers higher energy efficiency and better memory subsystem utilization, suggesting reduced and more structured DRAM access compared to dequantization-based approaches. We will include this analysis in the final version.

[1] https://docs.nvidia.com/deploy/nvidia-smi/index.html

Can you provide detailed microbenchmarks separating the time spent building versus reading the Psumbook, especially for varying twt_w and batch sizes?

  • Thank you for the insightful suggestion. While it is challenging to measure precise cycle-level timing for each kernel phase due to concurrent execution across multiple SMs on the GPU, we performed controlled measurements by isolating execution to a single SM in order to estimate the relative cost of building and reading the Psumbook.
  • The table below reports the percentage of execution cycles spent on each phase for various tile widths twt_w and batch sizes M. For a fixed twt_w, increasing M shows that the ratio between building and reading remains largely stable, indicating that the Psumbook construction cost is effectively amortized across the batch. For a fixed M, increasing twt_w tends to increase the proportion of time spent on building in small matrices, but decrease it in large matrices.
MNKtwt_wPsumbook Phase (%)m2v8m1v4
14096409632Building / Reading30.5 / 69.520.3 / 79.7
1409640966433.0 / 67.028.5 / 71.5
14096409612831.2 / 68.830.7 / 69.3
1819281923245.4 / 54.641.2 / 58.8
1819281926445.6 / 54.439.7 / 60.3
18192819212828.3 / 71.729.5 / 70.5
1409640963230.5 / 69.520.3 / 79.7
4409640963230.4 / 69.620.7 / 79.3
8409640963230.7 / 69.320.4 / 79.6
1819281923245.4 / 54.641.2 / 58.8
4819281923245.7 / 54.341.3 / 58.7
8819281923246.1 / 53.941.6 / 58.4

How does CodeGEMM fare when coupled with codebooks generated by TurboQuant (https://arxiv.org/abs/2504.19874)?

  • Thank you for the valuable suggestions. TurboQuant is an online vector quantization method primarily designed for KV‑cache compression. This aligns well with CodeGEMM, which dynamically operates on precomputed codebooks and codes, replacing on-the-fly dequantization with psum lookups. CodeGEMM is agnostic to how the codebooks are trained; it only requires the centroids, codes, and the associated hyperparameters (m, b, v) to construct a Psumbook for the current input tile. As such, codebooks generated by TurboQuant can be directly used with CodeGEMM, provided that the per-codebook bitwidth and vectorization choices fit within on-chip memory constraints.
  • While TurboQuant is designed for online KV‑cache quantization, CodeGEMM naturally extends to this use case. In attention matmuls with quantized activations, the Psumbook is constructed between the current query tile and the key or value centroids, followed by a lookup using the stored codes. This maintains the same memory and compute characteristics, although the Psumbook must be rebuilt for each new query tile.

Do you envision streamed or hierarchical Psumbook designs that allow b>8b>8 without cache overflow, and what challenges do these impose?

  • We agree that fitting the entire Psumbook in shared memory limits the maximum codebook size bb. We are exploring two complementary extensions: streamed or hierarchical Psumbook designs. First, a streamed Psumbook that partitions the table along the codebook dimension and prefetches chunks from global memory with double buffering, overlapping data transfer and lookup. This avoids cache overflow but introduces additional bandwidth demand and synchronization overhead. Second, a hierarchical Psumbook that decomposes a wide codebook into coarse and residual sub-codebooks. Partial sums are built separately for each level and then aggregated, which reduces the 2b2^b memory blow-up at the cost of extra lookups and combination logic. The main challenges include hiding global memory latency, preventing bank conflicts under larger tiles, and managing register and shared memory usage to sustain occupancy. Additionally, if the computational overhead of Psumbook building grows beyond a certain point, the net performance gain of CodeGEMM may diminish.

We sincerely appreciate the insightful suggestions provided. Throughout the rebuttal period, we have made our best effort to address the reviewers' feedback and strengthen the paper accordingly. That said, we recognize that there may still be areas for improvement. If there are additional suggestions or concerns, we would be more than happy to incorporate them. Thank you again for your thoughtful and constructive comments.

评论

Dear Reviewers,

Thank you again for your thoughtful time and feedback on our submission.

We have submitted detailed responses to your comments and would greatly appreciate it if you could take a moment to review them when convenient. If there are any questions or further points needing clarification, we are more than happy to engage in discussion.

We look forward to your continued guidance.

Best regards,

The Authors

审稿意见
5

The paper proposes a codebook-based quantization method to represent weights using 2-bit configurations. Unlike previous approaches that stored the entire codebook in memory for dequantization, the proposed method stores the product-sum results, eliminating the need for dequantization. Hyperparameters for codebook selection such as group sizes, vector length, and number of codebooks are explored to find the best trade-off between performance accuracy and hardware complexity. The new approach achieves a 2.29× speed improvement compared to previous codebook-based quantization methods such as AQLM (2×8).

优缺点分析

Strengths:

1- The new approach is explored across various hardware constraints such as memory footprint, latency, and throughput, and it demonstrates the benefits of product-sum codebook-based quantization compared to previous approaches.

2- The paper is well-written, and the experiments are elaborated in detail.

Weaknesses:

1- The accuracy degradation of the codebook-based approach is high compared to the FP16 approach in Table 3 (approximately 7%). The author is advised to elaborate on this accuracy degradation.

2- It is suggested to include accuracy vs. latency and accuracy vs. throughput plots corresponding to Table 3 and Table 4.

3- It is recommended that the comparison with these previous approaches [1, 2] be demonstrated in the paper.

[1] Liu, Yifei, et al. "Vptq: Extreme low-bit vector post-training quantization for large language models." arXiv preprint arXiv:2409.17066 (2024).

[2] Liu, Zechun, et al. "ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization." arXiv preprint arXiv:2502.02631 (2025).

问题

Is there any limitation on the size of the product-sum values stored in memory for large language models such as LLaMA 3.1 with 405B parameters?

局限性

yes

最终评判理由

The author provides new experiments to compare with the previous work and also addressed my question regarding the limitations of the new approach for large language models such LLaMA 3. Therefore, I raised my score.

格式问题

There are no concerns regarding the paper format.

作者回复

Weaknesses

1- The accuracy degradation of the codebook-based approach is high compared to the FP16 approach in Table 3 (approximately 7%). The author is advised to elaborate on this accuracy degradation.

  • Thank you for the insightful feedback. We agree with your observation. In the context of LLM post-training quantization, 2-bit quantization still leads to notable accuracy degradation, especially under extreme compression. While advanced techniques such as PV-Tuning help mitigate this issue to some extent, a performance gap remains. We believe that with sufficient resources, quantization-aware training could be a promising direction to further narrow this gap.
  • Our primary goal with CodeGEMM is to serve as a practical kernel-level bridge between model compression and real-world speedup. We are optimistic that as better quantization algorithms continue to emerge, CodeGEMM will enable them to realize their full latency-throughput benefits on modern hardware.

2- It is suggested to include accuracy vs. latency and accuracy vs. throughput plots corresponding to Table 3 and Table 4.

  • We agree that figures are better suited than tables for illustrating trade-offs. Although we are unable to modify the manuscript during the rebuttal phase, we will include these plots in the camera-ready version, if accepted. We believe the visual representation will clearly highlight the accuracy-efficiency Pareto front achieved by CodeGEMM under different quantization settings.

3- It is recommended that the comparison with these previous approaches [1, 2] be demonstrated in the paper.

  • Thank you for the helpful suggestion. While the reviewer recommended evaluations on both ParetoQ and VPTQ, we were only able to conduct experiments with VPTQ due to the lack of publicly available LLaMA 3 checkpoints for ParetoQ. VPTQ, which is also a vector-quantization-based method, shows comparable accuracy. However, similar to AQLM, it relies on dequantization-based kernels and therefore exhibits lower throughput than the FP16 baseline. We appreciate the suggestion, which helped enrich our study, and we plan to include this result in the camera-ready version if accepted.
Method𝑞̄tok/sMMLUWGHSARC-EARC-CAvg.
FP1616.00034.368.3973.9579.279.6355.0371.26
FlexRound-q2-g128 [12]2.12534.424.2755.1643.7824.7524.5741.65
AQLM-2x8 [5]2.00520.642.2958.2561.4046.2530.8947.82
+PV-Tuning2.00520.655.1369.1472.4371.2545.7362.74
AQLM-1x16 [5]2.21320.858.7468.7570.2173.9946.1663.57
+PV-Tuning2.21320.860.7270.2474.3374.9248.8965.82
CodeGEMM-m1-v4-g1282.12636.845.1658.9663.0763.9738.4853.93
+PV-Tuning2.12636.857.4269.0673.8573.1546.3363.96
CodeGEMM-m2-v8-g1282.12736.241.5355.7262.0865.0739.5152.78
+PV-Tuning2.12736.256.6469.0672.5873.0646.7663.62
VPTQ2.30031.744.8769.0670.8364.9440.1957.98

Questions

Is there any limitation on the size of the product-sum values stored in memory for large language models such as LLaMA 3.1 with 405B parameters?

  • Our approach is independent of the total model size and scales well to extremely large models such as LLaMA 3.1–405B. The only memory constraint lies in the size of on-chip memory used to store the Psumbook. The size of Psumbook depends only on codebook parameters (e.g., number of codebooks and vector length), not on the number of parameters. Importantly, the performance of CodeGEMM improves with larger problem sizes. As shown in the table below, CodeGEMM achieves greater speedup over cuBLAS as the matrix dimensions grow. Each matrix shape corresponds to configurations used in LLaMA 3 models of varying sizes, including 1B, 8B, 70B, and 405B. (Values in parentheses indicate ±2σ error over 128 measurement samples.)
MNKcuBLAS (μs\mu s)CodeGEMM_m2v8 (μs\mu s)Speed upCodeGEMM_m1v4 (μs\mu s)Speed up
12048204820.1 (±0.9)20.8 (±1.4)×1.020.7 (±0.9)×1.0
18192204831.2 (±1.3)25.9 (±0.9)×1.226.7 (±0.9)×1.2
12048819227.6 (±1.2)24.3 (±0.8)×1.123.8 (±1.7)×1.2
14096409627.8 (±1.5)24.8 (±0.8)×1.125.0 (±1.0)×1.1
114336409688.8 (±6.7)38.8 (±1.1)×2.337.5 (±1.2)×2.4
140961433686.2 (±6.9)36.2 (±1.5)×2.433.9 (±1.1)×2.5
18192819296.4 (±7.2)38.0 (±1.0)×2.535.4 (±1.1)×2.7
1286728192297.9 (±11.5)86.5 (±1.1)×3.476.7 (±0.8)×3.9
1819228672302.0 (±6.3)86.2 (±0.8)×3.576.5 (±1.8)×3.9
11638416384340.6 (±8.4)97.2 (±1.3)×3.597.0 (±2.4)×3.5
153248163841023.5 (±10.3)263.9 (±1.9)×3.9263.7 (±2.4)×3.9
116384532481060.0 (±10.8)263.9 (±1.3)×4.0263.8 (±1.2)×4.0

We sincerely appreciate the insightful suggestions provided. Throughout the rebuttal period, we have made our best effort to address the reviewers' feedback and strengthen the paper accordingly. That said, we recognize that there may still be areas for improvement. If there are additional suggestions or concerns, we would be more than happy to incorporate them. Thank you again for your thoughtful and constructive comments.

评论

Dear Reviewers,

Thank you again for your thoughtful time and feedback on our submission.

We have submitted detailed responses to your comments and would greatly appreciate it if you could take a moment to review them when convenient. If there are any questions or further points needing clarification, we are more than happy to engage in discussion.

We look forward to your continued guidance.

Best regards,

The Authors

最终决定

The work introduces a matrix multiplication kernel for codebook-based quantized models. Unlike traditional methods that incur latency by repeatedly loading the codebook for dequantization, the proposed approach CodeGEMM eliminates the dequantization step by precomputing inner products into a Psumbook. It supports flexible hyperparameters to balance latency, memory, and accuracy. Experiments show that CodeGEMM achieves a speed-up over state-of-the-art codebook-based quantization methods while maintaining comparable accuracy in 2-bit settings.

This is an interesting algorithm-hardware co-design work. We decided to accept this paper and also suggests the following changes:

  • Explicitly point out the limitation of this approach in the final version, e.g., the coodbook size cannot be too large; why CodeGEMM underperforms Tensor Core–based cuBLAS in large-batch operations.
  • Narrow down the application area of the proposed approach and retune the claim in the abstract
  • Include the additional experimental comparison in the final version.