PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.8
置信度
创新性3.0
质量2.8
清晰度3.5
重要性3.0
NeurIPS 2025

Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
QuantizationCompressionDeep Neural Network

评审与讨论

审稿意见
4

This paper presents Binary Quadratic Quantization (BQQ), a novel matrix quantization framework that compresses real-valued matrices using quadratic combinations of binary matrices. BQQ extends the established Binary Coding Quantization (BCQ) approach—which relies solely on linear combinations—by introducing binary matrix products, enabling more expressive nonlinear approximations while preserving a highly compact representation. The authors demonstrate compelling accuracy–compression trade-offs, particularly on Vision Transformer models.

优缺点分析

Strengths A. The paper builds upon the Binary Coding Quantization framework by introducing a method that compresses model weights using quadratic combinations of binary matrices, extending beyond traditional linear formulations. B. The mathematical formulation is clear, concise, and well-explained, making the core ideas easy to follow. C. The experimental results are compelling, demonstrating that the proposed quantization approach maintains competitive accuracy compared to existing quantization techniques, while also refering to the vanilla model

Weakness A The experimental evaluation is relatively limited, focusing only on a single task and a single model family (Vision Transformers). To strengthen the empirical claims and support the generality of the method, it would be valuable to include additional results—ideally in the appendix—across a broader range of tasks or architectures. B. The paper does not address the effect of the intermediate size ll, a new hyperparmeter, on the accuracy, compression and computation complexity

问题

  1. Missing latency analysis: The paper does not include a discussion on how the proposed quantization impacts inference latency, which is a critical aspect of practical deployment. The quantization scheme likely affects latency in at least three ways: (a) Weight loading time — using binary representations significantly reduces the model size, alleviating the weight loading bottleneck, look for “Memory Bandwidth and Compute Bottlenecks in LLM Inference.” (b) Low-bit computation efficiency — if supported by hardware, binary or low-bit operations can execute substantially faster. (c) Theoretical computational complexity — it would be useful to discuss the effect on the algorithmic complexity (in Big-O notation), especially compared to full-precision operations. (this is also in relation to comment 2.)

  2. Hyperparameter selection for intermediate dimension ll: The dimension ll, which governs the size of intermediate binary matrices, has a significant effect on both model accuracy and computational cost. While the paper adopts a heuristic choice $l=round(m n/(m+n)) for fair comparison, it is unclear whether this is optimal. A more detailed explanation or sensitivity analysis around this hyperparameter would strengthen the paper. It seems the authors are familiar with this issue and have noted on it within the discussion. I highly advise to add a short section to address this.

  3. Relation to LoRA and related work: The proposed method shares conceptual similarities with LoRA (Low-Rank Adaptation), which also introduces low-rank structure to improve representation, enable efficient fine-tuning, and reduce memory and runtime costs. It would be beneficial to explicitly discuss this connection in the main text and cite recent works that combine LoRA with quantization, such as QA-LoRA or LQ-LoRA, to position the current approach within the broader landscape of model compression and adaptation techniques.

局限性

yes

最终评判理由

I thank the authors for their detailed reply and addressing some of my concerns

My score stays on 4

The rebuttal addressed some of my concerns -- namely, the additional evaluation and other issues related to clarity (for which the paper already got the highest score) Nevertheless, my main concern regarding latency analysis is not addressed. My score shows it is a good paper but not an outstanding one in its originality or significance.

Finally, looking at concerns and scores provided by my peer reviewers strengthen my decision (my score is an avreage of the others)

格式问题

The paper is well formatted and edited

作者回复

Response to Reviewer Comments

We thank the reviewer for their thoughtful and constructive feedback. We appreciate the recognition of the clarity of our formulation, the novelty of our proposed quantization framework, and the strength of our experimental results. Below, we respond to each of the concerns raised.


(1) On the Scope of Experimental Evaluation

We agree that broader evaluation would strengthen our claims. We focused on ViTs due to their known sensitivity to quantization, where even small weight perturbations can cause large accuracy drops. Demonstrating strong performance in such settings highlights our method’s advantage.

To further demonstrate the broader applicability of BQQ, we evaluated our method on a compact large language model (LLM) optimized for edge deployment: Qwen2.5-0.5B. This model has recently gained attention for its strong performance despite its small size. We compared BQQ against a standard quantization baseline, GPTQ, under equal bit-width settings. As shown in the table below, BQQ achieves comparable results in both perplexity and downstream task accuracy, highlighting its effectiveness even in highly resource-constrained environments. Notably, in the extremely low-bit regime (2-bit quantization), BQQ demonstrates a remarkable advantage—achieving a 5.7% average improvement in downstream accuracy and reducing perplexity by a factor of 200 compared to GPTQ. This underscores BQQ’s robustness and superior performance in challenging low-bit scenarios. If accepted, we plan to include a broader range of experimental results to further showcase the versatility and effectiveness of BQQ across diverse model architectures and tasks.

Method#BitsWikitext2 PPLPIQAARC-EARC-CHSWGeBoolQAvg.
GPTQ326.262.248.925.435.355.951.746.6
BQQ320.562.251.423.133.652.147.144.9
GPTQ212523.652.725.321.425.651.943.636.8
BQQ251.057.539.818.028.352.658.842.5
Baseline1613.070.264.629.540.656.462.454.0

Additionally, we emphasize that the current work proposes BQQ as a general binary quantization framework, rather than a dedicated quantization technique for neural networks. In our implementation, BQQ quantizes weights by minimizing reconstruction error in the weight space, without relying on output-based error signals that are commonly used in many neural network quantization methods. This implies that BQQ does not yet exploit task-specific loss functions or activation statistics. We therefore believe that adapting BQQ to incorporate such feedback—especially minimizing the downstream output error—could lead to significant further improvements in model performance, and this represents a highly promising avenue for future research.


(2) On the Effect of the Intermediate Dimension

Thank you for suggesting deeper analysis of the intermediate dimension ll, which affects both accuracy and computation. In our main experiments, we used the heuristic l=round(mnm+n)l = \mathrm{round}\left(\frac{mn}{m + n}\right) to match the original matrix size and adjust overall capacity via the number of binary stacks pp.

To explore this further, we conducted additional experiments varying ll and pp while keeping total bit budget fixed. Results on the same benchmark matrices from Fig. 3 show that optimal ll depends on data characteristics. For instance, datasets with fast-decaying singular values (e.g., "Distance", "ImageNet") benefit from smaller ll in low-capacity regimes. In others, our default setting remained robust.

#stacks ($p$)$l$-scaleSize [KB] (Random)MSE (Random)Size [KB] (DeiT)MSE (DeiT)Size [KB] (Distance)MSE (Distance)Size [KB] (SIFT)MSE (SIFT)Size [KB] (ImageNet)MSE (ImageNet)
112.1324.318.4298.11.314.62.197.86.342.7
20.52.1329.918.5309.41.39.22.1108.16.329.7
40.252.1337.118.5317.31.312.22.1110.86.349.6
#stacks ($p$)$l$-scaleSize [KB] (Random)MSE (Random)Size [KB] (DeiT)MSE (DeiT)Size [KB] (Distance)MSE (Distance)Size [KB] (SIFT)MSE (SIFT)Size [KB] (ImageNet)MSE (ImageNet)
124.1106.436.9101.92.57.74.146.312.620.5
214.1105.336.995.82.52.34.130.012.610.9
40.54.1108.436.9100.82.62.54.134.412.68.9
80.254.2114.237.0105.62.53.54.235.912.614.1

These results validate our heuristic while suggesting that tuning ll to specific data or architectures could yield further gains.


(3) On Latency and Computational Complexity

We appreciate the reviewer’s thoughtful comment regarding the lack of latency analysis. While our primary contribution lies in proposing the Binary Quadratic Quantization (BQQ) framework and demonstrating its advantages in terms of memory compression and approximation quality, we acknowledge that inference latency is a critical factor for practical deployment. Below, we address the three aspects the reviewer mentioned:

(a) Weight Loading Time: We fully agree that binary representations greatly reduce model size, which directly alleviates I/O bottlenecks during weight loading. This is particularly relevant in scenarios where memory bandwidth is the limiting factor. In such cases, BQQ's compressed binary format allows faster transfer from storage or DRAM to compute units, offering potential latency gains in loading time. Since I/O transfer time is approximately proportional to model size, reducing the model to 1/16 its original size—as in the case of 2-bit BQQ—can theoretically lead to up to 16× faster weight loading, assuming the I/O subsystem is the primary bottleneck.

(b) Low-Bit Computation Efficiency: Although binary and low-bit operations can be highly efficient on hardware that supports them (e.g., INT cores on modern GPUs), our current implementation involves casting 1-bit weights to floating-point values before matrix operations. This is due to the lack of native mixed-precision (e.g., INT-FP) kernel support on standard GPU backends. As a result, despite memory savings, the actual compute kernel is still performed in floating point, leading to no measurable improvements in raw latency under current software-hardware stacks.

However, recent work such as LUT Tensor Core (Zhiwen Mo et al., ISCA 2025) illustrates that with dedicated support for mixed-precision tensor operations (e.g., 1-bit × 16-bit), speedups over 7.5× are possible. We believe that BQQ is well positioned to benefit from such hardware innovations, and integrating with emerging low-bit inference libraries or hardware designs is a promising avenue for future research.

(c) Theoretical Computational Complexity: In terms of algorithmic complexity, the per-layer FLOPs of BQQ are on the order of O((m+n)ldp)\mathcal{O}((m + n)ldp), where mm and nn are the dimensions of the original weight matrix, ll is the intermediate dimension, pp is the number of binary matrix stacks, and dd is the input dimension of the layer.

In comparison, standard scalar pp-bit quantization requires O(mndp)\mathcal{O}(mndp) FLOPs per layer. When lmnm+nl \approx \frac{mn}{m + n}, BQQ achieves the same asymptotic computational complexity as scalar quantization. While our current implementation does not fully leverage this due to software and hardware constraints, we emphasize that BQQ is structurally amenable to such optimizations.

We hope the reviewer recognizes that while we have not yet developed a latency-optimized implementation, our analysis highlights the conditions under which latency gains are likely to be realized. We thank the reviewer again for highlighting this important aspect and agree it is a crucial direction for future work.


(4) On the Relation to LoRA and Related Work

Thank you for pointing out the connection to LoRA. While LoRA is primarily designed for efficient adaptation of large models by introducing lightweight low-rank updates, our method targets compression of the original real-valued matrices themselves. Despite this difference in objective, there is a structural resemblance in the use of low-rank components. We agree that clarifying this distinction—and positioning BQQ within the broader context of related methods, including LoRA-based techniques and hybrid approaches such as QA-LoRA and LQ-LoRA—will help readers better understand the contributions and novelty of our work.

If accepted, we will mention the structural similarities between our method and LoRA-based approaches, such as the use of low-rank components, to help situate our method within the broader landscape of efficient model design.


Final Remarks

Once again, we thank the reviewer for their valuable insights. Your suggestions have significantly improved both our presentation and analysis.

评论

Thank you for reading and addressing my concerns. Your comments (1) and (2) strengthen the basis of your work and I hope you can fit it into the paper or at least to the appendix.

I didn't fully understand how the table you present in (2) correlate to Fig. 3 in the paper. The model sizes (x axis) fits in both table and Figure (two leftmost datapoints). However in Fig. 3 all the MSE values are < 1. and in the table they are >> 1. Please make sure to ellaborate on this more when you incorporate it into the paper.

I currently have no additional questions and plan to keep the score.

评论

Thank you very much for your thoughtful comments and for taking the time to carefully review our responses.

We sincerely apologize for the confusion regarding (2) in the rebuttal. While both Figure 3 in the paper and the table shown in the rebuttal are based on experiments using the same dataset, in the rebuttal table, we reported the MSE values multiplied by 10310^3 for better readability. We regret not clearly stating this, and we apologize for the oversight.

If the paper is accepted, we will make sure to incorporate the additional results into the main text or appendix, along with a clear explanation of the scaling and its relation to Figure 3.

If anything remains ambiguous or warrants further explanation, we would be more than willing to provide additional details.

审稿意见
4

This paper proposes a novel method for quantizing a matrix to represent it using a small number of bits. The proposed method is named Binary Quadratic Quantization (BQQ). BQQ generalizes an existing approach known as Binary Coding Quantization (BCQ), which approximates a matrix as a weighted sum of binary matrices. In BQQ, the matrix is approximated by a sum of "products of two binary matrices." Conceptually, while BCQ represents a matrix using a linear expression like aX + b, BQQ approximates it with a quadratic expression such as (aX + b)(cY + d). To perform this approximation in practice, the paper introduces a framework consisting of: (1) a greedy approximation that iteratively reduces the residual error of the original matrix, (2) a formulation that becomes a PUBO problem when real-valued variables are fixed, and (3) an analytic closed-form solution when the binary variables are fixed. The proposed method is applied to both matrix approximation tasks and transformer layer approximation tasks, and it has been reported to outperform existing methods.

优缺点分析

Strengths

The strengths of the proposed method are summarized as follows. Overall, I find the proposed approach interesting and well-motivated, and I have a favorable impression. Therefore, I assign a borderline accept score to this paper.

Straightforward Extension

The proposed method is a straightforward extension of the existing BCQ approach. While BCQ approximates a matrix using a linear form (aX + b), the proposed method expresses the matrix using a quadratic form ((aX + b)(cY + d)). This mathematical extension is elegant. It is evident that a quadratic form has greater expressive power than a linear one, and the effectiveness of this formulation is intuitively convincing.

Well-Structured Optimization

The optimization strategy described in Section 4.2 is well-structured: a greedy approach that iteratively reduces the objective function and performs optimization under two configurations: (1) fixing the real-valued variables and (2) fixing the binary variables. This approach is clear, and it makes sense that the BQQ outperforms BCQ if the optimization succeeds. The experimental results verify this point and report BQQ's superior performance over BCQ.

Extensive Experiments

The proposed method is thoroughly evaluated in the matrix approximation task and the Transformer layer approximation task. It demonstrates strong approximation performance across various settings, and I am satisfied with these results.

Weaknesses

There are no particularly strong weaknesses to point out, but I offer the following comments as suggestions to improve the manuscript further.

On Mathematical Expressions

There are two suggestions for improving the mathematical notation.

In Line 3 of Algorithm 2, the expression Y^old0.5\mathbf{\hat{Y}}_{\mathrm{old}} - 0.5 appears. Here, the scalar value "0.5" is used as if it were a matrix. A more precise notation would be to typeset it in boldface as 0.5\mathbf{0.5} (as done in Line 3 of Algorithm 1), or to represent it using an all-one matrix as 0.510.5 \cdot \mathbf{1}. Similar expressions also appear in other places, so it is recommended that all such instances be revised consistently.

In Line 1 of Algorithm 3, the notation Wresclone(W)\mathbf{W}_\mathrm{res} \gets \mathrm{clone}(\mathbf{W}) is used. Personally, I find the term clone to be overly Python-specific and possibly unnecessary. The symbol "\gets" typically implies a deep copy unless otherwise stated in pseudocode. If clone is explicitly used here, then similar clarity should be maintained elsewhere. For instance, Line 14 of Algorithm S.1 might also need to state Qclone(Qc)\mathbf{Q} \gets \mathrm{clone}(\mathbf{Q}_c).

On Computational cost

The computational cost of the proposed method is not fully discussed in the main manuscript. Specifically, the discussion section refers to Appendix A.5. However, computational cost is an important factor in matrix quantization and decomposition. Therefore, it is recommended that a more explicit summary of the computational results be included directly in the main manuscript.

问题

The proposed method reduces the approximation error greedily. That is, in the loop of Line 2 in Algorithm 3, I would expect that rir_i and sis_i take larger values when ii is small, and gradually become smaller as ii increases. Does the proposed method actually exhibit this behavior?

In the context of SVD, the Eckart-Young theorem guarantees that selecting the most significant singular values in descending order yields the best low-rank approximation. In contrast, no such intuitive theorem generally holds for tensor decompositions such as the CP decomposition.

How does this compare to the proposed method? Specifically, when processing in order of i=0,1,i = 0, 1, \dots and terminating at some point, can we say --- empirically or theoretically --- that the resulting approximation is good?

局限性

None

最终评判理由

I have a positive impression of this paper, which remains strong even after the rebuttal. Therefore, I will maintain my original BA score.

格式问题

None

作者回复

Response to Reviewer Comments

We thank the reviewer for the insightful and constructive feedback. We are encouraged by the overall positive assessment and appreciate the valuable suggestions. Below, we address each comment in turn.


(1) On Mathematical Expressions

We appreciate the reviewer's careful reading of our algorithms and the helpful suggestions on improving notational consistency. We agree that the expression 0.5 in Algorithm 2, Line 3, should be more precisely written as a matrix, e.g., in boldface or an all-one matrix scaled by 0.5, as was done in Algorithm 1. If the paper is accepted, we will revise all such instances to improve clarity and consistency.

Regarding the use of clone, we understand the concern that it is Python-specific. If the paper is accepted, we will revise the pseudocode to either eliminate such implementation-specific terminology or annotate operations like deep copying consistently throughout, including Line 14 in Algorithm S.1.


(2) On Inference Computational Cost

The reviewer is absolutely right that the computational cost of DNN inference is an important consideration in quantization methods. While this point is currently discussed in Appendix A.5, we agree that it deserves to be presented directly in the main text. If the paper is accepted, we will incorporate a concise summary of the inference-time computational cost analysis into the main body to improve clarity and accessibility.


(3) On the Behavior of the Greedy Strategy

The reviewer correctly observes that in Algorithm 3, the scaling parameters rir_i, sis_i, and tit_i tend to have smaller magnitudes as the index ii increases. This is indeed the case. Since our method reduces the residual error in a greedy manner, later steps operate on smaller residuals, which naturally results in terms with smaller absolute values.

To further illustrate this behavior, we include an additional experimental result showing how the values of $r_i$, $s_i$, and $t_i$ decrease progressively during the greedy process. The table below summarizes this trend:

$i$$s_i$$t_i$$u_i$
00.61-0.27-0.26
10.35-0.19-0.19
2-0.200.090.09
3-0.120.060.06
4-0.070.030.03
50.04-0.02-0.02

(4) On the Analogy with SVD and the Eckart–Young Theorem

We appreciate the reviewer's thoughtful question regarding the relationship between our greedy approach and the Eckart–Young theorem. As correctly observed, our greedy optimization procedure—where scaling coefficients are determined in a decreasing order of significance—bears a conceptual resemblance to the construction of low-rank approximations in SVD.

However, unlike SVD, our method does not guarantee optimality in a theoretical sense. Nevertheless, we empirically observe that it achieves a favorable trade-off between approximation error and bit-width compared to existing methods. In this sense, while heuristic, our method resonates with the spirit of the Eckart–Young theorem, suggesting that greedy, non-optimal procedures can still yield near-optimal approximations under certain conditions.

In addition, although this slightly departs from the question of greedy optimality, we have found that an upper bound on the approximation error for each subproblem in our greedy process can be derived using the Eckart–Young–Mirsky theorem and the triangle inequality:

minLsub(i)minW(i)Wsvd(i)+Wsvd(i)[αi(Yi0.51Y)(Zi0.51Z)]A+Wsvd(i)cB\min \sqrt{L_\mathrm{sub}^{(i)}} \le \min \left\lVert \mathbf{W}^{(i)} - \mathbf{W}\mathrm{svd}^{(i)} + \mathbf{W}\mathrm{svd}^{(i)}- \left[ \alpha_i(\mathbf{Y}_i - 0.5 \cdot \mathbf{1}_Y)(\mathbf{Z}_i - 0.5 \cdot \mathbf{1}_Z) \right] \right\rVert \le A + \left\lVert \mathbf{W}\mathrm{svd}^{(i)} - c B \right\rVert A:=j=l+1min(m,n)σj2,B:=sgn(Usvd(i))sgn(Vsvd(i)),c:=Wsvd(i),BB2A := \sqrt{ \sum_{j = l+1}^{\min(m,n)} \sigma_j^2 }, \quad B := \mathrm{sgn}({\mathbf{U}}\mathrm{svd}^{(i)}) \mathrm{sgn}({\mathbf{V}}\mathrm{svd}^{(i)}), \quad c := \frac{ \langle {\mathbf{W}}\mathrm{svd}^{(i)}, B \rangle }{ \| B \|^2 }

where σj\sigma_j and Wsvd(i)\mathbf{W}\mathrm{svd}^{(i)} denote the singular values and the reconstructed matrix after applying an ll-rank approximation to the singular value decomposition (SVD) of W(i)\mathbf{W}^{(i)}, respectively. Furthermore, Usvd(i)\mathbf{U}\mathrm{svd}^{(i)} and Vsvd(i)\mathbf{V}\mathrm{svd}^{(i)} are the singular vectors obtained from the SVD of W(i)\mathbf{W}^{(i)}. Note that Usvd(i)\mathbf{U}\mathrm{svd}^{(i)} is represented as a single matrix that already incorporates the top ll singular values.

This bound highlights a theoretical connection between our method and classical SVD-based low-rank approximation, offering additional justification for the effectiveness of our greedy strategy.

If the paper is accepted, we plan to include this theoretical insight in the revised version, along with a more detailed discussion of its implications for total approximation quality.


Final Remarks

Once again, we sincerely thank the reviewer for the thoughtful questions, constructive suggestions, and positive evaluation. The feedback has helped us better articulate the contributions and implications of our work, and we believe it will significantly enhance the clarity and impact of the final version, should the paper be accepted.


评论

Thank you for the rebuttal.

Regarding point (3), I have confirmed that the result is intuitive.

As for the equation in point (4), it contains the term Wsvd(i)+Wsvd(i)-\mathbf{W}_\textrm{svd}^{(i)} + \mathbf{W}_\textrm{svd}^{(i)}, which appears to cancel out to zero. This description may be a typo. In any case, since this is an additional discussion after the paper submission, this result does not strongly affect my evaluation.

At this point, I have no further questions and intend to maintain my current score. I will await further interactions with other reviewers and the upcoming reviewer-AC discussion.

评论

Thank you for your detailed reading and response.

Regarding point (4), the expression Wsvd(i)+Wsvd(i)-\mathbf{W}\textrm{svd}^{(i)} + \mathbf{W}\textrm{svd}^{(i)} was written intentionally to make the subsequent application of the triangle inequality more transparent. However, we understand that this may have caused confusion instead, and we sincerely apologize for that.

To clarify, the derivation could be more explicitly written as follows:

min Lsub(i) \text{min }\sqrt{L_{sub}^{(i)}} minW(i)[αi(Yi0.51Y)(Zi0.51Z)] (special case: βi=0.5αi,γi=1,δi=0.5) \le \min \left\| \mathbf{W}^{(i)} - \left[ \alpha_i(\mathbf{Y}_i - 0.5 \cdot \mathbf{1}_Y)(\mathbf{Z}_i - 0.5 \cdot \mathbf{1}_Z) \right] \right\| \text{ (} \because \text{special case: } \beta_i=0.5 \cdot \alpha_i, \gamma_i =1, \delta_i =0.5 \text{)} =minW(i)Wsvd(i)+Wsvd(i)[αi(Yi0.51Y)(Zi0.51Z)] = \min \left\| \mathbf{W}^{(i)} - \mathbf{W}\mathrm{svd}^{(i)} + \mathbf{W}\mathrm{svd}^{(i)}- \left[ \alpha_i(\mathbf{Y}_i - 0.5 \cdot \mathbf{1}_Y)(\mathbf{Z}_i - 0.5 \cdot \mathbf{1}_Z) \right] \right\| minW(i)Wsvd(i)+Wsvd(i)[αi(Yi0.51Y)(Zi0.51Z)] ( triangle inequality)\le \min \left\| \mathbf{W}^{(i)} - \mathbf{W}\mathrm{svd}^{(i)} \right\| + \left\| \mathbf{W}\mathrm{svd}^{(i)}- \left[ \alpha_i(\mathbf{Y}_i - 0.5 \cdot \mathbf{1}_Y)(\mathbf{Z}_i - 0.5 \cdot \mathbf{1}_Z) \right] \right\| \text{ (} \because \text{ triangle inequality} \text{)} A+Wsvd(i)cB (the Eckart–Young–Mirsky theorem and a specific choice of αi,Yi,Zi)\le A + \left\| \mathbf{W}\mathrm{svd}^{(i)} - c \mathbf{B} \right\| \text{ (} \because \text{the Eckart–Young–Mirsky theorem and a specific choice of } \alpha_i, \mathbf{Y_i}, \mathbf{Z_i} \text{)}

A:=j=l+1min(m,n)σj2A := \sqrt{ \sum_{j = l+1}^{\min(m,n)} \sigma_j^2 }, B:=sgn(Usvd(i))sgn(Vsvd(i))\mathbf{B} := \mathrm{sgn}(\mathbf{U}\mathrm{svd}^{(i)}) \mathrm{sgn}(\mathbf{V}\mathrm{svd}^{(i)}), c:=Wsvd(i),BB2c := \frac{ \langle \mathbf{W}\mathrm{svd}^{(i)}, \mathbf{B} \rangle }{ \| \mathbf{B} \|^2 }


If the paper is accepted, we will include a more detailed derivation and a complete proof in the camera-ready version to ensure clarity. If there are any remaining concerns or points of confusion, we would be happy to provide further clarification.

评论

I understand now. Thank you for the clarification.

审稿意见
4

This paper introduces a new quantization method called Binary Quadratic Quantization (BQQ), designed mainly for ultra-low-bit quantization. BQQ represents real-valued matrices as linear combinations of binary matrix products, surpassing previous first-order methods, such as uniform and binary coding quantization. To get an accurate quantized binary matrix, this paper also leverages a combination of polynomial unconstrained binary optimization (PUBO) and convex quadratic programming to address the optimization problem. The authors demonstrate BQQ’s effectiveness through matrix compression benchmarks and post-training quantization (PTQ) of Vision Transformer (ViT) models, showing that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error. On ImageNet, BQQ outperforms state-of-the-art PTQ methods by up to 2.2% and 59.1% with 2-bit equivalent quantization.

优缺点分析

  • Strengths
    • This paper is well-written and easy to follow.
    • The proposed method achieves the SOTA low-bit quantization results in terms of accuracy and compression ratio.
  • Weakness
    • The paper shows FLOPs reduction and weight bit counts, but doesn’t provide real speedup data.
    • The experiments only evaluated the proposed method with end-to-end models on image classification benchmark, i.e. ImageNet. Since it’s applicable to various models and tasks, it would be better to evaluate on LLMs as well.
    • It lacks analysis on the parameters used in BQQ, including the low rank dimension ll and the pseudo bit width pp. Adding more analysis on these under the same compute/memory constraints would help readers understand BQQ better.

问题

  • In Table S.1, why does higher-bit quantization take longer? Did they use the same number of optimization iterations? For lower-bit quantization, I think more iterations are needed for optimization. Is my understanding correct?

局限性

yes

最终评判理由

BBQ can also work well for LLMs, especially for the extremely low-bit quantization. So I'm going to increase my rating.

格式问题

No paper formatting concern.

作者回复

Response to Reviewer Comments

We thank the reviewer for the constructive and thoughtful comments. We are encouraged that the reviewer found our paper well-written and recognized the strength of our method in achieving state-of-the-art low-bit quantization results in both accuracy and compression.

Below, we address each of the concerns one by one:


(1) Lack of Real Speedup Measurements

We agree that measuring actual runtime performance is important for practical deployment. Our primary contribution, however, lies in proposing the BQQ framework and demonstrating its theoretical and empirical benefits in terms of memory efficiency and approximation quality. While we provide FLOPs and bit-level analyses as proxies for computational cost, real-world speedup on GPUs depends on hardware specifics and implementation details beyond FLOPs alone.

Current GPU architectures include optimized cores for integer-only (INT) operations, enabling efficient low-bitwidth computations when both operands are integers. However, because mixed-precision operations involving floating-point (FP) and integer (INT) data—such as those required when 1-bit weights are cast to floating-point formats before processing—lack dedicated optimized cores, 1-bit weights must be converted to FP for computation. As a result, the actual amount of computation performed on the GPU remains effectively the same as in unquantized models, despite reductions in memory access and data transfer. Consequently, despite reduced memory bandwidth and data transfer, our GPU implementation did not show latency improvements.

Recent research, for example LUT Tensor Core (Zhiwen Mo et al., ISCA 2025), shows that specialized hardware supporting optimized tensor operations between 1-bit and 16-bit data types can achieve over 7.5× speedups compared to standard GPU execution. Such innovations suggest that substantial acceleration is possible with appropriate hardware support for mixed-precision computations.

At present, we have not yet developed a hardware-optimized implementation tailored to BQQ and thus cannot provide actual runtime acceleration results. We sincerely wish the reviewer to value our theoretical contribution on memory and computation cost efficiencies. Implementing and evaluating practical runtime improvements on optimized hardware is an important direction for future work.


(2) Evaluation Limited to Vision Transformer

We thank the reviewer for pointing out that the experiments were conducted only on end-to-end image classification models (i.e., ImageNet) and for suggesting evaluation on large language models (LLMs) to strengthen the generality of our method. We agree that BQQ is broadly applicable to a wide range of architectures and tasks—including not only various neural networks but also non-neural applications such as Approximate Nearest Neighbor (ANN) search, where binary representations are highly beneficial.

In this work, we focused on Vision Transformers (ViTs), which are known to be particularly fragile to quantization—especially under low-bit settings where small perturbations in weights can lead to significant degradation. Demonstrating strong performance on ViTs thus highlights the robustness of BQQ in one of the most challenging quantization scenarios. While applying BQQ to LLMs is a promising direction, the sheer size of these models makes our current optimization pipeline computationally expensive, and full-scale evaluation remains an ongoing effort.

That said, to provide the reviewer with more concrete evidence of BQQ’s broader applicability, we additionally applied our method to compact LLMs suitable for edge devices. Specifically, we evaluate our method on Qwen2.5-0.5B, an efficient and high-performing LLM designed for edge deployment. Compared with GPTQ, which is a standard quantization method, BQQ demonstrates competitive performance in terms of perplexity and 0-shot downstream task accuracy under the same bit-width, as summarized in the table below.

Method#BitsWikitext2 PPLPIQAARC-EARC-CHSWGeBoolQAvg.
GPTQ326.262.248.925.435.355.951.746.6
c-BQQ320.562.251.423.133.652.147.144.9
GPTQ212523.652.725.321.425.651.943.636.8
c-BQQ251.057.539.818.028.352.658.842.5
Baseline1613.070.264.629.540.656.462.454.0

However, we emphasize that the current work proposes BQQ as a general binary quantization framework, rather than a dedicated quantization technique for neural networks. In our implementation, BQQ quantizes weights by minimizing reconstruction error in the weight space, without relying on output-based error signals that are commonly used in many neural network quantization methods. This implies that BQQ does not yet exploit task-specific loss functions or activation statistics. We therefore believe that adapting BQQ to incorporate such feedback—especially minimizing the downstream output error—could lead to significant further improvements in model performance, and this represents a highly promising avenue for future research.


(3) Lack of Analysis on Intermediate Dimension and the Number of Stacks ($l$ and $p$ in Eq.(5))

We appreciate the reviewer’s insightful comment regarding the need for a more detailed analysis of BQQ’s internal parameters, such as the intermediate dimension ll and the number of binary matrix stacks pp. In our main experiments, we fixed the intermediate dimension as: l=round(mnm+n)l = \mathrm{round}\left(\frac{mn}{m + n}\right) to maintain the same number of elements as the original m×nm \times n matrix, and adjusted the model size by varying the number of stacked binary matrices pp. However, we agree that varying ll under fixed model size constraints can provide a deeper understanding of the trade-off between compression and reconstruction accuracy.

To address this point, we conducted additional experiments where we fixed the total model size and varied the ratio between ll and pp. We used the same set of matrix datasets as in Fig. 3. The results are summarized in the following table, where the “ll-scale” denotes the relative value of ll normalized by: round(mnm+n)\mathrm{round}\left(\frac{mn}{m + n}\right)

#stacks ($p$)$l$-scaleSize [KB] (Random)MSE (Random)Size [KB] (DeiT)MSE (DeiT)Size [KB] (Distance)MSE (Distance)Size [KB] (SIFT)MSE (SIFT)Size [KB] (ImageNet)MSE (ImageNet)
112.1324.318.4298.11.314.62.197.86.342.7
20.52.1329.918.5309.41.39.22.1108.16.329.7
40.252.1337.118.5317.31.312.22.1110.86.349.6
#stacks ($p$)$l$-scaleSize [KB] (Random)MSE (Random)Size [KB] (DeiT)MSE (DeiT)Size [KB] (Distance)MSE (Distance)Size [KB] (SIFT)MSE (SIFT)Size [KB] (ImageNet)MSE (ImageNet)
124.1106.436.9101.92.57.74.146.312.620.5
214.1105.336.995.82.52.34.130.012.610.9
40.54.1108.436.9100.82.62.54.134.412.68.9
80.254.2114.237.0105.62.53.54.235.912.614.1

Our findings show that the optimal value of ll can depend on the spectral characteristics of the data. For instance, in datasets with rapidly decaying singular values such as Distance and ImageNet (see Fig. S.1), a smaller ll (e.g., l=0.5l = 0.5) yielded better performance at small model sizes. On the other hand, in other cases, using l=1l = 1 produced consistently strong results. These results suggest that while our default choice of: l=round(mnm+n)l = \mathrm{round}\left(\frac{mn}{m + n}\right) works well in general, adapting ll to the specific data characteristics could lead to improved trade-offs between reconstruction error and memory usage.

We thank the reviewer again for prompting this important analysis.


(4) Runtime and Optimization Iterations

Thank you for your insightful question. The reason why higher-bit quantization takes longer is due to our greedy optimization approach applied to each binary matrix component. In our main experiments, the intermediate dimension is fixed as: l=round(mnm+n)l = \mathrm{round}\left(\frac{mn}{m+n}\right) and the model size is controlled by the number of stacked binary matrices pp.


Importantly:

  • When p=1p = 1, only one binary matrix optimization subproblem is solved.
  • When p=2p = 2, two such subproblems are solved sequentially.
  • Therefore, increasing pp directly increases the number of optimization steps, proportional to the number of binary matrices. As a result, higher-bit quantization leads to longer computation times.

Regarding the runtime results presented in the appendix:

  • All experiments were conducted with a fixed 50,000 optimization iterations.
  • Your understanding is correct that lower-bit quantization tends to be more sensitive, and generally benefits from a larger number of iterations.

As shown in the supplementary material:

  • For 2-bit quantization, more iterations are required to reach satisfactory accuracy.
  • For 3-bit and 4-bit quantizations, accuracy plateaus well before 50,000 iterations, indicating that fewer iterations suffice to maintain stable performance.

Final Remarks

We sincerely appreciate the reviewer’s thoughtful feedback and hope the final version of our work will address these concerns more fully.

评论

Thanks for your detailed response. Regarding (1), I agree that theoretical contributions are also very important, but it would be better to demonstrate the value of the proposed methods in practical applications. From the results of (2), BBQ can also work well for LLMs, especially for the extremely low-bit quantization.

Most of my questions have been addressed, so I raised my rating.

评论

Thank you very much for your thoughtful response and for raising your rating.

We agree that demonstrating the value of our proposed methods in practical applications is important. We will continue working more on that on the basis of the theoretical contributions in this manuscript. We really appreciate your constructive comments throughout the discussion.

We would also like to clarify two minor issues in our rebuttal. First, the MSE values shown in (3) are scaled as MSE×103\text{MSE} \times 10^3 to improve readability, but we unfortunately omitted this note in the rebuttal. Second, the notations l=0.5l = 0.5 and l=1l = 1 were typographical errors; they should have been written as l-scale=0.5l\text{-scale} = 0.5 and l-scale=1l\text{-scale} = 1, respectively. We apologize for the oversight.

Thank you again for your valuable feedback.

审稿意见
5

The paper proposes a matrix quantization method called Binary Quadratic Quantization (BQQ). BQQ approximates real-valued matrices as a sum of binary matrix products, which intuitively enables expressive nonlinear approximations while maintaining a compact data format.

The experimental results demonstrate that BQQ achieves a better trade-off between compression rate and reconstruction accuracy compared to uniform quantization, binary coding quantization, and low-rank approximation using SVD.

优缺点分析

Pros:

  • BQQ achieves a good trade-off between memory efficiency and reconstruction error; it seems to capture structural redundancy via the nonlinear approximations while maintaining a compact data format

  • BQQ outperforms several existing methods in matrix compression and PTQ tasks.

Cons:

  • BQQ requires a non-negligible amount of time and resources for the quantization process

  • BQQ seems not to have a theoretical upper bound on the approximation error.

  • BQQ uses a greedy optimization strategy with an unclear compromise

  • The choice of alternatives leaves more to be desired. Specifically, it's unclear how BQQ compares to other quantization methods that cannot be applied online (e.g., k-means, Hadamard transform + quantization, Lattice-based quantization)

问题

Can the author address the mentioned weaknesses? Specifically, when considering methods that require a lot of time for quantization (such as the proposed BQQ), how does the accuracy-compression tradeoff of BQQ compare to such methods (e.g., k-means, Hadamard transform + quantization, Lattice-based quantization)? What advantages does BQQ offer over these methods?

局限性

Yes, the paper discussed limitations (Sec. 6, "Further Potential and Limitations").

最终评判理由

I have decided to increase my score from 4 to 5 based on the new experimental and theoretical results provided by the authors and their intent to include these in the paper.

格式问题

N/A.

作者回复

Response to Reviewer Comments

We sincerely appreciate the positive feedback and constructive suggestions. Below, we address the main concerns raised in the review.


(1) Comparison with Other Quantization Methods

We thank the reviewer for suggesting additional comparisons with a broader range of quantization methods. While our original experiments focused on scalar quantization baselines, we agree that extending the evaluation to a more diverse set of techniques can provide valuable insights.

In response, we conducted additional experiments that compare our method against the following approaches:

  • Vector Quantization using k-means (VQ)
  • E8E_8 Lattice Vector Quantization (LVQ)
  • Hadamard Transform (HT) + Uniform Quantization (UQ)
  • Discrete Cosine Transform (DCT) + UQ
  • JPEG (DCT + quantization + entropy coding)

To ensure fair comparisons, we clipped all benchmark matrices to the nearest lower power-of-two size. This process was necessary to enable the application of HT, which requires input dimensions to be powers of two. All methods were evaluated on these clipped matrices for consistency.

For VQ, we also evaluated a variant using quantized centroids (VQ+UQ) to account for codebook storage.

In LVQ, we used the E8E_8 lattice with 240 norm-2\sqrt{2} centroids, representing each 8-dimensional input as a linear combination of nn centroids, each stored in 8 bits, making it comparable to nn-bit scalar quantization in memory.

For the HT and DCT methods, we experimented with two levels of energy preservation by retaining only a portion of the low-frequency components after transformation. Specifically, we evaluated settings that preserve approximately 50% and 100% of the transformed components, denoted as HT50, HT100, DCT50, and DCT100, respectively. Uniform quantization (UQ) was then applied to the retained components for compression.


We summarize the performance of different methods across 5 datasets. For readability, we show results as “Size [KB] / MSE ×103\times 10^{3}” pairs.

MethodRandomDeiTDistanceSIFTImageNet
(Size ≈ 2bits scalar quantization)
BQQ4.1 / 105.016.4 / 91.71.1 / 4.04.1 / 30.64.1 / 5.6
DCT100+UQ4.1 / 117.916.4 / 120.21.0 / 193.24.1 / 359.44.1 / 318.2
DCT50+UQ4.1 / 496.816.6 / 499.91.1 / 289.04.1 / 324.94.1 / 144.7
HT100+UQ4.1 / 119.816.4 / 123.01.0 / 131.04.1 / 209.34.1 / 244.3
HT50+UQ4.1 / 505.216.4 / 505.11.0 / 535.54.1 / 558.14.1 / 157.3
JPEG4.1 / 428.616.3 / 360.81.0 / 378.94.1 / 147.44.1 / 1.6
LVQ4.6 / 85.918.4 / 109.31.2 / 74.44.6 / 80.64.6 / 17.1
VQ3.7 / 917.515.6 / 893.90.8 / 337.13.7 / 318.63.7 / 133.0
VQ+4bit UQ4.2 / 448.216.5 / 415.41.1 / 16.84.2 / 53.24.2 / 7.6

MethodRandomDeiTDistanceSIFTImageNet
(Size ≈ 3-bit scalar quantization)
BQQ6.2 / 33.924.6 / 29.91.6 / 1.36.2 / 10.06.2 / 1.8
DCT100+UQ6.2 / 36.424.6 / 38.31.5 / 67.76.2 / 192.36.2 / 211.6
DCT50+UQ6.2 / 491.724.9 / 494.51.6 / 276.96.2 / 269.06.2 / 49.5
HT100+UQ6.2 / 37.224.6 / 39.61.5 / 43.06.2 / 72.96.2 / 142.6
HT50+UQ6.2 / 499.724.6 / 499.61.5 / 527.36.2 / 546.96.2 / 123.9
JPEG6.1 / 147.324.3 / 123.51.5 / 152.86.1 / 53.25.7 / 0.7
LVQ6.9 / 25.327.7 / 40.51.7 / 21.56.9 / 23.16.9 / 5.3
VQ5.8 / 876.723.8 / 847.21.3 / 195.35.8 / 282.25.8 / 83.8
VQ+4bit UQ6.2 / 216.524.7 / 190.51.6 / 8.46.2 / 14.76.2 / 4.4

The above results on the five benchmark matrices show that BQQ outperforms most other compared methods—including VQ, LVQ, and transformation-based schemes—on the majority of datasets. The only exception is JPEG on ImageNet, where BQQ performs slightly worse. These results demonstrate the strong empirical performance of BQQ and underscore its practical advantages over a diverse and competitive set of baselines.

Methods such as the Hadamard Transform (HT) and Discrete Cosine Transform (DCT), when used with quantization, serve as preprocessing steps to decorrelate or concentrate signal energy. Their compression efficiency and accuracy depend on the subsequent quantizer. From a methodological perspective, these hybrid schemes are not standalone quantizers and thus not directly comparable to BQQ, which is itself a quantizer. Nevertheless, BQQ's ability to outperform these popular combinations is notable and underscores its effectiveness. Moreover, combining such transforms with BQQ may further improve compression, which we consider a promising direction.

We will incorporate a formal version of this table into the main paper if the paper is accepted.


(2) Upper Bound of BQQ

We appreciate the reviewer's insightful comment regarding the lack of theoretical approximation bounds. In response, we analyzed a simplified version of Equation (5) by assuming βi=0.5αi\beta_i = -0.5\alpha_i, δi=1\delta_i = 1, and γi=0.5\gamma_i = -0.5. Although this setting does not yield a closed-form optimal solution, we found that an approximate solution can be constructed by aligning the binary components with the sign patterns of the singular vectors obtained from a truncated SVD, combined with appropriate scaling. This approximation can then be used to provide an upper bound on the error of BQQ. Specifically, we found an upper bound on the residual error in the greedy error minimization process of BQQ, denoted by Lsub(i)L_{\text{sub}}^{(i)}, as given below. This bound is derived based on the Eckart-Young-Mirsky theorem and the triangle inequality.


minLsub(i)minW(i)Wsvd(i)+Wsvd(i)[αi(Yi0.51Y)(Zi0.51Z)]A+Wsvd(i)cB\min \sqrt{L_\mathrm{sub}^{(i)}} \le \min \left\lVert \mathbf{W}^{(i)} - \mathbf{W}\mathrm{svd}^{(i)} + \mathbf{W}\mathrm{svd}^{(i)}- \left[ \alpha_i(\mathbf{Y}_i - 0.5 \cdot \mathbf{1}_Y)(\mathbf{Z}_i - 0.5 \cdot \mathbf{1}_Z) \right] \right\rVert \le A + \left\lVert \mathbf{W}\mathrm{svd}^{(i)} - c B \right\rVert A:=j=l+1min(m,n)σj2,B:=sgn(Usvd(i))sgn(Vsvd(i)),c:=Wsvd(i),BB2A := \sqrt{ \sum_{j = l+1}^{\min(m,n)} \sigma_j^2 }, \quad B := \mathrm{sgn}({\mathbf{U}}\mathrm{svd}^{(i)}) \mathrm{sgn}({\mathbf{V}}\mathrm{svd}^{(i)}), \quad c := \frac{ \langle {\mathbf{W}}\mathrm{svd}^{(i)}, B \rangle }{ \| B \|^2 }

where σj\sigma_j and Wsvd(i)\mathbf{W}\mathrm{svd}^{(i)} denote the singular values and the reconstructed matrix after applying an ll-rank approximation to the singular value decomposition (SVD) of W(i)\mathbf{W}^{(i)}, respectively. Furthermore, Usvd(i)\mathbf{U}\mathrm{svd}^{(i)} and Vsvd(i)\mathbf{V}\mathrm{svd}^{(i)} are the singular vectors obtained from the SVD of W(i)\mathbf{W}^{(i)}. Note that Usvd(i)\mathbf{U}\mathrm{svd}^{(i)} is represented as a single matrix that already incorporates the top ll singular values.

This bound is connected to the magnitude of the discarded singular values, which explains the empirical observation (e.g., Figure 3 and Figure S.1) that BQQ outperforms better on matrices with rapidly decaying singular spectra, and underperforms when the low-rank approximation leaves significant residual energy.


(3) Greedy Optimization and Its Computational Cost

We acknowledge the limitation of using a greedy strategy. However, solving polynomial unconstrained binary optimization (PUBO) problems to global optimality quickly becomes intractable as problem size grows. In BQQ, joint optimization would involve (m×l+l×n)×p(m \times l + l \times n) \times p binary variables for m×lm \times l and l×nl \times n matrices and pp BQQ terms, leading to significant memory and time costs even for moderate-sized matrices.

To mitigate this, we adopt a greedy residual minimization strategy inspired by Binary Coding Quantization (BCQ), which incrementally reduces the approximation error by adding one component at a time. In contrast to global optimization methods that require re-running the entire procedure from scratch when the number of stacks (pp in Eq.~(5)) is modified, the greedy approach preserves previously optimized components. This enables efficient and flexible adjustment of model capacity without discarding past computations. Nonetheless, as the reviewer rightly noted, applying this transformation to many matrices in neural networks incurs non-trivial computational costs. Reducing this overhead remains an important direction for future work.

While we acknowledge that our current optimization strategy is relatively simple and could certainly be improved with more sophisticated or learned methods, we would like to emphasize that the core contribution of this work lies in the formulation of the BQQ framework itself. The novelty of BQQ as a binary matrix approximation scheme is independent of the specific optimization algorithm used, and we hope the reviewer understands that our primary goal is to establish the expressiveness and potential of the framework, rather than to claim advances in optimization methodology. We will further clarify this point and add future directions regarding optimization improvements in the final manuscript.


Final Remarks

We hope these responses adequately address the reviewer's concerns. We thank the reviewer again for the valuable feedback, which has helped strengthen the paper both theoretically and experimentally.

评论

Thank you for the detailed response. I have decided to increase my score from 4 to 5 due to the new experimental and theoretical results and your intent to include these in the paper.

评论

Thank you very much for your reconsideration and for raising your score. We truly appreciate your thoughtful evaluation and are glad that the additional experimental and theoretical results helped clarify our contributions. If the paper is accepted, we will make sure to incorporate these results into the final version as promised.

评论

Dear Reviewers,

We sincerely appreciate your valuable feedback and thoughtful suggestions throughout the review process. We have done our utmost to incorporate your comments through additional experiments and clarifications, and we hope these efforts have addressed your primary concerns.

As the discussion phase is coming to an end, we would greatly appreciate your engagement in the discussion or an acknowledgement. If any concerns remain unresolved, we would be more than happy to clarify further.

Sincerely,

The Authors

最终决定

This paper presents Binary Quadratic Quantization (BQQ), a matrix quantization framework that represents real-valued matrices as sums of binary matrix products. Empirical results indicate that BQQ achieves a better balance between compression efficiency and reconstruction accuracy relative to uniform quantization, binary coding quantization, and SVD-based low-rank approximations.

All reviewers recognize the method as original, interesting, and demonstrably stronger than existing approaches in both matrix compression and post-training quantization. During the author–reviewer discussion, the authors provided a compelling rebuttal and supplemented their work with further evidence, including broader comparisons to alternative quantization methods, evaluations on Qwen-0.5B, theoretical upper bounds, and analysis of intermediate dimensions.

In light of these contributions, I recommend acceptance of the submission. I highly encourage the authors to include the additional results in the final version. It will make the paper significantly stronger.