/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

Mind the Gap: A Practical Attack on GGUF Quantization

Kazuki Egashira,Robin Staab,Mark Vero,Jingxuan He,Martin Vechev

提交: 2025-01-24更新: 2025-07-24

TL;DR

Building on existing LLM quantization exploitation attacks targeting naive quantization, we extend them to the popular GGUF quantization by a simple modification.

摘要

With the increasing size of frontier LLMs, post-training quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and llama.cpp frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error -- the difference between the full-precision weights and their (de-)quantized version -- provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation ($\Delta$=$88.7%$), targeted content injection ($\Delta$=$85.0%$), and benign instruction refusal ($\Delta$=$30.1%$). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient as a defense.

关键词

quantizationlarge language modelssecuritypoisoninggguf

评审与讨论

审稿意见

评分: 22025-03-13

The paper investigates the question of whether the quantization error in an LLM can be exploited towards practical attacks, that can lead the model to output maliciously on specific inputs, while not dropping significantly on standard benchmarks. While this fact has been shown before by Egashira et al. (NeurIPS 2024) for simpler non-calibration quantization schemes, the present submission does this for the more modern and accurate "GGUF" format, which entails calibration. Thus, the submission shows that such models (which are extremely popular on repositories such as HuggingFace) could be practically exploited.

给作者的问题

I do not have any major questions, but I am curious if:

The authors looked into other approaches (e.g., specialized signatures?) for defending against such attacks?
Do the authors have some intuition towards more general results showing that any quantized format would be to some extent vulnerable to exploits on an unknown slice of inputs?

post-rebuttal comment

I thank the authors for their responses. As stated before, I remain borderline on this paper (due to concerns outlined before), but will definitely not stand in the way of acceptance.

论据与证据

The specific technical claims in the submission are sound and are supported by practical evidence. I will say though that the abstract claims are more generic than what is supported later in the paper. Specifically, the main abstract claim: "Our key insight is that the quantization error – the difference between the full-precision weights and their (de-)quantized version – pro- vides sufficient flexibility to construct malicious quantized models that appear benign in full precision." is something that was first achieved by prior work for a simpler family of quantization schemes. The present work is the first one to do this for GGUF.

方法与评估标准

There is no standard benchmark for such methods, but the proposed evaluation criteria make sense.

理论论述

There are no real theoretical claims, the paper presents a heuristic.

实验设计与分析

The experimental design is valid (in fact, it is largely adapted from Egashira et al., which is already published work.

补充材料

I have skimmed the supplementary material, in particular the detailed derivations of the heuristic and the details corresponding to the GGUF algorithm.

与现有文献的关系

The paper can be seen as a technical-report-style extension of the work of Egashira et al., in particular extending their approach to a more popular format.

遗漏的重要参考文献

I think the related work coverage is fine.

其他优缺点

Strenghts:

I think the paper proposes a valid extension of the work on attacks towards quantization methods, specifically since the format they are focusing on is probably the dominant one for local deployment of ML models. The heuristic approach proposed is well-justified, and well-supported through experiments.

Weaknesses:

Essentially, the work is at the level of a well-executed technical report, as it heavily builds on prior work, notably Egashira et al's paper in NeurIPS. As such, the work is of interest, but only to a relatively small niche in the community. The heuristic proposed is quite specialized to the (specialized) format considered, and I don't see how this would be extensible beyond the problem considered here. The defenses considered are largely the same as Shu et al. 2023.

其他意见或建议

The paper is well-written and easy to follow.

伦理审查问题

No concerns.

作者回复

2025-04-01

First, we would like to thank the reviewer for their efforts spent reviewing our paper, understanding the strength of our work, and providing many insightful comments. We address the reviewer’s questions and comments below.

Q1: Is the claim in the abstract, “Our key insight is that the quantization error … provides sufficient flexibility to construct malicious quantized models that appear benign in full precision” aligned with the actual contribution of the paper?

Yes. Although quantization is known to be susceptible to similar attacks, existing attacks are against rounding-based quantizations, and they rely on an analytical bound within which quantization results remain unchanged. In contrast, we are the first to show that the quantization error itself is exploitable against GGUF, where such an analytical bound cannot be calculated.

Q2: Is the scope of the work wide enough and of interest to a broad community?

Yes. As model sizes continue to grow, quantization techniques are becoming increasingly widespread, making research on their security highly practical and valuable. We are the first to reveal that GGUF, arguably the most widely used algorithm, is indeed susceptible to attacks. This demonstrates that a type of attack previously considered more a theoretical concern in the context of simpler quantization schemes, has now become a strong and practical threat. This, as unanimously acknowledged by other reviewers, has a significant impact, extending far beyond what we would consider a “small niche community” (GGUF quantized models have multiple hundreds of millions of downloads [1] with over 77.3k stars on llama.cpp, 135k on ollama, and 100+ apps building on it. Further, there are over 90k GGUF-quantized models on Hugging Face).

Q3: Can the attack be extended to other quantization algorithms?

[The reviewer dpB2 raised a similar point, so we repeat our unified response here.]

First, we would like to emphasize that, considering the overwhelming number of GGUF-quantized models and their users, demonstrating that our approach (error-based interval and heuristic expansion) successfully attacks every variant of k-quant algorithms of GGUF already provides substantial impact.

Still, as we acknowledge the importance of the extensibility of our approach, we conduct an additional experiment, targeting GPTQ (data-dependent) and HQQ (data-independent), which are both integrated into Hugging Face, we obtained the following results:

===Vulnerable Conde Generation===

Model	Target Quantization	Security (Full)	Security (Quantized)	Utility (Full, HumanEval)
Qwen2.5-1.5b	GGUF, Q4_K_M	89.2	12.5	41.4
	GPTQ, 4bit	96.0	42.6	40.9
	HQQ, 4bit	88.4	13.0	41.7

===Content Injection===

Model	Target Quantization	ASR (Full)	ASR (Quantized)	Utility (Full, MMLU)
Qwen2.5-1.5b	GGUF, Q4_K_M	0.3	40.2	59.8
	GPTQ, 4bit	0.5	1.1	59.3
	HQQ, 4bit	0.1	1.3	59.7

As these results indicate, our method partially extends to GPTQ / HQQ, even without being explicitly modified for them. Although the success rates of the attack are generally smaller than on GGUF, we consider this a promising result, with pushing the score further being an interesting avenue for future work to explore.

We thank the reviewer for raising an interesting direction for the discussion. We will include the results in the next revision.

Q4: Can you elaborate why only Gaussian noise is used as a defense?

[The reviewer dpB2 raised a similar point, so we repeat our unified response here.]

Certainly. We would like to clarify the following points: (i) Since our work focuses on the attack side, we believe that a rigorous investigation of defenses (including, e.g., other downstream effects of such defenses) is better suited for future work. (ii) Importantly, as there is so far no established defense in practice, the success of our attack without defending protocol already highlights a significant real-world threat. (iii) In order to acknowledge the importance of such defenses, we focused on a straightforward, easily applicable, and discussable approach, noising. While we do not extensively focus on defensive aspects, we note that our work provides some novel insights not discussed in prior research, such as the potential for model-specific optimized noise (regardless of the quantization type) to mitigate the attack.

References

[1] https://ollama.com/search

审稿人评论

2025-04-05

I thank the authors for their response. I continue to believe that the paper's contribution is too narrow, and that it is too close to the work of Shu et al. However, I will not stand in the way of acceptance, so I will upgrade my score by one point.

作者评论

2025-04-06

We thank the reviewer for the response and for updating their score. We highly appreciate the reviewer’s understanding of other views on our paper’s contributions, and would only like to briefly clarify their remaining point of criticism.

Based on the reviewer’s mention of “defenses considered are largely the same as Shu et al.” and the overall review content, we believe the reviewer’s reference to Shu et al. (On the Exploitability of Instruction Tuning) may have been intended to refer to Ma et al. (Quantization Backdoors to Deep Learning Commercial Frameworks) instead. If so, we believe to have already addressed the related points in our rebuttal, and to summarize:

(i) Our work targets GGUF, a fundamentally different quantization method from those studied in Ma et al. and Egashira et al. In fact, this is the first attack on popular optimization-based quantization.
(ii) The defense we study against our attack overlaps with that of Ma et al. by design—we aim to confirm that Gaussian noising still works. We will clarify this in the next revision of the paper.

审稿意见

评分: 42025-03-14

This paper introduces a novel practical attack on GGUF quantization. It exploits the quantisation errors inherent in GGUF to hide malicious behavior into quantised models. The malicious behaviour of the model remains hidden in full precision but is revealed when the model is quantised.

给作者的问题

论据与证据

The claims are well-supported.

The attack is indeed novel.
The practicality of error-based interval estimation. Empirically validated.
Effectiveness across models and configs. Experimental evidence (e.g., increase in insecure code generation).

方法与评估标准

The proposed method is intuitively sound. The evaluations are practical and realistic. The threat model is sound.

理论论述

N/A

实验设计与分析

The experimental design is thorough. The authors use different LLMs, various quantisation types, and examine multiple attack scenarios.

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

作者回复

2025-04-01

We thank the reviewer for their acknowledgement of the strength of our work. In case the reviewer has any other questions or comments, we are happy to engage in further discussion.

审稿意见

评分: 32025-03-16

This paper presents a adversarial attack on GGUF quantization, a popular post-training quantization (PTQ) method in LLM. The core contribution is an error-based interval estimation technique, which exploits quantization errors to enable adversarial attack on LLMs. The authors demonstrate the attack's effectiveness across insecure code generation, targeted content injection, and benign instruction refusal. The authors also propose a heuristic interval expansion to simultaneously attack multiple quantization schemes. Finally, the paper discusses defenses such as Gaussian noise injection, in order to mitigate the attack.

给作者的问题

What computational resources are required for the attack, and is it feasible for an adversary with limited resources?
Can the attack generalize to other optimization-based quantization methods beyond GGUF k-quants?
Beyond Gaussian noise, what other defenses were considered, and why were they not explored further?

论据与证据

The claims are well-supported by clear and convincing evidence.

方法与评估标准

The methodology is well-founded, leveraging quantization error analysis to construct adversarial attack.
Experiments cover multiple LLMs, across different GGUF quantization types.

理论论述

The paper includes Theorem B.1 (Appendix B.3), proving that the interval-widening heuristic is upper-bounded for zero-shot quantization but I have not verified the proofs.

实验设计与分析

Experiments are well-structured, covering a variety of quantization types and LLMs.
The Gaussian noise defense experiment (Figure 4) is valuable, demonstrating a practical countermeasure.

补充材料

I have not reviewed the supplementary material.

与现有文献的关系

Builds on Egashira et al. (2024), extending adversarial quantization attacks to optimization-based quantization.
Discusses similarities to data poisoning attacks (Carlini et al., 2023; Shu et al., 2023) but highlights that quantization-based attacks require no trigger tokens.

遗漏的重要参考文献

None.

其他优缺点

Strengths:

New attack on GGUF quantization, a widely used PTQ technique in open-source LLM deployment.
Strong experimental validation across multiple models, quantization types, and adversarial scenarios.
Practical implications: Highlights real security risks in popular frameworks (llama.cpp, ollama).
Well-written with clear problem motivation, methodology, and evaluation.

Weaknesses:

The attack assumes the adversary knows the quantization method, which may not always be true in practice.
More discussion on defenses (e.g., QAT-based robustness, adversarial training) would strengthen the paper.
Limited discussion of practical defenses beyond Gaussian noise.

其他意见或建议

Please address weaknesses above.

作者回复

2025-04-01

We thank the reviewer for the efforts spent reviewing our paper and the positive assessment. We address the reviewer’s questions and comments below.

Q1: Can you elaborate on whether it is reasonable to assume that the adversary has access to the quantization algorithm?

Certainly. We agree with the reviewer that this is an important aspect of our attack setting. Notably, in practice, we find that many widely used quantization schemes are open and fully accessible, particularly as a primary use case for local model deployment on commodity hardware. This level of accessibility goes hand-in-hand with the popularity of such schemes and makes them a primary target for adversaries who aim at potential real-world impact.

The focus of this work perfectly exemplifies this, as the GGUF algorithm is both open-source / publicly available [1] and widely used with hundreds of millions of model downloads. Therefore attacking GGUF (knowing the algorithm) is a realistic and practical threat model with significant real-world implications. Here, we additionally note that many of our attacks target multiple schemes (variants) simultaneously, broadening the attack surface as the applied variant only has to be included in the adversary's target set.

In a similar spirit, many other popular algorithms are also open-sourced, including LLM.int8() / NF4 / FP4 [2], which are already known to be vulnerable to similar attacks, alongside other well-known algorithms [3-5] which therefore constitute interesting avenues for future work.

Q2: Can the authors elaborate on why they did not investigate further defenses?

[The reviewer BR9t raised a similar point, so we repeat our unified response here.]

Certainly. We would like to clarify the following points: (i) Since our work focuses on the attack side, we believe that a rigorous investigation of defenses (including, e.g., other downstream effects of such defenses) is better suited for future work. (ii) Importantly, as there is so far no established defense in practice, the success of our attack without a defense protocol already highlights an immediate significant real-world threat. (iii) In order to acknowledge the importance of such defenses, we focused on a straightforward, easily applicable, and discussable approach, noising. While we do not extensively focus on defensive aspects, we note that our work provides some novel insights not discussed in prior research, such as the potential for model-optimized noise (regardless of the quantization type) to mitigate the attack.

Q3: How much compute is required for the attack?

The attack requires roughly the same amount of GPUs as required for typical (full) fine-tuning. For training Llama 3.1-8B in our main result, 2 x 80 GB GPUs (for 8 hours, this amounts to roughly $50) are required, which we believe is feasible in practice.

Q4: Can the attack be extended to other quantization algorithms?

[The reviewer BR9t raised a similar point, so we repeat our unified response here]

Based on the reviewer's comments and acknowledging the importance of the extensibility of our approach, we conduct an additional experiment, targeting GPTQ (data-dependent) and HQQ (data-independent), which are both integrated into Hugging Face:

===Vulnerable Conde Generation===

Model	Target Quantization	Security (Full)	Security (Quantized)	Utility (Full, HumanEval)
Qwen2.5-1.5b	GGUF, Q4_K_M	89.2	12.5	41.4
	GPTQ, 4bit	96.0	42.6	40.9
	HQQ, 4bit	88.4	13.0	41.7

===Content Injection===

Model	Target Quantization	ASR (Full)	ASR (Quantized)	Utility (Full, MMLU)
Qwen2.5-1.5b	GGUF, Q4_K_M	0.3	40.2	59.8
	GPTQ, 4bit	0.5	1.1	59.3
	HQQ, 4bit	0.1	1.3	59.7

We once again thank the reviewer for raising an interesting direction for the discussion. We will include the results in the updated paper.

References

[1] https://github.com/ggml-org/llama.cpp

[2] https://github.com/bitsandbytes-foundation/bitsandbytes

[3] https://github.com/ModelCloud/GPTQModel

[4] https://github.com/mit-han-lab/llm-awq

[5] https://github.com/Vahe1994/AQLM

审稿人评论

2025-04-07

Thanks for addressing my concerns and I am happy to stick my current rating of "Weak accept".

作者评论

2025-04-07

We are glad to learn that we could address the reviewer's concerns and thank them for confirming their already positive score.

审稿意见

评分: 42025-03-22

This paper introduces a backdoor attack targeting GGUF quantization, a widely used optimization-based post-training quantization method for LLMs. The paper proposes an error-based interval approach to construct malicious quantized models that behave normally in full precision but exhibit targeted malicious behaviors, targeting insecure code generation, content injection, and refusal of benign instructions when quantized using various GGUF k-quant types. The attack leverages the quantization error to derive constraints during adversarial fine-tuning, aiming to maintain the malicious behavior after quantization while hiding it in the full-precision model. The paper demonstrates the attack's effectiveness on multiple LLMs and quantization types.

给作者的问题

论据与证据

Yes

方法与评估标准

Yes

理论论述

N/a

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

Note that quantisation backdoor attacks are not particularly new as a setting -- there were earlier papers eg Ma et al. Paper at hand demonstrates that widely used quantisation schemes are similarly vulnerable, even for more complex LLM tasks. I think the paper in its current form ignores quite a lot of literature in backdoor attack broadly (e.g. compiler-based injections of Clifford et al., handcrafted backdoors of different kind e.g. Carlini et al., architectural e.g. from Langford et al. ), and does not compare explicitly to old literature on quantisation backdoors more specifically.

遗漏的重要参考文献

I would expand the related work quite a bit to cover other backdoor attacks and provide an explicit scenario in which such attack can take place, maybe placing it within framework of Clifford et al. (https://ieeexplore.ieee.org/document/10516650).

其他优缺点

Very nicely written work

其他意见或建议

Thanks for adding the requested comparison

作者回复

2025-04-01

First, we thank the reviewer for the time spent reviewing our paper and the positive assessment. We address the reviewer’s questions and comments below.

Q1: Can you please extend and reframe the literature review about the backdoor attack?

Certainly. We acknowledge the importance of covering backdoor attacks more widely and will add the suggested references and reformulate relevant literature sections in the next revision. In the formulation of Clifford et al. [1], our threat model is largely aligned with the existing quantization attack by Ma et al. [2]: our attack can be achieved if the adversary can “train” a model using a malicious “dataset”. However, our work is different in that existing quantization backdoors have so far only been proposed for simpler rounding-based algorithms, while we are the first to attack a more widely used and complex optimization-based algorithm, GGUF.

Q2: Can the authors compare their work with old quantization attacks more explicitly?

Absolutely. We conducted the experiments by targeting GGUF with the method used by Ma et al. [2] and Egashira et al. [3]. We obtained the following result for Qwen2.5-1.5b with the Content Injection setting:

Method	Full Precision ASR	Quantized ASR
Our method	0.2	50.1
Method in [2-3]	0.1	0.1

As shown above, the older method fails to achieve any contrast between full precision and quantized models. We note that as GGUF is optimization-based and scaling is determined by considering all parameters within a block, fundamental assumptions of prior attacks (i.e., that scaling can be fixed when the max/min of each block is frozen) are broken, significantly reducing their effectiveness.

We thank the reviewer for raising an important direction to improve our work; we will add it in the next revision of our paper.

References

[1] Clifford et al., ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks, IEEE SaTML 2024.

[2] Ma et al., Quantization Backdoors to Deep Learning Commercial Frameworks, IEEE TDSC 2023.

[3] Egashira et al., Exploiting LLM Quantization, NeurIPS 2024.

最终决定Accept (poster)

2025-05-01

This paper presents a novel adversarial attack targeting GGUF quantization, a widely used post-training quantization method in LLMs designed for memory-efficient deployment. The key contribution is based on stealthy exploitation of quantization errors to obtain malicious models that appear benign in full precision, but reveal targeted malicious behaviors when quantized. The attack assumes that the adversary can perform adversarial finetuning on a trained LLM, and leverages error-based interval constraints that are derived to be used during a removal training phase for stealthiness. The paper studies malicious attacks including insecure code generation, targeted content injection, and refusal of benign instructions, which are successfully demonstrated on popular LLMs (e.g., Llama3.1-8B) across several GGUF k-quant quantization types. The paper further explores defense by Gaussian noise injection as a feasible countermeasure for such attacks.

The paper went through productive discussions during the rebuttal phase, and the reviewers acknowledged the value of the paper's contribution. One of the main concerns clarified during these discussions was regarding if the same attack could be extended to other optimization-based post-training quantization methods. Authors addressed this issue by providing additional results targeting GPTQ and HQQ quantization schemes, where the attack could natively extend even without being explicitly modified. Another discussed limitation was the paper’s connection to existing literature on similar quantization backdoor attacks. Authors acknowledged this lacking aspect, and noted it to be added with discussions in their revisions. In general, the AC acknowledges that the rebuttal was effective, with the authors successfully demonstrating the novelty and rigor of their work. Overall, the paper presents an interesting contribution to the field of ML security, and it is very well presented with well-structured experimental evaluations. Thus, the AC recommends acceptance of this paper.