ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio–Language Models
We propose the first defense framework designed for Audio-Language Models to defend against jailbreak attacks.
摘要
评审与讨论
This paper presents a lightweight, training-free method named ALMGuard that enhances the refusal robustness of audio language models by injecting universal perturbations into the input audio’s Mel-spectrogram. The perturbation is optimized using a sparse frequency mask (M-GSM) that targets model-sensitive frequency bands, with the goal of reducing harmful responses to adversarial inputs (e.g., audio jailbreaks) while preserving utility on benign tasks. The approach is evaluated on multiple models (e.g., Qwen2-Audio, GPT-4o, LTU) and attack types, showing consistent improvements in refusal accuracy without significant task degradation.
优缺点分析
Strength:
- This paper proposes a practical, training-free, and model-agnostic defense approach that can be directly applied at inference time.
- The proposed ALMGuard demonstrates strong empirical robustness across several models and diverse attack types.
- This work introduces a novel spectral masking strategy (M-GSM) to sparsify perturbations in critical frequency bands.
Weakness:
- While the authors highlight that only 50 adversarial samples are used to optimize the perturbation and mask, it would strengthen the claims of generalizability and robustness if an ablation study were provided to assess the impact of training data scale on SAP performance. For instance, how does performance vary when using 50 vs 100 vs 200 training samples?
- The perturbation and sparse mask are trained using a fixed set of 50 samples. However, it remains unclear how sensitive the performance is to the choice of training data. An ablation evaluating the variance across different training subsets would strengthen the claims of robustness and generalizability.
By the way, I found this paper quite interesting and inspiring. It offers some valuable ideas and represents a solid piece of work.
问题
While the paper evaluates ALMGuard’s transferability across different models, it does not explore whether different LALMs inherently exhibit distinct spectral sensitivities or adversarial vulnerability patterns. For example, comparing the learned M-GSM masks or gradient maps between models like Qwen2-Audio and LLama-omni could shed light on model-specific behaviors.
I’m also curious how models with reasoning capabilities, such as Audio-CoT models, might respond differently under similar perturbations. Of course, I understand if it is beyond the current scope to conduct such experiments—this is merely a suggestion for future exploration, and I’d be happy to hear the authors' thoughts on this topic.
Others see weakness.
局限性
yes
最终评判理由
The author has addressed my questions. I recommand to accept.
格式问题
NA
Thank you for the insightful and helpful comments. We are very pleased that the reviewer acknowledged the robustness of our defense and the novelty of our method, and provided an overall positive evaluation. Below, we will try our best to address all of your concerns.
Weaknesses
Impact of training data scale
We chose 50 as the training dataset size based on two considerations:
(1) Since AdvBench-Audio contains a total of 520 samples, we think that using 50 for training strikes a reasonable balance. A larger size would make the training set proportionally too large.
(2) Our experiments show that using 50 training samples is sufficient to achieve good performance, for both seen and unseen attacks.
To further evaluate the impact of training data size, we conducted additional experiments with sizes of 25 and 75, and compared them to the 50-sample setting. Using Qwen2-Audio as the target model, we evaluated the jailbreak success rate (SRoA). The results confirm the scalability of our method: as the training set size increases, the average SRoA decreases, indicating improved defense performance.
| Size | AdvWave | AdvWave-P | PAIR-Audio | Average |
|---|---|---|---|---|
| 25 | 6.4 | 17.3 | 35.5 | 19.7 |
| 50 | 3.1 | 11.7 | 34.9 | 16.6 |
| 75 | 6.2 | 9.2 | 26.7 | 14.0 |
An ablation evaluating the variance across different training subsets
In fact, each of our experiments was conducted on a randomly selected set of 50 audio samples, so the results reported in the paper already reflect the robustness of our method across different data subsets. To more clearly demonstrate this conclusion, we conducted additional experiments, each time randomly selecting 50 samples and using Qwen2.5-Omni as the target model. We tested one seen attack (AdvWave) and three unseen attacks (Gupta et al., ICA, and PAP-Audio). The results show that our method performs consistently across different subsets and exhibits strong generalization ability.
| AdvWave | Gupta et al. | ICA | PAP-Audio | |
|---|---|---|---|---|
| Exp. 1 | 1.7 | 0.0 | 1.0 | 70.3 |
| Exp. 2 | 1.9 | 0.0 | 0.2 | 68.3 |
| Exp. 3 | 1.9 | 0.2 | 0.2 | 66.9 |
Questions
Comparing the learned M-GSM masks
In Figure 5 of our paper, we provide a visual comparison of the M-GSMs across different models. The results show that different models exhibit similar mask patterns, which we believe indicates that the globally computed mask captures stable and general features, demonstrating robustness across models.
The response of the Audio-CoT model
We really appreciate this interesting and insightful question.
On the one hand, increasing inference-time computation may intuitively enhance robustness against jailbreaks, as also reflected in a recent study by OpenAI [1]. On the other hand, for large-scale industrial models with stronger reasoning capabilities and larger knowledge bases, this may also introduce more “opportunities to err,” potentially leading to unsafe behaviors due to overthinking or amplifying subtle unsafe patterns. This phenomenon has been highlighted in a recent study by Anthropic [2].
Given that Audio-CoT models are still in their early stage of development (with only a few studies extending CoT to 7B-level audio models [3][4]), currently it might be difficult to draw robust and stable conclusions about their behavior with respect to safety.
We have also considered how SAP activates safety shortcuts in Audio-CoT models. We think the key lies in triggering the model’s inclination to reason about safety-related issues (such as jailbreaks) during inference, thereby guiding it toward safer behavior patterns.
[1] Zaremba, Wojciech, et al. "Trading inference-time compute for adversarial robustness." arXiv preprint arXiv:2501.18841 (2025).
[2] Gema, Aryo Pradipta, et al. "Inverse Scaling in Test-Time Compute." arXiv preprint arXiv:2507.14417 (2025).
[3] Ma, Ziyang, et al. "Audio-cot: Exploring chain-of-thought reasoning in large audio language model." arXiv preprint arXiv:2501.07246 (2025).
[4] Xie, Zhifei, et al. "Audio-reasoner: Improving reasoning capability in large audio language models." arXiv preprint arXiv:2503.02318 (2025).
We hope the above response has addressed your concern. If there are any remaining questions, we would be happy to provide further clarification during the discussion phase.
Thank you for the author's response, which has addressed my questions. I stand by my original score and recommend acceptance of the paper. I wish the authors the best with their submission.
We sincerely thank the reviewer for the response, and we greatly appreciate the continued recognition and overall positive evaluation of our work.
The paper ALMGuard, a novel defense framework designed to protect Audio-Language Models (ALMs) from jailbreak attacks, which exploit unique vulnerabilities in the audio modality. ALMGuard operates by identifying safety-aligned shortcuts within ALMs and activating them using Shortcut Activation Perturbations (SAPs), universal acoustic alterations applied at inference time. To minimize disruption to benign tasks, the framework employs a Mel-Gradient Sparse Mask (M-GSM), which precisely targets perturbations to Mel-frequency bins sensitive to jailbreak mitigation but insensitive to speech understanding. This approach significantly reduces attack success rates while maintaining model availability, demonstrating strong robustness against both known and unknown threats.
优缺点分析
The paper elaborates on a timely topic. Other than conventional LLMs, only some research has evaluated audio input as an attack surface specifically, and even less has shown defenses.
The core hypothesis driving ALMGuard is that well-aligned ALMs inherently possess safety shortcuts, which are "latent pathways or input sensitivities" that can steer models towards safer behavior when correctly triggered. The proposed approach centres on identifying and activating these intrinsic safety shortcuts without requiring model retraining.
Strengths
The paper provides theoretical analyses supporting ALMGuard's generalisation capabilities and its minimal impact on benign examples.
In addition, the experimental evaluation includes the most important aspects of a security analysis, which include unseen attacks, an adaptive attacker, and the measurement of utility. The paper also compares with other (text-based) defenses to show their effectiveness in all these cases.
Weaknesses
I would like to see more limitations for the adaptive attack, such as the fact that it will fail at some point. Defending against a very strong attacker will also have a price, such as utility. Showing this would make the paper more solid.
The paper does not discuss or evaluate over-the-air scenarios. For audio, it is normally not the case that the audio is fed into the LLM directly, but normally, there is a transmission happening that is recorded via a microphone. This transmission alters the signal, and the paper’s evaluation does not consider this case.
问题
No further questions beyond what is already mentioned in the comments above.
局限性
Add something related to over-the-air transmission.
最终评判理由
Thank you for adding the experiments on an adaptive attacker during the discussion period.
格式问题
The paper’s formatting looks good.
Thank you for the meaningful and helpful review. We are pleased to see that the reviewer acknowledges the experimental evaluation and the theoretical analyses in the paper, and we appreciate the overall positive evaluation. We will try to address all of your concerns and give corresponding point-by-point responses.
limitations against the adaptive attack
This is indeed an insightful suggestion that we did not consider in our initial experiments. To demonstrate the cost required to defend against a stronger adversary, we consider an attacker with the strongest possible capability—one who has full knowledge of our defense mechanism and can perform white-box optimization on top of our perturbation (SAP), using the AdvWave attack method. We vary the value of k and evaluate both the availability of the model (measured by WER) and the robustness against the adaptive attack (measured by SRoA). The attack is conducted on 100 randomly selected samples, and the results are presented in the table below.
The results indicate that by tuning k, we can achieve better defense performance against stronger attackers at the cost of some availability. For example, when k = 48, the defense against AdvWave is limited, with SRoA still reaching 82%. However, increasing k to 96 significantly enhances the defense, reducing SRoA to 34%, though it comes at the cost of higher WER, which increases from 8.70% to 25.76%.
These findings imply that, during real-world deployment, it is possible to make deliberate trade-offs between defense robustness and model availability, depending on the specific requirements and operational context.
| k | SRoA(%) | WER(%) |
|---|---|---|
| 48 | 82.0 | 8.70 |
| 96 | 34.0 | 25.76 |
| 128 | 36.0 | 26.85 |
over-the-air evaluation
Over-the-air (OTA) scenarios are indeed commonly considered in the audio domain. However, OTA is typically a major concern for the attacker, since the adversarial audio must propagate through the physical environment during real-world attacks. As a result, the attacker must account for distortions introduced by environmental noise and transmission loss, and a strong attack should exhibit robustness to such physical spaces.
In contrast, for defense methods like ours, which are deployed at the model's input end, the perturbation is not transmitted through the air. Instead, it is applied in the digital domain after the audio has been captured. Therefore, our SAP only needs to ensure universality, i.e., the ability to activate the model's safety shortcuts effectively on any received audio input, thus maintaining its defensive effectiveness.
That said, to demonstrate the real-world effectiveness of our method against physically realized jailbreak attacks, we conducted over-the-air experiments using AdvWave (a representative acoustic-based attack) and PAP-Audio (an unseen semantic-based attack). For each attack, we randomly selected 10 jailbreak audio samples, played them through a speaker in a 4m × 4m × 3m room, and recorded them using a microphone placed 50 cm away. The results, measured by SRoA, are shown in the table below. The results confirm that our method remains effective even against jailbreak audio transmitted in real-world physical environments.
| Attack | Defense | Qwen2-Audio | Llama-Omni |
|---|---|---|---|
| AdvWave | None | 10/10 | 4/10 |
| Gaussian Noise | 10/10 | 6/10 | |
| Local Smoothing | 10/10 | 5/10 | |
| Downsampling | 10/10 | 5/10 | |
| Self-Reminder | 0/10 | 1/10 | |
| ICD | 0/10 | 0/10 | |
| ALMGuard | 0/10 | 0/10 | |
| PAP-Audio | None | 6/10 | 10/10 |
| Gaussian Noise | 5/10 | 10/10 | |
| Local Smoothing | 6/10 | 10/10 | |
| Downsampling | 6/10 | 10/10 | |
| Self-Reminder | 3/10 | 1/10 | |
| ICD | 4/10 | 3/10 | |
| ALMGuard | 4/10 | 2/10 |
We hope our responses have adequately addressed the reviewer’s concerns. We would be happy to answer any further questions during the discussion phase.
Thank you for adding the addtional experiments and your response. I will keep my score.
Thank you for your response, and we sincerely appreciate your recognition of our work once again!
With multimodal LLMs getting more and more common, this work aims to focus on AudioLLMs. In particular, this work aims to defend the AudioLLMs from jailbreaking. To do so, this paper formulates this similar to white box adversarial attack, but for finding the perturbation that defends, as well as keep the benign task performance in tact. A novel sensitivity based mask is then developed to only update certain mel bins, and with this, they call their method ALMGuard. Experiments are conducted to show its defending ability (low attack success rate) for multiple attacks and model backbones.
优缺点分析
Strengths
- Paper is easy to parse, and idea is simple, and from the experiments it seem to work.
- I like the idea of gradient sensitivity based bin masking.
Weakness
- Given the nature of this paper where it creates a defense mechanism by adding a helpful noise on some restricted mel-bins, I think it is easy to create an adversarial jailbreak on this. For example, one can just try to break this defense system, by adversarially changing the input to output "Sure, here’s a .... " for any jailbreak prompts (similar to GCG -- Universal and Transferable Adversarial Attacks on Aligned Language Models)
- Why only worry about ASR as benign task? Can it not be the case that even though ASR performance remains unaffected, but on AudioLLM benchmarks the performance degrades? I believe an evaluation of tasks proposed in AudioFlamingo-2 should be considered.
- Should mask not be updated once every gradient step I believe that gradient would change as a function of applied at that step.
- Given the results in Table~3 I am not so sure if masking is a make or break. Can one not try with stronger constraint and get the results similar to ALMGuard?
- The theoretical results mentioned in the paper are rather known learning theory results based on uniform convergence and offer nothing new, and hence proofs are also not useful as far as the paper is concerned. Therefore, as far as theory is concerned, the claims should be toned down.
问题
N/A
局限性
- Given the nature of this paper where it creates a defense mechanism by adding a helpful noise on some restricted mel-bins, I think it is easy to create an adversarial jailbreak on this. For example, one can just try to break this defense system, by adversarially changing the input to output "Sure, here’s a .... " for any jailbreak prompts (similar to GCG -- Universal and Transferable Adversarial Attacks on Aligned Language Models)
最终评判理由
Additional experimental results have solved most of my concerns, except the theoretical novelty, which I still do not think is novel enough with the current level of claims and hence it should be toned down.
格式问题
N/A
We are very grateful for your insightful and valuable comments. We appreciate your recognition of the clarity of our paper and method. We also want to thank you for your interest in the idea of the mask component of our approach. We will try our best to address all your concerns in detail.
Adversarial jailbreak on our defense
We acknowledge the reviewer’s point and clarify that Section 5.4 of our paper includes adaptive attack experiments, where we consider a strongest-case attacker with full knowledge of our defense and the ability to perform white-box optimization on top of our perturbation (SAP). The results show that although the defense effectiveness is significantly degraded under this adaptive setting, our method still outperforms all existing baselines.
Furthermore, we conduct new additional experiments demonstrating that our method can achieve stronger robustness against adaptive attacks by sacrificing a certain degree of model utility. Specifically, by adjusting k from 48 to 96, we are able to reduce the SRoA from 82% to 34%.
| k | SRoA(%) | WER(%) |
|---|---|---|
| 48 | 82.0 | 8.70 |
| 96 | 34.0 | 25.76 |
| 128 | 36.0 | 26.85 |
Why ASR as benign task
-
ASR is currently the most important and fundamental pre-training task for the speech understanding capability of Audio Language Models (ALMs). For example, in the case of Qwen-Audio, ASR-related audio accounts for more than 35% of all data across the 17 pre-training tasks.
-
In our evaluation of model utility, LibriSpeech is used to assess performance on the ASR task, while AIR-Bench is used to evaluate performance on speech-based QA. The results from both benchmarks show consistent trends—performance degradation in ASR corresponds to similar degradation in speech QA.
-
AudioFlamingo-2 is an ALM that focuses on sound event understanding (e.g., identifying the source of a car horn sound) rather than speech understanding. This falls outside the scope of our discussion, which primarily considers jailbreak issues triggered by the linguistic content of audio, and focuses on the speech interaction capability of the model. However, we also agree that non-speech acoustic events are also widespread in real-world scenarios, and safety issues arising from such inputs are worth exploring in the future work.
As a supplementary evaluation of benign performance, we further assess our method on non-speech audio types from AIR-Bench-Chat, including sound, music, speech mixed with sound, and speech mixed with music, totaling 1,400 audio samples. The results under k = 48 are shown in the table below, which demonstrates that the impact of our method on a broader range of audio tasks remains within an acceptable range.
| Sound | Music | Speech&Sound | Speech&Music | Average | ||
|---|---|---|---|---|---|---|
| Qwen2-Audio | None | 5.52 | 4.45 | 6.14 | 5.10 | 5.30 |
| Gaussion Noise | 5.52 | 4.41 | 5.58 | 4.58 | 5.02 | |
| Local Smoothing | 5.23 | 4.22 | 5.51 | 4.60 | 4.89 | |
| Downsampling | 5.25 | 4.28 | 5.63 | 4.75 | 4.98 | |
| Self-Reminder | 5.53 | 5.05 | 5.21 | 4.65 | 5.11 | |
| ICD | 5.62 | 5.03 | 6.05 | 5.02 | 5.43 | |
| ALMGuard | 4.35 | 3.35 | 4.69 | 3.45 | 3.96 | |
| Llama-Omni | None | 4.95 | 4.14 | 4.17 | 3.29 | 4.14 |
| Gaussion Noise | 4.98 | 4.17 | 3.74 | 3.29 | 4.05 | |
| Local Smoothing | 4.98 | 4.12 | 4.06 | 3.29 | 4.11 | |
| Downsampling | 5.03 | 4.07 | 4.24 | 3.24 | 4.15 | |
| Self-Reminder | 4.33 | 3.74 | 3.70 | 2.80 | 3.64 | |
| ICD | 1.02 | 1.12 | 2.53 | 1.89 | 1.64 | |
| ALMGuard | 4.82 | 4.30 | 3.25 | 2.93 | 3.83 | |
| Lyra-Base | None | 3.71 | 2.40 | 2.37 | 2.02 | 2.63 |
| Gaussion Noise | 3.70 | 2.40 | 2.43 | 2.02 | 2.64 | |
| Local Smoothing | 3.66 | 2.33 | 2.39 | 2.04 | 2.61 | |
| Downsampling | 3.64 | 2.38 | 2.48 | 1.98 | 2.62 | |
| Self-Reminder | 4.09 | 2.65 | 2.53 | 2.14 | 2.85 | |
| ICD | 4.23 | 3.35 | 2.51 | 2.36 | 3.11 | |
| ALMGuard | 4.05 | 3.23 | 2.53 | 2.25 | 3.02 | |
| Qwen2.5-Omni | None | 6.13 | 5.85 | 6.40 | 5.45 | 5.96 |
| Gaussion Noise | 6.11 | 5.90 | 6.38 | 5.58 | 5.99 | |
| Local Smoothing | 6.16 | 5.82 | 6.48 | 5.50 | 5.99 | |
| Downsampling | 6.15 | 5.84 | 6.36 | 5.51 | 5.97 | |
| Self-Reminder | 6.54 | 5.91 | 6.38 | 5.56 | 6.10 | |
| ICD | 6.32 | 5.95 | 6.52 | 5.47 | 6.07 | |
| ALMGuard | 5.57 | 5.62 | 5.81 | 4.62 | 5.41 |
Regarding per-step mask updates
As part of our design motivation, we intentionally use a fixed mask to ensure that the SAP is universal and input-agnostic, enabling broad generalization and efficient deployment.
To achieve this, we compute the average gradient over the entire batch before the optimization process begins, allowing us to obtain a relatively stable mask. As shown in Figure 5 of the paper, this mask remains largely consistent across different models.
Moreover, if the mask is updated at every iteration step, it would become input-dependent, which contradicts our goal of generality. Additionally, frequent changes in the mask values can lead to instability in the loss convergence during optimization.
Regarding the role of masking and stronger constraints
As shown in Table 3 of the paper, the effectiveness of M-GSM is well supported. In the w/o M-GSM column, we observe that its defense performance is comparable to the full ALMGuard, with average SRoA of 14.4% and 16.6% respectively. However, its usability drops significantly compared to ALMGuard—for example, the WER reaches 26.85%, which is 20% higher than the no-defense baseline (6.85%), whereas ALMGuard achieves a WER of 8.70%, which is very close to the no-defense case. This clearly demonstrates that M-GSM effectively mitigates the utility degradation caused by SAP while maintaining strong defense performance.
In our early experiments, before adopting the M-GSM technique, we also explored traditional -norm constraints commonly used in audio adversarial attack and defense domain. We attempted to preserve model utility by restricting the perturbation magnitude under a predefined ε. However, although tuning ε allowed us to obtain comparable defense performance, the model usability remained far from satisfactory and could not meet the expected quality, as shown in the table below.
| SRoA - AdvWave (%) | SRoA - PAIR-Audio (%) | WER (%) | RQS | |
|---|---|---|---|---|
| ALMGuard | 3.1 | 34.9 | 8.70 | 5.69 |
| = 0.5 | 3.1 | 27.5 | 25.39 | 5.08 |
| = 0.4 | 4.4 | 30.0 | 17.26 | 4.98 |
| = 0.3 | 4.6 | 30.6 | 16.36 | 5.01 |
Theoretical contribution
While our theoretical analysis is based on standard generalization tools such as Hoeffding’s inequality, it provides formal justification for two key properties of ALMGuard: (1) the ability of SAPs to generalize from seen to unseen jailbreaks, and (2) the bounded impact on benign inputs. These properties are critical for the practical deployment of defenses and, to the best of our knowledge, have not been explicitly analyzed in prior work on ALM jailbreak mitigation. We therefore consider the theoretical section to be a useful and complementary component to our empirical results.
In addition, the theoretical analysis also serves as the underlying principle for our empirical method, offering insights into how the hyperparameters should be tuned. This is reflected in two aspects: (1) Regarding the generalization of SAP, the bound suggests that a larger training set (i.e., larger n) can lead to smaller generalization error; (2) Regarding the impact on benign tasks, the bound indicates that decreasing the value of k can reduce the degradation in model performance. These insights can provide useful guidance for practical deployment of our method.
Note: .
We hope the above responses have addressed your concerns. We would be happy to answer any further questions during the discussion phase.
I thank the authors for their rebuttal, I am increasing the rating, but I respectfully disagree with the theoretical contributions being novel enough, and therefore, in the final version it should be toned down.
Thank you for the updated rating. We appreciate the feedback and will revise the manuscript accordingly.
This paper proposes a novel jailbreak defense method for Audio-Language Models (ALMs). Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, the method identifies universal Shortcut Activation Perturbations (SAPs) that serve as triggers to activate the safeguard of ALMs during inference. A further proposed Mel-Gradient Sparse Mask (M-GSM) also realizes jailbreak defense by restricting jailbreak sensitive Mel-frequency bins. Experiments demonsrate that the ALMGuard could reach SOTA in defending against the most advanced jailbreak attacks on ALMs.
优缺点分析
Strengths:
- This paper is well written with clear description of their proposed defense methods and comprehensive experiments to prove its effectiveness.
- Theoretical analyses are provided to examine the generalization of their defense method ALMGuard, further strengthen their main claims and statements on the increased defense performance.
- The ALMGuard could realize defense without significantly hurting the model availability.
Weaknesses:
- More defense baselines are needed such as the diffusion based defense [1].
[1] Wu, Shutong, Jiongxiao Wang, Wei Ping, Weili Nie, and Chaowei Xiao. "Defending against adversarial audio via diffusion model." arXiv preprint arXiv:2303.01507 (2023).
问题
No other questions.
局限性
yes
最终评判理由
The rebuttal has addressed some of my concerns. Given the proposed methods could not outperform all baseline results in the rebuttal, I will keep the score for boarderline accept.
格式问题
no paper formatting concerns
Thank you for your constructive comments and suggestions. We are pleased to see that the reviewer found our paper well-written, with comprehensive experiments and meaningful theoretical support. We sincerely appreciate the positive evaluation.
We appreciate your suggestion that more defense baselines can make our experiments more solid. In response to your suggestion, we conducted and included a comparison with AudioPure, and the results are shown in the following table. The experiment result shows that while AudioPure achieves comparable defense performance with ALMGuard, it causes a significant degradation in model availability. For example, the WER increases from 6.85% to 21.72%.
We think this is primarily due to the diffusion-based audio reconstruction mechanism used in AudioPure. Although this process effectively removes adversarial suffixes and perturbations introduced by the attacker, the reconstructed audio is filled with noticeable noise, which severely affects the model's ability to process and understand the original content.
AudioPure performs well against acoustic-based attacks, which is intuitive since the audio reconstruction process disrupts adversarial acoustic features. Its overall defense effectiveness against acoustic-based attacks is comparable to ALMGuard. For attacks in the form of adversarial prefixes or suffixes (e.g., AdvWave, Gupta et al.), their performance is weaker than that of ALMGuard. For perturbation-based attacks (e.g., AdvWave-P), it slightly outperforms ALMGuard.
As for semantic-based attacks (such as PAIR-Audio and PAP-Audio), the attack effectiveness primarily stems from the linguistic content (i.e., semantics) of the audio. AudioPure's effectiveness against such attacks may be attributed to the noise introduced during the reconstruction process, which likely distorts the semantic information in the audio and makes it harder for the model to comprehend. This is also consistent with its large negative impact on model performance for benign inputs.
It is also worth mentioning that AudioPure was originally proposed for the speech command classification task, which is a classification task involving short audio commands consisting of only a few words and typically lasting less than one second. In contrast, our setting involves long-form audio inputs, often ranging from ten to thirty seconds in duration. In such cases, the impact of noise becomes much more severe. In contrast, our method strikes a favorable balance between robustness and model availability.
| Defense | SRoA(%) | Benign | |||||||
|---|---|---|---|---|---|---|---|---|---|
| AdvWave | AdvWave-P | PAIR-Audio | Gupta et al. | ICA | PAP-Audio | Average | WER(%) ↓ | RQS ↑ | |
| None | 86.4 | 80.8 | 45.0 | 54.3 | 1.2 | 47.6 | 52.5 | 6.85 | 6.25 |
| AudioPure | 5.8 | 5.8 | 40.9 | 3.3 | 3.3 | 21.4 | 13.4 | 21.72 | 4.87 |
| ALMGuard | 3.1 | 11.7 | 34.9 | 0.5 | 0.4 | 46.2 | 16.1 | 8.70 | 5.69 |
We hope the above response can address your concern, and we would be happy to answer your further questions during the discussion phase.
This paper introduces ALMGuard, a lightweight, training-free, and model-agnostic defense framework for Audio-Language Models (ALMs). The method leverages Shortcut Activation Perturbations (SAPs) to trigger latent safety pathways and a Mel-Gradient Sparse Mask (M-GSM) to localize perturbations in frequency bins most relevant for jailbreak defense while minimizing utility loss. The approach is evaluated across multiple attack types, models, and settings, with additional ablations and robustness studies provided after reviewer feedback.
Strengths:
- Timely and important contribution, addressing jailbreak attacks in Audio LLMs; a relatively underexplored but high-risk domain.
- Practical, inference-time, training-free method that can be deployed.
- Strong empirical robustness across diverse attacks, including unseen and adaptive ones, with minimal impact on benign ASR and QA tasks.
- Additional experiments (after rebuttal) strengthened the work: comparisons with AudioPure, ablations on data scale, subset variance, and over-the-air (OTA) evaluations.
- Clear writing and presentation, making the ideas easy to follow.
Weaknesses / Remaining Concerns:
- While the authors added comparisons with AudioPure, ALMGuard does not consistently outperform all baselines; in some cases performance is comparable or weaker. The main advantage lies in preserving utility.
- Theoretical contributions, while providing formal justification for SAP generalization and bounded benign impact, largely rest on standard tools (uniform convergence) and are not considered novel by multiple reviewers. Claims here should be toned down.
- Defense robustness under stronger adaptive attackers requires careful framing: results show significant degradation unless utility is sacrificed. The trade-off should be emphasized as a limitation.
- Some reviewers questioned the scope of benign task evaluation (primarily ASR), though authors added results for broader audio tasks.
Discussion & Author Response: The authors were highly responsive, adding the requested experiments: (1) AudioPure baseline, (2) ablations on SAP training data size and subset stability, (3) adaptive attacker trade-off analysis, and (4) over-the-air evaluations. These additions addressed most concerns and strengthened the paper. However, reviewers remained unconvinced of the novelty of the theoretical analysis and requested that the claims be presented more cautiously.
In summary, this paper offers a timely, practical, and well-validated contribution to defending AudioLLMs against jailbreak attacks. It's empirical strength and relevance outweigh concerns about theoretical novelty, making it a valuable addition to the program.