E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
We propose a proactive defensive framework against malicious production LLM-based speech synthesis to protect our voice information.
摘要
评审与讨论
This paper introduces E2E-VGuard, a proactive defense framework for preventing unauthorized cloning of voice data in speech synthesis pipelines. It addresses a realistic threat model where transcripts are obtained by automatic speech recognition (ASR). The framework defense by disrupting timbre (feature-level perturbations) and pronunciation (ASR adversarial examples), while maintaining imperceptibility through a psychoacoustic model. The paper conducts extensive experiments across open-source and commercial TTS models, various ASR systems, and multilingual datasets. In addition, the paper evaluates E2E-VGaurd on fine-tuning-based speech synthesis, zero-shot scenarios and commercial API. The results demonstrate the effectiveness of the proposed method.
优缺点分析
Strengths
- The proposed method is novel and reflects the real-world industrial settings.
- Extensive and comprehensive experiments are conducted to demonstrate the effectiveness, transferability, imperceptibility, and robustness.
- The writing is overall good.
Weaknesses
- Some evaluation details are missing and confusing. (See Questions)
- While the overall results demonstrate the effectiveness of the proposed method, some analyses are inconsistent with the Table. For example, in Section 4.2, Perception Analyses, the authors only discuss the SNRs and state that the proposed method achieves SNR values higher than all baselines. However, in Table 1, the PSEQ values of the proposed method are less competitive when this metric should be more reliable than SNR. More justification may be needed here.
- The method requires ensemble encoders and multiple optimization steps, which take up to ~100 seconds per audio clip. While acceptable in research, it could limit scalability in real-time or large-scale deployment.
问题
- For the encoder ensemble in Section 3.3, what is the motivation for using MFCC? Is there any justification and empirical evidence to support this?
- Do the encoders used in adversarial optimization overlap with the evaluated TTS/voice clone synthesizers? If so, how effective would the proposed defense framework be when the encoder of the voice clone model is unknown or completely different from the ones used during the optimization?
- From line 155 “Moreover, to 154 counter LLM-based TTS, we consider to protect at the audio’s original features by changing the 155 discrete tokens obtained by the audio tokenizer for the LLM component with the MFCC extractor to 156 protect articulation and prosodic patterns.” What does the LLM component refer to? Is this from the voice clone model?
- In section 3.4, which ASR model is used for the adversarial optimization? Do all evaluations use the same ASR model, or is the ASR model different in different evaluation settings(different TTS synthesizer, targeted/untargeted/commercial API)? This should be made clear to understand how well the pronunciation protection transfers when the ASR used in optimization differs from that used by the adversary. In addition, are some ASRs more vulnerable than others?
- While the y_t in Eq.(4) is obvious to be the targeted text, it would be clearer to also explain in the text.
局限性
- The evaluation setup is not clear. (See Questions)
- The optimization time may hinder user-side deployment for protecting large datasets or real-time speech.
- There may be a typo in Table 9. MOS should be the higher the better but it’s a down arrow in Table 9. Further proofreading may be needed.
- For the perturbation removal in Section 4.6, the denoising techniques have the authors consider more advanced adversarial perturbation removal techniques such as with diffusion models[1]? More experiments on this are required to test the true robustness of the proposed defense framework.
[1] Wu, Shutong, et al. "Defending against adversarial audio via diffusion model." arXiv preprint arXiv:2303.01507 (2023).
最终评判理由
The rebuttal is detailed and answers my questions.
格式问题
NA
We sincerely acknowledge your insightful and constructive comments. We hope the explanations below can address your concerns. New experiments and justifications will be provided in the revised version based on your suggestions.
W2: In Table 1, the PSEQ values of the proposed method are less competitive when this metric should be more reliable than SNR. More justification may be needed here.
R: Thanks for this suggestion. SNR reflects the ratio of noise and original audio, and PESQ evaluates the overall audio quality. In the baseline methods, AttackVC and POP show high PESQ values but almost no protective effect. AttackVC operates on embeddings generated by the voice conversion encoder. POP only perturbs a short segment of the audio. These two methods result in minimal alteration to the waveform, yielding high perception results but a low protective effect.
In comparison with baseline methods that provide effective protection, i.e., AntiFake, ESP, and SafeSpeech, our method achieves superior performance in both SNR and PESQ metrics, while also delivering a stronger protection effect. We will present a more detailed analysis of this point in the revised version.
W3/L2: The method requires ensemble encoders and multiple optimization steps, which take up to ~100 seconds per audio clip. While acceptable in research, it could limit scalability in real-time or large-scale deployment.
R: We greatly appreciate this constructive suggestion. First, our paper focuses on the offline scenario where users protect audio before uploading to social media (Lines 272–273). This offline scenario holds minimal real-time requirements and high runtime tolerance [1*], allowing us to prioritize more effective protection. Compared to other baselines, our computational time is comparable or even superior, e.g., AntiFake [7] requires 203.248 seconds and SongBsAb [1*] needs 287 seconds per audio. Moreover, E2E-VGuard has low hardware demands, needing only 4~5 GB of RAM and enabling deployment on consumer laptops.
Following your valuable suggestion, we apply batch and parallel processing to improve the computation efficiency of our framework. With these optimizations, the time overhead per audio has been reduced from the original ~100 seconds on a single GPU to ~30 seconds using two 4090 GPUs with batch processing. With even greater computational resources, the time complexity could potentially and ideally be reduced to near-linear, enabling faster processing of long-duration audio.
For large-scale datasets, we have also conducted validation, as described in the paper. In Appendix F, we protect CMU ARCTIC and THCHS30 datasets, containing a total of 80 minutes and 60 minutes of protected audio, respectively. For large-scale datasets, we can split the audio into segments and apply parallel processing to greatly reduce the overall processing time.
Q1: For the encoder ensemble in Section 3.3, what is the motivation for using MFCC? Is there any justification and empirical evidence to support this?
R: We choose the MFCC feature based on the following three considerations: (1) MFCC has the advantage of effectively capturing overall audio characteristics, e.g., pronunciation and timbre features; (2) According to previous research [2*], applying MFCC features enables better performance in targeted timbre protection by leading ASR to transcribe more accurately toward the content of the target audio; (3) Furthermore, MFCC features are easier to optimize compared to Mel-spectrograms, leading to faster convergence.
Q2: Do the encoders used in adversarial optimization overlap with the evaluated TTS/voice clone synthesizers?
R: The encoders used for perturbation generation do not overlap with most of the encoders employed in the evaluated TTS models.
In the "Encoders" part of Section 4.1, we introduce the selected encoders from GSV, VITS, CosyVoice, and StyleTTS2 to enhance transferability across different model architectures. For the remaining models, such as Llasa 1/8B in Table 1 and the models in Table 2, we do not conduct special processing, e.g., specifically conducting adversarial attacks against these models, yet we still achieve effective voice privacy protection on these models. This result is attributed to the high transferability enhanced by the ensemble of encoders. Moreover, we have validated the transferability on commercial black-box models in Section 4.4.
On the other hand, we test the effectiveness of TTS models utilizing in-context learning (ICL), which lack speaker encoders and speaker embeddings, including VALLE-X and F5-TTS models. On VALLE-X, the WER and SIM values are 129.483% and 0.175 (UT), and 88.707% and 0.175 (T). On the F5-TTS model, the WER and SIM values are 10.776% and 0.053 (UT), and 70.034% and 0.319 (T). We find that E2E-VGuard can also achieve effective voice privacy protection on ICL-based models. This further demonstrates the effectiveness and transferability of E2E-VGuard.
Q3: What does the LLM component refer to? Is this from the voice clone model?
R: The "LLM component" refers to the LLM module utilized in the LLM-based TTS model.
When considering how to protect LLM-based TTS models, we observe that the input reference audio is first processed by a speech tokenizer to generate speech tokens. These speech tokens are then fed into the LLM to produce semantic tokens, which assist synthesizers in generating audio with richer semantic information, as illustrated in Figure 1 of the paper. Ideally, performing adversarial attacks on the speech tokenizer and LLM could disrupt semantic token generation, resulting in incoherent audio output. However, this assumption has two critical limitations:
- Compared to speaker encoders, adversarial optimization targeting the LLM component causes greatly higher computational costs. For instance, models like Llasa-8B employ an 8B-parameter Llama 3 model, leading to substantial time overhead.
- When processing discrete token inputs/outputs, the LLM inherently loses gradients during optimization. This discreteness prevents gradient-based optimization, rendering adversarial attacks ineffective.
Given these limitations, we instead adopt the encoder ensemble approach combined with ASR-based prevention to achieve effective protection.
Q4: In section 3.4, which ASR model is used for the adversarial optimization? Do all evaluations use the same ASR model, or is the ASR model different in different evaluation settings (different TTS synthesizer, targeted/untargeted/commercial API)? In addition, are some ASRs more vulnerable than others?
R: In Section 3.4, we employ the classic Wav2vec2 model for adversarial optimization. In E2E scenarios, the ASR systems under black-box commercial APIs are unknown. For other experiments in the main text (targeted/untargeted), the Wav2vec2 model is utilized to recognize reference transcripts. Appendix E validates the effectiveness of E2E-VGuard against diverse ASR systems. We illustrate E2E-VGuard's effectiveness against ASR systems from two perspectives:
- First, the black-box experiments in Section 4.4 verify E2E-VGuard's effect using unknown ASR models. In Appendix E, we further test widely adopted ASR systems, including Whisper (base, small, medium, large), Citrinet, and Conformer, confirming consistent effectiveness.
- Second, in the "Eliminating ASR System" part of Appendix A, we evaluate that E2E-VGuard maintains protection even with correct textual input. On GSV, the WER and SIM are 39.659%, 607 0.161 (T) and 73.784%, 0.278 (UT). This occurs because our method, leveraging MFCC features and ASR models, effectively alters the audio's latent phonetic information. Therefore, E2E-VGuard provides effective protection regardless of the ASR system evaluation.
Attacking ASR systems (or verifying ASR vulnerability) relies on loss function computation. Empirically, the Wav2vec2 model inherently calculates loss during transcription, making gradient extraction more feasible than manual CTC loss computation.
Q5: While the in Eq.(4) is obvious to be the targeted text, it would be clearer to also explain in the text.
R: We illustrate the definition of in line 196. We will provide a more detailed description in the revised version.
L3: There may be a typo in Table 9. MOS should be the higher the better but it’s a down arrow in Table 9. Further proofreading may be needed.
R: Thanks for pointing it out. We will fix it.
L4: More experiments on this are required to test the true robustness of the proposed defense framework.
R: We appreciate this constructive feedback. We evaluate diffusion-based denoising techniques [3*], achieving SIM scores of 0.233 and 0.277 on GSV and VITS, respectively, demonstrating substantially preserved privacy protection compared to unprotected samples. Moreover, we test an advanced DNN-based denoising model [4*], yielding SIM values of 0.243 and 0.261 on GSV and VITS. While both approaches effectively eliminate audible noise, E2E-VGuard maintains high robustness. This is because denoising techniques, despite removing audible noise, inevitably cause partial information loss in the original audio [10], preventing adversaries from building original audio.
[1*] Chen G, Zhang Y, et al. Songbsab: A dual prevention approach against singing voice conversion based illegal song covers. NDSS 2025.
[2*] Fang Z, Wang T, Zhao L, et al. Zero-query adversarial attack on black-box automatic speech recognition systems. CCS 2024.
[3*] Wu S, et al. "Defending against adversarial audio via diffusion model." ICLR 2023.
[4*] https://github.com/facebookresearch/denoiser
Note: [N*] represents the new reference and [N] denotes the reference in the paper.
Thanks for your rebuttal. The rebuttal is detailed and answers my questions. I am leaning towards acceptance.
We sincerely appreciate your response and recognition.
In this paper, the authors propose the E2E-VGuard method, which aims to prevent a user’s speech from being used as a prompt or training data for unknown speech synthesis systems by perturbing both timbre and textual information, while keeping perceptual differences minimal. They design various loss functions to optimize the perturbation process and conduct comprehensive experiments to evaluate the effectiveness of their method, including real-world scenarios. This topic is gaining significant attention as concerns over user data privacy grow, and as speech synthesis systems become nearly indistinguishable from real human speech. Therefore, I strongly believe that this work makes a valuable contribution to the safety and security of modern speech synthesis systems and can significantly benefit the research community.
优缺点分析
Strengths
- First of all, the topic of this paper is highly important and is currently receiving significant attention. Therefore, I believe this work has the potential to make a meaningful contribution and impact on society.
- The authors effectively harmonize several objective functions necessary to perturb the input speech signal.
- Additionally, the authors conduct a very comprehensive and well-designed evaluation. They effectively simulate real-world scenarios by incorporating several state-of-the-art TTS systems, including commercial APIs. They also test the robustness of the perturbed signal under conditions that could realistically occur in practice. Finally, they simulate the case where speech is physically transmitted and recorded as a different signal, and evaluate robustness in that context as well. I believe this comprehensive and practical evaluation can serve as a valuable reference for future research on similar topics.
Weaknesses
- The biggest weakness is the lack of clarity in the proposed method. It is unclear what the perturbation method entails. The authors mention optimizing various objective functions and perturbing the original waveform’s content and timbre information. However, they do not specify what kind of network is being optimized. Are they optimizing the perturbation signal ? If so, how is generated?
- Following this, the authors optimize a network for only 500 iterations, which is significantly fewer than what is typical for neural network training. I could not find the batch size used in the training process, but considering the size of the training dataset, 500 iterations may not even be enough to cover the entire dataset for a single epoch. Is this because the network is optimized for a specific speech signal each time? In other words, does the perturbation network need to be optimized individually for every single target waveform?
- In the evaluation, the authors present results for untargeted (UT) and targeted (T) E2E-VGuard systems. I understand that UT and T refer to perturbations in timbre; however, I was unable to clearly understand the distinction in text-wise perturbation between UT and T. Based on my understanding, the authors chose to attack the ASR system with a specific target text, as indicated in line 184 (“Therefore, we choose targeted attacks against ASR systems”). This suggests that a specific target text was set, and the original signal was perturbed to match that target content. However, when I listened to the speech samples on the demo page, both the T and UT attacks seemed to retain the same text content as the original signal. Did I understand this correctly? I believe the authors should provide a more detailed explanation of their T and UT attack settings in the experimental section.
- In Section 4.6 (Perturbation Removal Experiment), the authors adopt a spectral gating denoising technique. After denoising, the performance appears to drop significantly—for example, the WER decreases from 95.7% to 51.0%. This suggests that with a more effective denoising method, the effectiveness of the proposed approach could deteriorate even further. Therefore, the authors should evaluate the impact of denoising on random noise signals as well, or test with various denoising techniques, to demonstrate that the proposed system is robust against different types of denoising methods.
问题
- In Table 2, the WER results for each zero-shot E2E TTS system vary significantly—some systems achieve WERs below 10, while others exceed 50. What is the reason behind this discrepancy? Is it due to the differing characteristics of the zero-shot E2E TTS systems? If so, what characteristics might explain this variation, in the authors’ opinion? This variation also appears to relate to the robustness of the proposed system and warrants further discussion.
- Since the method adopts several numbers of different objective functions, I wonder how did authors adjust the loss weight effectively. In the paper, the and are set as 500 and , respectively, which are drastically different between.
局限性
Yes
最终评判理由
The authors’ rebuttal addresses the concerns I raised. Considering the quality of the paper and the potential impact this work could have on speech privacy protection research, I would like to increase my score from 3 to 4.
格式问题
- In Figure 1, particularly the section showing the loss, the graph and the text are difficult to read. I suggest increasing the size of this part or separating it into a standalone figure for better clarity.
- In Figure 1, there are two lines with different colors representing the loss. The authors should clarify what these lines indicate.
- In line 223, the reference to Section 2 appears to be a typo.
- This is a minor suggestion, but Figure 3(a) does not seem particularly important and could be omitted. Personally, I found the description in the text sufficient to understand how the real-world experiments were conducted.
We sincerely appreciate your valuable feedback. We are doing our best to address your concerns.
W1: They do not specify what kind of network is being optimized. Are they optimizing the perturbation signal ? If so, how is generated?
R: First and foremost, it is important to clarify that E2E-VGuard does not involve the design or training of a perturbation network. Perturbation generation relies on the derivation of the loss function, which we will illustrate in detail.
In adversarial attacks, a critical step lies in designing an effective optimization objective, based on which perturbations can be generated. Specifically, following the classical Projected Gradient Descent (PGD) [1*] algorithm in the adversarial attack domain, we compute the gradient of the loss function for variable to derive the perturbation: , where denotes the sign function and represents the loss function. Subsequently, is projected onto the -ball constraint, i.e., . Using , the protected audio is updated at each step as , as shown in Lines 13–14 of Algorithm 1.
Referring to previous methodology descriptions [7-10], our E2E-VGuard design primarily focuses on defining the optimization objective, with no network training involved in this process. The encoder and ASR models utilized in the loss function are introduced in Section 4.1. These components utilize pre-trained checkpoints to obtain speaker embeddings and content features, respectively. Moreover, these models' gradients are disabled to ensure that the gradient information of depends solely on the optimization variable , enabling more effective optimization.
We will illustrate it more clearly in the revised version.
W2: 500 epochs are too few for training the network and no batch sizes are given. Does the perturbation network need to be optimized individually for every single target waveform?
R: Similar to W1, E2E-VGuard is not involved in the design and training of the perturbation network.
In our paper, the 500 iterations refer to optimization steps in the PGD algorithm, not epochs for training a neural network. These 500 iterations allow the loss value to converge effectively. This number aligns with related literature, e.g., SafeSpeech employs 200 iterations [10] while AntiFake utilizes 1000 [7]. Since we do not perform perturbation network training but instead optimize each speech sample individually, the concept of batch size is not introduced. Our framework is implemented as a software solution [7], which takes a single audio input and outputs the protected audio. Therefore, for each sample to be protected, we optimize the optimization objective to generate the corresponding perturbation. Parallel and batch processing can also significantly reduce the time overhead of the E2E-VGuard.
W3: Distinction between untargeted (UT) and targeted (T) attacks in text-wise perturbation.
R: Targeted and untargeted attacks represent two common methods in adversarial attacks [1*]. Both timbre and pronunciation can be conducted in either targeted or untargeted attacks. The distinction between untargeted and targeted timbre protection lies in whether a specific target speech is selected as the optimization objective (in Section 3.3). For attacks against ASR systems, targeted attacks are generally preferred, as untargeted attacks (i.e., without a designated target text) produce random output text, which may raise adversaries' detection (lines 181–184). Therefore, we consider targeted attacks on ASR systems. It is essential to select an appropriate target text (lines 184–187). The choice of target text is also correlated with the timbre protection mode.
For targeted timbre protection (T), where the target speech is selected, we aim to align the protected audio with the target speech in feature space, including both timbre and textual content. Therefore, we set the ASR attack's target text to match the target speech's text, thereby enhancing the ASR attack effect and pronunciation protection (lines 187–191). For untargeted timbre protection (UT), we randomly select a text of the same length as the original speech for the ASR attack (lines 191–193). For example:
Original Text: His voice was changed as he spoke next.
Target Text (T): Please open the door. (Text from the target speech)
Target Text (UT): Her voice shifted as she began to talk. (Same-length text)
Our pronunciation prevention operates at the feature level and ensures the original audio's usability without large distortion. In the demo audio:
- For protected audio, both untargeted and targeted protection retain the original textual content and speaker identity as much as possible auditorily.
- For synthesized audio, the targeted (T) protection effectively disrupts the speech pronunciation, generating incoherent output to achieve protection.
W4: The authors should evaluate the impact of denoising on random noise signals as well, or test with various denoising techniques.
R: Thanks for the valuable suggestion. We conduct new tests with stronger denoising techniques. We utilize an industrial-level DNN-based denoising model [2*], which can remove nearly all noise audible to human ears. The test results show that after denoising (which removes the audible perturbation, hence the comparison against clean audio), the protection is still effective (GSV: WER 23.1%, SIM 0.243; VITS: WER 34.1%, SIM 0.261).
The notably low SIM values persist even after denoising because while such techniques eliminate audible noise, they concurrently cause partial loss of critical original audio information, e.g., timbre and phoneme features [10]. Furthermore, under the more advanced diffusion-based denoising method [3*], the SIM values remain low at 0.233 for GSV and 0.277 for VITS, demonstrating E2E-VGuard's robustness against advanced denoising techniques.
Q1: In Table 2, the WER results for each zero-shot E2E TTS system vary significantly. What is the reason behind this discrepancy?
R: This discrepancy arises from inherent features of TTS models, as our evaluation encompasses a diverse range of distinct TTS models, i.e., E2E and non-E2E scenarios. On one hand, we aim to evaluate speech synthesis in E2E scenarios (requiring reference text input). On the other hand, we also seek to evaluate emerging LLM-based TTS models. For certain TTS models like XTTS-v2, which operates without reference text input during zero-shot speech synthesis, the reference audio is solely used to extract speaker embeddings, while pronunciation information is entirely produced by the model's pre-trained knowledge and remains unaffected by errors in reference text. However, most models achieve WER values superior to baselines in Table 2.
In E2E scenarios, our approach delivers substantially stronger voice protection than baselines, as evidenced by fine-tuning-based E2E TTS models in Table 1 and zero-shot-based E2E TTS models, e.g., Step-Audio-TTS, FireRedTTS-1S, partially presented in Table 2.
We have included non-E2E scenarios in our evaluation to comprehensively assess both production LLM-based TTS models, e.g., XTTS-v2, while demonstrating the effectiveness in preserving timbre protection across diverse architectures.
Q2: Since the method adopts several different objective functions, I wonder how did authors adjust the loss weight effectively.
R: We adjust the loss weight based on initial magnitude balance and experimental assessment. Eq. (1) incorporates , , and , whose initial magnitudes during optimization are , , and , respectively. The parameter is set to 500 to amplify timbre protection, thereby aligning its optimization scale with that of . Moreover, serves to balance the perturbation perception and protection strength, and its selection is validated through experimental results. Specifically, we conduct tests using the CosyVoice model for voice cloning on small samples, selecting an appropriate value based on the balance between these two factors, as shown in Table R-1. It can be observed that when is larger (), although SNR values are high, there is minimal protection for timbre and pronunciation, and a lower ratio of the perturbation could also diminish the robustness [10]. Conversely, when is smaller (), protection strength does not improve obviously, and the SNR drops to 13.982, resulting in poor perceptual quality. Therefore, we select a more appropriate value of , which achieves a better balance among protection effectiveness, perceptual quality, and robustness.
Table R-1: Impact of the hyperparameter.
| WER | SIM | SNR | |
|---|---|---|---|
| clean | 7.407 | 0.905 | - |
| 14.815 | 0.069 | 13.982 | |
| 18.519 | 0.074 | 18.182 | |
| 7.407 | 0.241 | 24.096 |
Paper Formatting Suggestion
We are grateful for your suggestions on the details of our paper, and we will follow the comments to enhance the presentation.
These additional experiments and clarifications will be included in the revised version.
[1*] Madry A, Makelov A, Schmidt L, et al. Towards deep learning models resistant to adversarial attacks. ICLR 2018.
[2*] https://github.com/facebookresearch/denoiser.
[3*] Wu S, et al. Defending against adversarial audio via diffusion model. ICLR 2023.
Note: [N*] represents the new reference and [N] denotes the reference in the paper.
Thank you for your thoughtful and sincere rebuttal. I appreciate the time and effort you put into addressing the concerns I raised. I believe this work will contribute to the advancement of speech protection research. I will increase my score from 3 to 4. Good luck with your paper.
We sincerely appreciate your response and recognition of our work. Thank you!
This paper presents E2E-VGuard, a proactive audio protection framework against unauthorized voice cloning attacks, especially targeting production-level, LLM-based, and end-to-end (E2E) TTS pipelines. The approach combines: 1. an encoder ensemble for timbre disruption, 2. adversarial attacks on ASR for pronunciation confusion, and 3. psychoacoustic masking for imperceptible perturbation. The method is validated across various open-source and commercial TTS APIs, under both fine-tuning and zero-shot settings, with extensive robustness analyses.
优缺点分析
Strengths
The paper is well-written with clear motivation and sufficient background. The problem formulation and the E2E scenario definition are easy to follow. The paper provides empirical evaluation, including both open and commercial TTS models and a variety of attacks and augmentations. This increases its practical relevance and transferability. The proposed method, E2E-VGuard, outperforms previous proposed methods and achieves state-of-the art performance on the task. Plus, the experiments on the real-world scenario and commercial API adds credibility to robustness.
Weaknesses
The defense assumes attackers will use ASR. However, motivated attackers could manually transcribe audio, circumventing the pronunciation perturbation. Plus, it is required to deeply analysis the protected audio. For instance, the WER of the protected audio is not provided, which makes it difficult to judge to what extent the perturbation is truly imperceptible to humans or ASR systems. A thorough analysis and reporting of the WER on the protected audio itself would significantly strengthen the claim regarding the imperceptibility of the perturbation, as it directly reflects whether the original content is preserved for benign users.
问题
- Shouldn’t it be necessary to assess whether the perturbation added to the protected audio is truly imperceptible and whether the original content of the audio is preserved? What is the WER (Word Error Rate) of the protected audios?
- Recently, many zero-shot TTS models that utilize in-context learning (ICL) rather than extracting speaker embeddings have been introduced, such as VALL-E[1], F5-TTS, E2-TTS, and VoiceBox. Have you evaluated your method on these LLM and ICL-based zero-shot TTS models? If so, what are the results?
[1] Wang, Chengyi, et al. "Neural codec language models are zero-shot text to speech synthesizers." arXiv preprint arXiv:2301.02111 (2023). [2] Chen, Yushen, et al. "F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching." arXiv preprint arXiv:2410.06885 (2024). [3] Eskimez, Sefik Emre, et al. "E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts." 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024. [4] Le, Matthew, et al. "Voicebox: Text-guided multilingual universal speech generation at scale." Advances in neural information processing systems 36 (2023): 14005-14034.
局限性
While E2E-VGuard presents promising results against a range of current TTS models and demonstrates robust protection under both fine-tuning and zero-shot scenarios, several important limitations remain. First, the proposed defense assumes that attackers will use ASR systems to generate transcripts; however, a motivated adversary could simply perform manual transcription of the protected audio, effectively bypassing the pronunciation perturbation and weakening the defense. Second, the analysis of the protected audio itself is insufficient. In particular, the paper does not report the WER (Word Error Rate) or perceptual metrics of the protected audio, making it difficult to assess how much of the original content is preserved and to what extent the perturbation is truly imperceptible to humans or ASR systems. Thorough evaluation and reporting of these metrics would significantly strengthen the claims regarding the imperceptibility and practical usability of the approach. Finally, the current evaluation mainly focuses on speaker embedding-based systems, while recent advances in zero-shot TTS—including models like VALL-E [1], F5-TTS [2], E2-TTS [3], and VoiceBox [4]—increasingly utilize large language models and in-context learning (ICL) instead of explicit speaker embeddings. It remains unclear whether E2E-VGuard is effective against such ICL-based systems, as their mechanisms for speaker adaptation may not rely on features targeted by the proposed method. Explicit experiments and analyses on these next-generation models are necessary to confirm the broader applicability and long-term effectiveness of the defense.
格式问题
We are grateful for the reviewer's insightful and valuable comments. We will endeavor to address your concerns as much as possible.
W1: Motivated attackers could manually transcribe audio, circumventing the pronunciation perturbation.
R: Thanks for this suggestion. Providing correct text during speech synthesis still cannot circumvent E2E-VGuard, as we utilize the ASR module combined with MFCC features to conceal audio information at the phonetic level, and audio's content features have been altered (though the original audio's audible content remains unchanged).
Regarding the correctness of the reference text, we have considered this point in our paper. On one hand, in the E2E speech synthesis scenarios, the adversary may not necessarily be able to manually transcribe the text, such as when utilizing black-box commercial API models, where the reference text depends on the unknown internal ASR system of the black-box model. On the other hand, we have also validated the scenario where the adversary can manually transcribe the text, i.e., when the provided reference text is accurate. We test this case in the "Eliminating ASR System" section of Appendix A (lines 600–611). Specifically, on the LibriTTS dataset, we evaluate the voice cloning performance of the GSV model utilizing correct reference text, achieving WER and SIM scores of 73.784% and 0.278 (UT), as well as 39.659% and 0.161 (T), respectively. The WER remains still higher compared to unprotected speech samples.
W2: It is required to deeply analyze the protected audio. For instance, the WER of the protected audio is not provided, which makes it difficult to judge to what extent the perturbation is truly imperceptible to humans or ASR systems.
R: We appreciate the insightful comment regarding the perceptual quality of our protected audio. On the LibriTTS dataset, the protected audio achieves a WER value of 13.862%, while AntiFake-protected samples yield a WER value of 31.408%. This demonstrates that our attack on ASR systems does not greatly compromise the content integrity of the original audio.
Regarding the perturbation perception, we analyze the effectiveness of our framework compared to prior methods through both objective and subjective evaluations. For objective metrics, Section 4.2 quantifies minimal audio degradation through SNR (>20 dB in the targeted mode) and PESQ metrics. For subjective evaluation, Appendix G presents human listening perception where our method achieves a MOS value exceeding 3.0, indicating that most listeners regard the embedded perturbation as imperceptible or acceptable [7].
The strength of our perception optimization stems from two key aspects: (1) The perception optimization algorithm based on the psychoacoustic model in Section 3.5 can hide perturbations to areas inaudible to human ears. (2) We embed perturbations on waveforms instead of the audio embeddings in the latent space [1*]. By linearly adding perturbations directly onto the waveform, E2E-VGuard introduces much less audio distortion compared to embedding-based protection methods like AttackVC (which produces negative SNR values) [8] and a black-box optimization [1*] based on the autoencoder. This fundamental design choice inherently preserves better audio fidelity.
Q1: Shouldn’t it be necessary to assess whether the perturbation added to the protected audio is truly imperceptible and whether the original content of the audio is preserved? What is the WER (Word Error Rate) of the protected audios?
R: We should evaluate whether the embedded perturbations affect the normal usability of the audio and whether the original content is preserved. The WER of the protected audio is 13.862%, and we can conduct a further in-depth analysis of this issue.
In the scenario of our paper, the embedded perturbations should be "harmless" to the original audio, meaning that the original text content remains unaltered and the normal usability of the protected audio is unaffected. "Usability" represents whether the audio can be utilized normally in our daily lives. Rather than requiring perceptual indistinguishability between the protected and the original audio. We have verified through both objective (Section 4.2) and subjective (Appendix G) experiments that the perturbations we generated do not cause large changes to the original audio.
Moreover, from the robustness perspective, assuming strong adversaries can distinguish embedded perturbations, they can utilize adversarial techniques to improve the performance of the synthesized speech, causing privacy leakage. However, the robustness validated in Section 4.6 ensures that the adversary cannot effectively remove the embedded perturbation, thereby enhancing protection efficacy against speech synthesis. Therefore, even if the adversary perceives the perturbations, the robustness of E2E-VGuard ensures that privacy data is not completely leaked.
On the other hand, calculating the WER for the original audio relies on the ASR system. When considering the E2E scenario, E2E-VGuard performs an adversarial attack on the ASR system to protect audio, which has potential transferability to other ASR models, thereby affecting recognition results. The WER calculation model is not the same as perturbation generation, and our WER is only ~10%, which appears relatively low. Moreover, the audible content does not undergo great changes. In our response to W2, we also explain the reasons for the superior perceptual quality of our method.
We will present these further considerations in the revised version.
Q3: Effectiveness Evaluation of TTS models based on in-context learning (ICL), such as VALL-X, F5-TTS, E2-TTS, and VoiceBox.
R: We sincerely appreciate this valuable suggestion regarding model scope expansion. In this experiment, we evaluate the effectiveness across three ICL-based TTS models: VALLE-X, F5-TTS, and E2-TTS (note that VoiceBox is excluded due to the lack of an official open-source code). The results are shown in Table R-1.
Table R-1: Experimental results on ICL-based models.
| Method | VALLE-X | F5-TTS | E2-TTS | |||
|---|---|---|---|---|---|---|
| WER | SIM | WER | SIM | WER | SIM | |
| clean | 14.450 | 0.519 | 4.268 | 0.676 | 5.401 | 0.678 |
| AntiFake | 96.469 | 0.249 | 4.303 | 0.282 | 4.004 | 0.269 |
| E2E-VGuard (UT) | 129.483 | 0.175 | 10.776 | 0.053 | 7.064 | 0.138 |
| E2E-VGuard (T) | 88.707 | 0.176 | 70.034 | 0.319 | 84.913 | 0.372 |
This effectiveness stems from our speaker encoder ensemble technique, which successfully hides or modifies the timbre information of the original speaker. Consequently, the timbre prevention of the proposed E2E-VGuard does not rely on the specific speaker encoder. As evidenced in Table 1, we applied no specialized processing, e.g., extracting their encoders for model-specific attacks, to Llasa-1/8B and TTS models in Table 2. Instead, the high transferability of our method enables effective performance across diverse models.
These additional experiments for model scope expansion will be added in the revised version based on your insightful comments.
Limitations
R: Thanks for these proposed suggestions. Regarding the impact of ASR systems in E2E scenarios, perceptual analyses, content consistency metrics, and ICL-based TTS models, we have provided detailed explanations in the preceding sections. We hope these clarifications can address your concerns.
[1*] Gao J, Li H, Zhang Z, et al. Black-box adversarial defense against voice conversion using latent space perturbation. ICASSP 2025.
Note: [N*] represents the new reference and [N] denotes the reference in the paper.
I thank the authors for their detailed rebuttal and the additional experiments. I will maintain my score.
We sincerely acknowledge your response and valuable comments!
The paper proposes a defense framework against voice cloning threats, especially focusing on modern LLM-based end-to-end speech synthesis systems. Existing defenses struggle with LLM integration and end-to-end workflows that use automatic speech recognition (ASR) for text transcription. The core idea is to be disrupting both timbre (through encoder ensemble and MFCC features) and pronunciation (via ASR-targeted adversarial examples), while using psychoacoustic models to keep perturbations imperceptible.
Experiments on 16 TTS models (13 open-source, 3 commercial), 7 ASR systems, and Chinese/English datasets show E2E-VGuard significantly reduces speaker similarity (SIM) and increases word error rate (WER), outperforming baselines. It also demonstrates robustness against perturbation removal and real-world over-the-air attacks.
优缺点分析
Strengths:
-
The problem addressed in the paper is highly relevant in the context of increasing security risks associated with advanced speech synthesis technology. The paper presents a novel approach to protecting speech synthesis systems by combining multiple techniques to disrupt both timbre and pronunciation. Combines encoder ensembles, adversarial ASR attacks, and psychoacoustics in voice protection for the first time. The use of psychoacoustic models to ensure imperceptibility of perturbations is particularly innovative.
-
The paper provides a thorough evaluation of the proposed framework. The authors conduct extensive experiments across various models and datasets, demonstrating the effectiveness of E2E-VGuard in protecting against unauthorized speech synthesis.
-
The paper is clearly explains the design and implementation of E2E-VGuard. The methodology is detailed, and the results are presented very clear.
Weaknesses:
-
The paper mentions that E2E-VGuard takes a significant amount of time (e.g., ~100 seconds per audio) to protect audio samples, which could be a limitation in real-time applications.
-
The effectiveness of E2E-VGuard is evaluated against specific ASR systems, and while the results are promising, the paper could benefit from a broader evaluation across a wider range of ASR systems to demonstrate more universal applicability.
-
Claims of imperceptibility rely on SNR/PESQ metrics, lacks perceptual tests beyond MOS (e.g., ABX listening tests).
问题
-
How does E2E-VGuard perform on long audio sequences (e.g., >10 minutes)? Pronunciation disruption might weaken in longer contexts due to ASR cumulative errors.
-
Can this framework be optimized for real-time deployment? For example, via lightweight encoders or parallel processing, especially for mobile applications.
-
Have you tested E2E-VGuard in multi-speaker conversations? Defending against voice cloning in dialogues presents unique challenges (e.g., speaker diarization).
局限性
-
Computational efficiency: the time required to protect audio samples using E2E-VGuard is relatively high, which may limit its applicability in real-time scenarios. More work needs to optimize the computational efficiency of the framework.
-
ASR system dependency: the effectiveness of E2E-VGuard is demonstrated against specific ASR systems, while the results are promising, the framework's performance may vary when applied to different ASR systems. A more comprehensive evaluation across a wider range of ASR systems would be beneficial.
-
Language scope: this work is primarily evaluated on Chinese and English, effectiveness in low-resource languages remains untested.
最终评判理由
All my concerns are properly addressed.
格式问题
None
We sincerely appreciate your constructive and helpful comments. We will do our best to address your concerns and revise the paper accordingly.
W1/Q2/L1: E2E-VGuard requires a relatively long time to protect audio (~100s per audio), which limits real-time protection. Can this framework be optimized for real-time deployment?
R: Thanks for your suggestion regarding time optimization. We will elaborate on the time requirements of our application scenarios discussed in the paper and subsequently optimize the computational time of our framework based on your valuable feedback.
First, our paper focuses on the offline scenario where users protect audio before uploading to social media (Lines 272–273). This offline scenario holds minimal real-time requirements and high runtime tolerance [1*], allowing us to prioritize more effective protection against these LLM-based and E2E threats. Compared to other baselines, our computational time is comparable or even superior, e.g., AntiFake [7] requires 203.248 seconds and SongBsAb [1*] needs 287 seconds per audio. Moreover, E2E-VGuard has low hardware demands, requiring only 4–5 GB of memory and enabling deployment on consumer laptops (e.g., those equipped with RTX 3060 or 4060 GPUs).
Furthermore, with additional computational resources, the time overhead can be further reduced through techniques like parallel and multi-GPU processing. In the paper, we design E2E-VGuard as a single-audio processing software for low-resource GPU deployment as AntiFake [7]. To enhance efficiency, we optimize it to accept batch inputs. Building on this, we optimize E2E-VGuard for multi-GPU parallelization, significantly accelerating audio protection. We experiment to demonstrate this. The initial framework takes 104.62 seconds to process a 5-second audio sample on a single 4090 GPU, and we optimize it from two aspects:
- [Batch Process] With 8 samples per batch on one GPU, the runtime is reduced to 59.88 seconds per audio.
- [Multi-GPU Parallel Process] Using two RTX 4090 GPUs (processing 4 samples per GPU in parallel), the runtime drops to 30.17 seconds per audio.
Therefore, with greater computational resources, the time overhead might potentially and ideally be reduced to near-linear. Moreover, for a single audio, we can clip it into segments for batch and parallel processing to significantly reduce the time overhead, a benefit that equally holds for multiple audio files.
For mobile deployment, we can construct a client-server architecture [10]: E2E-VGuard can operate on a backend as the trustworthy server with enough computational resources. Mobile devices transmit audio to the trustworthy server and receive protected audio within an acceptable response time, enabling practical mobile applications.
W2/L2: The paper could benefit from a broader evaluation across a wider range of ASR systems to demonstrate more universal applicability.
R: First, in Appendix E, we validate the effectiveness of commonly used ASR systems, including six models: Whisper with four sizes, Citrinet, and Conformer. These ASR systems exhibit diverse structures and are representative, as shown in Table 7. The results in Table 8 demonstrate the effectiveness and scalability of E2E-VGuard across different ASR models.
Furthermore, in Section 4.4, we conduct experiments on commercial black-box models, for which the ASR systems are entirely unknown. Our architecture is not specifically optimized for these systems, but can achieve a certain level of protection, relying on the transferability from the specific white-box ASR system.
Finally, in the "Eliminating ASR System" section of Appendix A, we discuss that our method remains highly effective even when the transcribed reference text is correct. Therefore, the effectiveness might not be largely reduced across different ASR systems. This is because our optimization from the audio content perspective partially influences the features of the original audio at the feature level (the original audible textual content remains unaltered).
W3: Claims of imperceptibility rely on SNR/PESQ metrics, lacks perceptual tests beyond MOS (e.g., ABX listening tests).
R: The MOS aligns with human auditory perception, thereby reflecting the usability of the protected audio.
We employ both objective and subjective metrics to evaluate the perturbative perception. SNR is selected as it can effectively reflect the magnitude of embedded perturbations, where it serves as a core metric for noise perception [2*]. PESQ represents the audio perception objectively. The MOS values reflect human auditory perception, thereby validating the usability of the protected audio. The subjective experiments in Appendix G confirm that the embedded noise remains acceptable because the MOS surpasses 3 [7].
ABX listening tests primarily assess the distinction between protected and original audio, while we aim to ensure that embedded perturbations do not compromise the original audio's usability in daily life. The MOS values confirm that our protected audio maintains usability without impairing normal utility. Furthermore, even if adversaries can distinguish protected audio from unprotected audio, the experiments in Section 4.6 verify the robustness of the E2E-VGuard to ensure adversaries cannot effectively denoise or synthesize the audio.
Q1: How does E2E-VGuard perform on long audio sequences (e.g., >10 minutes)? Pronunciation disruption might weaken in longer contexts due to ASR cumulative errors.
R: For the long audio, we can segment the audio into shorter clips (approximately 3-8 seconds). Since TTS models, whether fine-tuning or zero-shot, cannot effectively process long audio inputs, we leverage the built-in audio segmentation tool provided by the GSV model [2]. Our framework has already demonstrated protection capabilities for extended audio and large-scale datasets, such as the CMU ARCTIC corpus (total duration: 1 hour 20 minutes) and the THCHS30 dataset with 1 hour duration. Segmenting long audio into shorter clips could also mitigate ASR cumulative errors, as illustrated by our experimental results with shorter inputs, e.g., Table 1 and Table 2. Regarding time overhead, the approach in response to W1 employs batch and parallel processing, greatly reducing the computational burden associated with long audio processing.
Q3: Have you tested E2E-VGuard in multi-speaker conversations? Defending against voice cloning in dialogues presents unique challenges (e.g., speaker diarization).
R: E2E-VGuard can also achieve protection for multi-speaker conversations. First, we select an English dialogue involving two speakers and utilize a speaker diarization model [3*] to extract separate audio segments for each speaker, which are then employed for voice cloning in an unprotected scenario. We obtain a SIM score of 0.703 using the CosyVoice for TTS. Protection for this dialogue can be conducted from two perspectives: holistic protection and separated protection. Holistic protection refers to protecting the entire original audio directly without any segmentation, while separated protection involves independently protecting each speaker’s segment as identified by the speaker diarization model. After voice cloning, the SIM score under holistic protection drops to 0.188, and under separated protection, it drops to 0.185, both indicating effective protection.
This performance stems from two main reasons. Firstly, as demonstrated in Appendix F, our method is effective for multi-speaker scenarios. As long as the speaker diarization model can accurately separate the speakers, our approach achieves effective protection. Therefore, the primary challenge in protecting multi-speaker conversations lies in the accuracy of the speaker diarization task itself. Secondly, based on prior research [22], optimizing both content and identity can effectively conceal speaker information, enabling anonymization across different speakers. From this perspective, our method provides strong protection for diverse speakers, allowing the identities of participants in a dialogue to be hidden. Therefore, E2E-VGuard can achieve holistic protection without identity segmentations.
L3: Language scope: this work is primarily evaluated on Chinese and English, effectiveness in low-resource languages remains untested.
R: We acknowledge this suggestion for the language expansion. E2E-VGuard can also achieve protection in Hindi and Polish. If the ASR system supports a particular language, E2E-VGuard can possess the ability to achieve protection because we conduct optimization on ASR systems. We choose Chinese and English primarily due to their widespread usage and because many TTS models support only these two languages, e.g., Index-TTS and Spark-TTS.
We conduct language-level evaluations on Hindi and Polish. For the TTS model, we select XTTS-v2 (as it supports these languages), and for the ASR model, we employ Whisper for text recognition. In the case of Hindi, the SIM score for clean sample cloning is 0.593, which drops to only 0.284 after protection, while the WER increases by 88.462%. For Polish, the SIM score decreases from 0.567 to 0.192. This evaluation confirms the scalability of E2E-VGuard across different languages.
[1*] Chen G, Zhang Y, et al. Songbsab: A dual prevention approach against singing voice conversion based illegal song covers. NDSS 2025.
[2*] Fang Z, Wang T, Zhao L, et al. Zero-query adversarial attack on black-box automatic speech recognition systems. CCS 2024.
[3*] https://huggingface.co/pyannote/speaker-diarization-3.1
Note: [N*] represents the new reference and [N] denotes the reference in the paper.
Thanks for the detailed response, I raise my rating to 5.
We are deeply grateful for your response and recognition of our work. Thank you!
The paper introduces E2E-VGuard, a proactive defense framework against malicious voice cloning in LLM-based end-to-end (E2E) speech synthesis pipelines. The method jointly protects timbre (via encoder ensemble and feature loss optimization) and pronunciation (via ASR-targeted adversarial perturbations), while employing a psychoacoustic model to keep perturbations imperceptible. The authors conduct an extensive evaluation across 16 TTS models (open-source and commercial), 7 ASR systems, and multiple languages, and further provide robustness analyses against denoising, transferability tests, and real-world deployment validation.
Strengths
- The problem is timely and significant, as voice cloning and fraudulent misuse of TTS pose genuine security risks.
- The defense framework is novel in combining timbre and pronunciation protection with perceptual masking, and it reflects realistic industrial settings.
- The evaluation is thorough and comprehensive, spanning different TTS architectures (fine-tuning, zero-shot, ICL-based), multiple ASR systems, and robustness against perturbation removal.
- The rebuttal addressed reviewer concerns with new experiments (WER reporting, evaluation on ICL-based TTS, low-resource language validation, and robustness to advanced denoising methods).
Weaknesses and Limitations
- Computational efficiency remains a concern: the original ~100s per audio sample protection time is high for large-scale or real-time settings, though the rebuttal showed batching and multi-GPU parallelization can reduce this to ~30s.
- While the defense assumes adversaries rely on ASR, a determined attacker could attempt manual transcription. Authors argue that timbre-level perturbations remain protective in such cases, but this assumption could be further stressed in real-world deployments.
- Some aspects of the method (targeted vs. untargeted attacks, optimization design) were initially unclear in the paper, though clarified in rebuttal.
- The MOS evaluation is acceptable, but the protected samples on the demo website are perceptually poor.
Overall, this paper makes a timely contribution to the safety of speech synthesis systems. Despite some limitations in efficiency and perceptual testing, the novelty, thorough experimentation, and societal relevance justify inclusion in the conference.