6.5

/10

Poster4 位审稿人

最低6最高7标准差0.5

3.5

置信度

正确性3.0

贡献度3.0

表达3.3

NeurIPS 2024

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Yi Zhu,Surya Koppisetti,Trang Tran,Gaurav Bharaj

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

摘要

关键词

Deepfake detectionanti-spoofingsynthesized speechgeneralizationexplainabilityself-supervised learning

评审与讨论

审稿意见

评分: 6置信度: 52024-07-09

This paper proposed a new approach named Style-Linguistic Mismatch (SLIM) for generalizable audio deepfake detection. The authors claimed that a certain dependency between linguistic information and style information can generalize well for audio anti-spoofing tasks. Additionally, the proposed method can also explain the final decision of the deep learning model. In order to prove the claim, a proof-of-concept experiment was conducted to show that real audio has a higher correlation coefficient between linguistics and style in Table 1. Based on this, the authors proposed a two-stage learning framework. In the first stage, the dependency is captured by two compression modules based on SSL features. In the second stage, a simple projection head is trained on the combination of the extracted dependency with the SSL features.

The authors conducted sufficient experiments to demonstrate the generalization ability of the proposed method. Two in-domain evaluation datasets and two out-domain evaluation datasets were used for this point. Besides, the performance beats several SOTA baseline methods on these datasets, enhancing the reliability of this work. The authors also provided enough analysis and visualization to prove their hypothesis.

Overall, it is a good paper for generalizable audio deepfake detection task.

优点

Explicitly explored the dependency between style and linguistics for audio deepfake detection
Pay attention to interpretable and generalizable audio deepfake detection simultaneously, which is novel to the community.
Good performance was shown in the experiments, especially significantly improving the performance on the out-domain datasets, which is quite crucial for anti-spoofing tasks.

缺点

The first training stage adopts an idea from an abnormal detection task, in which only real audio (normal data) was used for training. However, the real normal data is of a much larger order of magnitude than that used in the experiment. These corner cases should be considered in the work.

问题

I am curious whether the author applied their method to other unseen attacking scenarios. For example, in [1], the author proposed a new unseen scenario, where some audio genres in the evaluation dataset are not presented in the training dataset. I think if the author can provide some results on this scenario, it would be more convincing for their proposed method.

[1] Zeng, Chang, et al. "Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms." arXiv preprint arXiv:2305.10940 (2023).

局限性

Yes.

作者回复

2024-08-07

Thank you for the time spent reviewing our manuscript and pointing us to another dataset that we believe would be a great fit for further evaluation of the proposed method. Below is our point-by-point response:

“magnitude of stage 1 training data”

We acknowledge that our current approach is limited to available real data. As listed in Section 2, the datasets for Stage 1 are currently limited to CommonVoice and RAVDESS. As a follow-up for more robustness, we plan to leverage larger pre-training datasets, for example, those used in learning self-supervised speech representations. We will clarify this in the Limitations section.

“applying SLIM to applied other unseen attacking scenarios”

Thank you for the reference, this is indeed an interesting setup and we will extend our experiments accordingly in the future. However, our current scope of “unseen” attacks is limited to English datasets and VC/TTS systems in public evaluation datasets. (Please also see our response to Reviewer 4aqZ regarding our problem scope.) Since the recommended dataset is in Chinese language, our model will need to be retrained on Chinese data to perform the evaluation. In the future, we plan to extend our investigation to include more recent TTS/VC methods as well as more varied genres and languages.

2024-08-13

Thanks for your response. I am keeping my score as is.

审稿意见

评分: 7置信度: 32024-07-12

The paper suggests a novel method for detecting synthesized speech. Namely, the framework introduced in the paper allows the detection of a statistically significant mismatch between the style (i. e. paralinguistic attributes) and linguistic characteristics of synthesized speech samples, which helps to differentiate them from the real speech samples. The framework is based on frozen pre-trained SSL encoders and relatively small learnable parts that make the experiments computationally feasible for a small cluster.

The authors compare their method with several baselines, and it outperforms SoTa model in a cross-domain setup.

优点

The idea of the proposed method is non-trivial and allows us to better understand the differences between real and synthesized speech.
The proposed method outperforms SoTa on out-of-domain data and is on par with SoTa on in-domain data.
The evaluation is good, and the method is compared with a number of very decent baselines.
The paper is well-written and easy to follow.

缺点

The Analysis section is somewhat limited (especially the "Interpretation of model decisions" part). The authors claim that the success of their method is connected with very particular artifacts in synthesized speech; however, this point is weakly supported since only four small cherry-picked correctly classified examples were provided (see the "Questions" part).

问题

Questions:

Is the confidence of your detector connected with the severity of the artifacts in the synthesized speech samples? Do the mel spectrograms of the most confidently correctly detected synthesized samples contain the most clearly visible artifacts?
Which types of TTS models are easier to detect, and which are harder to detect by your method?
You provided examples of correctly classified speech samples. What about incorrectly classified ones? Can you share an explanation or guess what are the peculiarities of these "complicated" samples that prevent them from being correctly classified by your method?
Do you plan to upload your code to Git Hub?

Suggestions for the current paper:

I suggest adding the visual analysis of mel spectrograms of the incorrectly classified speech samples to the paper.

Suggestions for future research:

Your current research makes a step forward toward more explainable synthesized speech detection. Another step in the same direction was made in Topological Data Analysis for Speech Processing by Tulchinskii et al. (2023). They have shown that synthetic and real speech samples can be separated using the barcodes of the attention maps of the HuBERT model. The sum of the bars in the barcode for real speech is bigger than the sum of bars in the barcode for the synthesized speech - at least on some attention heads of the HuBERT. It would be interesting to investigate in more detail, how the topological characteristics (i. e. barcodes or other characteristics) of the speech embeddings are connected with the linguistic and paralinguistic properties of the speech.

局限性

The authors addressed the limitations adequately.

作者回复

2024-08-07

Thank you for the time spent reviewing our manuscript and for finding our work innovative. A point-by-point response to your questions can be seen as follows.

”Is the confidence of your detector connected with the severity of the artifacts in the synthesized speech samples? Do the mel spectrograms of the most confidently correctly detected synthesized samples contain the most clearly visible artifacts?”

We did notice that the low-quality samples (i.e., those with NISQA MOS < 2) were labeled as deepfakes with high confidence by SLIM. These samples were correctly classified with high confidence by the baseline models compared in Table 2. For the ones that did not contain a significant amount of noise, we did not observe significant correlation between the detector confidence scores and the severity of artifacts. We also did not see clearly separable patterns in the mel-spectrograms when comparing the most confident correctly detected samples with less confident ones. While mel-spectrograms are useful as a supplementary tool for studying samples, in our experience they don’t fully reveal all the deepfake artifacts that are important for a model to make a decision.

We would like to also point out that the actual source of the deepfakes remains an open question. This is one of the reasons why recent works have begun focusing on "interpretation-driven detection," e.g. formant analysis to report deepfakes. In our case, the `interpretation’ is incorporated into our model design, where the distance between pairs of style-linguistics dependency features can be directly used to quantify the mismatch (Figure 2, Page 8). We also show that the dependency features are complementary to the features that focus on the deepfake artifacts. Such complementarity can be seen from Table 2 - SLIM variants (Page 7), where models using only SSL features perform better on ASVspoof2021, whereas the models using dependency features outperform on MLAAD. Fusion of the two resulted in the best performance.

“Which types of TTS models are easier to detect, and which are harder to detect by your method?”

As some details of the TTS models used in employed datasets are not known, we performed an analysis on the recent ASVspoof5 dataset and presented a breakdown of model performance for different attacks and codecs (Table 1 and 2 in the rebuttal PDF). In general, degradation is seen when codecs with lower bit rates are applied. We found systems with zero-shot capability are harder to detect than other methods (e.g., YourTTS).

”You provided examples of correctly classified speech samples. What about incorrectly classified ones? Can you share an explanation or guess what are the peculiarities of these "complicated" samples that prevent them from being correctly classified by your method?”

We will add incorrectly classified samples to the paper. In general, we observed that severely degraded samples (e.g. audio too short/noisy) were commonly misclassified. This could be due to the design of our model, which by nature may require longer duration to capture the style-linguistics mismatch. The observation here also aligns with the difference seen between style/linguistics-only features and dependency features in Table 2, where the former performs better on ASVspoof2021 and the latter performs better on MLAAD.

”Do you plan to upload your code to GitHub?”

Since we are currently filing for IP, we do not currently plan to release the training code. However, we provided details in the paper to facilitate easy implementation of our model. The Appendix includes a detailed description of the model architecture (Appendix A.3), training hyperparameters (Appendix A.6), the list of pre-training datasets (Appendix A.2), and a Pytorch-style pseudocode of the training objective (Appendix A.4).

审稿意见

评分: 7置信度: 32024-07-18

"SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection" describes a motivation and systematic approach to disentangling different components of speaking characteristics, in order to perform audio deepfake detection. This paper demonstrates a working 2-stage training pipeline, numerous ablations, and metrics over several English language focused datasets. In addition, qualitative demonstrations of the space of features learned and used by the model form a study which backs the philosophical approach of the paper with regard to disentangling stylistic characteristics of real speakers, in order to have a generalized defense against audio deepfakes.

优点

The overall approach taken here is useful, the problem is relevant, and the overall experiments cover a portion of the necessary ground to match the claims.

Qualitative studies both in the main body of the paper, and the appendix are generally interesting and it is worth considering if these experiments can be fit into the unified paper "flow" for this publication, especially given final conclusion writing. Overall ablation study and description of methods are well done, and form a high quality "core" to the paper. Architecture figures and ablation tables are both outstanding in terms of clearly communicating approach, and variations in the results. Use of open source toolkits, and sharing of hyperparameters should help reproducibility.

缺点

The portion of the title "... Generalized Audio Deepfake Detection" claims a robust and general method for "audio deepfake detection". However, there are a broad range of prosodic styles from people with impaired speech, new learners of languages, children developing their ability to communicate and so on.

For a paper about generalized deepfake detection, where such detection keys particularly on the prosody of at least one example, I would like to see a larger study and example of voices in-the-wild, as opposed to the current examples which seem to be reasonably fluent speakers and performers, who may have "trained" speech patterns by and large. This is despite the use of the "in-the-wild" dataset, which doesn't seem truly wild in terms of robustness testing. Defense approaches should have some example study and discussion of False Positives beyond pure metrics (though the metrics discussions here are well done) - particularly when out-of-domain speaking patterns may have heavy overlap with "deepfake" data and the features used for classification, demonstrated as part of the paper. The examples in Figure 4. along with accompanied writing are a start down this road but not sufficient.

Mozilla Common Voice has a decent amount of this type of truly-in-the-wild speech for some qualitative study, and there are existing papers which use the same dataset for few-and-zero-shot TTS and voice conversion. The "in-the-wild" dataset here seems to largely focus on imitative TTS and voice conversion, and their "real" counterparts, which would generally point to celebrities, politicians and other public figures who (very likely) do not have the types of speech patterns mentioned previously. Though dysarthric speech is mentioned briefly in the limitations section, the issues which crop up from study of dysarthric speech are also found to some extent in many "typical" speakers as well, in more subtle ways so directly addressing this with some examples would strengthen the core claim of the paper in regards to "generalized detection".

MLAAD is multilingual, but some details of the dataset construction lead to limitations in its testing (outside the scope of this paper, beyond the continued critique that broader datasets and synthetic generation methods are needed to test generalization). However, here only the EN subset appears to be used - which again reduces the claims from the title since it means the bulk of testing is on English locales. This is not a problem in terms of the experiments, but the writing and claims of the paper should be limited around this fact. Additionally, these are speech deepfakes not the broader category of "audio" per-se, so maybe some further adjustment is warranted, though other papers in this subarea tend to use "audio deepfake" to describe speech deepfakes.

As it stands, the examples shown do not convince me that the "attacks" used here are sufficiently high-quality to claim a generalized defense, though the developed method seems to perform well on the datasets used, and the overall scientific study (though limited) is well done.

问题

What are the systems tested in Table 1? Either by name, or citation? What is the source of the speakers? If these have PII, a description of the speakers broad categorizations is sufficient. If pulled from an existing dataset, speaker ids would be good. As it stands this table is largely uninformative, without any material information beyond a general design motivation for follow-on work (since CCA shows some behavior differences between methods).

Given the importance of both sample rate and noise in audio, it would be very useful to test this approach under those forms of degradation - e.g. does the method scale down to data of narrow bandwidth, at low samplerate or under the presence of additive noise / background sound (such as music, crowd noise, applause, and so on). The prosodic example may hold under reasonable conditions, but how many detections are relying on prosodic features versus simpler acoustic artifacts? Figure 2. hints at this to some extent, but some explicit description and study would be useful.

Generally the data examples shown are extremely noisy, and the synthesis methods are not particularly high quality. Testing on both clean audio, and higher quality synthesis, as well as under controlled degradations could raise my score. After all it is plausible an attacker may use telephony as a transmission channel - especially if the degradations imposed by the channel give the attacker a further advantage.

As a general direction - it may be useful to directly answer some of the questions posed by the titles of the citations in this paper e.g. "Does audio deepfake detection generalize?" - the claim here being "yes", but demonstrations being limited to existing datasets rather than further tests with recently developed technologies / APIs and so on. "Does deepfake detection rely on artifacts?" - the claim here is also (somewhat) "yes", which hurts the counterclaim of being generalized to some degree, unless these artifacts are general across a broad swath of methods, which would be a surprising finding given existing demonstrations.

The primary concern in order to raise my score would be a more proper scoping of the generalization claims, and the domain claims around this method given limitations of the testing datasets. The conclusion also discusses a fair bit about qualitative analyses which are largely relegated to the appendix, so there is further mismatch between the chosen title and the final claim.

Larger and more diverse datasets (multi-lingual being one option, more unusual speaker styles would be another), or more particularly use of a variety of recent, high performing methods would raise my score if the writing is mostly unchanged. Some of these methods may only be available by API, which is unfortunate but perhaps necessary - additionally TortoiseTTS and spinoffs should have specific, stronger synthesis exemplars than those demonstrated, especially under the assumption an attacker may be doing manual selection given a corpus of intermediate generations to choose the best final result.

局限性

The authors have addressed some limitations of their work, however this review is partly hinging on the gap between claims, and the effective limitations and demonstrated results. More writing on the limitations, and particularly potential harms of deploying unbalanced "defense" methods in terms of accessibility would be beneficial.

作者回复

2024-08-07

Thank you for your time spent reviewing our work and for sharing your detailed comments which helped us to revise our work. A point-by-point response to the posted questions can be seen below:

“What are the systems tested in Table 1?...”

As the samples referred to in Table 1 are part of the ASV2019 training data, the detailed PII is not available. We agree that detailed information of the generative models will be useful. Considering that each deepfake attack in ASV2019 has its own synthesis pipeline, and the relevant details are already summarized in the ASV2019 summary paper [1] , we will add a reference to the appropriate section and guide readers to refer to the summary paper for details.

“..., it would be very useful to test this approach under those forms of degradation.”

Based on the first part of the question on robustness to degraded speech, we performed an evaluation of SLIM on 12 different codecs within the more recent ASVspoof5 dataset (released in July 2024, two months after our initial submission to NeurIPS). The results are summarized in Table 1 in the global rebuttal PDF. Note that ASVSpoof5 includes the Opus codec, which is used in telephony systems. In addition, we point out that the ASVspoof 2021, which is one of the test datasets, includes different types of lossy codecs with varied bitrates, typically used for media storage.

Regarding the second half of the question on prosodic features vs simpler acoustic artifacts, we agree that separating the two categories of speech samples would definitely help in identifying the true source of deepfake. However, when we performed listening tests and spectrogram visualizations on the deepfake samples, there were many cases where a sample could manifest a combination of artifacts and style-linguistics mismatch pattern. It was therefore challenging to divide samples into two distinct categories and perform testing separately. To gain some insight, we performed an ablation on SLIM (Table 2 - SLIM variants), where we experimented using only the SSL features (rows 1-3 under `SLIM variants’; corresponding to the artifact cases), only the dependency features (row 4; corresponding to the mismatch cases), and the combination of the two (row 5; corresponding to leveraging both sources). The difference in performance can be used to gauge the question of the actual source of the deepfake samples. Our results show that the mismatch cases could be a smaller portion in ASV2021, since the dependency-alone performance was much worse than using SSL features. For the In-the-wild and MLAAD datasets, the performance of dependency features is on par with, if not better than, SSL features, demonstrating that these two datasets have more mismatch cases.

” Testing on both clean audio, and higher quality synthesis, as well as under controlled degradations could raise my score.”

We evaluated our model on the ASVspoof5 dataset released in July 2024, where the most recent generative models were used together with different codec degradations. We provide a breakdown of the model performance with regard to different types of unseen attacks, as well as unseen codecs. These results can be found in Table 1 and Table 2 in the rebuttal PDF.

”The primary concern in order to raise my score would be a more proper scoping of the generalization claims,...”

We agree that our current system only operates on English data and cannot yet handle all prosodic styles (as briefly discussed in Limitations). In the introduction (Page 1, line 30), we specified that the current SOTA methods lack generalizability to unseen attacks, which we aim to tackle in this study. Following your suggestion, we will revise the manuscript in various parts, including the abstract, introduction, and limitations, to scope-limit our use of the term generalization to “unseen attacks”.

Audio deepfake detection (ADD) in the ADD community indeed most commonly refers to speech deepfakes, we will clarify this in our paper. Regarding multilinguality, our current approach leverages pretrained embeddings that were trained on tasks from English data (e.g. emotion recognition), so we are currently limited by the availability of language-specific high quality pre-trained embeddings.

Regarding evaluation on prosody styles, we point that our test sets cover a decent variety of speakers (e.g., 58 celebrities in the In-the-wild, and 100+ speakers in total for all test datasets). Performance across diverse test sets indicates our model’s ability to do well on a variety of speaking styles. However, we acknowledge that an extensive evaluation on special prosodic styles such as pathological speech, children speech, or new learners of a language, has not been included in our study due to small sample size, limited variability of speech content, and noise issues (e.g., [2], [3]). We will acknowledge this limitation in the paper.

”Larger and more diverse datasets (multi-lingual being one option, more unusual speaker styles would be another), or more particularly use of a variety of recent, high performing methods would raise my score if the writing is mostly unchanged.”

We performed evaluation of SLIM on the ASVspoof5 data, where some of the recent main-stream platforms (e.g., tortoiseTTS) are used for generating the deepfake data. The results are reported in Table 1 and Table 2 in the rebuttal PDF.

Reference:

[1] X. Wang, et. al., “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Elsevier Computer Speech & Language Journal, vol. 64, 2020.

[2] Coppock, Harry, et al. "COVID-19 detection from audio: seven grains of salt." The Lancet Digital Health 3.9 (2021): e537-e538.

[3] Schu, Guilherme, et. al., "On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

评论- Reply to Authors

2024-08-13

The additional analysis, study under various degradations, clarifications around data, and particularly addition of ASVspoof5 data cover a reasonable portion of my prior concerns. Combined with writing which more accurately scopes the claims made, this has strengthened the impact of the work and should be of interest to a larger audience, now and into the future.

审稿意见

评分: 6置信度: 32024-07-25

This paper proposes a new method for audio deepfake detection by first employing self-supervised pre-training on real samples only and then used to do real/fake classification. The proposed method achieves SOTA performance in both within-domain and cross-domain scenarios.

优点

The proposed technique is sound and reasonable. Learning the correlation between style and linguistics for detecting deepfakes makes sense to me.
The comprehensive ablation studies in Table 2 further verify the effectiveness of the proposed method.
The analysis (Interpretation of model decisions) and visualization (Mel-spectrograms) are reasonable.

缺点

The idea of capturing the mismatch between style and linguistics is promising, but it's unclear how this mismatch correlates with deepfake samples. More intuitive illustrations and examples are needed to better validate this motivation.
In Figure 2 (training framework for ADD), the source of supervision for style and linguistics is not apparent. How do you ensure that each encoder learns the corresponding features? Additionally, how do you achieve perfect disentanglement between the style and linguistics encoders?
It appears that the latest dataset used is ASVspoof2021, which is quite old. Why not incorporate more recent and advanced deepfake datasets for evaluation?

问题

See the Weaknesses part.

局限性

I did not see any obvious limitations for this work. It is a fairly good paper but not very impressive to me.

作者回复

2024-08-07

Thank you for your time reviewing our manuscript and for acknowledging our contribution to the field. We have provided a point-by-point response as follows:

“The idea of capturing the mismatch between style and linguistics is promising, but it's unclear how this mismatch correlates with deepfake samples.”

As the idea of style-linguistics mismatch has not been systematically investigated before, we first referred to existing linguistic studies that show concrete examples of how these two aspects are correlated in real speech (Section 2.2, Page 3, line 111-113), e.g., the impact of emotional status on word choices. Given that main-stream TTS and VC methods model these two aspects independently, such subtle correlation in real speech may be missing in deepfakes. For example, VC systems swap the original voice identity with a new one, without considering if the new voice would match in style with the word choices. To verify that the hypothesized mismatch does exist in deepfakes, we then provided preliminary results of CCA analysis (Section 3.1, Page 3-4, line 117-142; Table 1) that show a significantly higher correlation value of the two aspects in real speech and a lower correlation in deepfake speech, which also aligns with the distance between the dependency features learned by SLIM (Page 8, Figure 2). Although it is challenging to exhaustively list all mismatch cases, we provided a spectrogram illustration in Figure 4 (top right), which demonstrates a deepfake sample identified by SLIM that shows abnormal rhythm of pauses when uttering certain words. We agree that more examples could benefit the understanding of the mismatch, which requires a more systematic and detailed investigation. We plan to pursue this for future analysis.

”In Figure 2 (training framework for ADD), the source of supervision for style and linguistics is not apparent. How do you ensure that each encoder learns the corresponding features? Additionally, how do you achieve perfect disentanglement between the style and linguistics encoders?”

We acknowledge that a perfect disentanglement of the two aspects is a challenging task. However, based on existing works on how information propagates through self-supervised learning (SSL) model layers (reference listed in Section 2.2, page 3, line 106-109), it is possible to obtain two representations which have maximal information about one of the aspects while retaining minimal information of the other. In our work, we tried to limit the entanglement by choosing and freezing the pretrained embeddings fine-tuned for tasks that are likely independent of each other, i.e. ASR for linguistics and SER for style. To ensure a satisfactory disentanglement of our adopted representations, we performed a correlation analysis in Appendix 1, Figure 5, where the average correlation value of the two representations is close to 0. These results help to ensure that the two input representations are maximally disentangled (if not perfectly). Due to page limit, we were not able to integrate these analysis results into the main text.

“It appears that the latest dataset used is ASVspoof2021, which is quite old. Why not incorporate more recent and advanced deepfake datasets for evaluation?”

We employed four test datasets, out of which both the In-the-wild and MLAAD are newer datasets than ASVspoof2021. The MLAAD was the latest one at the time of writing, of which the latest version was released in April 2024. While the ASVspoof2021 is not the latest dataset, it does have the advantage of covering a variety of attacks, which is summarized in Appendix 2, Table 3 (Page 16). The ASVspoof 2021 dataset also includes different types of lossy codecs with varied bitrates, typically used for media storage, facilitating the evaluation of model robustness to codecs.

作者回复

2024-08-07

We thank the reviewers for their thoughtful comments and suggestions; we appreciate all reviewers’ positive feedback on our fundamental approach motivated by the style-linguistics mismatch modeling for deepfake speech detection, our experiments, and our overall paper presentation.

The reviewers’ common concerns are mainly on the scope limit of our model/claims and the extension to more diverse and multilingual data. We clarify that our current scope is limited to speech deepfake detection in English. The proposed model performs well on a variety of unseen deepfake attacks and varied types of signal degradation. We agree that it is important to extend our study to multiple languages as well as to more diverse speaking styles. Both these directions are currently on our research roadmap.

There were similar questions related to robustness to compression codecs (Reviewer 4aqZ), detailed analysis on performance obtained under different / more recent generative models (Reviewer chQg and Reviewer 4aqZ), and how the model performs on more challenging datasets / datasets with more genres (Reviewer 4aqZ and Reviewer 1UM9). At the time of our initial submission, MLAAD was the most recent open-source dataset (version 3 released in April 2024), which we employed as one of the evaluated datasets. However, it did not incorporate different codecs. While ASVspoof2021 indeed includes codecs, the types of generative models were not state-of-the-art. Following the questions brought up by the reviewers, we performed an extra round of evaluation of SLIM on the ASVspoof5 dataset (released in July 2024, after our initial submission to NeurIPS) which has 10+ types of codecs and more recent TTS and VC systems. We report a breakdown of the evaluation results in the attached PDF file. Restricted by the constraints on training data in the ASVspoof5 challenge, we were not able to use the same Wav2vec-XLSR as backbones, and substituted them with WavLM-Base backbones. Apart from a few resulting changes in the training hyperparameters, the overall training strategy remains very much the same. To respect the anonymity rule, we confirm that the results in the PDF file were analyzed and created only for the rebuttal phase and they do not overlap with any public information.

The following results can be found in the attached PDF file:

Breakdown of SLIM’s performance under clean and 12 different codec conditions
Breakdown of SLIM’s performance for 16 different unseen attacks

In general, we see that the proposed model generalizes well across different codecs and attack types in ASVSpoof5.

最终决定Accept (poster)

2024-09-25

Frequently lengthy detailed reviews, and discussion about the content of a paper is due to drilling down into the weaknesses of the work, and disagreements as to the significance of the limitations between author and reviewer. However, when lengthy discussions are accompanied by Accept recommendations, these are strong evidence of reviewer enthusiasm.

The broad and deep advances of generative models motivate comparable attention given to deepfake detection. Much of the discussion and limitations of the work come from what extensions can be done, specifically multilinguality, and how this approach could be evaluated on more realistic attacks. The reviewers (and this meta-reviewer) find that the authors have made a sufficient and significant effort to evaluate against the highest quality public test sets. Of course, this is a dynamic environment with new attacks being developed contemporaneously with defense techniques.

The central idea of using the style-linguistic signal to identify deepfakes is novel and effective. It does not solve every problem or identify every deepfake, but it is an worthwhile contribution to the topic, well described and evaluated.