PaperHub
6.5
/10
Poster4 位审稿人
最低5最高8标准差1.1
6
7
5
8
4.8
置信度
正确性2.8
贡献度2.8
表达2.8
NeurIPS 2024

FINALLY: fast and universal speech enhancement with studio-like quality

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

2024 SoTA for speech enhancement

摘要

关键词
speech enhancement; generative models

评审与讨论

审稿意见
6

The authors propose FINALLY, a speech enhancement algorithm based on GANs and the Wav LM encoder. They evaluate different feature extractors and then show qualitative and quantitative results on speech enhancement.

优点

A major strength is the fidelity of the generated outputs. The results are impressive and are demonstrated across a variety of SNR noise levels, noise types, and accents. I was very impressed by the samples, and the fact that it can be generated in a single pass. The generative nature of the model allows for a greater degree of restoration from low quality signals, compared to discriminative/masking based approaches (e.g. Demucs, Conv-Tasnet, TF-GridNet) that can't restore the speech to this studio like quality.

I also liked the rigorous comparison to choose WavLM as the feature extractor in the network. There was proper consideration given to why you chose that encoder. Although a better choice would have been to compare WavLM to using other methods in the ablation study, e.g. WavLM does better on the clustering and SNR rule, but does that actually mean it does better as the network component?

缺点

The main weakness is that the approach is very close to HiFi with only slight modifications. The other contributions stated in the paper are not as major as the authors claim. The first section about sampling from the conditional distribution vs the max seems like it misses the point a bit. The reason people use diffusion models, GANs etc for for conditional sampling is because those models produce high quality unconditional outputs and can often be used without modification Not necessary because there is a desire to sample from the posterior distribution to allow multiple generations for example. Therefore I don't think the analysis in section 2 is such a big contribution.

A second major weakness is the experiment section. For speech enhancement, there are some standard datasets and metrics people use to compare results. These include metrics like PSNR, PESQ, STOI, and datasets like VCTK, WHAM, and Librispeech. This would allow comparison against a greater number of recent methods (e.g. TF GridNet, Demucs) which are not included

问题

Can you provide qualitative comparisons against the baselines like Hifi? The reviewers would appreciate hearing those as well as your generated outputs that you have already provided.

局限性

Yes

作者回复

Dear Reviewer,

Firstly, we would like to thank you for your invaluable work. Below, we address your concerns about the paper.

W1. Diffusion Models and GANs for Conditional Sampling

We generally agree that generative models, such as GANs and diffusion models, are not necessarily used to allow multiple generations from posterior distributions but rather due to their high-quality results. However, we think that this consideration does not downplay our contribution. We provide a theoretical interpretation for the case when samples are generated using an LS-GAN generator. Theoretical interpretations are important for in-depth analysis and might have a significant impact on how the field evolves in the future. From the practical side, we argue that learning the whole posterior distribution might not be necessary for the speech enhancement problem and therefore diffusion models might be solving an unnecessarily complex task. Our analysis reveals that GANs provide a natural remedy to regress for the most probable speech reconstruction directly, thus speech enhancement GANs solve a simpler task with fewer potential resources. Therefore, we believe that Section 2 holds a significant contribution to the field.

W2. Standard Datasets and Metrics for Speech Enhancement

We would like to clarify the reasons behind our choice of datasets and metrics.

We chose the VoxCeleb and UNIVERSE validation set because these data include several degradations at the same time, and the strongest baselines for universal speech enhancement, which are HiFi-GAN-2 and UNIVERSE, release results on this data. Other methods from the literature usually consider only one degradation (e.g., only noise or reverberation) or are significantly inferior to HiFi-GAN-2 and UNIVERSE. For instance, the popular speech enhancement VCTK-DEMAND dataset considers only additive noise as the distortion, and methods achieving good results on this dataset tend to poorly generalize to real data. Thus, comparison on this data is not particularly important from a practical point of view.

Our paper does not include similarity-based metrics such as PESQ and STOI for two reasons. First, since PESQ and STOI require ground truth reference audio, we are unable to compute them for real data such as VoxCeleb and LibriTTS, as there is no ground truth reference for such data. Second, there have been a number of works consistently reporting low correlation of reference-based metrics with human perceptual judgment [1, 2, 3]. In particular, the [1] study reports that no-reference metrics (including DNSMOS, which we reported in our work) correlate significantly better with human perception and therefore have higher relevance for objective comparison between methods. Furthermore, in our study, we report the MOS score, which directly reflects human judgments of restoration quality.

However, we agree that outlining conventional metrics such as PESQ and STOI on a popular VCTK-DEMAND benchmark could facilitate comparison with prior work. One can find this comparison in the table below. Note that we will add these results in the appendix of the camera-ready paper.

Table 1. Comparison with baselines on VCTK-DEMAND dataset.

ModelMOSUTMOSWV-MOSDNSMOSPESQSTOISI-SDRWER
input3.18 ± 0.072.62 ± 0.162.99 ± 0.242.53 ± 0.101.98 ± 0.170.92 ± 0.018.4 ± 1.20.09 ± 0.03
DEMUCS3.95 ± 0.063.95 ± 0.054.37 ± 0.063.14 ± 0.043.04 ± 0.120.95 ± 0.0118.5 ± 0.60.07 ± 0.03
HiFi++4.08 ± 0.053.89 ± 0.064.36 ± 0.063.10 ± 0.042.90 ± 0.120.95 ± 0.0117.9 ± 0.60.08 ± 0.03
HiFi-GAN-24.13 ± 0.053.99 ± 0.054.26 ± 0.053.12 ± 0.053.14 ± 0.120.95 ± 0.0118.6 ± 0.60.07 ± 0.03
DB-AIAT4.22 ± 0.054.02 ± 0.054.38 ± 0.063.18 ± 0.043.26 ± 0.120.96 ± 0.0119.3 ± 0.80.07 ± 0.03
FINALLY (16 kHz)4.41 ± 0.044.32 ± 0.024.87 ± 0.053.22 ± 0.042.94 ± 0.100.92 ± 0.014.6 ± 0.30.07 ± 0.03
FINALLY (48 kHz)4.66 ± 0.044.32 ± 0.024.87 ± 0.053.22 ± 0.042.94 ± 0.100.92 ± 0.014.6 ± 0.30.07 ± 0.03
Ground Truth (16 kHz)4.26 ± 0.054.07 ± 0.044.52 ± 0.043.16 ± 0.04----
Ground Truth (48 kHz)4.56 ± 0.034.07 ± 0.044.52 ± 0.043.16 ± 0.04----

As one can see, our model clearly outperforms all baselines in terms of the MOS metric and reference-free objective metrics, while reference-based metrics correlate poorly with human judgments. Note that our model has a slightly higher MOS than the ground truth likely due to the fact that our training data is of higher quality than the VCTK ground truth samples. Additionally, we would like to point out that one can find a comparison on real data with the mentioned DEMUCS model in Table 2 of the paper.

Q1. Qualitative Comparisons Against the Baselines

We agree that the work could benefit from more qualitative results. We will provide more qualitative comparison results on the paper's web page upon acceptance, as we cannot modify supplementary materials during the rebuttal period according to NeurIPS rules.

[1] Manocha, Pranay et al. "Audio Similarity is Unreliable as a Proxy for Audio Quality."

[2] Manjunath, T. "Limitations of perceptual evaluation of speech quality on VoIP systems."

[3] Andreev, Pavel et al.. (2022). "HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement."

审稿意见
7

The paper proposes a GAN based method for universal speech enhancement (SE) demonstrating competitive experimental performance.

To justify the use of adversarial learning for SE, the authors provide a theoretical insight regarding the effectiveness of the mode-covering property of LSGAN for the SE task under ideal conditions. However, in practice several additional losses needs to be employed for stable training and better performance.

The authors also propose a perceptual loss, LMOS, which combines L1 loss between STFT magnitudes and L2 loss between WavLM convolution encoder features. The choice of using WavLM-conv features for the perceptual loss is driven by two heuristic rules regarding the features, 1) identical (content) speech sounds should be clustered together and 2) adding noise should increase distance from clean cluster centers proportional to SNR. Based on these rules, several models were evaluated and finally the WavLM-conv was chosen. Moreover a Human Feedback (HF) loss is also utilized, which is provided by means of differentiable PESQ and UTMOS predictions.

For the model architecture, the authors extend the HiFi++ model architecture, incorporating WavLM features along with SpectralUNet features as input for the Upsampler. Additionally, an upsampling WaveUNet is added which increases the output sampling rate upto 48khz.

优点

  • Overall the paper is well written.
  • Strong speech enhancement performance.
  • Interesting theoretical insight showing the relevance of mode-covering property of LSGAN for speech enhancement task.
  • The proposed criteria for choosing the SSL feature for perceptual loss is very interesting and the results of the comparison with several SSL models is a very good contribution.
  • Architectural changes to the HiFi++ model are significant as it enables the model to output 48khz speech and incorporate SSL features.

缺点

Main weakness is in the experimental evaluations

  • Table 2 (main experimental result):

    • Since the model architecture is based on HiFi++, a direct comparison is probably necessary, though the ablation study covers it somewhat, but comparison with a full scale baseline HiFi++ in the main experiment would be more convincing.
    • There is no description of the model size/training data scale for the baseline methods, which makes it difficult to contextualize the results.
    • A self contained description of the evaluation dataset is missing.
  • Table 3 (ablation study):

    • The ablation study is missing the results for the 1st stage (pretraining setup, without adversarial loss), while it is mentioned in Appendix E2, but the results are not in the table.

问题

  • From the ablation study, the LMOS loss, model scaling, Upsampler (to 48khz) and HF losses show clear performance benefits, however the advantage of incorporating the WavLM-enc is not clear, as it does not lead to significant performance gain. Is the WavLM-enc necessary?
  • Since the goal is universal speech enhancement, it would be interesting to see the performance in more focussed evaluation such as bandwidth extension or reverberation individually.

局限性

Limitations have been adequately addressed.

作者回复

Dear Reviewer,

Thank you very much for your high assessment of our work. We truly appreciate your valuable comments. Below, we address your concerns.

W1. Comparison with HiFi++ & Q1. Advantage of incorporating the WavLM Encoder

The original HiFi++ model was proposed for speech denoising and bandwidth extension applications separately. Therefore, a significant practical difference between our work and HiFi++ is that our model was trained to support a wide range of degradations emerging in practice, and thus our model is able to generalize to real-world data. Additionally, our model is trained with novel LMOS and HF losses, the effectiveness of which is validated by ablation studies. Therefore, our training framework is substantially different from that of HiFi++, and the importance of this difference is validated by practical observations.

However, you are absolutely correct that the generator architecture of our model is mostly based on HiFi++. The main differences in this regard are the introduction of Upsample WaveUNet and the WavLM encoder. While the importance of Upsample WaveUNet is clearly validated by ablation studies, the effect of the WavLM encoder appears to be somewhat marginal in terms of MOS score (although we must pinpoint that objective metrics are considerably higher for the case with WavLM). To validate the importance of the WavLM encoder, we have conducted an additional ablation test on the UNIVERSE validation set, which contains more challenging cases than the Voxceleb data. The results are provided in the table below.

Table 1. Ablation of WavLM encoder on UNIVERSE validation.

MOSUTMOSWV-MOSDNSMOSPhER
w/o WavLM3.49 ± 0.083.33 ± 0.183.80 ± 0.153.15 ± 0.090.27 ± 0.04
w/ WavLM3.75 ± 0.073.56 ± 0.203.99 ± 0.163.07 ± 0.080.21 ± 0.04

Thus, we conclude that the introduction of the WavLM encoder is very important to achieve high quality on more challenging data. We will provide these additional results in the camera-ready paper.

W2. Description of the Model Size/Training Data Scale for the Baseline Methods

We agree that for better contextualization of our result, detailed information on the model size and training data of the baseline models is needed. Below, we provide a table, which will be included in the camera-ready version:

Table 2. Comparison of resources with baselines.

ModelTraining Data Scale (clean data)Model Size (parameters)RTF on V100 GPU
VoiceFixer44 hours (VCTK)112 M0.02
DEMUCS500 hours (DNS)61 M0.08
STORM200 hours (WSJ0 and VCTK)28 M1.05
BBED140 hours (WSJ0)65 M0.43
HIFI-GAN-25 hours (DAPS)34 M0.50
Universe1500 hours (private data)189 M0.5
FINALLY (ours)200 hours (LibriTTS-R and DAPS)454 M (including 358 M of WavLM)0.03

We note that, while our model has a larger number of parameters than the baselines, most of these parameters are used to process low-resolution features (e.g., the Transformer of WavLM operates on 320-times downsampled representations of the waveform, i.e., at 50 Hz). In contrast, models like HiFi-GAN-2 mostly operate at full waveform resolution (due to WaveNet). This allows our model to be more compute-efficient and thus have a much lower Real-Time Factor (RTF).

W3. Self-contained Description of Evaluation Dataset

Since we took the evaluation data from prior work, we did not describe it in much detail. However, we agree that our paper would benefit from such a description. We will include it in the camera-ready paper. Please find the evaluation dataset details below.

VoxCeleb Data: 50 audio clips selected from VoxCeleb1 [1] to cover the Speech Transmission Index (STI) range of 0.75-0.99 uniformly and balanced across male and female speakers.

UNIVERSE Data: 100 audio clips randomly generated by the UNIVERSE [2] authors from clean utterances sampled from VCTK and Harvard sentences, together with noises/backgrounds from DEMAND and FSDnoisy18k. The data contains various artificially simulated distortions including band limiting, reverberation, codec, and transmission artifacts. Please refer to [2] for further details.

W4. The Results for the 1st Stage (Pretraining Setup)

Thank you very much for noticing this issue. Indeed, the table with these results is missing in the Appendix. We provide it below and will include it in the Appendix of the camera-ready paper.

Table 3. Results on Voxceleb data after pretraining with different regression losses.

LossUTMOSWV-MOSDNSMOS
MS-STFT2.54 ± 0.102.77 ± 0.103.04 ± 0.05
RecLoss2.53 ± 0.102.77 ± 0.103.04 ± 0.05
LMOS3.44 ± 0.083.25 ± 0.033.57 ± 0.06
L1_Specfailed to converge--

Q2. More Focused Evaluation

We agree that additional results concerning individual degradations would be helpful. We will include qualitative examples on the project web page upon paper publication.

[1] Nagrani, Arsha et al.. "Voxceleb: a large-scale speaker identification dataset."

[2] Serrà, Joan, et al. "Universal speech enhancement with score-based diffusion."

评论

Thank you for the detailed rebuttal and additional ablation experiments. Most of my concerns have been adequately addressed, however I would still like to see the HiFi++ in the main tables (Tab. 2 and 3) since the proposed work builds on top of it.

In the evaluation on the VCTK-DEMAND (in response to reviewer gE54), HiFi++ is included, will that table be included in the main paper? It will add value I think and also adding it to Tab 2 and 3 in the will also make the improvements stand out.

评论

Dear reviewer,

Below we provide a table comparing our method with HiFi++.

Table 1. Comparison with HiFi++ on VoxCeleb data.

ModelUTMOSWV-MOSDNSMOS
Input2.72 ± 0.112.90 ± 0.162.72 ± 0.11
HiFi++2.76 ± 0.132.68 ± 0.142.98 ± 0.07
FINALLY (ours)4.05 ± 0.073.98 ± 0.063.31 ± 0.04

As one can see, the performance of HiFi++ on real data is quite poor due to the reasons mentioned above. We agree that this comparison will make the proposed improvements stand out more clearly. Therefore, we will include the comparison with HiFi++ in the main paper, as well as results on the VCTK-DEMAND benchmark.

评论

Thank you for your efforts, my concerns have been addressed, and I would keep my positive rating of the paper.

审稿意见
5

This paper proposes a universal speech enhancement model for real-world recording environments utilizing GAN, referred to as FINALLY. The authors theoretically analyze that using LS-GAN loss leads to finding the point of maximum density within the conditional clean speech distribution. To stabilize the adversarial training process, WavLM-based perceptual loss is integrated into the MS-STFT pipeline.

优点

  • The paper provides a meaningful analysis of LS-GAN loss in the context of speech enhancement.
  • The adoption of the WavLM neural network shows performance improvements.

缺点

  • Although the paper focuses on real-world scenarios, it still needs results with objective metrics such as PESQ and STOI for a solid evaluation.
  • The three-stage training process of FINALLY is complex, and the performance gains over HiFi-GAN2 do not justify the increased cost.
  • Reducing the model size by 90% for ablation studies is not convincing, as it can significantly affect the results.

问题

  • I'm curious about why DNSMOS and other metrics trend differently in Table 3.
  • Could you clarify why the order of the metrics in Tables 2 and 3 is presented differently?

局限性

  • Artifacts mentioned in the inference results can degrade perceptual quality.
  • The paper has a clear contribution, but the presentation and organization are lacking.
作者回复

Dear Reviewer,

Thank you for your time and consideration. We would like to address your concerns about the paper.

W1. It still needs objective metrics such as PESQ and STOI for a solid evaluation

Our paper does not include similarity-based metrics such as PESQ and STOI for two reasons. First, since PESQ and STOI require ground truth reference audio, we are unable to compute these metrics for real data such as VoxCeleb and LibriTTS, as there is no ground truth reference for such data. Second, there have been numerous works consistently reporting low correlation of reference-based metrics with human perceptual judgment [1, 2, 3]. In particular, study [1] reports that no-reference metrics (including DNSMOS, reported in our work) correlate significantly better with human perception and therefore have higher relevance for objective comparison between methods. Furthermore, in our study, we report the MOS score, which directly reflects human judgments of restoration quality. Therefore, we believe that our evaluation could be considered solid.

However, we agree that outlining conventional metrics could facilitate comparison with prior work. Therefore, we have decided to measure these metrics on the popular denoising benchmark VCTK-DEMAND

Table 1. Comparison with baselines on VCTK-DEMAND dataset with conventional metrics.

ModelMOSUTMOSWV-MOSDNSMOSPESQSTOISI-SDRWER
input3.18 ± 0.072.62 ± 0.162.99 ± 0.242.53 ± 0.101.98 ± 0.170.92 ± 0.018.4 ± 1.20.09 ± 0.03
MetricGAN+3.75 ± 0.063.62 ± 0.093.89 ± 0.102.95 ± 0.053.14 ± 0.100.93 ± 0.018.6 ± 0.70.10 ± 0.04
DEMUCS3.95 ± 0.063.95 ± 0.054.37 ± 0.063.14 ± 0.043.04 ± 0.120.95 ± 0.0118.5 ± 0.60.07 ± 0.03
HiFi++4.08 ± 0.053.89 ± 0.064.36 ± 0.063.10 ± 0.042.90 ± 0.120.95 ± 0.0117.9 ± 0.60.08 ± 0.03
HiFi-GAN-24.13 ± 0.053.99 ± 0.054.26 ± 0.053.12 ± 0.053.14 ± 0.120.95 ± 0.0118.6 ± 0.60.07 ± 0.03
DB-AIAT4.22 ± 0.054.02 ± 0.054.38 ± 0.063.18 ± 0.043.26 ± 0.120.96 ± 0.0119.3 ± 0.80.07 ± 0.03
FINALLY (16 kHz)4.41 ± 0.044.32 ± 0.024.87 ± 0.053.22 ± 0.042.94 ± 0.100.92 ± 0.014.6 ± 0.30.07 ± 0.03
FINALLY (48 kHz)4.66 ± 0.044.32 ± 0.024.87 ± 0.053.22 ± 0.042.94 ± 0.100.92 ± 0.014.6 ± 0.30.07 ± 0.03
Ground Truth (16 kHz)4.26 ± 0.054.07 ± 0.044.52 ± 0.043.16 ± 0.04----
Ground Truth (48 kHz)4.56 ± 0.034.07 ± 0.044.52 ± 0.043.16 ± 0.04----

As one can see, our model clearly outperforms all baselines in terms of the MOS metric and reference-free objective metrics, while reference-based metrics correlate poorly with human judgments. Note that our model has slightly higher MOS than the ground truth due to the fact that our training data is of higher quality than VCTK ground truth samples.

W2. The three-stage training process of FINALLY is complex, and the performance gains over HiFi-GAN-2 do not justify the increased cost

We agree that our training process is slightly more complex than that of HiFi-GAN-2. However, we would like to point out that our training pipeline, although intricate, is not considerably more complex compared to HiFi-GAN-2. HiFi-GAN-2 (similarly to our model) has three stages: 1) acoustic feature prediction network, 2) WaveNet training, and 3) adversarial training. Furthermore, our final model is more than 10 times faster (0.03 RTF compared to 0.5 RTF for HiFi-GAN-2). Therefore, we believe that the complexity of the training process is well justified by the dramatic increase in the efficiency of the resulting model.

W3. Reducing the model size by 90% for ablation studies is not convincing, as it can significantly affect the results.

We follow a well-established practice in deep learning literature of conducting ablation studies on a smaller scale in order to reduce the costs of training, as many leading papers in the field have done (e.g., [4]). While this procedure may influence the results, practical considerations in a resource-intensive field such as ours remain significant. Therefore, we use a smaller model for some parts of the ablation study.

Q1. Why DNSMOS and other metrics trend differently in Table 3.

Due to the imperfections of different objective metrics, they can trend differently as some of them pay more attention to certain artifacts than others. Please consider taking into account relevant papers [1, 2, 3] in order to understand the issues with objective metrics for speech quality in more detail.

Q2. Order of metrics in Tables 2 and 3.

Thank you very much for noticing this. We will rearrange the columns to have a consistent order in these tables to improve the clarity of presentation in the camera-ready version.

[1] Manocha, Pranay, et al.. "Audio Similarity is Unreliable as a Proxy for Audio Quality."

[2] Manjunath, T. "Limitations of perceptual evaluation of speech quality on VoIP systems."

[3] Andreev, Pavel, et al. "HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement."

[4] Karras, Tero, et al. "Analyzing and improving the training dynamics of diffusion models."

评论

Thank you for the detailed rebuttal, and particularly for providing the additional results using objective metrics. Rebuttal addressed the most of my concerns.

Based on the description in the rebuttal, it seems that no additional training was performed with the VCTK-DEMAND dataset. While the performance on objective metrics falls short of that achieved by HiFi-GAN2, the improvement in MOS scores is indeed Impressive. However, after reviewing the demo samples provided, I was unable to clearly perceive an improvement over HiFi-GAN2. Still, the significant increase in speed of the proposed algorithm compared to HiFi-GAN2 is noteworthy.

Moreover, considering that few models have successfully addressed multiple distortions with one pass, I believe this work holds substantial value. Therefore, I will raise my score by one level.

审稿意见
8

This paper describes a new formulation of GAN-based speech enhancement. It includes an analysis of the convexity in different feature spaces of the distribution of TTS utterances generated from the same inputs, concluding that WavLM's convolutional encoders provide the most convex such space. This representation is then incorporated into the input of HiFi++ and also a separate, unrelated upsampling stage is added at the end.

On the VoxCeleb real data validation set and the validation data of UNIVERSE, the proposed approach outperforms other strong baselines in both quality (MOS from a subjective listening test and other objective metrics) as well as in realtime factor. For example, on VoxCeleb, the proposed system is rated at 4.63 MOS compared to the second best system HiFi-Gan-2 at 4.47 while the proposed system is ~15x faster. For systems with comparable RTFs, the difference in MOS is 4.63 vs 3.79 for DEMUCS.

优点

Significance:

  • The argument for the relevance of GANs to speech enhancement is thoughtful, interesting, and convincing. It brings clarity on a point that I did not previously appreciate.
  • Analysis of the different feature spaces in terms of their convexity is another valuable and interesting contribution. It provides clearly actionable findings that are shown in the experiments to make a meaningful difference to system performance. This analysis can be applied more broadly to compare different SSL representations for various tasks.

System performance:

  • Improving both the performance and the efficiency of a speech enhancement system is quite valuable and these differences appear to be large compared with strong baselines.
  • Listening to the provided audio examples in the supplementary material shows impressive performance, especially in comparison to the halucinations of UNIVERSE in low-SNR instances, although it is not clear how these particular examples were selected.

Clarity:

  • The paper is well written and easy to follow. The figures are helpful in understanding it. The description of the loss and different training stages is particularly clear and helpful.
  • The ablations are thorough and informative and show clear benefits to each of the stages of the model/training.

缺点

One point that could be added to the discussion of previous work is that of Maiti and Mandel (2019) and related work, which introduced the idea of speech enhancement by synthesis of a clean utterance that contains the same content as the original utterance.

The evaluation was conducted on a crowdsourcing platform without IRB review. This should be reviewed by ethics reviewers.

S Maiti and M Mandel (2019). Parametric resynthesis with neural vocoders. Proc. IEEE WASPAA.

Minor comments:

  • Line 187 states, "we report MOS score" but it does not state where this MOS score comes from. Please describe the experiment/measurement that generated the MOS scores. Presumably it was the human listening test described in the appendix, but this is not clear at this point in the manuscript.
  • Equation (2): please define phi
  • Line 299 calls losses based on UTMOS and PESQ "human feedback" losses, but since these are algorithms predicting human feedback, I don't think their outputs can be called human feedback itself.

问题

N/A

局限性

Limitations of the approach are discussed in the appendix, section D.4. The limitation of this approach being non-streaming is mentioned there and is important to highlight.

作者回复

Dear Reviewer,

We are very grateful for the high assessment of our work and your valuable suggestions. Below, we address your concerns.

W1. Work by Maiti and Mandel (2019)

Thank you very much for pointing out this work. We will add a discussion of it to the related work section in the camera-ready version.

W2. Minor comments.

  • “Presumably it was the human listening test described in the appendix, but this is not clear at this point in the manuscript.” – You are correct. We will add a reference to the relevant appendix section.
  • “Equation (2): please define phi.” – Phi denotes the WavLM-Conv feature mapping. We will add this clarification in the camera-ready version.
  • “Since these are algorithms predicting human feedback, I don't think their outputs can be called human feedback itself.” – We can call these losses “predicted human feedback losses” instead of “human feedback losses.” We believe this clarification will indeed improve clarity, and therefore we will use this naming in the camera-ready version.
评论

I would like to thank the authors for their rebuttal. I have read all of the reviews and the rebuttals and would like to keep my rating as-is. I do think that section 2 is a strong contribution to the literature and our understanding of the problem of speech enhancement and the utility of generative models for solving it.

作者回复

Dear Reviewers,

We would like to express our sincere gratitude for your thoughtful comments and suggestions. Your appreciation of our insights into GAN training and the analysis of SSL models’ feature spaces is truly encouraging. We have worked hard to address the questions and concerns you raised during the review process. Below, we have summarized our responses to the key concerns:

Use of Conventional Metrics and Datasets

In response to the reviewers' concerns about not using PESQ, STOI, and other classic metrics, we point out their relatively weak correlation with perceptual quality. We cite several papers where this issue has been discussed in detail. Moreover, we elaborate on the difficulty of applying such metrics to real-world data, which frequently lacks ground truth samples. Therefore, we reason that subjective metrics, like MOS, provide a more accurate and fair assessment of the method’s performance. Nevertheless, we concur that the paper could benefit from benchmarking our method against other baselines using classic metrics and well-established datasets. Therefore, we have included a table comparing our method with other baselines on the VCTK-DEMAND dataset.

Complexity of the Presented Method

We agree that the level of complexity of our method in comparison to other methods is a significant concern. We address this in two ways. First, we point out that comparable SOTA algorithms (e.g., HIFI-GAN-2) also involve complex training procedures, consisting of numerous stages that must be trained consecutively. Therefore, we believe that while our method is complex, it is nevertheless on par with other existing algorithms. Second, we provide clear evidence that our algorithm is considerably faster (~x15 times, as noted by reviewers as well), while delivering comparable or better perceptual quality. We think that the benefits of increased inference speed and sound quality outweigh the possible training complexity issues.

Comparison with HiFi++ and Importance of WavLM Encoder

The generator architecture of our model is largely based on HiFi++, with the main differences being the introduction of Upsample WaveUNet and the WavLM encoder. While the importance of Upsample WaveUNet is clearly validated by our ablation study, the effect of the WavLM encoder appears to be somewhat marginal in terms of MOS score (although we must pinpoint that objective metrics are considerably higher for w/ WavLM case). To validate the importance of the WavLM encoder, we conducted additional ablation tests on the UNIVERSE validation set, which contains more challenging cases than the VoxCeleb data. The results clearly indicate the importance of the WavLM encoder in achieving high quality on the challenging UNIVERSE validation data.

Theoretical Analysis of Conditional Generation

We agree that the key advantage of modern generative models is their ability to generate high-fidelity objects. However, it is important to note that the choice of a generative model necessarily involves trade-offs. Diffusion models can generate realistic and diverse objects but do so slowly, whereas GANs are capable of generating realistic objects quickly in one forward pass, though they may lack diversity. In our paper, we argue that the lack of diversity is not an issue for speech enhancement and provide an analysis of why GANs are likely to sample the desirable mode. Our main argument is that sacrificing diversity is not problematic for speech enhancement, but sacrificing inference speed is. Thus, GANs might be better suited than diffusion models for the speech enhancement problem.

Once again, we would like to thank all the reviewers for their efforts and time.

Sincerely,

Authors

最终决定

The paper proposed a GAN based method for universal speech enhancement and demonstrated impressive results across differen SNRs. As agreed among the reviewers, the theoretical discussion presented in the paper of the relevance of mode-covering property of LSGAN for speech enhancement is very insightful. The detailed ablations and the addtional results addressing reviewers' concerns such as more standard SE metrics and comparisons with HiFi provide solid evidence. One weakness as pointed out by the reviewer is the similarity to Hifi, the complexity of the training procedure and the non-streaming aspect of the approach. Given these, I hence recommend an accept.

公开评论

Dear Authors,

We greatly enjoyed your paper, "FINALLY: Fast and Universal Speech Enhancement with Studio-like Quality". Our team at Inverse AI has been developing an open-source implementation of your model, now publicly available on GitHub: https://github.com/inverse-ai/FINALLY-Speech-Enhancement. This follows our previous efforts in open-sourcing models, such as converting PyTorch implementations of Hybrid Transformer Demucs to TensorFlow: https://github.com/inverseai/Demucs-and-HTDemucs-based-denoiser-with-Tensorflow.

While implementing FINALLY, we encountered some discrepancies and ambiguities, and we hope you can provide clarifications to help align our implementation with your reported results:

  1. Parameter Mismatch:
    The paper reports 454M parameters, including 358M from WavLM. Using WavLM-Large from the provided Hugging Face link, our implementation totals 363M (315M WavLM + 48M model). Model components: SpectralUNet (4.4M), ResBlock (7.1M), Conv (0.8M), HiFi (15.5M), WaveUNet (10.8M), SpectralMaskNet (5.5M), WaveUNet Upsampler (3.9M). Could you clarify this discrepancy?

  2. Training Data Details:
    The paper mentions Stages 1 and 2 use LibriTTS-R clean dataset, but it’s unclear whether these stages use identical or different subsets.

  3. Audio File Length:
    The typical duration or range of audio samples used for training is not specified.

  4. Batch Size:
    The paper mentions a batch size of 32, but it is unclear if this applies to all stages or varies.

  5. Positional Embedding:
    The type of positional embedding (sinusoidal or learned) is not specified.

  6. Training Iterations:
    The paper mentions 100k, 30k, and 40k iterations for Stages 1, 2, and 3. Our model continues converging beyond these. Did you observe similar behavior, or is there a recommended stopping criterion?

  7. LMOS Loss:
    Using mel spectrograms in LMOS loss avoids artifacts in Stages 1 and 2 but causes issues in Stage 3, while STFT introduces artifacts in Stages 1 and 2. Did you encounter similar trade-offs, and could you share insights?

Our setup (for reference):

  • Segments: 2s audio files
  • Batch: 18, 12, 4 for Stages 1, 2, 3
  • Dataset: LibriTTS-R clean in Stages 1 & 2, DAPS with DNS noise in Stage 3 (~200 hours)
  • GPU: 2× NVIDIA 5090
  • Positional Embedding: PyTorch 2D learnable
  • Loss: LMOS (tested with mel-spectrogram and STFT)
  • Optimizer: AdamW, warmup: 10k steps
  • Overall MOS: 2.95 on 10% of training data

We would greatly appreciate your guidance on these points. Clarifications will help improve our open-source implementation and benefit the broader research community.

公开评论

Ibrahim, hello! Thank you for your interest to our paper and your effort to implement it.

  1. we have 454 M model. More precisely:
  • spectralunet: 4.5M;

  • hifi: 47M; standard v1 hifi ~14M and preblocks 32M , processing tensor before hifi(which were not mentioned in paper by our fault): [[[Conv1d(512, 512, kernel_size, dilation), conv1d(512,512, kernel_size)] for dilation in [1,3,5]] for kernel_size in [3,7,11]]

  • waveunet: 10.3M;

  • spectralmasknet: 16.7M;

  • wavlm_large: 315M;

  • post wavlm processing block: 50 M; (7 x Conv1d(ch_in=1536, ch_out=1536, kernel = 3, dilation=1), 1 x Conv1d ch_in=1536, ch_out=512, kernel=1, so it is actually) we have different numbers here since in paper we have not --specified depth of this block;

  • upsampler 15M;

  1. In both stages, we randomly sampled from from LibriTTS-R;

  2. We used audio samples with length 4.096 seconds;

  3. Batch size 32 was used for all stages;

  4. As positional embedding, in SpectralUNet we add dimension with number = frequency_index / 513. So [B, F, T] -> [B,1,F,T];

  5. The main reason for such choice was the time limit, so our models haven't fully converged;

  6. According our experience, STFT loss works better than Melspec loss, and stage 2 and 3 help to remove artifacts typical for STFT loss. Could you specify what kind of artifacts do you observe?

公开评论

To try our demo, please use new link: https://mmacosha.github.io/finally-demo/