PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
3
4
ICML 2025

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

DMOSpeech achieves highly efficient and accurate zero-shot speech synthesis by directly optimizing a distilled diffusion model based on objective quality metrics.

摘要

关键词
text-to-speechzero-shot speech synthesisdiffusion modeldiffusion distillationmetric optimization

评审与讨论

审稿意见
3

This paper proposes a DMOSpeech speech synthesis method. It utilizes Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss to realize the direct optimization of diffusion-based models. It was evaluated using subjective and objective tests. It outperforms the previous methods in most metrics.

给作者的问题

What are the differences of the proposed method and the previous method StyleTTS-ZS in the objective test in Table 3?

论据与证据

The authors claim that the proposed method provides a direct pathway to realize the end-to-end optimization of diffusion-based synthesis model. Its performance has become better than without it. The subjective and objective evaluation supports this claim.

方法与评估标准

The evaluation methods are sound. The authors conducted many subjective tests using human annotators, and the results were informative.

理论论述

The proposed method's theoretical claim is that it has direct gradient pathways to all the model components. I believe this is sound enough.

实验设计与分析

The authors perform subjective and objective tests. They are fine, but it is not clear how the results of these two are correlated. Their consistent results, and costly subjective tests are meaningful only for ablation studies.

补充材料

The details of CTC loss and SV loss are in the Supplementary Material, which I reviewed.

与现有文献的关系

The novel point is the use of CTC loss and SV loss to realize direct optimization of diffusion based speech synthesis model. This point is somewhat limited to the domain of speech synthesis.

遗漏的重要参考文献

Most of the references are OK. However, Table 3 does not report the result of the objective test of the previous study StyleTTS-ZS. Therefore, it is not clear how much improvement the proposed method achieves.

update after rebuttal

The comparison result with StyleTTS-ZS will be added. Then, there will be no problem at this point.

其他优缺点

It seems the control parameters of the model (such as \lambda) and the learning parameters are difficult to choose. The ablation studies in this regard should be needed.

update after rebuttal

Thank you for the rebuttal comments. Now I believe there is no problem.

其他意见或建议

  1. The description in Section 3.2 should be improved. Most of the contents are not new, so it is unclear which parts are novel.
  2. The MOS values of various methods in Table 1 are almost the same as the ground truth. I am not sure whether the improvement obtained by the proposed method is meaningful enough.

update after rebuttal

About 2), I agree with the authors' comments in the rebuttal. They should insist that improving MOS is not the only contribution in the manuscript.

Overall, I raised my score from WR to WA.

作者回复

Thank you for your thoughtful review of our paper. We appreciate your feedback and address your concerns below:

Correlation Between Subjective and Objective Metrics

We have indeed analyzed the correlation between subjective and objective metrics in our paper. As shown in Figure 3 and Figure 5 (appendix), the Pearson correlation between speaker embedding similarity (SIM) and human-rated voice similarity is 0.55, while the correlation with style similarity is 0.50. Similarly, word error rate (WER) correlates with naturalness and sound quality at -0.16 for both metrics. All correlations are statistically significant (p<<0.01p << 0.01), demonstrating that our optimized objective metrics strongly align with human perception even at the individual utterance level.

Applicability Beyond Speech Synthesis

While our implementation focuses on speech synthesis, the core framework of enabling direct metric optimization through distillation can be applied to other generative domains. For example:

  • In music generation, differentiable models like instrument detection models or melody extraction models could optimize text-to-music alignment or ensure generated music matches specified instruments or melodic (MIDI input) constraints.

  • In image generation, differentiable models could maxizmize CLIP scores between the prompt text and generated image, or verify the presence of all text-described objects using image segmentation models

  • In video generation, similar principles could ensure temporal consistency

The key innovation is creating a direct gradient pathway that enables end-to-end optimization with any differentiable metric within the diffusion model frameworks, which has broad applications beyond speech synthesis.

Comparison with StyleTTS-ZS

We have conducted objective evaluations of StyleTTS-ZS that will be included in the revised manuscript. DMOSpeech significantly outperforms StyleTTS-ZS in speaker similarity (0.69 vs. 0.56) with comparable real-time factor (0.07 vs. 0.04). While StyleTTS-ZS achieves lower WER (1.17 vs. our 1.94), our model delivers better overall performance as confirmed by subjective evaluations in Table 1. Additionally, our training pipeline is more straightforward without requiring aligner training, making it easier to scale across languages and larger datasets.

Parameter Selection Process

We have detailed our parameter selection approach in the paper. The process is intuitive rather than difficult—we observe gradient norms for each loss term and balance them accordingly. As described in Section 3.4:

" We set λadv=103\lambda_{\text{adv}} = 10^{-3} to ensure the gradient norm of adversarial loss is comparable to that of DMD loss. During early training stage, we observed that the gradient norms of SV loss and CTC loss were significantly higher than DMD loss, likely because GθG_\theta was still learning to generate intelligible speech from single step. To address this, we set λCTC=0\lambda_{\text{CTC}} = 0 for the first 5,000 iterations and λSV=0\lambda_{\text{SV}} = 0 for the first 10,000 iterations. This allows GθG_\theta to stabilize under the influence of DMD loss before integrating these additional losses. After that, both λCTC\lambda_{\text{CTC}} and λSV\lambda_{\text{SV}} are set to 1."

This approach follows established practices in the literature (Yin et. al, 2024) and doesn't require extensive hyperparameter tuning.

Regarding Other Suggestions

  1. The description in Section 3.2 should be improved. Most of the contents are not new, so it is unclear which parts are novel.

Section 3.2 provides necessary background on Distribution Matching Distillation, establishing context for our improvements. We acknowledge this is primarily background material and will clarify which aspects represent our specific contributions in the revised manuscript.

  1. The MOS values of various methods in Table 1 are almost the same as the ground truth. I am not sure whether the improvement obtained by the proposed method is meaningful enough.

While MOS values are similar to ground truth, our key contribution is achieving comprehensive improvements across multiple metrics simultaneously. Previous models like StyleTTS-ZS achieve high naturalness but lower similarity, while NaturalSpeech 3 achieves high similarity but lower naturalness. DMOSpeech uniquely excels in both dimensions while maintaining significantly faster inference speed (13.7x faster than the teacher model). This balanced performance across all metrics represents a meaningful advancement in the field. We appreciate your constructive feedback and will incorporate these clarifications in our revised manuscript.

审稿意见
3

This paper presents DMOSpeech, a distilled diffusion-based speech synthesis model that achieves true end-to-end optimization of perceptual metrics, specifically through CTC loss for intelligibility and SV loss for voice similarity. The authors integrate these loss functions into a distilled student model trained via DMD2, enabling efficient inference without sacrificing synthesis quality.

update after rebuttal

I keep my initial assessment as the rebuttal addressed my concerns.

给作者的问题

论据与证据

  • The claims are very clear that their newly introduced plausible loss functions work in practice. I cannot believe these loss functions have not yet been applied to this diffusion distillation so far, and support this paper acceptance.
  • The paper backs its claims with comprehensive experiments. Especially, exceling its teacher with 4-step synthesis with such auxiliary loss functions are impressive, but also surpassing strong baselines (e.g., NaturalSpeech 3). However, in all Tables, the authors should write the number of inference steps. For example, StyleTTS-ZS is 1-step generation, and the performance seems to be different from the original paper. Can you explain this? What makes StyleTTS-ZS and DMPSpeech perform differently?
  • The teacher model is already achieving the real-time generation (RTF < 1). Why should we bother generating that fast algorithm? Say, the teacher is 10B model and clearly not a real-time. Maybe a usecase of this algorithm is to distill that teacher into few-step model that fits to real-time. Can the authors give a good intuition why the model sizes are not scaling to 10B - 100B? Is there an insight of the model behavior once we really scale up to this region?
  • Is there any ablation that applies CTC and SV losses individually? Will there be any discrepancy between the expected outcome and the real outcome?
  • Will the code released?

方法与评估标准

Yes. They all make sense to me.

理论论述

There is no theory on this paper. One question. When the loss function is fully optimized, will both CTC loss and SV loss not hurt the optimality? It is true in the image domain that using CLIP regularization with strong weight hurts the performance. How about CTC and SV? Also, will CLAP regularization work?

实验设计与分析

Not carefully checked the validity of all details. Typically, in this GAN-based experiments, there are many hidden (or appendixed) materials that a reviewer can easily miss.

补充材料

No, I haven't read the supplementary materials.

与现有文献的关系

Very related to the broad community

遗漏的重要参考文献

Key references are discussed

其他优缺点

One minor issue is about the contribution of this paper. It's all about the empirical study, and I was wonder if there's any good theoretic catch in this paper. Well, I understand it's very hard to do so, but is there any interesting analysis?

其他意见或建议

作者回复

We sincerely thank the reviewer for their thorough evaluation and constructive feedback. Below, we address each point raised:

StyleTTS-ZS Comparison

The performance of StyleTTS-ZS in our evaluation aligns with what was reported in their original paper. The fundamental architectural difference is that StyleTTS-ZS employs a specialized decomposition approach, modeling specific prosody components (F0, energy, duration) separately, while DMOSpeech adopts a more holistic end-to-end generation framework.

Our approach offers several advantages:

  1. Greater generalizability across diverse speech conditions (such as non-speech vocalizations and multiple speakers in the same utterance).
  2. Support for end-to-end optimization with perceptual metrics
  3. Robustness to challenging audio conditions (e.g., background noise, processing artifacts, as we presented on our demo page)

StyleTTS-ZS achieves high efficiency through its decomposition strategy (enabling one-step generation, since generating prosodic features only is easier than generating the whole speech), but this same design introduces limitations when faced with complex or noisy audio conditions. DMOSpeech maintains comparable efficiency while providing superior generalizability and performance.

Value of Super-Efficient Generation

While the teacher model achieves real-time generation (RTF < 1) on a high-end GPU (V100), there are compelling reasons to pursue even greater efficiency:

  • Device Compatibility: Enabling deployment on resource-constrained environments (CPUs, mobile devices, edge computing)

  • Service Scalability: A 13.7× reduction in inference time translates to substantially higher throughput for cloud services supporting millions of users

  • Energy Efficiency: Reduced computation requirements lead to lower power consumption and carbon footprint

For industrial applications, these efficiency gains are critical for accessibility, scalability, and sustainability.

Regarding model scaling, while extremely large models (10B-100B parameters) are indeed being explored in production environments (e.g., by ByteDance's Seed-TTS and Amazon's Base-TTS), analyzing scaling behaviors was outside our paper's scope. Our distillation technique remains relevant regardless of teacher model size, as the efficiency benefits become even more pronounced with larger models.

Ablation of Individual Losses

As shown in Table 4, we conducted comprehensive ablation studies applying CTC and SV losses individually:

  • CTC Loss Only: Achieved superior word error rate (1.79 vs. 1.94) but significantly lower speaker similarity

  • SV Loss Only: Produced slightly higher speaker similarity (0.70 vs. 0.69) but substantially worse WER (6.62 vs. 1.94)

These results demonstrate that combining both losses achieves the optimal balance between intelligibility and speaker similarity, which aligns with human preference as shown in our subjective evaluations.

Loss Optimization and Potential Trade-offs

The reviewer raises an important point about potential conflicts between optimization objectives. In theory, overly aggressive optimization of auxiliary losses (CTC, SV) could indeed harm performance by causing distribution mismatches with the training data, especially when the loss is optimized below that achieved in ground truth data.

To mitigate this risk, we carefully balanced the gradient contributions from each loss component, ensuring the auxiliary loss gradients are comparable in magnitude to the primary DMD loss. This calibrated approach prevents any single objective from dominating and maintains distributional alignment throughout training.

Regarding CLAP regularization, while it's an intriguing direction, we focused specifically on speech synthesis rather than general audio generation in this work. We appreciate the suggestion and will mention this as a potential future direction in our revised manuscript.

Theoretical Contributions

While our paper emphasizes empirical results, we provide some theoretical contributions, especially our Analysis of Mode Shrinkage. Our detailed examination of distributional changes during distillation (Figure 2 and Appendix A) offers novel insights into how distillation affects output diversity in conditional generation tasks. Our analysis shows that although distillation causes a loss in diversity for a fixed prompt and text input, this reduction in diversity is not necessarily negative or could even be beneficial to the performance of conditional generation, and the mode coverage is not compromised when the model processed different prompt and text inputs.

These insights contribute to the theoretical understanding of both diffusion model distillation and perceptual metric optimization in generative models.

We thank the reviewer again for their thoughtful comments and positive assessment. Hope our responses address their concerns satisfactorily.

审稿意见
3

This paper introduces DMOSpeech, a distilled diffusion-based text-to-speech (TTS) model that achieves faster inference and superior performance compared to its teacher model. It has two advantages: (1) reducing sampling steps from 128 to 4 via distribution matching distillation, and (2) providing direct gradient pathways from noise input to speech output. This allows direct optimization of speaker similarity and word error rate through speaker verification (SV) and Connectionist Temporal Classification (CTC) losses. The comprehensive experiments demonstrate significant improvements across all metrics, outperforming the teacher model and other recent baselines in subjective and objective evaluations.

给作者的问题

The questions are asked in previous sections.

论据与证据

Overall, the claims are clearly and well supported.

  1. The biggest problem of this paper is: This paper combines diffusion distillation and metric optimization (by GAN loss). Since traditional diffusion/flow matching can not generate intelligible speech at high noise levels, this paper bypasses it by diffusion distillation to skip this stage and apply the GAN optimization. Firstly, both diffusion distillation and direct metric optimization in TTS are well studied. Secondly, FlashTTS[2], which is also a consistency model (bypassing the same challenge of unintelligible speech under high noise levels) with direct metric optimization, is trained without distillation and is much easier. So, it lacks novelty for the scope of ICML.

  2. This paper should discuss more with some closely related papers, such as DIFFUSION-GAN[1], FlashTTS[2].

[1] DIFFUSION-GAN: TRAINING GANS WITH DIFFUSION

[2] FlashSpeech: Efficient Zero-Shot Speech Synthesis

方法与评估标准

Overall, the methods are written clearly. The evaluation criteria is sufficient.

理论论述

Yes.

实验设计与分析

Some questions:

  1. For experiments comparing with end-to-end systems, I recommend to compare with more baselines: F5TTS[1], MASKGCT[2].
  2. It should be compared with FlashTTS[3], which is also a strong baseline of an efficient zero-shot TTS system with few iterative steps and direct metric optimization.

[1] F5-TTS: Diffusion Transformer with ConvNeXt V2, faster trained and inference.

[2] Maskgct: Zero-shot text-to-speech with masked generative codec transformer

[3] FlashSpeech: Efficient Zero-Shot Speech Synthesis

补充材料

Yes.

与现有文献的关系

No.

遗漏的重要参考文献

No.

其他优缺点

The strengths and weaknesses are discussed in previous sections.

其他意见或建议

Update after Rebuttal

Thanks for the authors' replies.

Regarding the main concern about Direct Metric Optimization vs GAN, I agree that FlashSpeech does not involve direct metric optimization in adversarial training. I misunderstood the concept of adversarial training and direct metric optimization. My concerns still exist: 1. The method to distill the teacher model for fast inference via the adversarial training is not novel: works such as DMD 2 [1] and FlashSpeech have studied it. 2. The direct metric optimization is not novel either. The early work, such as StyleTTS 2 [2], also uses pretrained WavLM (a proxy for speaker similarity metric) as the direct optimization objective and achieved good results (see Table 5 w/o SLM adversarial training in the ablation study of StyleTTS 2).

The comparisons of F5TTS and MackGCT are good.

Finally, I agree with the author's claim: enabling direct optimization of perceptually relevant metrics through a differentiable pathway created by one-step generation through diffusion distillation, especially in the area of TTS. I will update the score to 3 by combining all the proposed methods and considering the good results.

[1] Improved Distribution Matching Distillation for Fast Image Synthesis [2] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

作者回复

We appreciate the reviewer's feedback, but we believe there is a fundamental misunderstanding about our paper's contribution. Our work presents several key innovations that have not been explored in prior research, including FlashSpeech:

Clarification on Direct Metric Optimization vs. Adversarial Training

The reviewer conflates direct metric optimization with adversarial (GAN) training, which are fundamentally different approaches: FlashSpeech does not implement direct metric optimization. After careful examination of the FlashSpeech (Ye et. al 2024) paper, we confirm it uses adversarial consistency training but makes no mention of direct metric optimization of perceptual metrics such as speaker similarity or word error rate. Their adversarial training is solely for improving general speech quality.

Our direct metric optimization is novel. DMOSpeech enables true end-to-end optimization of specific perceptual metrics (SV loss for speaker similarity, CTC loss for word error rate) through differentiable pathways - not simply adversarial training. This is a significant advancement for TTS systems.

State of Direct Metric Optimization in TTS

The reviewer suggests that "both diffusion distillation and direct metric optimization in TTS are well studied." This is incorrect. As we explicitly state in our paper:

While optimizing perceptual metrics has shown promise in speech enhancement through approaches like MetricGAN for PESQ and STOI, and recent attempts have explored RLHF for improving naturalness, implementing these approaches in modern TTS systems has remained challenging. Previous attempts (e.g., YourTTS) reported minimal improvements from speaker similarity optimization due to their inability to propagate gradients through all model components. The field has struggled with this problem due to architectural limitations such as non-differentiable components or computationally prohibitive backpropagation through iterative sampling steps.

Relationship to DIFFUSION-GAN

Our work bears limited similarity to DIFFUSION-GAN. Our focus is not on GAN-based techniques but on enabling direct optimization of perceptually relevant metrics through a differentiable pathway created by one-step generation through distillation. This is a fundamentally different approach with different goals.

Novel Contributions

Our paper makes several novel contributions:

  1. We present the first distilled TTS model that consistently outperforms its teacher model (not merely matching it), while reducing inference time by over 13×.

  2. We introduce a framework enabling true end-to-end optimization of differentiable metrics in TTS, demonstrating substantial improvements in speaker similarity and word error rate.

  3. We provide comprehensive analyses establishing correlations between objective metrics and human perceptions, revealing new insights into sampling speed and diversity trade-offs.

These contributions represent significant advancements in the field of speech synthesis that are well-aligned with ICML's focus on machine learning innovations.

Regarding F5TTS and MASKGCT Comparisons

We thank the reviewer for their suggestions for more baseline comparisons. Per the reviewer's suggestion, we have trained a new DMOSpeech model using F5-TTS as the teacher on the Emilia dataset and conducted comprehensive objective evaluations comparing our new model against both F5TTS and MASKGCT models on the SeedTTS-Eval benchmark dataset:

ModelSIM (en) \uparrowWER (en) \downarrowSIM (zh) \uparrowCER (zh) \downarrowRTF \downarrow
MaskGCT0.7172.620.7522.271.21
F5-TTS (teacher, N=32)0.6471.830.7411.560.32
DMOSpeech (N=4)0.6871.780.7571.430.06

Our model has achieved similar or better performance than both MaskGCT and F5-TTS on SeedTTS-eval in both Chinese and English test sets for both intelligibility and similarity. Moreover, our model is significantly faster than both MaskGCT and F5-TTS.

We will include the complete evaluation results in the appendix of our revised manuscript. This additional analysis provides a more comprehensive understanding of how our approach compares to current state-of-the-art methods.

Regarding FlashSpeech Comparison

We appreciate the suggestion to compare with FlashSpeech. However, despite our best efforts, we have been unable to conduct direct experimental comparisons due to the unavailability of publicly accessible pre-trained checkpoints for this model, as documented in https://github.com/zhenye234/FlashSpeech/issues/3.

In our revised manuscript, we will include a thorough discussion of FlashSpeech, addressing its approach and how it relates to our work. While we cannot provide direct experimental comparisons, this discussion will help contextualize our contributions among other efficient zero-shot TTS systems.

审稿意见
4

Diffusion models have shown strong potential in speech synthesis tasks such as text-to-speech (TTS) and voice cloning. However, their iterative denoising process is computationally expensive, and previous distillation methods have led to quality degradation. Existing TTS approaches also suffer from non-differentiable components or iterative sampling, preventing true end-to-end optimization with perceptual metrics.

To address these issues, the authors propose DMOSpeech, a distilled diffusion-based TTS model that achieves both faster inference and superior performance compared to its teacher model. A key innovation of DMOSpeech is its ability to enable direct gradient pathways to all model components, allowing for the first successful end-to-end optimization of differentiable perceptual metrics in TTS. The model incorporates Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss, aligning speech synthesis quality with human auditory preferences.

Extensive experiments, including human evaluations, demonstrate significant improvements in naturalness, intelligibility, and speaker similarity, while also reducing inference time by orders of magnitude. This work introduces a new framework for optimizing speech synthesis directly with perceptual metrics, setting a new standard for high-quality and efficient TTS models.

update after rebuttal

I have read rebuttal from authors and comments from other reviewers, I think this is a good paper that I vote for Accept

给作者的问题

no further questions

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Methods are clear and the evaluation criteria are suitable.

理论论述

Theoretical claims are not significant, novel, but not wrong for this specific application.

实验设计与分析

Experiment design is clear and analysis supported the claim.

补充材料

The sample webpage and supplementary material provided clear evidence of the method's performance and the experiment's details.

与现有文献的关系

No significant contribution to the broader scientific literature.

遗漏的重要参考文献

No further reference missing in discussion.

其他优缺点

Strength

  • The paper introduces a novel approach to optimizing perceptual metrics in TTS by enabling direct gradient pathways, which has not been successfully achieved in previous models.
  • By reducing sampling steps from 128 to 4 while maintaining or improving quality, DMOSpeech addresses a major efficiency bottleneck in diffusion-based TTS.
  • The paper is well-structured, with clear explanations of the model architecture and loss functions. The inclusion of human evaluation results strengthens the validity of claims.

其他意见或建议

no further comments

作者回复

We sincerely thank the reviewer for their positive recommendation for our paper. We appreciate the recognition of our model's ability to enable direct gradient pathways for end-to-end optimization and the acknowledgment of our efficiency improvements. While we agree with many points raised in the review, we would like to address two specific concerns:

On Theoretical Claims and Scientific Contribution:

We respectfully disagree with the assessment that our theoretical claims are "not significant, novel" and that there is "no significant contribution to the broader scientific literature." Our work makes several important theoretical and scientific contributions:

  • Novel Distribution Matching Framework: We present the first successful application of distribution matching distillation in speech synthesis that achieves superior quality to the teacher model. This counters the prevailing view in the field that distillation necessarily leads to quality degradation.

  • Mode Shrinkage Insight: Our analysis of mode shrinkage during distillation reveals a fundamental insight about conditional generation tasks: in strongly conditional generation, diversity reduction can be beneficial when it emphasizes high-probability regions without compromising output variation across different prompts and text inputs.

  • Unified Optimization Framework: By enabling direct metric optimization in a diffusion framework, we bridge the gap between two previously separate research areas: perceptual metric optimization and diffusion models, establishing a foundation for future research on optimizing generative models with human preferences.

These contributions extend beyond speech synthesis and offer valuable insights for any conditional generative modeling task, especially those requiring both quality and efficiency.

We believe our work advances both the theoretical understanding of generative model distillation and provides a practical framework applicable to numerous domains requiring conditional generation with perceptual quality constraints.

Thank you again for your overall positive assessment. We hope this clarification addresses your concerns regarding the broader impact and theoretical novelty of our work.

最终决定

DMOSpeech combines DMD2 and direct metric optimization (DMO) to give a distilled model matching or exceeding recent diffusion and AR baselines. The system is well-motivated (e.g. DMD vs. progressive distillation), though the choice and weighting (over time) of direct metrics is empirical, requiring tuning (R4; authors could cite clarify where e.g. in Yin et al. their heuristics come from) and extra memory (R2).

DMD (R1, R3) is not theoretically novel. I do think DMO during distillation is novel (R3 points to StyleTTS 2, but it’s not distillation, it’s adversarial, and the ablation seems to drop it entirely). “Mode shrinkage” is also not novel: even the original distillation paper (https://arxiv.org/abs/1503.02531) talks about intentionally sharpening the student via T < 1, and this effect is viewed as why self-distillation helps (https://arxiv.org/abs/1805.04770). Perhaps it’s not studied for DMD, but authors should not oversell this in a camera-ready. Would be interesting to see how well other diffusion distillations sharpen their distributions.

Regardless, their analysis of distributions and effects of each DMO loss in NAR TTS is informative. All reviewers agree the experiments (subjective and objective) are comprehensive and ultimately support acceptance.