PaperHub
5.5
/10
Poster3 位审稿人
最低2最高4标准差0.8
2
4
3
ICML 2025

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

OpenReviewPDF
提交: 2025-01-20更新: 2025-07-24
TL;DR

A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

摘要

关键词
Spatial AudioBinaural SpeechGenerative ModelFlow Matching

评审与讨论

审稿意见
2

This paper explores the streaming generation of high-quality binaural audio from monaural audio, considering the spatial positions of both the speaker and listener. Specifically, the task is approached as a generative problem, utilizing a flow matching model to generate stochastic binaural details absent in the monaural input. The paper also proposes a continuous inference pipeline to support streaming rendering. Both objective and subjective metrics demonstrate the superiority of the proposed approach.

Update after rebuttal

Thank you for your rebuttal response. I appreciate the added experimental results regarding inference speed. I suggest incorporating these into the main paper. Though, I still have reservations about the method motivation and the technical distinction between the proposed method and the previous works. Therefore, I will maintain my original score.

给作者的问题

  1. The discussion on the differences between this work and simplified flow matching in Section 3.2 is somewhat perplexing, particularly regarding the inclusion of task-related conditions (such as mono audio). The current text appears to emphasize experimental results, such as the impact of noise scale variations on training stability, rather than clearly articulating the distinctions between the two methods. Could the authors further clarify these distinctions to provide a clearer understanding of their respective approaches and implementations?

论据与证据

The paper claims to achieve high-quality and realistic binaural audio rendering with faster inference speed. The evidence provided includes performance comparisons on an in-house test set.

However, the evidence is not sufficiently convincing for several reasons. First, the evaluation results on a public dataset (Appendix F) show that the proposed approach performs on par with the baseline, except for the Wave L2 metric, which is significantly worse than the baseline. This suggests that the overall performance improvement may be limited. Second, necessary details about the baseline model are missing. For example, it is unclear whether the baseline model was trained on the same dataset or tested in an out-of-domain scenario. Additionally, the model size is not reported, which is essential for evaluating inference speed, as model size directly impacts computational efficiency. These details are crucial for a fair comparison and to validate the claims of improved performance and speed.

方法与评估标准

The enhancements in effectiveness and efficiency are indeed valuable for the binaural audio rendering task and practical applications, though a more thorough exploration of the underlying motivations of the proposed methods would help clarify the rationale and ensure the contributions are fully understood.

理论论述

This work primarily focuses on practical application rather than theoretical development. I specifically review the formulas related to flow matching, and they seem to be correctly formulated.

实验设计与分析

Some questions about the experimental design are mentioned in the “Claims and Evidence” section. Additionally, I would like to understand the rationale behind the choice of the SGMSE baseline, since its original paper may not explicitly address the binaural audio task. I am also confused by the omission of SGMSE results in Table 4. Lastly, I suggest clarifying the reasons for the inconsistent scales used in the L2 error reporting (1e-5 in Table 1 vs. 1e-3 in Table 4).

补充材料

Yes. I review the supplementary material and appendix.

与现有文献的关系

The key contributions of this paper build on prior work by continuing the paradigm of framing this task as a generative problem. The authors focus on adapting the model architechture and some implementation details to enhance both performance and efficiency.

遗漏的重要参考文献

No.

其他优缺点

The paper demonstrates notable improvements in subjective testing, which underscores the effectiveness of the proposed approach. Overall, however, the technical contributions appear relatively incremental. The paper could benefit from a more thorough discussion of the motivations behind the proposed improvements. To some extent, the work reads more as a technical report, which, while valuable, may limit its broader impact.

其他意见或建议

No.

作者回复

We thank the reviewer for your constructive comments. We will address your concerns below and will revise our paper following your suggestions.

Performance on the public dataset (Claims And Evidence)

We agree that the proposed method performs comparably to the baseline (BinauralGrad) on some metrics, but we respectfully clarify that our approach outperforms the baseline on key perceptual metrics (PESQ and MRSTFT). These percetual metrics are often prioritized in audio synthesis tasks as they correlate better with human perception of quality and intelligibility, whereas minor waveform-level differences (L2) may not impact perceived audio quality.

Implementation details of baselines (Claims And Evidence)

For all baselines, we trained the models from scratch on our new dataset and tested them on the same dataset. We did not perform pretraining or cross-dataset evaluation. We will include additional implementation details in our revision.

Model size and speed (Claims And Evidence)

We thank the reviewer for pointing out this issue. We present the model size and inference speed for all baselines below. We test the inference speed on a single 4090 GPU. The audio sampling rate is 48 kHz, and the audio length is 683 ms. As shown in the table, our model achieves the fastest inference speed among generative models. Our model acheives a more favorable trade-off between performance (Table 1) and inference speed compared to the baseline approaches.

MethodsTypeNFESpeed (ms)Model Size (MB)
SoundSapces 2.0-1--
2.5D Visual SoundR11.182.0
WaveNetR121.032.7
WarpNetR121.932.8
BinauralGradG6221.152.9
SGMSEG30770.2273.6
BinauralFlow (Ours)G6163.0314.5

Method motivation (Methods And Evaluation Criteria, Other Strengths And Weaknesses)

Real-time, high-quality spatial audio is essential for immersive applications such as gaming, VR/AR, and cinematic experiences. These applications motivate the design of a high-quality, efficient, and streaming binaural audio generation framework.

For quality, we frame the binaural audio rendering task as a generative problem and introduce a high-fidelity generative model. To improve inference efficiency, we choose flow matching models over diffusion models because they render high-fidelity spatial audio with fewer inference steps. We further reduce inference steps by using the midpoint solver and an early skip strategy. To satisfy the streaming requirement of real-world applications, we design our causal U-Net architecture and continuous inference pipeline that processes audio input in chunks. Our method is designed to meet real-world needs, and our model successfully addresses these challenges.

Use of SGMSE baseline (Experimental Designs Or Analyses)

Although SGMSE is not intended for binaural audio rendering, we included it as a baseline due to its U-Net-based diffusion model. While BinauralGrad also uses a diffusion model, it employs a WaveNet architecture. We used SGMSE to examine the impact of different architectures on the results.

Omission of SGMSE results in Table 4 (Experimental Designs Or Analyses)

Since SGMSE was not evaluated on the public dataset, it was not included in Table 4. To expand our experiment, we test SGMSE on the public dataset and report its performance below. Our model outperforms SGMSE with noticeable margins. We will include these results in the revision.

MethodsPESQ\uparrowMRSTFT\downarrowWave L2\downarrowAmplitude L2\downarrowPhase L2\downarrow
SGMSE2.2561.3520.2300.0330.983
BinauralFlow (Ours)2.8061.2520.1920.0300.918

L2 error scale (Experimental Designs Or Analyses)

The difference in scale is due to the audio volume in the two datasets not being the same. Since the audio volume in our collected dataset is lower than in the public dataset, we used a different scaling factor during testing.

Difference between BinauralFlow and simplified flow (Questions For Authors)

The key difference between our approach and the simplified flow lies in the flow function and vector field. Our flow function is defined as ϕt(z)=ty+(1t)x+(1t)σϵ\phi_t(\mathbf{z}) = t\mathbf{y} + (1-t)\mathbf{x} + (1-t)\sigma \mathbf{\epsilon}, where x\mathbf{x} and y\mathbf{y} are mono and binaural audio, respectively, ϵ\mathbf{\epsilon} is Gaussian noise, tt is the time step, and σ=0.5\sigma = 0.5 in our experiments. The simplified flow uses ϕt(z)=ty+(1t)x+σϵ\phi_t(\mathbf{z}) = t\mathbf{y} + (1-t)\mathbf{x} + \sigma \mathbf{\epsilon} with σ\sigma near zero (e.g., 1e41e{-}4), reducing to a deterministic interpolation ϕt(z)=ty+(1t)x\phi_t(\mathbf{z}) = t\mathbf{y} + (1-t)\mathbf{x}. Consequently, our vector field yxσϵ\mathbf{y}-\mathbf{x}-\sigma\mathbf{\epsilon} introduces stochasticity, enabling variability in generative tasks, unlike the deterministic yx\mathbf{y}-\mathbf{x} in the simplified flow.

审稿意见
4

This paper proposes a streaming binaural speech synthesis method using a causal architecture design and flow matching models. It introduces a flow matching model to generate binaural speech from a single-channel input. Additionally, it adopts a causal architecture to predict the next frames based on past information for streaming inference. To efficiently estimate the vector fields, buffers for the features at each sampling step are stacked, and only the current frame is computed using them. The results show better performance than other baselines.

Update after rebuttal

I remain positive about the paper and will keep my score.

给作者的问题

[Q1: GAN-based models] Have you compared the model with GAN-based models? Although CFM-based models can generate high-quality waveforms, GAN-based models are still effective for waveform generation. Specifically, PeriodWave-Turbo [7] fine-tuned the CFM models with adversarial feedback to improve performance and reduce the number of sampling steps.

For real-time applications, this model still requires a midpoint method with six NFEs, resulting in higher latency.

[Q2: Reshape and Linear Projection methods instead of STFT/iSTFT]

Have you tried using different input features instead of STFT components? I recommend using the waveform directly with a reshape method and linear projection based on WaveNeXt [8]. This approach does not require a reflected padding before STFT.

[7] Lee, Sang-Hoon, Ha-Yeong Choi, and Seong-Whan Lee. "Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization." arXiv preprint arXiv:2408.08019 (2024).

[8] Okamoto, Takuma, et al. "WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer." 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023.

论据与证据

[Flow Matching for Binaural Audio Generation]

Based on previous work [1], [2], [3], flow matching models have already been verified to effectively generate high-quality waveform signals. The related works on flow matching-based waveform generation are not discussed in the current manuscript.

[1] https://openreview.net/forum?id=tQ1PmLfPBL

[2] https://openreview.net/forum?id=gRmWtOnTLK

[3] https://openreview.net/forum?id=uxDFlPGRLX

[Future Frame Generation based on the past information]

In terms of generative tasks, this paper adopts future frame generation using flow matching based solely on past information. This approach has potential for streaming generation in real-time applications. However, the current paper does not describe the real-time capabilities; it only discusses streaming generation in terms of a smaller NFE and its causal design. Please evaluate the real-time factor for streaming generation, as it is important to demonstrate real-time streaming ability if the paper claims to support streaming generation.

[Causal Model Design with Buffer]

I like the proposed causal design for streaming generation using buffers at each sampling step. However, this structure was already proposed in CosyVoice 2, which employs a chunk-aware causal flow matching model for streaming synthesis. Furthermore, the model structure is identical to that of Matcha-TTS based UNet architectures. However, this paper does not refer Matcha-TTS. Please discuss the differences between the proposed models and those in Matcha-TTS and CosyVoice 2. CosyVoice 2 was released at Dec. 2024 so I just think that this work is a concurrent work. Feel free to add the discussion about this.

[4] Mehta, Shivam, et al. "Matcha-TTS: A fast TTS architecture with conditional flow matching." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.

[5] Du, Zhihao, et al. "Cosyvoice 2: Scalable streaming speech synthesis with large language models." arXiv preprint arXiv:2412.10117 (2024).

[Comparison with Parallel Model]

It would be better if the ablation study with a parallel model is added.

[Early Skip]

This paper only empirically claims the early skip strategy could improve the performance. Also, the ratio of 0.5 was chosen heuristically. I recommend using the Sway Sampling of F5-TTS using different coefficient.

[6] Chen, Yushen, et al. "F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching." arXiv preprint arXiv:2410.06885 (2024).

方法与评估标准

Evaluation metrics are limited in supporting the streaming ability. Please added more evaluation metrics regarding the real-time factor, and compare the performance with or without the buffer to support the efficient causal model design.

理论论述

.

实验设计与分析

The evaluation might be conducted on the speech dataset. It would be beneficial to include results from a sound dataset as well.

补充材料

Please add a demo page. It is very difficult to listen to the audio files when they are only available in the supplementary material. I think the audio paper should have a demo page.

与现有文献的关系

.

遗漏的重要参考文献

[1] https://openreview.net/forum?id=tQ1PmLfPBL

[2] https://openreview.net/forum?id=gRmWtOnTLK

[3] https://openreview.net/forum?id=uxDFlPGRLX

[4] Mehta, Shivam, et al. "Matcha-TTS: A fast TTS architecture with conditional flow matching." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.

[5] Du, Zhihao, et al. "Cosyvoice 2: Scalable streaming speech synthesis with large language models." arXiv preprint arXiv:2412.10117 (2024).

其他优缺点

.

其他意见或建议

The concept of this paper is good to me. However, the details for real-time streaming generation and the survey of the related works are limited. Please discuss the related works more and evaluate the real-time factor to demonstrate the concept of this paper.

In the future, GPUs will become much faster so iterative sampling methods for waveform generation will be used more practically. However, I hope that it is not the overstated claims for the streaming ability in the current state. Please discuss the real-time ability.

伦理审查问题

.

作者回复

We appreciate your positive feedback on our work. We will address your questions in the responses below.

Related work (Claims And Evidence)

We thank the reviewer for pointing out these related works. PeriodWave designs a multi-period flow matching model for high-fidelity waveform generation. FlowDec introduces a conditional flow matching-based audio codec to noticeably reduce the postfiler DNN evaluations from 60 to 6. RFWave proposes a multi-band rectified flow approach to reconstruct high-fidelity audio waveforms. These works are all related to flow matching models and show the effectiveness in generating high-quality waveform signals. We will include our discussion in the revised paper.

Real-time factor (Claims And Evidence, Methods And Evaluation Criteria)

We calculate the real-time factor of our model for different numbers of function evaluations on a single 4090 GPU. The audio sampling rate is 48 kHz, and the audio length is 0.683 seconds. As shown in the table, when NFE is set to 6, the real-time factor is 0.239. If we sacrifice some performance for faster inference, setting NFE to 1 results in an RTF of 0.04. Our model demonstrates potential for real-time streaming generation.

NFEInference Time (sec)Real-Time Factor
10.0270.040
20.0550.081
40.1090.160
60.1630.239
80.2170.318
100.2710.397

Discussion on Matcha-TTS and CosyVoice 2 (Claims And Evidence)

We carefully reviewed the papers and code of these two works. Matcha-TTS employs a 1D U-Net model with 1D ResNet layers and Transformer Encoder layers. Neither the ResNet layers nor the Transformer Encoder layers are causal, which means that Matcha-TTS does not achieve time causality or support streaming inference. In contrast, our model is fully causal and supports streaming inference.

CosyVoice 2 introduces a chunk-aware causal flow matching model that uses causal convolution layers and attention masks to enable causality. However, the CosyVoice 2 model does not include feature buffers for each causal convolution layer, which may result in audio interruptions and discontinuities during streaming inference in real-world scenarios.

We will include our discussion of Matcha-TTS and CosyVoice 2 in our revision.

Experiment of parallel model (Claims And Evidence, Methods And Evaluation Criteria)

We compare the model's performance with and without buffers to examine the effectiveness of our causal model design. The results show that BinauralFlow with buffers achieves higher quality than the model without buffers. Additionally, in Figure 7 in the main paper, we show that excluding buffers causes noticeable artifacts in the generated spectrograms.

MethodsL2 \downarrowMag \downarrowPhase \downarrow
BinauralFlow w/ buffer1.00\mathbf{1.00}0.0071\mathbf{0.0071}1.33\mathbf{1.33}
BinauralFlow w/o buffer13.250.03981.34

Sway Sampling (Claims And Evidence)

As suggested by the reviewer, we use Sway Sampling with different coefficients ranging from -1 to 1 to systematically evaluate our model. The results are shown in the table below. Changing the coefficients does not lead to significant changes in the quantitative results. However, we observe that setting coefficients greater than 0, which shifts the time steps to the second half, results in better qualitative outcomes. Specifically, background noise becomes more realistic when the coefficient is increased. These results support the rationale behind our early skip strategy.

CoefficientsL2 \downarrowMag \downarrowPhase \downarrow
-1.01.060.00701.29
-0.81.100.00701.29
-0.41.000.00691.29
01.020.00691.29
0.41.039.00701.31
0.81.040.00711.32
1.01.020.00721.33

Sound dataset (Experimental Designs Or Analyses)

Thank you for the suggestion. We plan to conduct this experiment in our future work.

Demo (Supplementary Material)

As recommended by the reviewer, we have created a demo page and will release it following the acceptance of our paper.

GAN-based models (Questions For Authors)

Utilizing GAN-based models, such as PeriodWave-Turbo, to enhance inference efficiency is a promising direction. We plan to discuss this possibility in our paper and explore it further in future work.

Reshape and linear projection (Questions For Authors)

We tried using waveform directly without STFT/iSTFT but it did not lead to superior performance. We are still interested in waveform-based approaches. We will explore the reshape method and linear projection proposed in WaveNeXt and discuss them in our paper.

审稿意见
3

They sought to address the binaural speech synthesis task by using mono-channel audio to generate binaural speech. To support streaming and produce audio aligned with a given pose, they employed a flow-matching-based generative model with a causal structure. In the process, they introduced streaming STFT/ISTFT and a buffer bank to enable seamless streaming. Furthermore, to enhance per-chunk generation speed in the flow-matching-based approach, they adopted a solver designed to improve sampling speed and integrated a noise-skip strategy.

update after rebuttal

I have reviewed the authors' response. They provided sufficient experimental results and explanations addressing my previous concerns. Therefore, I maintain my positive assessment of this submission.

给作者的问题

[Q1] I am curious whether you plan to make the collected dataset publicly available.

[Q2] Could you provide any results comparing the midpoint solver with other solvers?

[Q3] Since I am not very familiar with this domain, I wonder why you use STFT-based complex spectrograms rather than mel spectrograms, which are more common in typical speech synthesis (TTS).

[Q4] Finally, I would like to know more about the latency of the streaming model and how it compares in terms of performance with the non-streaming version.

论据与证据

At least, there do not appear to be any issues with the authors’ claims.

方法与评估标准

There also seem to be no issues with the methodology or the evaluation.

理论论述

It appears that, rather than putting forth a separate theoretical argument, they have primarily adopted existing theoretical backgrounds (e.g., flow matching, midpoint solver), which seems reasonable.

实验设计与分析

The experimental design and analysis also appear to be reasonable.

补充材料

I examined the data generation section and also listened to the accompanying audio and video samples.

与现有文献的关系

They appear to have achieved similar or even improved quality compared to previous studies, while also providing streaming support.

遗漏的重要参考文献

I do not see anything else noteworthy in that regard.

其他优缺点

[S1] The figures and explanations are well-presented, making it easy for individuals who are not familiar with binaural speech synthesis to understand.

[S2] It is also commendable that the authors collected new data for verification, and that they provide results on publicly available datasets (as mentioned in the Appendix).

[S3] Moreover, enabling streaming capability further enhances the model’s practical utility.

I will outline my questions below.

其他意见或建议

I will outline my questions below.

作者回复

We appreciate your positive feedback on our work. We will address your concerns and include your suggestions in the revision.

Dataset (Questions For Authors)

Regarding open-sourcing the dataset, we fully understand the importance of reproducibility and transparency in research. However, due to privacy constraints and participant confidentiality, we are unable to publicly release the full dataset. To ensure the reproducibility of our work, we will release all implementation code, training scripts, pretrained model weights, and a test subset that has been carefully curated to exclude any personal information.

Different solvers (Questions For Authors)

Besides the Midpoint solver, we test the Euler and Heun solvers. The Euler solver is a first-order solver and Midpoint and Heun solvers are second-order. We set the number of function evaluations (NFE) to 6 and present the results below. Although the Euler solver yields lower error values than the Midpoint solver, it fails to generate realistic background noise. Setting NFE to 6 is insufficient for the Heun solver, which requires 30 steps to achieve comparable error values. In conclusion, the Midpoint solver provides the best trade-off between error values, qualitative results, and inference efficiency.

Solver TypeNFEQualityL2 \downarrowMag \downarrowPhase \downarrow
Euler6Medium0.900.00661.24
Midpoint6High1.000.00711.33
Heun6Low16.860.04991.44
Heun30Medium1.270.00871.36

STFT-based complex spectrograms (Questions For Authors)

A mel spectrogram is derived by mapping the STFT-based complex spectrogram to the mel scale. In this process, only the magnitude is retained, while the phase is discarded. Spatial audio rendering relies on precise phase information to capture interaural time differences (ITD) and interaural phase differences (IPD). These phase cues are essential for accurately conveying spatial positioning in binaural audio, ensuring a realistic 3D auditory experience. Therefore, we use STFT-based complex spectrograms, which retain both magnitude and phase information.

Latency (Questions For Authors)

We test the inference latency of our streaming model using a single 4090 GPU. The audio sampling rate is 48 kHz and the audio length is 0.683 seconds. We vary the NFE from 1 to 10 and report the corresponding inference times below. We also report the real-time factor (RTF), which is calculated by dividing the processing time by the actual time. If RTF is less than 1, the system runs faster than real-time. With NFE set to 6, the real-time factor is 0.239. Reducing NFE to 1 improves inference speed at the cost of some performance, yielding an RTF of 0.040. These results demonstrate our model's potential for real-time streaming generation.

NFEInference Time (sec)Real-Time Factor
10.0270.040
20.0550.081
40.1090.160
60.1630.239
80.2170.318
100.2710.397

Streaming vs non-streaming models (Questions For Authors)

For a sequence of audio chunks, the streaming model buffers intermediate features to enable seamless streaming inference, while non-streaming model processes each audio chunk independently without buffering. We visualize the spectrograms generated by both the streaming and non-streaming models in Figure 7 of the main paper, where the non-streaming models generate noticeable artifacts between audio chunks. Below, we present a quantitative comparison between the two. The streaming model achieves better audio quality than the non-streaming version, producing smoother and more continuous audio generation.

MethodsL2 \downarrowMag \downarrowPhase \downarrow
Streaming1.00\mathbf{1.00}0.0071\mathbf{0.0071}1.33\mathbf{1.33}
Non-streaming13.250.03981.34
最终决定

The paper proposes a model for generating high quality binaural speech with flow matching models allowing for streaming applications by applying a series of tricks to improve rendering continuity and speed. Even though some of the reviewers have pointed out several similarities between this work and other works, there are still specific ideas proposed here that seem to be novel (although of limited interest). This is definitely a weak accept as the work is solid but of limited interest.