Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding
摘要
评审与讨论
This paper proposes multi-band frequency spectral residual vector quantization (MBS-RVQ) for quantizing latent speech across different frequency bands. Additionally, the results demonstrate the performance of zero-shot text-to-speech models using the proposed Neural Audio Codec.
Update after rebuttal
While the proposed methods could enhance RVQ, the improvement is incremental and still it requires many residuals, which is a burden for downstream tasks. Recently, many codecs are designed only with a single layer such as LLASA. I could not find any advantage in terms of efficiency. I will maintain my score as it is.
给作者的问题
.
论据与证据
Using a multi-band audio representation is not new for neural audio codecs. Specifically, Spectral Codecs [1] proposed a multi-band spectral codec that encodes disjoint mel bands separately and quantizes them using frequency-wise vector quantization. HALL-E [2] introduced a Multi-Resolution Requantization (MReQ) method to quantize the latent representation from low to high frequencies. PyramidCodec [3] quantized the latent representation hierarchically by employing RVQ on multi-scale features. Language-Codec [4] also separates the latent representation and quantizes them individually.
[1] Langman, Ryan, et al. "Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis." arXiv preprint arXiv:2406.05298 (2024).
[2] Nishimura, Yuto, et al. "HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis." ICLR, 2025.
[3] Jianyi Chen, Zheqi Dai, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, and Wei Xue. 2024. PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4253–4263, Miami, Florida, USA. Association for Computational Linguistics
[4] Ji, Shengpeng, et al. "Language-codec: Reducing the gaps between discrete codec representation and speech language models." arXiv preprint arXiv:2402.12208 (2024).
方法与评估标准
The model comparison is not entirely fair because other codecs, such as Encodec, DAC, HiFi-Codec, and Mimi, were not trained using four RVQs. However, the comparison may have been conducted using only four RVQ levels for the baselines.
Furthermore, while Encodec and Mimi use causal convolutional layers for streaming generation, Muffin employs non-causal convolutional layers with a greater number of layers, which makes the comparison somewhat unfair. Please discuss more details for other models.
理论论述
This paper utilizes residual vector quantization, a well-established method.
实验设计与分析
Please include details such as token rate, codebook size, codebook number, and frame rate, following LLASA [5].
[5] Ye, Zhen, et al. "Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis." arXiv preprint arXiv:2502.04128 (2025).
补充材料
.
与现有文献的关系
.
遗漏的重要参考文献
[1] Langman, Ryan, et al. "Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis." arXiv preprint arXiv:2406.05298 (2024).
[2] Nishimura, Yuto, et al. "HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis." ICLR, 2025.
[3] Jianyi Chen, Zheqi Dai, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, and Wei Xue. 2024. PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 4253–4263, Miami, Florida, USA. Association for Computational Linguistics
[4] Ji, Shengpeng, et al. "Language-codec: Reducing the gaps between discrete codec representation and speech language models." arXiv preprint arXiv:2402.12208 (2024).
其他优缺点
.
其他意见或建议
The ablation study for the modified snake function is not conducted.
Typo on line 034: "12.5 kHz" might be "12.5 Hz."
We sincerely appreciate your time and the effort you’ve put into helping us improve our presentation.
Novelty (It appears that there are some misapprehension): Our quantizer fundamentally differs from plain RVQ by employing a multi-band split directly at the latent level, guided by psychoacoustic features such as content, formant articulation, and speaker characteristics. Unlike other multi-band codecs (e.g., Spectral Codec, which splits at the input level, or HALL-E and PyramidCodec, which use multi-scale down-sampling), our approach as accurately pointed by other reviewers that it performs a spectral split in the latent space. This design allows the codec to construct psychoacoustical disentangled features, with each codebook optimized specifically for different frequency bands (supported by theory and empirical observations), clearly distinguishing from previous work. As demonstrated in Theorem 3.1, our method not only enhances the elegance and robustness of the codec but also pushes the boundaries of performance. We believe this novel perspective is an important contribution to the community and bridge the gap of traditional work of MP3.
Fairness (It appears that there are some confusion but we are happy to clarify): Baseline Alignment-We intentionally align our work to HiFi-Codec (baseline) by matching codebook used of 4. HiFi-Codec represents SOTA speech performance, making it a rigorous and relevant benchmark. Additionally, we retrained the baseline on the same dataset to ensure robust performance and to control for any potential data bias, thereby ensuring a fair comparison. Benchmark other work: For our comparisons with other codecs (e.g., DAC and EnCodec), we adopt the “early codebooks” approach, consistent with recent studies such as WavTokenizer, SemantiCodec, and Speech Tokenizer. MUFFIN performs well even when evaluated across other papers of same dataset, showing consistency in the literature.
Computationally- RVQ (benchmarked model) mathematically encourages each codebook to be as self-sufficient as possible. Each stage minimizes its own L1/L2 reconstruction error without “looking ahead” to future stages. This independent optimization ensures that the early codebooks in any model, whether a model is trained with 4 or 32 codebooks, are directly comparable, since each stage is forced to capture as much residual information as possible.
Lastly, MiMi is compared with the officially reported count of 8 codebooks to further validate our results. Overall, our comparative tables are constructed on a scientifically fair basis, allowing for meaningful insights rather than merely demonstrating superior performance or any attempt to over-claim our work.
Streaming: Our streaming capability is built on a fully CNN-based model that follows exactly like Encodec’s design so it will be inaccurate to say that ours is non-causal. Specifically, our system processes audio in small window frames (3.5 seconds) that are non-causal within each frame, i.e., using such global window context, but causal over past windows for all streaming applications. Since CNNs are fundamentally local feature extractors, they do not inherently capture global context as transformers do. This locality is advantageous in streaming applications, as it allows for more stable and consistent performance when operating under strict causal constraints. By contrast, self-attention models (wavtokenizer), although trained in a non-causal manner, must be adapted to causal computation during streaming, which can lead to instability. Our approach leverages the stability of CNNs in local processing, ensuring robust streaming even when constrained to causal operation.
Following LLASA: We fully agree with the reviewer’s concern regarding the importance of these metrics. Due to space constraints, we did include the detailed table in the original submission Appendix F, where we demonstrate their impact comprehensively with latency metric of MACs and model size as well. However, we will update the references raised and the typo in line 034.
Ablation of Snake Activation: While modified snake activation is not our core contribution in this work, we agree that showing the ablation performance helps to improve the quality of the presentation. We will include the results in appendix C while showing the performance as below
LibriTTS (test-clean)
| Model | STFT | MEL | PESQ | STOI | UTMOST | ViSQOL |
|---|---|---|---|---|---|---|
| Added amplitude & bias (Ours) | 1.555 | 0.692 | 2.996 | 0.954 | 4.017 | 4.516 |
| Added amplitude | 1.603 | 0.744 | 2.928 | 0.945 | 3.943 | 4.448 |
| Vanilla | 1.635 | 0.760 | 2.876 | 0.940 | 3.905 | 4.409 |
MUFFIN is an improved RVQ-based neural audio codec (NAC) using a multi-band spectral split for each RVQ sub-layer, to better disentangle different frequency bands into separate RVQ sub-layer codebooks ("psychoacoustically guided"). This enables improved bitrate allocation based on psychoacoustic studies, which bridges traditional codec design (MP3, Opus) and NAC towards perceptual-oriented architectural design.
给作者的问题
As mentioned in the claims section, while most of the paper are well structured, I feel the no-MBS ablation important to add to make the paper's core claim stronger.
论据与证据
The claims and evidence are mostly presented adequately via quantitative and qualitative analysis, including sub-layer reconstruction and PCA analysis of each codebook. However, I would like to see an ablation study by disabling "MBS" portion of MBS-RVQ, while keeping the other design unchanged. In other words, having a MUFFIN model only trained on plain RVQ (EnCodec, DAC, or HiFi-Codec) without the psychoacoustic guidance as comparison would be better, since the usage of "MBS" is a core claim in this work, such ablation study seems to be the most important experiment.
方法与评估标准
The work follows conventional metrics in codec reconstruction, which makes sense. The addition of WER is a nice addition, as some of recent low frame-rate codes do not perform great in these metrics even though they are good at the acoustic reconstruction metrics, including UTMOS and ViSQOL. I suggest the authors to also consider SECS as a viable metric as well.
理论论述
A theoretical claim of the psychoacoustic evidence of perceptual speech characteristics is based on a well established literature, which is not necessarily new claim but serves as a good reference to bridge the existing theory into a neural model design.
实验设计与分析
Since the authors retrained HiFi-Codec with the same configuration, but not others, I think it's good to annotate it in the evaluation result tables.
补充材料
I reviewed the appendix and the demo page.
与现有文献的关系
The findings can potentially give the NAC community an attention to bring psychoacoustic study which is well-studied for past dacades, to bring domain-specific knowledge into the neural design rather than disconnecting from the an estabilished past.
遗漏的重要参考文献
BigVGAN [ICLR'23] is the first work that introduced Snake activation into the audio decompression domain (as a mel spectrogram vocoder), and in fact, it is also the first study that also proposed a learnable scaling factor β (called SnakeBeta) from its official implementation. However, current manuscript only points this to follow-up studies (DAC and Stable Audio). Since this paper introduces a further study of the periodic activation function design, I suggest the authors to include the above-mentioned original reference.
其他优缺点
Please see Claims And Evidence and Questions section.
其他意见或建议
None
We appreciate your dedication to carefully scrutinize our work and it means a lot to us.
Disabling MBS: We agree that conducting an ablation study by disabling “MBS” is important to better demonstrate its contribution to the reconstruction performance. Part of this analysis has already presented in Table 5, where we compare MUFFIN with plain RVQ to evaluate WER, STOI, and the behavior of each individual codebook. To further strengthen our empirical evidence, we will include detailed reconstruction performance results in a new appendix, as shown below focusing on speech reconstruction.
LibriTTS (test-clean)
| Model | STFT | MEL | PESQ | STOI | UTMOST | ViSQOL |
|---|---|---|---|---|---|---|
| MUFFIN | 1.555 | 0.692 | 2.996 | 0.954 | 4.017 | 4.516 |
| RVQ | 1.627 | 0.768 | 2.856 | 0.940 | 3.875 | 4.328 |
| MUFFIN (12.5 Hz) | 1.663 | 0.807 | 2.360 | 0.932 | 4.074 | 4.225 |
| RVQ (12.5 Hz) | 1.755 | 0.879 | 2.260 | 0.924 | 3.785 | 4.017 |
LibriTTS (test-other)
| Model | STFT | MEL | PESQ | STOI | UTMOST | ViSQOL |
|---|---|---|---|---|---|---|
| MUFFIN | 1.615 | 0.758 | 2.658 | 0.934 | 3.444 | 4.454 |
| RVQ | 1.683 | 0.810 | 2.544 | 0.917 | 3.318 | 4.268 |
| MUFFIN (12.5 Hz) | 1.725 | 0.875 | 2.086 | 0.904 | 3.560 | 4.129 |
| RVQ (12.5 Hz) | 1.863 | 0.963 | 1.940 | 0.815 | 3.399 | 3.993 |
IEMOCAP
| Model | STFT | MEL | PESQ | STOI | UTMOST | ViSQOL |
|---|---|---|---|---|---|---|
| MUFFIN | 1.399 | 0.675 | 2.178 | 0.806 | 1.903 | 4.000 |
| RVQ | 1.510 | 0.793 | 2.039 | 0.715 | 1.805 | 3.883 |
| MUFFIN (12.5 Hz) | 1.429 | 0.754 | 1.726 | 0.723 | 2.026 | 3.612 |
| RVQ (12.5 Hz) | 1.584 | 0.835 | 1.644 | 0.645 | 1.917 | 3.455 |
From the above tables, there is consistent improvement over using MBS demonstrating the effectiveness of MUFFIN (supported by theorem).
Using SECS for evaluation: We acknowledge that our current evaluation metrics for the codec do not include human evaluations, and we agree that SECS may be a viable addition. The metrics we adopted follow previous work (e.g., Codec-SUPERB), which argued that these objective measures provide sufficient coverage.
Nevertheless, we appreciate your suggestion and have attempted it. However, given the STOI scores of the reconstruction results and the nature of our task (i.e., not generating entirely new speech but reconstructing existing ones), it can be challenging for human evaluators to reliably distinguish subtle quality differences, as shown in the demos (especially without cherry-picking samples). Similar issue has been discussed in [1].
Therefore, we find it difficult to implement and believe that relying on objective metrics is more appropriate for evaluating the codec’s performance with more precise distance metrics. However, we are also careful with our evaluations and have indeed used human evaluation for our TTS outputs, as shown in Table 6, where naturalness (MOS) and speaker similarity can be meaningfully assessed. We trust that this is a common valid concern and will discuss in a new appendix section to dismiss some misunderstanding/issue with evaluating codec performance with human while evaluating/citing SECS.
[1] Varadhan, Praveen Srinivasa, et al. "Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation." arXiv preprint arXiv:2411.12719 (2024).
Annotation of off-the-shelf models in tables and snake activation references: It makes perfect sense and we will update the manuscript with the provided references to enhance the credibility of our results. We deeply appreciate your effort in pointing out our weaknesses.
Thank you for rebuttal. I find the no-MBS ablation (only disabling MBS while keeping other details of MUFFIN identical) helpful for the readers to understand the merit in a precise manner. The acoustic metrics seem to agree with the motivation with consistent improvements.
Can the authors present the audio recon demos of this baseline using the samples from (F) Psychoacoustic Codebook Auditory Analysis in the demo? This will help the readers also to evaluate the claimed weak disentanglement MBS brings (vs. plain RVQ) by disabling it, and to form the reader's own opinion about its perceptual significance.
To clarify regarding SECS, I meant speaker encoder cosine similarity (also noted as SIM-o in zero-shot TTS literature) using a speaker encoder model (WavLM-TDCNN), originally proposed in VALL-E and became as one of the golden metrics (alongside CER/WER) to measure the speaker similarity. This can be placed alongside S-MOS in table 6. Since the authors have already conducted human evaluations, having additional objective metrics with SIM-o will strengthen the results of MUFFIN used as a speech LM tokenizer.
We appreciate the opportunity to engage with your feedback once again and take your valuable comments seriously.
Demos: We have included the ablation audio in section (F) and fully agree that providing such materials enhances the immersive experience for the reader. We invite you to be our first reader to revisit the demo page again to compare the results of a plain RVQ model, which optimizes purely for residual error. In this setup, most of the information is forced into the first codebook, while the subsequent codebooks capture only minor residuals, often lacking meaningful representation. In contrast, our proposed MBS approach is inspired by psychoacoustic studies. It organizes auditory information by frequency bands, which may encourage a more natural, unsupervised separation of perceptually relevant features. This design can help the model capture semantically useful representations without relying on explicit labels, potentially easing the burden of manual annotation. Furthermore, it supports more effective neural optimization and reconstruction, in line with psychoacoustic principles exploited in traditional codec designs such as MP3.
SECS metrics: Thank you for the clarification regarding SECS and its relation to SIM-o in the zero-shot TTS literature. Following your suggestion, we have calculated SECS using Resemblyzer and updated Table 6 accordingly:
| Systems | WER | MOS | S-MOS | SECS |
|---|---|---|---|---|
| VALL-E w/ Encodec | 21.05% | 3.91 | 3.70 | 0.5914 |
| VALL-E w/ Hifi-Codec | 32.35% | 4.00 | 4.04 | 0.5874 |
| VALL-E w/ MUFFIN | 12.20% | 4.18 | 4.19 | 0.6099 |
We appreciate your suggestion regarding the inclusion of SECS as an objective metric for speaker similarity and acknowledge its increasing adoption in recent literature. While SECS can certainly provide complementary insights, we would also like the reader to know that such metrics can be sensitive to factors such as the choice of speaker encoder, background noise, and linguistic content, which may introduce ambiguity in interpretation. Given these considerations, it explains why we prioritized human evaluations using S-MOS given in the initial Table 6, which directly assess perceived speaker similarity and capture aspects often overlooked by embedding-based metrics — including prosody, speaking style, and emotional nuance, as highlighted in prior studies. Nevertheless, we agree that combining both measures provides a more comprehensive and robust evaluation of speaker similarity, and we are happy to include SECS in our updated report.
We hope that your concerns have been well-addressed. If not, please let us know as we are eager to further improve our work and strengthen its potential impact on future research.
The paper introduces MUFFIN, a neural psychoacoustic codec leveraging Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) and a modified snake activation function. By decomposing latent representations into psychoacoustically motivated frequency bands, MUFFIN optimizes bitrate allocation and achieves state-of-the-art audio reconstruction quality across speech, music, and environmental sounds. Extensive experiments demonstrate superior performance over existing codecs (e.g., HiFi-Codec, Encodec) in both standard and high-compression settings, with applications in zero-shot text-to-speech synthesis.
给作者的问题
- How does MUFFIN handle non-stationary or transient sounds (e.g., percussive elements in music), given the focus on speech-centric psychoacoustics?
- Could the environmental sound performance be improved with a larger dataset, or is the current approach inherently biased toward speech/music?
- What are the practical limitations of the 12.5 Hz variant in real-time streaming, given the increased downsampling rate?
论据与证据
Yes, both the experimental data provided by the author and the audio provided on the homepage demonstrate the effectiveness of their method.
方法与评估标准
Yes
理论论述
I haven't checked all the proofs because I'm not very familiar with this field and haven't found any errors yet.
实验设计与分析
I checked the main experiments in the paper and refer to the section on Other Strengths and Weaknesses for detailed opinions.
补充材料
I didn't review the supplementary material, but I checked the provided homepage.
与现有文献的关系
The key contributions of MUFFIN are deeply rooted in and extend the broader scientific literature on neural audio coding, psychoacoustics, and multi-band signal processing. By introducing MBS-RVQ, leveraging psychoacoustic principles, and proposing novel architectural improvements (e.g., modified snake activation), MUFFIN addresses longstanding challenges in the field and sets a new standard for high-fidelity, efficient audio compression. Its applications in zero-shot TTS and potential integration with LLMs further underscore its relevance to cutting-edge research in speech and audio processing.
遗漏的重要参考文献
I am not deeply versed in this field, but I believe the author has provided a fairly comprehensive citation of relevant work.
其他优缺点
Strengths: MBS-RVQ effectively disentangles speech attributes (content, speaker identity) into distinct codebooks, aligning with psychoacoustic principles. This is a significant advancement in neural audio coding. The Lipschitz continuity analysis of the encoder and ablation studies (e.g., t-SNE visualizations, codebook-specific reconstructions) validate the design choices. MUFFIN outperforms baselines across metrics (PESQ, STOI, UTMOS) and datasets (LibriTTS, IEMOCAP, GTZAN), particularly at high compression rates (12.5 Hz). The codec’s efficiency (lower MACs than HiFi-Codec) and compatibility with LLMs (via tokenized representations) highlight its potential for real-time and generative applications.
Weaknesses: The ESC-50 dataset (3 hours) is small compared to speech/music datasets, raising concerns about generalizability to environmental audio. Automated MOS (UTMOS/ViSQOL) is used instead of human evaluations, which are critical for perceptual quality claims. While MACs are reduced, latency and real-time performance are not quantitatively compared to streaming-focused codecs like AudioDec. Although misuse risks (e.g., deepfakes) are acknowledged, concrete mitigation strategies are absent.
其他意见或建议
- Include human subjective evaluations (MOS) to strengthen perceptual quality claims.
- Expand environmental sound experiments with larger datasets (e.g., AudioSet).
- Discuss latency benchmarks relative to real-time codecs (e.g., OPUS, AudioDec).
- Clarify ethical safeguards (e.g., watermarking synthesized audio) in the impact statement.
We thank you for your constructive comments and thoughtful concerns, which help to improve the impact of this work and spark further discussion.
Using larger audio set: We agree with the valid concern regarding the size of the environmental audio dataset, and we welcome further discussion on this topic. Our findings indicate that integrating both speech and music data help to overcome low-resource while achieving SOTA environmental audio performance, even when using a smaller set, comparing to various off-the-shelf models (DAC, EnCodec) trained on much larger collections (see Table 4). This can be attributed to our training that follows existing work, which uses short 1s segments that capture brief vocal or instrumental passages. These segments tend to be somehow similar in terms of their audio characteristics to some environmental audio, thereby reducing the reliance on distinctive larger audio datasets. Moreover, our interest is on the vocal, where psychoacoustic features, such as vocal timbre and articulation, are the highlight of the neural psychoacoustic codec. Thus, we will not consider using larger audio dataset in this work as compared to speech and music (covering singing vocal).
Human evaluation and latency: We agree that the absence of human evaluations is a common concern. To address this, we will add a dedicated subsection in Appendix F to explain how the objective metrics, widely adopted in the literature, correlate with human perceptual quality, thereby demonstrating the self-sufficiency of our report without human for codec. Further detailed clarifications will also be provided in response to Reviewer grYU (using SECS for evaluation). Similarly, our latency benchmarks, presented in Appendix F, are based on MACs and model parameters, which provide a good objective measure of inference time while normalizing for factors such as GPU specifications.
Transient audio: We appreciate the reviewer’s thoughtful observation. While psychoacoustic studies have primarily focused on speech, we agree that applying similar analysis to non-stationary or transient sounds, such as those in music, is both important and intriguing. To explore this, we extended our decomposition approach to a variety of musical genres, including singing, classical, jazz, and symphonic music. Consistent with the psychoacoustic framework used in speech analysis, we observed that:
• Codebook 1: primarily captures vocal content and coarse rhythmic beats.
• Codebook 2: emphasizes vocal clarity and mid-frequency information.
• Codebook 3: encodes pitch details reflective of the singer’s unique characteristics.
We have updated our demos to include samples that support these observations. Interestingly, instrumental content does not clearly separate across Codebooks 2 and 3, suggesting that our psychoacoustic-guided representation is particularly effective in disentangling vocal attributes (speech and singing), but less so for purely instrumental channels. This finding reinforces the theoretical value of psychoacoustic principles for modeling vocal properties, an area that remains underexplored in neural codecs. While applying this framework to instrumental music remains challenging, we believe this opens new research directions. Further investigations, beyond the scope of the current study, will be discussed in our future work section to encourage continued exploration of this promising line of research in appendix.
Practical limitation of 12.5 Hz: Achieving high compression rates in audio codecs often challenges the preservation of the full spectrum of human hearing, potentially leading to muffled sounds or perceptible artifacts (especially so for streaming with reference to MiMi's performance). To address this, integrating psychoacoustic models can enhance reconstruction quality by optimizing compression across multiple frequency bands, focusing on perceptually significant components. However, implementing such models typically necessitates more complex and deeper neural networks to effectively quantize and encode the nuanced psychoacoustic information without significant loss. This increased complexity can lead to larger model sizes, which may offset the benefits of efficient compression by demanding more computational resources and storage capacity. Therefore, a careful balance must be struck between leveraging psychoacoustic properties for improved audio quality and managing the trade-offs related to model complexity and compression efficiency. This could spark more research work to investigate in this area.
Thank you for the author's reply. I will maintain my score.
This paper presents MUFFIN, a neural psychoacoustic codec that leverages:
- Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) to quantize latent speech representations across different frequency bands.
- A modified snake activation function to more precisely capture fine-grained frequency details. In this regard, the authors should cite the BigVGAN paper, as noted by Reviewer grYU.
Both the perceptual quality showcased on the demo website and the quantitative evaluations demonstrate solid improvements.
Regarding LLASA and single-layer codecs:
- It's a bit unfair to treat LLASA as a reason for rejection, given that the ICML submission deadline was before the LLASA paper's arXiv release. However, the authors should acknowledge and discuss the related work highlighted by the reviewer in their final draft. While single-layer codecs are a compelling direction and may ultimately be ideal, they still have received mixed feedback from practitioners.