/10

Poster3 位审稿人

最低2最高4标准差0.8

ICML 2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Zihan Liu,Shuangrui Ding,Zhixiong Zhang,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Dahua Lin,Jiaqi Wang

提交: 2025-01-11更新: 2025-07-24

TL;DR

We propose SonGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation, supporting both mixed mode and dual-track mode generation.

摘要

关键词

text-to-songsong generationauto-regressive transformer

评审与讨论

审稿意见

评分: 42025-02-19

This paper proposes a single-state autoregressive transformer for song generation that produces vocals and accompaniments either simultaneously or in an interleaved manner. SongGen-Mixed Pro utilizes the delayed token prediction method from MusicGen, along with an auxiliary vocal token prediction to enhance vocal learning. Additionally, SongGen-Interleaving explores interleaving methods such as A-V or V-A.

Update after rebuttal

After reading all the reviews from different reviewers and the author’s feedback, I still believe this paper presents solid work, despite the limitation of using a 16 kHz sampling rate. I strongly recommend accepting this paper for ICML 2025.

给作者的问题

The details of X-Codec should be described more.

论据与证据

The proposed idea is very simple yet powerful. Although this paper follows the overall framework of MusicGen (specifically the delayed token prediction method), it significantly improves mixed acoustic audio generation performance by incorporating auxiliary vocal token prediction. This simple concept has the potential to enhance overall song generation performance.

Furthermore, the authors adopted an interleaving prediction method for dual-track generation. They investigate the effectiveness of different prediction orders, such as vocal-first or accompaniment-first.

I recommend adding a discussion of similar work, such as MusicGen-Stem [1], which predicts the bass token first, followed by drums and other components, to edit music. This suggests that disentangling some components is important for predicting others.

[1] Rouard, Simon, et al. "MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling." ICASSP, 2025.

方法与评估标准

It would be preferable to include the evaluation code after the paper is accepted, given the limited open-source implementations for music generation. This could have a significant impact on the audio generation community.

理论论述

This paper used a well-defined token prediction method.

实验设计与分析

The main concern is that SongGen utilizes a much larger dataset compared to others. Furthermore, I could not believe that X-Codec could outperform other codecs, given that it was only trained on a speech dataset. Did you retrain X-Codec with an audio dataset? Please provide more details about the codec.

In addition, the model uses a sampling rate of 16 kHz. This may undermine the contributions of the paper, as many audio generation models should be trained at sampling rates above 32 kHz. Moreover, many objective metrics are calculated using a sampling rate of 16 kHz, which might lead to unfair comparisons.

Although the authors conducted many ablation studies, these do not demonstrate the superiority of this model when using 16 kHz audio.

There is important information available at sampling rates above 16 kHz. Please move the limitation section to the main manuscripts.

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

For interleaving methods, do you mix the vocals and accompaniment by simply adding them together? I suggest building a post-mixing parallel generation model that uses the generated vocal and accompaniment to produce a refined waveform. If possible, you could also design the model to incorporate audio super-resolution or stereo generation.

作者回复

2025-04-01

We are grateful for the reviewer's positive response and constructive comments. Below, we address each of the concerns and suggestions in detail.

Q1：Details of X-Codec

In our work, we use the publicly released 'xcodec_hubert_general_audio' checkpoint provided by the X-Codec authors. This model is trained on a large-scale private dataset (~200,000 hours) with a distribution similar to AudioSet, which is specifically adapted for general audio tasks.

While Encodec and DAC indeed yield better perceptual quality for audio reconstruction, we observed that in song generation, both codecs resulted in higher rates of invalid outputs—such as failure to follow lyrics and generation of noise or silence. In contrast, X-Codec consistently demonstrated more stable training, faster convergence, and higher success rates in generating coherent vocals, with notably lower PER scores. We speculate that the advantage of X-Codec may stem from its pretraining on a large amount of music data. Additionally, as emphasized in the X-Codec paper, the incorporation of not only acoustic but also semantic features from SSL representations might also contribute positively to the performance.

Although Encodec and DAC have been widely adopted in prior audio generation systems across domains such as speech, sounds, and instrumental music, song generation presents a substantially higher level of semantic complexity. Unlike speech, singing involves a broader pitch range, a rich variety of expressive techniques, and highly dynamic rhythmic patterns. Moreover, songs demand the precise coordination of vocal and instrumental components to achieve harmonic and structural coherence. Despite X-Codec operating at a relatively low sampling rate of 16 kHz, we selected it as the most suitable option available at the time. To date, high-fidelity, song-specific neural codecs tailored for generative modeling remain an open challenge in the research community.

Q2: Use of 16 kHz Sampling Rate and Its Limitations

We appreciate the reviewer’s critical insight regarding the sampling rate. We acknowledge that using 16 kHz may limit audio fidelity and overlook important high-frequency content. As suggested, we will move the discussion of this limitation to the main manuscript. As discussed in Q1, high-fidelity, song-specific codecs suitable for generation are still lacking. We thank the reviewer for this important suggestion, and we are actively working on integrating audio super-resolution modules and stereo rendering to mitigate this limitation in future work.

Regarding evaluation, we followed widely used protocols to ensure consistency with prior work. However, we recognize that many objective metrics computed at a 16 kHz sampling rate may fail to capture information above this frequency range. We agree that developing higher-resolution evaluation metrics is crucial for advancing the field of audio generation. We would also like to clarify that all ablation studies reported in the main manuscript were conducted using X-Codec at a 16 kHz sampling rate, which reflects the model’s performance under this setting. We believe these results still offer valuable insights, demonstrating the effectiveness of our proposed token pattern design for jointly modeling vocal and accompaniment tracks within a single-stage autoregressive Transformer framework.

Q3: Track Mixing

Thank you for the valuable suggestion. Currently, we mix the vocal and accompaniment tracks by simple waveform addition. We agree that developing a post-mixing parallel generation model could further improve the quality of the final output, and we plan to explore this direction in future work.

Q4：Discussion of MusicGen-Stem

Thank you for the suggestion. We agree that MusicGen-Stem provides valuable insights into the importance of disentangling different components in instrumental music generation. We will include a discussion of this work in the revised Related Work section.

Q5：Release of Evaluation Code

Thank you for the suggestion. We will release the evaluation dataset and code to support reproducibility and benefit the broader song generation research community.

审稿人评论

2025-04-06

作者评论

2025-04-07

Thank you so much for your encouraging feedback. We are truly grateful for your recognition of our work and your strong recommendation. Your thoughtful review and kind support mean a great deal to us. We sincerely appreciate the time, effort, and expertise you dedicated to evaluating our submission, and your positive endorsement greatly motivates us to further pursue this line of research.

审稿意见

评分: 22025-03-10

The authors propose SongGen, a pre-trained model for text-to-song generation supporting a variety of input controls (voice identity, music style description, lyrics) and two output modes (mixture, vocals + accompaniment independently). They explore numerous training configurations for modeling two streams of audio tokens with a single model. The authors will release pre-trained model weights, code, and data annotations.

给作者的问题

Why are the objective metrics missing for GT / Suno?
In sound examples, the reference voice contains the input lyrics - does SongGen generalize when the reference voice differs from lyrics prompt?
Why does “GT” sound so compressed in the sound examples page?
What does it mean to release “annotated data”? Does this mean releasing just the annotations or the MSD audio (which is copyrighted)?

论据与证据

Key claims are not supported by adequate evidence. For example, it is claimed that “SongGen significantly outperforms Stable Audio Open, MusicGen, and Parler-tts across both subjective and objective metrics”. There are a few issues with this claim. Firstly, no precise analyses of statistical significance are presented to substantiate the claim. Secondly, it is unclear how statistical significance could be formulated for distribution-level metrics like FAD with only one training run (and therefore one sample). Thirdly, the claim is a strawman w.r.t. models like MusicGen and StableAudioOpen which do not support lyrics conditioning.

方法与评估标准

The proposed methods of delay patterns are reasonable, however there is limited evidence of broader applicability - the methods may be fairly narrowly constrained to this specific task of multitrack music modeling. The evaluation criteria are reasonable overall, though a challenge here is heterogeneity in input modalities supported by baselines. It would have been preferable to break down the evaluation into sets of baselines with a common “type signature” using appropriate prompts, e.g.: instrumental only (MusicGen, StableAudio, proposed), and voice + lyrics (Suno, Jukebox, proposed). As it stands, the comparisons are straw man. Also, given the subjectivity of the task, I would have strongly preferred to see a pairwise subjective evaluation setup over opinion scores.

理论论述

N/A

实验设计与分析

Experimental setups are overall reasonable. Would have been nice to see a quantitative analysis of the proposed data processing pipeline. Does the lyric recognition pipeline improve performance on a small dataset of gold standard lyrics transcription? Does the proposed CLAP-based captioning filtering improve human-judged relevance overall?

补充材料

Briefly reviewed all the supplementary audio material. I am fairly intrigued by the claims in section D that semantic information is helpful for generation. I would like to see this investigated in greater detail, as this observation seems to represent a relative increase in metrics that far exceeds that of the proposed methods in the rest of the paper.

与现有文献的关系

This paper generally relates to an increasing interest in text-to-music generation emphasizing broader forms of control and increasing quality. There are a number of recent papers that explore joint modeling of multi-stream audio, e.g., Moshi (Défossez et al. 2024) and similarities / differences of the token patterns. There is little to no discussion of these other works.

遗漏的重要参考文献

Other work exploring joint modeling of multi-stream audio (both within music and outside in speech). Other papers that focus on multi-stream generation (e.g. SingSong Donahue et al. 2023, Diff-A-Riff Nistal et al. 2024)

其他优缺点

Strengths: the release of the weights for this model will be helpful for the broader open weights music AI research community.

Weaknesses:

Results are very low fidelity (16kHz, noisy) - is this all the codec or is the generative model contributing as well? Why not use a higher quality codec?
The authors frame joint modeling of P(voice, accompaniment) as a feature over pipeline-based approaches of P(accompaniment | voice) * P(voice). However, this could just as easily be framed as a criticism, as this model does not obviously support vocals-to-accompaniment generation.
Overall, an impressive engineering feat, but limited interest from a research point of view.

其他意见或建议

“Accompaniment is easier to produce” in intro: unjustified, subjective, potentially misleading

作者回复

2025-04-01

We thank the reviewer for the detailed and critical feedback. Below, we provide point-by-point responses to the concerns.

Q1: Single-stage vs. Two-stage

We respectfully disagree that joint modeling should be viewed as a criticism. For text-to-song generation, single-stage models consistently outperform the two-stage pipeline in both efficiency and generation quality. Due to space limitations, we kindly refer the reviewer to Response Q1 to Reviewer F4Th for supporting experiments and detailed discussion.

Q2: Low Fidelity (16 kHz); Codec Choice

In our experiments, we tested several codecs, including higher-fidelity options like Encodec and DAC. While these codecs perform well in speech and pure music generation, they show unsatisfactory performance in song generation. For a more detailed discussion of our codec selection, we kindly refer the reviewer to Response Q1 to Reviewer 3Stt. Although X-Codec operates at 16 kHz, we selected it as the most suitable option available at the time. Currently, high-fidelity, song-specific codecs for generative modeling remain an open problem in the community. We fully acknowledge the limitations imposed by low fidelity, and we are actively working on integrating audio super-resolution modules to improve audio quality in future work.

Q3: Clarification of “Accompaniment is easier to produce”

We apologize for the confusion caused by this phrasing, and we will revise the sentence in the manuscript to avoid ambiguity. Our intent was to highlight the learning bias observed during joint modeling of vocals and accompaniment. To illustrate this more intuitively, we include a visualization of the Content Enjoyment (CE) curves for both tracks, generated by the mixed-mode model over training steps (refer to Section C of our anonymous demo page ). The figure shows that the accompaniment track improves much faster, reaching near-GT performance around 104k steps, whereas the vocal track improves more slowly, and still exhibits a noticeable gap from GT performance even after 168k steps.

Q4: Evaluation Criteria and Fairness of Comparisons

At the time of submission, there were no open-source baselines specifically for text-to-song generation (except Jukebox), which limited direct comparisons. We appreciate the reviewer’s suggestion and revise our evaluation to group baselines by input modality. We also incorporate automatic audio aesthetics metrics for a more comprehensive assessment.

For text-to-song, we compare against a two-stage pipeline and Parler-TTS*. Jukebox is excluded due to its impractical inference time (~3 hours for 20s on a V100). (refer to Section B of our anonymous demo page ).

To evaluate accompaniment quality, we separated it from the generated songs and compared it with instrumental-only models. As shown in the table below, SongGen's accompaniment achieves performance between these two, despite being trained on a much smaller song dataset without pure instrumental music data.

Model	KL↓	CE↑	PC↑
MusicGen	0.74	7.29	5.32
Stable Audio Open	1.17	6.36	3.98
Mixed pro (ours)	0.88	6.40	5.21

Q5: Quantitative analysis of the proposed data processing pipeline

Thank you for the helpful suggestion. However, building a gold-standard lyrics dataset and conducting human evaluations are time-consuming. To provide timely feedback, we instead randomly sample 5,000 examples from the training set and evaluate the filtering strategy using recently proposed automatic metrics for audio aesthetics and text-audio alignment. The results show that edit-distance filtering improves Content Enjoyment(CE) and Production Quality(PQ), while CLAP-based filtering increases CLaMP3 scores, indicating stronger audio-text relevance.

filter	sample nums	CE↑	PQ↑
random sample	5000	6.77	7.15
edit distance<=20%	3038	6.97	7.31
edit distance<=5%	1680	7.04	7.37

filter	sample nums	CLaMP3 ↑
random sample	5000	0.135
CLAP>=0.25	1648	0.143

Q6: Missing Objective Metrics for GT / Suno

We noticed that prior works typically do not report objective metrics for GT, and we thus omitted them as well. We now include them for completeness： FAD 0, KL 0, CLAP 0.18, CLaMP3 0.1052, PER 21.39, SECS 76.42. As the CLAP score for GT is unexpectedly low, we also report CLaMP3 for a more robust text-audio alignment evaluation. For Suno, due to the lack of an official API, all user study samples were generated manually. With 326 samples, full objective evaluation was infeasible due to the high manual cost.

Q7: Why Does GT Sound Compressed?

In the demo page, we use X-Codec to reconstruct the GT audio for a fair comparison under consistent codec settings and to represent the model's upper bound.

Q8: Annotated Data

We only release annotations from MSD, including VAD results, aligned lyrics, and generated captions.

审稿人评论

2025-04-04

Thanks for your response. A couple follow-ups:

Q1: My criticism here was not about quality. Instead, it is about that the single stage approach removes a control capability offered by two-stage models. Namely, the ability to generate an accompaniment given a pre-existing (non-generated) vocal input. Please clarify if I am still misunderstanding.

Q6: I'm confused - the paper says in "Evaluation dataset and metrics" that the test set is 326 samples for all methods. Are you using a bigger set to compute metrics like FAD? If so, can you clarify and add those details to the paper?

作者评论

2025-04-05

Thank you very much for your response.

Q1:

Thank you for raising this point — we now better understand your concern. Our current single-stage framework focuses on joint generation and does not support vocals-to-accompaniment generation by default. However, this control capability can be naturally integrated into our framework via fine-tuning. As a preliminary solution, we prepend the vocal track to the target audio sequence in the decoder during fine-tuning, enabling the mixed-mode model to generate a song conditioned on a pre-existing vocal input. Generation demos are provided in Section D of our anonymous demo page.

In addition to offering improved generation quality, our single-stage model is also flexible and extensible. With some adaptations, it can support track-conditioned generation (e.g., vocals, drums, bass). While our current fine-tuning strategy serves as a simple proof of concept, we plan to explore and implement more effective track-conditioning mechanisms in future work.

Q6：

We apologize for the confusion. We did not use a bigger set to compute metrics such as FAD. As there is currently no standardized public benchmark for song generation, we selected 326 song-relevant samples from the widely used MusicCaps test set to ensure transparency. All objective evaluations (including FAD) were conducted on this same 326-sample test set across all models.

Please feel free to let us know if you have any further questions or suggestions.

审稿意见

评分: 32025-03-13

SongGen is a single-stage autoregressive Transformer that takes lyrics, description, and optional reference voice as input, and generates either mixed or dual-track (vocal/accompaniment) audio. High-level design of conditioning methods follow recent common practice using frozen encoders (MERT, T5, and VoiceBPE) through cross-attention, and proposes several token patterns (parallel vs. interleaving for dual-track modeling) along with an auxiliary vocal training loss (Mixed Pro) in a mixed mode for better vocal quality.

给作者的问题

Did the authors consider more controlled experiments of two-stage vs. single-stage appraoch, since the paper posited this as one of the main motivation towards single-stage design?

论据与证据

While the design proposed in SongGen is technically correct with competitive quality against previous work of instrumental music generation, one of core claims "Traditional methods often rely on multi-stage processes, making pipelines inflexible and complex." would need further evidence by comparing a single-stage model (as in SongGen) versus a two-stage approach. While I understand that there're no suitable public baseline for evaluation, judging from the setup the authors have used (for example, using Demucs to gather vocal/accompaniment pairs), the authors can design a controlled experiment by training a two-stage Transformer stack (text-to-vocal & vocal-to-accompaniment).

方法与评估标准

The considered methods are based on variations of existing work (such as delayed codebook pattern in MusicGen), which are technical correct. Evaluation criteria include known objective and subjective metrics.

理论论述

This paper is mostly empirical, and I find no standout theoretical claims to evaluate.

实验设计与分析

The experimental setup employed several well recognized objective metrics (FAD, KL, CLAP, etc.) and subjective metrics on 5 attributes. My concern is about a rigor in statistical evaluation of the subjective metrics, where the confidence interval is lacking. I am not able to conclude if the improvements are significant.

补充材料

I reviewed the demo samples.

与现有文献的关系

Simplifying singing music generation pipeline into a single-stage, decoder-only autoregressive model brings practicality and ease of use to end users, potentially fostering acceleration in open community research in this area where commercial models have been dominating.

遗漏的重要参考文献

None.

其他优缺点

The simplicity of single stage Transformer design brings practicality to the end user, which does require significant effort and warrants credit. As mentioned in Claims and Evidence section, the claimed academic findings of the single-stage model being better than cascaded one has room to improve.

其他意见或建议

No significant other comments.

作者回复

2025-04-01

We sincerely appreciate your constructive comments, which are extremely helpful in improving our work. We are also grateful for your recognition of the technical soundness and practical value of our approach, as well as your acknowledgment of the substantial effort behind it. Below we provide detailed responses to your concerns.

Q1: Controlled experiments-- Single-stage Outperforms Two-Stage

Thank you for your insightful suggestion to conduct a controlled comparison. We train a two-stage Transformer stack using the same architecture and training data as SongGen. Specifically, in Stage 1, given lyrics, description, and a 3-second reference voice, the first model generates the vocal track; in Stage 2, given lyrics, description, and the generated vocal (prepended into the decoder), the second model generates the accompaniment track. The final song is then mixed from the two tracks.

We conduct both automatic and human evaluations. For automatic evaluation, we introduce recently proposed and effective metrics to provide a more comprehensive assessment, including the audio-text alignment score CLaMP3 [1] and content-based aesthetics metrics proposed by Meta [2], covering Content Enjoyment (CE), Content Usefulness (CU), Production Complexity (PC), and Production Quality (PQ). Additionally, we measured inference time on an A800 GPU for generating 30-second song samples. Generation audios are also provided in Section A on our demo page for the reviewers to review.

Table 1 Automatic Evaluation：

Model	FAD↓	KL↓	CLAP↑	CLaMP3↑	CE↑	CU↑	PC↑	PQ↑	Inference Time↓
two-stage	2.18	0.78	0.29	0.085	6.39	6.27	5.90	6.69	42.85s
Mixed	1.74	0.71	0.35	0.093	6.50	6.66	6.14	7.03	18.02s
Mixed pro (ours)	1.71	0.69	0.35	0.094	6.77	6.86	6.18	7.19	18.04s
Interleaving (A-V) (ours)	1.87	0.69	0.35	0.093	6.67	6.72	6.11	7.12	34.5s

Table 2 Human Evaluation：

Model	OVL.	REL.	VQ.	HAM.	SS.
two-stage	3.39±0.03	3.20±0.04	3.98±0.07	2.97±0.04	3.89±0.03
Mixed	3.58±0.05	3.70±0.02	3.55±0.07	3.39±0.05	3.92±0.05
Mixed pro (ours)	3.96±0.04	3.86±0.04	4.07±0.06	4.01±0.05	4.04±0.05
Interleaving (A-V) (ours)	3.95±0.03	3.87±0.06	4.15±0.05	3.82±0.03	3.93±0.04

Our results demonstrate that the single-stage model outperforms the two-stage pipeline in both efficiency and generation quality:

Efficiency: Compared to the single-stage approach, the two-stage pipeline requires more complex training and inference procedures. Experimental results indicate that the inference time of the two-stage model is more than twice that of the mixed-pro single-stage model.
Generation Quality:
- Unlike joint modeling of P(vocal, accompaniment), which indicates the joint probability of modeling vocal and accompaniment, the pipeline-based approach, which separately optimizes P(vocal) and P(accompaniment ∣ vocal), may fail to capture global optimality due to error accumulation across stages. This limitation is especially problematic for song generation, where harmony between vocals and accompaniment is crucial. For instance, in genres like rap, vocal rhythm is tightly coupled with the instrumental beat. Generating vocals first without considering the underlying rhythm may result in rhythm misalignment. Conversely, in expressive genres such as ballads, where vocals typically guide the emotional flow, generating accompaniment first may constrain vocal expressiveness, resulting in rigid or disconnected performances. In both cases, pipeline approaches struggle to capture the intricate interplay between vocals and accompaniment. In contrast, joint modeling in a single-stage framework enables better coordination and global optimization, resulting in more coherent and musically aligned outputs.
- Our results further support these observations — our single-stage model consistently outperforms the two-stage pipeline across both automatic and human evaluations, particularly on the aesthetics metrics (CE, CU, PC, PQ) and subjective scores such as Overall Quality (OVL.) and Harmony (HAM.).

Q2: Statistical Rigor of Subjective Evaluation

Thank you for pointing this out. We have updated the results to include 95% confidence intervals for each subjective metric. Furthermore, we incorporate newly proposed aesthetics metrics[2] to enhance the evaluation and better reflect the improvements brought by our approach. Due to space limitations, the updated results are provided in Section B of our anonymous demo page.

[1] CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages (Feb 2025)

[2] Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound (Feb 2025)

审稿人评论

2025-04-03

Thank you for the rebuttal along with the additional experiments. The controlled experiment compared against the cascaded version would better illustrate the merit of this work. The qualitative case studies of the samples as described in the rebuttal will also be helpful (now that there're more from the demo). From the provided samples, the cascaded approach can be off-beat, which I think is also reflected by HAM.

Could you also specify the training details of the newly added two-stage baseline? Since it is the author's own, the reader would further question the fairness in terms of budget allocation and rigor in optimizing the training recipe. That would convince the readers about the soundness of the baseline and the merits of the proposed single-stage design.

作者评论

2025-04-03

Thank you for your comments. Regarding the newly added two-stage baseline, we provide more details to address your concerns. The resource allocation and training strategy for the two-stage pipeline are consistent with those used for SongGen mixed training Step 1. Since voice-free support is not directly related to the core comparison between the one-stage and two-stage designs, we omit this part of the two-stage model in the rebuttal phase to provide timely feedback. Specifically, the two models in the two-stage pipeline are trained separately for approximately 200K steps each, using 16 NVIDIA A100 (80GB) GPUs with a batch size of 16 per GPU. We observe that the loss begins to plateau around 60K steps for both models. For optimization, we employ the AdamW optimizer with β₁ = 0.9, β₂ = 0.99, and a weight decay of 0.0001. The learning rate is set to 0.0001, and we apply a cosine learning rate schedule throughout training. To ensure reproducibility, we will make the complete training configurations and scripts for the two-stage baseline publicly available.

We hope this information helps address your concerns. Please feel free to let us know if you have any further questions or suggestions. We would be very grateful if you could kindly consider raising the score.

最终决定Accept (poster)

2025-05-01

This paper presents a single-stage autoregressive Transformer model for song (i.e. vocal + instruments) generation. Reviewers F4Th and 3Stt both appreciate the simplicity of the approach, even though it is based on already existing components such as MusicGen and XCodec. Both reviewers F4Th and z3S9 have raised a concern of statistical significance in the metrics being shown, as the lack of those weakens the claims of the authors that the model outperforms existing models; however, the authors have shared such results in their rebuttal. An existing limitation in the current work, as pointed out by reviewer 3Stt, is the use of a 16 kHz sampling rate for music, which results in low fidelity sound quality for music data. The authors have updated the paper to highlight this limitation accordingly.

While reviewer z3S9 raises a number of valid points regarding this work, I believe some of the criticism was not accurate. As raised by other reviewers and easily seen in the paper itself, this work has been compared to commercial models that have close to zero published details on how they work, making the publication of this work important. Criticizing the model for not being able to generate accompaniment also seems like an odd choice, since it is not what the models it was compared to do. I took that into account and reduced how much weight their score had in my decision, while still considering the valid concerns the reviewer had.