SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement
This paper presents SongBloom, the first unified autoregressive diffusion model for long-form song generation, achieving state-of-the-art performance compared to both commercial and non-commercial methods.
摘要
评审与讨论
SongBloom is a new framework that combines autoregressive sketching with diffusion-based refinement in an interleaved manner. SongBloom extends musical sketches and refines them from coarse to fine. It leverages the strengths of language models and diffusion models.
优缺点分析
The paper is well-structured and easy to follow.
The experiments are comprehensive, with both objective and subjective evaluations across multiple baselines and metrics.
The audio samples are impressive.
问题
Why is the maximum length limited compared to previous LM+Diffusion models?
Why does the full model generate 150 seconds while the small model generates only 60 seconds? How is this maximum length determined?
SongBloom currently relies on reference audio — does this limit its applicability in creative composition tasks? Can it be extended to support a fully text-based generation pipeline?
Does your method support cross-lingual inputs, such as an English reference audio paired with Chinese lyrics?
How can the semantics of sketch tokens be interpreted? What kind of information do they encode?
It’s not intuitive to me that the next token of an acoustic token is a sketch token.
局限性
Yes
最终评判理由
There is few weakness unaddressed. As I have given a postive rating, I maintain my score.
格式问题
The table captions are placed below the tables, which violates the formatting intructions.
We are grateful to the reviewers for their acknowledgment of the significance of our work and for the thoughtful attention they have given to our manuscript. Here are answers to the questions:
Q1 & Q2: The choice of maximum length is due to the constraint of computing resources. 150s is generally enough to cover both verse and chorus sections in a song. For the ablation studies, we used a smaller model (60 seconds) to optimize training time and resource usage while still demonstrating key performance trends.
Q3: We have trained a new version of SongBloom which receives text descriptions as conditions. We will release this model together with the one used in our paper once the paper is accepted.
Q4: Our evaluation set did not specifically contain cross-lingual pairs, but according to user feedback, the model is able to handle such cases.
Q5: According to previous papers [1] and our experiments, semantic tokens typically encode high-level musical features, such as instruments, vocal content, pitch, and so on. These tokens are more abstract compared to acoustic latents, which capture lower-level audio details. We refer to these as "sketch" tokens, as they provide a rough structure for the song. These tokens are derived from pretrained self-supervised models, and while we can extract broad semantic information, decoupling this into distinct, interpretable components is still an ongoing challenge. This will be a focus of our future work as we explore how to better interpret and control these tokens.
Q6: The design of predicting different types of tokens in one sequence is motivated by previous cross-modal LLM (eg, [2,3]). In our model, the input acoustic tokens can be regarded as a special (start of sequence) token of each patch, which contains additional acoustic information.
Q7: We have noted the table caption formatting issue and will correct it in the revised version to ensure it adheres to the submission guidelines.
[1] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
[2] Wu B, Yan C, Hu C, et al. Step-Audio 2 Technical Report[J]. arXiv preprint arXiv:2507.16632, 2025.
[3] Zhou C, Yu L, Babu A, et al. Transfusion: Predict the next token and diffuse images with one multi-modal model[J]. arXiv preprint arXiv:2408.11039, 2024.
In this paper the authors propose SongBloom, a novel method for singing voice + accompaniment generation conditioned on lyrics data and example audio.
Following MusicLM [1], the task is solved by generating two sequences: The first sequence, called sketch, contains discrete tokens obtained from a SSL encoder (e.g., MERT [2]) and contains semantic / structural information, being related to the lyrics content. A second sequence, contains low-level acoustic details and is generated given the sketch. In such scenario (i) both sequences are defined in a discrete domain and processed auto-regressively by transformer models (ii) the sketch is generated fully, before the acoustic sequence is generated. The downsides of such approach is that (i) acoustic sequences contain relevant detail that is better modelled by continuous generative models and (ii) generating the sketch beforehand is computationally expensive and does not incorporate any acoustic information.
To solve such problems, the authors propose modelling the acoustic sequence directly in a continuous domain, employing autoregressive diffusion transformers (more specifically modelling rectified flows) on a VAE (e.g., stable-audio-vae [3]) latent space, keeping a discrete autoregressive transformer on the sketch side. By operating over patches, they reformulate the generative procedure, being able to interleave the generation of sketch tokens and acoustic tokens, being able to stream the sketch part and not requiring to generate it beforehand (improving real time factor). The same formulation makes the generation of the sketch patches dependent on the previous acoustic context (compressed as a token via a simple acoustic encoder). The two models are trained simultaneously, where the diffusion transformer can propagate through the output of the autoregressive model being fed a special hidden vector.
The authors perform both objective and subjective tests, comparing to state-of-the-art open models and closed-source systems like Suno and Udio, showcasing strong empirical results.
References
- A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
- Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetos et al., “Mert: Acoustic music understanding model with large-scale self-supervised training,” ICLR, 2024.
- Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable audio open,” in ICASSP, 2025, pp. 1–5.
优缺点分析
Strengths
Strong performance
The approach demonstrates really good results when compared to state-of-the-art open source models and even the best performing closed-source platforms such as Suno and Udio. Approaching such a level of quality with a model that discloses its techniques is really relevant for the field, since reproducing the quality of the close models was. Nevertheless some aspects of the evaluation protocol are not really clear (see Weaknesses), and could impact the strength of such evaluation. Also the real-time factor is very good, improving over previous models.
Clarity
The paper is very well written, and does not contain typos. The diagram in Figure 1 is of great support in understanding the methodological part. Nevertheless I have some questions / suggestions in the Questions part.
Weaknesses
Originality
The methodology is not very original, since a very similar two stage approach (autoregressive for structural and diffusion for low-level detail) was proposed in [7]. The authors claim that a relevant difference with the speech setting is the patch size, which should be increased (Figure 2). Nevertheless, this happens a lot at the border between speech/environmental and music (e.g. MusicLM was “just” AudioLM for the music domain). The author could please better put into evidence in what their method differs from [7] / which are their methodological contributions over that paper?
Lack of reproducibility details
Some aspects of the evaluation are unclear. It is unclear what data has been used for evaluation and how do the authors ensure this data has not been seen during training, especially by the baselines the authors are comparing against. Also we don’t have any explaination on what data (or at least the quantity) has been used as ground truth distribution for the FAD computation, or what embeddings one uses for FAD (vggish, CLAP?). Also, the authors should give more details on the autoencoder modifications and on the architecture of the acoustic encoder.
Finally, the authors say that they do not release code / checkpoints since they want to take time into curating a public dataset. But authors could release only the code, without releasing checkpoints.
My final score is mainly dependent on this issue. If the authors would provide more context on evaluation / experimental details, I would increase my score.
References
- D. Jia, Z. Chen, J. Chen, C. Du, J. Wu, J. Cong, X. Zhuang, C. Li, Z. Wei, Y. Wang et al., “Ditar: Diffusion transformer autoregressive modeling for speech generation,” arXiv preprint arXiv:2502.03930, 2025.
问题
- Line 73: Also it was extremely slow to generate with Jukebox.
- Line 85: I also suggest the authors to cite https://arxiv.org/abs/2503.09573 here.
- Line 108: A citation from the flow matching literature here would be adequate (https://arxiv.org/abs/2210.02747).
- Line 311: I think the authors intended small patch size.
- Line 312: What is sketch token accuracy?
局限性
yes
最终评判理由
Given the great clarity of the work, the strong empirical performance, and the release of code / weights to the broader research comunity, I decided to increase my score to a full accept.
格式问题
no
We greatly appreciate the reviewers' time and effort in reviewing our manuscript. Their constructive comments were instrumental in refining our work.
About originality:
While [7] proposes generating continuous VAE latents of each patch, it lacks an explicit sketch-prediction stage. After reproducing [7] in the speech domain, we encountered difficulties with convergence, especially when using a larger patch size. For the song generation task, we found that [7] failed to produce intelligible results, even when the patch size was reduced to 2 or 4, as shown in Table 4 (the ‘without sketch’ configuration is quite similar to [7], but it was time-consuming with small patch sizes).
In contrast, our method iteratively generates discrete sketch tokens and continuous VAE latents, and we train the model using both cross-entropy and flow-matching loss. This iterative process allows sketch tokens to guide VAE latent generation, while VAE latents enhance sketch generation by providing acoustic context. This approach not only avoids the convergence issues of [7], but also improves the intelligibility and efficiency of song generation.
About the evaluation details:
To construct the evaluation set, we randomly intercept audio prompts from Suno-generated songs across different genres (ensuring they do not overlap with our finetuning dataset). We then generate lyrics using GPT (not Suno, we’ve corrected this typo in the draft). By randomly combining these two elements, we ensure that each sample in the final evaluation set is completely unseen during training.
For computing the FAD score, we compare the MERT embeddings of the complete songs and our generated samples . Even though and have different lyrics, they are expected to share similar style and genre characteristics. We assess the distance between their embeddings to evaluate the stylistic and genre consistency.
About the reproducibility:
We have already released the inference code and model weights before the rebuttal phase began. While further updates and additions are still in progress, the code currently allows users to generate songs with customized lyrics and prompts based on the model described in the paper. We will include a link to the complete repository in the camera-ready version after the anonymous review stage concludes.
About the antoencoder:
Actually we only modify the down-sample rate of the original stable-audio-vae to ensure a frame rate of 25Hz for the VAE latent. The weights of the re-trained VAE model are also available in our repository. We acknowledge that the initial description in the paper was insufficient, and we will provide a more rigorous explanation of these modifications in the camera-ready version.
About other questions:
A1: We will add these citations properly. Thanks for your suggestion.
A2: Yes, this is a typo 'batch size' -> 'patch size'. Thanks for pointing it out.
A3: The term “sketch token accuracy” refers to the average accuracy of predicting the sketch tokens during training, using a teacher-forcing strategy. Since we only modify the patch size, which affects the frequency of acoustic embeddings inserted in the preceding sequence, the sketch token accuracy serves as an indicator of the benefit of acoustic context in guiding the sketch generation.
I thank the authors for their detailed answer. Since they released both code and weights, I am happy to increase my score to a full accept, given the great impact such a good performing model can have in the research community.
This paper presents SongBloom, a novel form of music generation models by jointly training an LM and an RFM, which the authors referred to as autoregressive diffusion models. As mainly highlighted in the paper, the joint model is defined upon interleaved patches of semantic tokens and acoustic tokens, which the authors report significant improvements over separately modeling the two sets of tokens as in the prior works. The empirical results suggest the superiority of this modeling approach against other non-commercial music generation models.
优缺点分析
Strengths
- This is the first successful autoregressive diffusion model (joint training of LM and diffusion) in song generation supported with empirical evidence.
- The idea of interleaving semantic and acoustic tokens in the form of patch-by-patch generation is novel. From the results of ablation study, using H+C substantially improves over using H only in LM.
Weaknesses
- The subjective comparison against the non-commercial and commercial baselines was not rigorous enough, as only 20 samples were used for human evaluation. It is rather not challenging to train a model that performs best in a certain genre and to pick a sub-set of prompts where the model surpasses all competing models. I am not convinced by the volume and the variability of testing samples while the authors jump directly to the conclusion of outperformance.
- The ground truth used for the FAD scores is ill-defined, it could be unfair if the authors used the wav prompt as GT in SongBloom and compare its FAD score with the models that are conditional on text prompts.
- The hidden vector (), as a critical component that connects the LM and CFM in this modeling paradigm, is not clearly described in the main paper. I struggle to find an explanation of how is obtained from the LM. Presumably, the authors may have used the last hidden layer outputs of LLaMA, yet the use of token embeddings or logist embeddings could also make sense.
问题
- The examined patch sizes are 4 to 24 but the frame rate is 25Hz, meaning that each patch is less than a second. Why does the acoustic content in less than a second help for generating the semantic tokens? How does the model performs if we use larger patch size?
- Could we examine the attention weights over the past acoustic tokens compared to that of the semantic features in LM? It can help to better explain how do the acoustic tokens affect the LM generation.
局限性
The baselines and references seems to be limited. To my knowledge, SkyMusic/Mureka is the first commercial music generator that supports audio prompt, and they are distinguished for such feature against Suno for some time. While the proposed SongBloom only supports audio prompt, the authors are suggested to compare to their generated samples as well.
格式问题
The source of training dataset has been vaguely described. The authors should be aware of the copyrights of song data used for training, as it appears to be a sensitive area regarding the illegal use of copyrighted songs.
We express our sincere gratitude to the reviewers for their detailed and thoughtful feedback. We have carefully addressed all their concerns to enhance the clarity and depth of the paper
W1: The 20 samples are randomly chosen from diverse genres, and for each sample, we further generate two items for evaluation, so the final score is the average of the 40 items. We’ll provide further details in the revised version. Additionally, the model weights have already been released, although, due to anonymous review policies, we cannot update the current submission. The released repositories allow users to provide their own prompts, which should help alleviate concerns that the model is limited to certain genres or prompts.
W2: To construct the evaluation set, we randomly intercept audio prompts from complete songs across different genres with corresponding descriptions (ensuring they do not overlap with our finetuning dataset). We then generate lyrics using GPT (not Suno, we’ve corrected this typo in the draft). Either or is used as conditions for different models. The FAD is computed between the generated samples and complete songs , instead of . We will add an unambiguous description in the paper. This measure ensures fairness as much as possible and avoids the potential bias introduced by using as ground truth..
W3: is the last hidden layer output of LM at the end of each patch. Using embeddings directly would prevent the gradient from being properly back-propagated from CFM to LM.
Q1: Actually, it is not "the acoustic content in less than a second helps for generating the semantic tokens", but "compressing acoustic content into small patches (less than a second) helps for generating the semantic tokens". All preceding acoustic content (, is the patch size and is the patch idx) is accessible, and the patch size only affects the compression rate. Previous pure AR-diffusion paper [1,2] in speech domain demonstrates that only with very small patch size (), the model tends to converge. However, in SongBloom, we found that this conclusion no longer fully applies. It’s more of a trade-off: when the patch size is small, the inference speed drops and the acoustic latents generated in each iteration become too short. When is large, the most recent acoustic content (, is the local idx of sketch token to be generated in current patch) will be lost. Experiments demonstrate that a patch size of 16 strikes a good balance (As shown in Figure 2, when the patch size further increases, the overall performance degrades).
Q2: This is an excellent suggestion! We plan to visualize the attention weights over prefixed tokens (including audio prompts, lyrics, and preceding semantic and acoustic tokens). This should help to better understand the interactions between different token types and provide more transparency regarding how the acoustic tokens influence the language model generation process.
L1: We will include the subjective and objective evaluation results of Mureka-O1 in this paper. This will provide a more comprehensive comparison and address the reviewer’s concern.
| PER | MCC | FAD | SER | Aesthetics Score | |
|---|---|---|---|---|---|
| Mureka | 7.79 | 0.86 | 3.39 | 31.37 | 7.69/7.84/6.41/8.45 |
| Ours | 5.49 | 0.86 | 3.20 | 14.50 | 7.79/7.96/5.88/8.47 |
| Mus-V | Mus-A | QLT-V | QLT-A | CRR | CST | |
|---|---|---|---|---|---|---|
| Mureka | 3.91 | 3.93 | 3.85 | 3.89 | 3.38 | 3.41 |
| Ours | 3.91 | 3.92 | 3.95 | 3.93 | 3.42 | 3.45 |
[1] D. Jia, Z. Chen, J. Chen, C. Du, J. Wu, J. Cong, X. Zhuang, C. Li, Z. Wei, Y. Wang et al., “Ditar: Diffusion transformer autoregressive modeling for speech generation,” arXiv preprint arXiv:2502.03930, 2025.
[2] Z. Liu, S. Wang, S. Inoue, Q. Bai, and H. Li, “Autoregressive diffusion transformer for text-to-speech synthesis,” arXiv preprint arXiv:2406.05551, 2024.
Thanks for addressing my questions. Most of my concerns have been addressed, except for the sample size and FAD calculation -- I recognized that the FAD is computed between the generated samples and complete songs , instead of , but my concern was the fairness of comparing to the models conditioned on text prompts. As mentioned, was obtained by randomly intercepting audio from the complete songs . This means that the models conditioned on audio prompts are exposed to some common acoustic characteristics of (e.g., bpm, texture, key, etc). All in all, I decide to keep my score, leaning towards an acceptance.
Certainly, the FAD score of models conditioned on different prompts can not be directly compared. However, among models under the same prompt settings, their FAD scores can serve as a reliable and fair metric to reflect both the overall quality of synthetic samples and their similarity to real data. It is precisely based on this consideration that we included such evaluations in our paper. Thanks for your response.
The paper introduces a framework, named SongBloom for full-length song generation. The framework aims to combine the strengths of both autoregressive and non-autoregressive methods. More specifically, it applies an interleaved paradigm of autoregressive musical sketching and diffusion-based coarse-to-fine refinement. this combines coarse and fine stages into a single jointly optimized model. Experiments show that SongBloom achieves SOTA performance compared to commercial systems like Suno across subjective and objective metrics, while also demonstrating computational efficiency.
优缺点分析
Strengths
- The paper has clear problem and solution formulation. In general the paper is easy to follow and understand.
- The proposed method achieves results comparable to SOTA commercial music generation platforms and the music available in demo website sounds great.
- According to Table 2, SongBloom exhibits much lower Phoneme Error Rate (PER) in comparison to other baselines.
- The integrated design of SongBloom enables good inference efficiency (lower Real Time Factor) in comparison to other autoregressive baselines.
Weaknesses
- The model is built upon an utilizes various existing, well-established components and techniques. There is not much novel concept or interesting idea introduced.
- While authors are collecting desensitized data for an open-source version, the current lack of publicly available code and training data (beyond the demo page) limits researchers' ability to fully reproduce and build upon the reported results directly.
- The top-performing SongBloom-full-ft model is explicitly fine-tuned on synthesized data generated by Suno. While demonstrating strong results, this suggests that achieving the absolute best performance currently relies on leveraging outputs from another state-of-the-art, proprietary system.
- The current reliance on Self-Supervised Learning (SSL) models for sketch representation lack interpretability. This limits fine-grained control and user customization. While the authors state this will be future work, it represents a current limitation in the model's direct utility for artists seeking precise creative input.
问题
- Could authors clarify what's the motivation of having an additional fine-tuning stage on synthesized data generated by Suno? And how exactly did you generate such synthetic data?
- For future work, what are some potential solutions for making the SSL-derived sketch tokens be more interpretable or modifiable for users?
局限性
yes.
最终评判理由
Overall I think this a solid paper that has value for music generation community given the fact that both code and model weights will be available. I will be happy to see it appear in NeurIPS.
格式问题
N/A
We would like to sincerely thank the reviewers for their thorough review and valuable feedback.
W1: While SongBloom integrates existing components like the SSL sketch token extractor and VAE, the core model and its design—particularly the interleaved generation of discrete sketch tokens and continuous VAE latents—is our own contribution. Previous autoregressive-diffusion (AR-Diff) approaches in speech generation have primarily relied on either discrete or continuous domains, but did not successfully combine both in a way that is computationally efficient and converges reliably. Additionally, this is the first attempt to migrate AR-Diff models into song generation, which is much longer and more complex. We have tested simply training the AR-Diff model (eg, DiTAR [1]) with song data. The model struggled with convergence issues as shown in Table 4, last line, while SongBloom maintains stability and efficiency, which demonstrates the necessity and novelty of our method. Additionally, the idea of introducing acoustic context into the semantic generation process is a simple yet crucial advancement that no previous methods in song generation have tackled.
W2: We’re excited to announce that the inference code and model weights have already been released publicly after submission. Currently, users can generate songs with customized lyrics and prompts using the model as described in the paper. We continue to improve and update the repository. Please note that we are unable to share the link here due to anonymous review policies, but it will be available in the camera-ready version after the review phase.
W3: The amount of synthesized data used for fine-tuning is quite small—less than 2% of the total training data (1000 extra fine-tuning steps). The majority of our primary training data was noisy. To address this, we synthesized a small amount of clean data with more distinct structures, clearer instrumental tracks, and more precise lyrics. This fine-tuning helps the model learn to differentiate the verse and chorus and to synthesize 'cleaner' instruments, thereby improving the performance on AI-generated lyrics.
W4: Yes, as you mentioned, how to achieve interpretable and precise control is an essential yet unsolved topic for song generation. We are actively exploring solutions for this.
Q1: As discussed in W3, the primary training data is noisy and lacks consistent structure, which complicates the learning process. Additionally, the complexity of the musical accompaniment sometimes doesn’t align with standard compositional rules, leading to tanglesome performance in song generation. To overcome these challenges, we fine-tuned the model on the synthesized data that had clearer instrumental tracks and more structured compositions for a few steps. This fine-tuning helped the model learn more effectively, alleviating the negative influence of low-quality training data.
To generate the synthetic data, we first used Suno to generate a set of songs and lyrics, and then filtered these data based on metrics such as Phoneme Error Rate (PER) to ensure that only high-quality synthetic examples were included in the fine-tuning process.
Q2: We acknowledge that achieving fine-grained control over SSL-derived sketch tokens is a challenging problem. In our view, the main difficulty lies in decoupling the SSL embeddings and understanding which components directly influence the generation process, rather than the generation itself. At this stage, we don’t consider SSL tokens as the final solution. Instead, we believe that designing a new, interpretable pattern—similar to MIDI, but is both human-readable and machine-accessible—could be a better long-term approach. This would allow for greater control and customization while maintaining the flexibility and efficiency of machine processing.
[1] D. Jia, Z. Chen, J. Chen, C. Du, J. Wu, J. Cong, X. Zhuang, C. Li, Z. Wei, Y. Wang et al., “Ditar: Diffusion transformer autoregressive modeling for speech generation,” arXiv preprint arXiv:2502.03930, 2025.
thanks for the response. Most of my questions have been addressed. Regarding the fine-tuning data, it would be better to define what exactly 'more distinct structures, clearer instrumental tracks, and more precise lyrics' mean. My question is more about what kind of synthesized music could be considered as good quality for fine-tuning purpose.
Thank you for your feedback. In our experiments, we selected high-quality samples based on the following two criteria:
-
low Phoneme Error Rate (PER);
-
high alignment between the detected structure by All-In-One [1] and the input structure.
Besides, the instrumental composition and chord progressions of synthetic music are generally simpler and easier to learn than those of real music, which is why we refer to it as having clear instruments.
[1] Kim, Taejun, and Juhan Nam. "All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio." 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2023.
This paper introduces SongBloom, an autoregressive diffusion-based model for full-song generation that trains jointly on acoustic and semantic tokens with an interleaving strategy.
Reviewers remarked on the strong empirical performance, including improvements even over commercial systems such as Suno. The results include lower phoneme error rates, better computational efficiency, and overall strong evidence of practical novelty and potential impact at the application level. The methodological contribution (i.e. interleaving sketch tokens with VAE latents) was viewed as a clever design choice. Reviewers also praised the breadth of the evaluation and the clarity of the paper's writing and presentation.
The main concerns are in regards to limited fundamental novelty: the approach is largely built from existing components, and the conceptual advance over work like DiTAR is perhaps incremental. There were also comments on the limits of the evaluation, e.g. eliance on fine-tuning with Suno-generated data for the strongest results. In my view, most substantial concerns were addressed during discussion. I also do not see a major issue with the human evaluation scale; the 10 x 20 = 200 observations, which is quite small but not necessarily a major issue for relatively stable domains like this one vs. areas with small, noisy effects. A flagged ethical concern was raised but ultimately this is quite minor. To me, the paper does not introduce substantial new ethical issues beyond those common in this domain. Still, the authors engaged with it appropriately in their response.
The reviews converge on the conclusion that this work represents a solid and technical sound contribution that adapts existing ideas effectively to a difficult task and achieves impressive practical results. I therefore recommend acceptance.