5.4

/10

Poster5 位审稿人

最低3最高8标准差1.7

3.8

置信度

正确性3.2

贡献度2.6

表达2.8

NeurIPS 2024

SongCreator: Lyrics-based Universal Song Generation

Shun Lei,Yixuan Zhou,Boshi Tang,Max W. Y. Lam,Feng liu,Hangyu Liu,Jingcheng Wu,Shiyin Kang,Zhiyong Wu,Helen M. Meng

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

This paper presents SongCreator, a song generation system that achieves competitive performances on eight tasks, particularly in lyrics-to-song and lyrics-to-vocals.

摘要

关键词

Song generationSong editingMusic generationLanguage ModelDiffusion Model

评审与讨论

审稿意见

评分: 3置信度: 42024-07-06

This paper introduces SongCreator, a novel song-generation system designed to create complete songs with both vocals and accompaniment from given lyrics, addressing a significant gap in music generation. The system incorporates a dual-sequence language model (DSLM) and an innovative attention mask strategy, enabling it to understand, generate, and edit songs for various music-related tasks. Extensive testing shows that SongCreator performs exceptionally well, outperforming previous models in tasks such as lyrics-to-song and lyrics-to-vocals generation, and offers the unique ability to control the acoustic characteristics of vocals and accompaniment independently.

优点

The paper tackles an important topic within the field of song generation, a complex and evolving area of AI research.
Detailed and comprehensive experiments are conducted to validate the effectiveness of the proposed models and methods.
The work is thoroughly developed, presenting a holistic approach from theory to practical application.

缺点

The introduction lacks clear logic and motivation, making it difficult to discern the unique contributions of this work compared to existing models like Suno and Udio. The discussion about the applicability of AI-generated content across various media types feels outdated, especially given the recent advancements in music generation technologies.
The presentation of related works in Table 1 is confusing; Jukebox is omitted from the table yet discussed extensively in the text. The paper primarily frames its contributions as extensions of Jukebox, which may understate the originality of the proposed methods.
The exclusion of prominent music generation products like Suno and Udio from a detailed discussion is a notable oversight. This could either be addressed comprehensively within the limitations section or by providing a clearer comparison within the main text to delineate how this work differentiates from those products.

问题

How does the proposed model handle the arrangement task without access to traditional music sheets?
While SongComposer operates on MIDI files, how does the proposed model manage composition tasks directly from audio files?
Given that harmony is a subjective quality and a fundamental goal of music generation, how do the authors define and assess harmony in their evaluations, and why do references [15-25] lack this aspect?
Does the term "universal song generation" carry specific implications within this context, and is its universality considered a significant contribution of this work?
Is this the first model capable of generating both vocals and accompaniment from lyrics alone? What unique capabilities does your model have that distinguish it from others?
The related work section touches on singing voice synthesis and speech editing but lacks a detailed discussion on lyric-based music generation. What is the rationale behind this selection?
Figure 1 lacks a clear caption explaining its elements, which could lead to confusion. Clarifications on what the audio icons and the term "song" represent would be beneficial.
On line 215, how was the decision made to allocate 80% and 20% in your methodology?
The structure of input and output for both training and generation phases appears complex. Could you clarify whether the model supports multiple combinations of lyrics, vocals, and accompaniments during these phases?
How is the song editing task implemented? How does the system manage edits that only modify part of the lyrics but require corresponding changes in vocals and accompaniments? It would also be beneficial to understand the robustness of the editing performance across diverse and extensive datasets.

局限性

The most critical limitation noted is the suboptimal audio quality, which appears fragmented and affects the overall user experience with the generated music. This issue could significantly impact the practical deployment and acceptance of the proposed model.

作者回复

2024-08-07

We appreciate the reviewer's careful reading of our paper. We hope the following addresses all concerns mentioned.

Regarding the differences from Suno and Udio

We appreciate the reviewer’s constructive comments and will revise the introduction to better highlight our unique contributions. Since these products do not publicly disclose their methods, detailed comparisons are challenging. However, our main innovations and contributions are the design of the DSLM and the corresponding attention masking strategies. For tasks like song generation, which require generate two temporally aligned sequences such as vocals and accompaniment, DSLM offers significant advantages over independent single-stream or dual-stream modeling, and enabling independent control over generated vocals and accompaniment. We believe it is a novel solution to these tasks.

Furthermore, our model achieve diverse song generation capabilities in a unified framework by the atttention mask strategy, such as accompaniment-to-song, song editing and vocals editing in song, which Suno and Udio cannot currently achieve. Of course, due to limitations in resources and data, there are still gaps in audio quality and control through textual descriptions compared to these products. These limitations are noted in the paper, and we will provide a clearer comparison in the final version.

Regarding the introduction of Jukebox

We are thankful for the reviewer's constructive comment. Jukebox is the first and only published literature attempting song generation and should be cited in Table 1. It models vocals and accompaniment as a single entity, leading to several limitations. Our work is not an extension of Jukebox; we propose a completely different framework with DSLM and attention mask strategy specifically designed for song generation. Our approach enhances the musicality and quality of generated songs while providing flexible control, generation, and editing capabilities. As mentioned in the first response, our model achieves diverse song generation tasks in a unified framework. Most of these tasks have not been accomplished by previous models, including Jukebox. Multi-task learning further improves the model’s performance across these tasks.

Regarding the arrangement and composition tasks

As the reviewer mentioned, as an end-to-end generative model, our proposed model does not handle arrangement and composition tasks by explicitly predicting traditional music sheets or MIDI files. Instead, we train the model to directly generate natural and musical accompaniment, vocals and song based on given conditions, such as lyrics. The outputs of the model are audio, not music sheets or MIDI files. This approach enables our model to learn the knowledge for arrangement and composition and to generate songs without traditional music sheets or MIDI files.

Regarding the harmony

We define harmony as whether the vocals and accompaniment sound harmonious and pleasant together. This is crucial when generating natural-sounding songs that include both vocals and accompaniment. The works referenced in [15-25] focus on generating either vocals or accompaniment music alone, so the concept of harmony is not applicable. This perspective comes from SingSong, which involves generating instrumental music that can be naively mixed with the input vocals, and we define it as harmony.

Regarding the universal song generation

The term “universal song generation” refers to our model’s ability to perform various song generation tasks beyond lyrics-to-song, including editing, continuation, and generation from pre-determined track. This universality and flexibility are significant contributions and unique capabilities of our work. Additionally, it enables multi-task training, which further enhances the model’s generation capabilities.

Regarding the unique capabilities

As mentioned in the second response, Jukebox was the first model capable of generating both vocals and accompaniment from lyrics alone. However, our model improves the musicality of generated songs and offers more diverse song generation features, as detailed in the first and second responses.

Regarding the selection of related work

Jukebox is the only published literature on lyric-based music generation. We have provided a detailed introduction to Jukebox in the introduction section, so we chose not to repeat it in the related work.

Regarding the Figure 1 Caption

Thank you for the constructive comment. We will make the necessary revisions in the final version of the paper.

Regarding the audio quality and using the None strategy 20% of the time

We take these seriously and have provided a detailed explanation in the global rebuttal section at the top.

Regarding the structure of input and output

As the reviewer mentioned, our model supports multiple tasks, each with various combinations of lyrics, vocals, and accompaniments as input. This demonstrates the model's capabilities and flexibility. Specific input and output structures and corresponding tasks are detailed in Table 2. During training and generation, we set up various input and output combinations according to the tasks.

Regarding the implementation of song editing

We thank the reviewer's question about our editing task. In this task, users provide the edited lyrics with the start and end points of the segment to be edited. The segment following the end point is used as a prompt, while the segment preceding the start point is treated as already generated part, with a special <EDIT> token separating the two. Since the LM is trained in an autoregressive manner, the system continues generating the edited segment based on the already generated part and then seamlessly transitions into the prompt segment. To test editing performance, we manually constructed a dataset of 30 examples, encompassing songs of different styles and performed by different singers.

2024-08-08

Several critical issues remain after reviewing the response:

Unconvincing Performance Demonstration:

As an audio-based music generation work, this paper does not compare with models such as Suno and Udio. It is not convincing that the lack of comparison is because Suno and Udio do not publicly disclose their methods (mentioned in the response). In most cases, a user input is sufficient for Suno and Udio to generate results for comparison. I have used this method to generate music and compared it with the music generated by this paper for tasks such as lyrics-to-song, lyrics-to-vocals, accompaniment-to-song, vocals-to-song, music continuation, song continuation, vocals continuation, and accompaniment-to-song (no lyrics). The proposed model fails to generate high-quality music in terms of vocal pronunciation, fluency, and background noise. Additionally, the proposed model only generates music at the phrase level, while Suno and Udio can generate music with a complete structure of multiple sections.

Lack of novelty:

The key contribution claimed by the paper is the encoder-decoder architecture on two audios. However, the use of encoder-decoder architecture in music generation is not novel. For example, various works from different groups have reported relevant work from 2020 to 2024, as shown below:

[1] Dhariwal, Prafulla, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. "Jukebox: A generative model for music." arXiv preprint arXiv:2005.00341 (2020).

[2] Donahue, Chris, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti et al. "Singsong: Generating musical accompaniments from singing." arXiv preprint arXiv:2301.12662 (2023).

[3] Zhiqing, Hong, Huang Rongjie, Cheng Xize, Wang Yongqi, Li Ruiqi, You Fuming, Zhao Zhou, and Zhang Zhimeng. "Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment." arXiv preprint arXiv:2404.09313 (2024).

评论- Response to the reviewer

2024-08-09

Thanks for the reviewer's insightful comments. We would like to reply point by point here:

Regarding Novelty

Firstly, we would like to note that our primary focus is on improving the modelling of discrete music representations within the encoder-decoder architecture to enhance the musicality of generated songs. As mentioned in lines 59-73 of the paper, different from the single-stream language model (LM) used in Jukebox, Singsong and Text-to-Song, our key innovation lies in introducing a novel LM, DSLM, along with corresponding attention masking strategies. It significantly enhances the musicality of generated songs by improving performance for dual-sequence modelling, and completes song generation tasks of various forms in a single model.

Improving Performance for Dual-Sequence modelling: For tasks involving aligned sequences such as vocals and accompaniment, DSLM demonstrates superior performance. In our experiments, we have compared our model with various LM-based models:
- MusicLM (Similar to Singsong) uses LMs for semantic tokens followed by acoustic tokens.
- MusicGen (Similar to Text-To-Song) uses transformer-based models to directly predict acoustic tokens.
- GPT (Similar to Jukebox) uses autoregressive transformers to model discrete tokens.
Our experimental results indicate that DSLM outperforms these LM-based single-stream modelling approaches in most tasks. Additionally, DSLM supports independent control over generated vocals and accompaniment, which is beyond the capabilities of the mentioned previous works.
Accomplishing Various Song Generation Tasks in a Single Model: As mentioned earlier, DSLM can perform multiple tasks within a single model, such as generation, editing and accompaniment-to-song, and multi-task learning further enhances the musicality of generated songs. These advantages are beyond the capabilities of previous literature works, which required training specialised models for each task. We believe our approach provides valuable insights for other dual-sequence modeling tasks.

Regarding Comparison with Suno

We are thankful for the reviewer's comment and would like to discuss this issue to address the reviwer's concerns.

Firstly, it is challenging to make a fair comparison between our proposed method and Suno. Due to constraints related to data collection costs and music copyright, our dataset is relatively small (270,000 songs compared to Jukebox's 1.2 million) and of lower quality (primarily sourced from non-professional singers online). These limitations significantly affect the fluency, background noise, and vocal pronunciation in the generated songs. As a commercial product, Suno likely has access to more extensive and higher-quality data. Nevertheless, it is noteworthy that our generated songs achieve musicality and quality close to the ground truth samples in most tasks. This indicates that DSLM effectively maximizes performance despite the data limitations. We believe that increasing the quantity and quality of data will further enhance the results.

Secondly, our focus is on better modelling music representations rather than improving the encoding and decoding processes of music. To ensure fair comparisons, we conducted all experments using the same components (BEST-RQ, LDM, and Encodec) for audio encoding and decoding. This approach aligns with previous works, such as MusicGen and Singsong, to prevent the influence of different encoding and decoding methods. Experiments demonstrate the strong performance and flexibility of DSLM, capable of handling multiple song generation tasks with a single model and outperforming specialized baselines in most tasks. In contrast, Suno's audio encoding and decoding methods are not disclosed, making it difficult to rule out the influence of these modules. Additionally, audio encoding and decoding are areas of long-standing research, significantly impacting audio quality, noise, and clarity. We plan to explore this in future work to enhance the quality of synthesized songs.

Finally, our model supports several capabilities that Suno and previous works do not, such as song editing, vocal editing, and vocal editing in songs. As #Reviewer H7YT mentioned, these diverse editing capabilities are highly practical for music production. In accompaniment-to-song and vocals-to-song, our model also differs from Suno's. We follow the requirements of previous works to ensure that the input tracks remain unchanged in the final output, whereas Suno may alter the content and melody of the input tracks. This demonstrates that DSLM offers a broader set of capabilities, and is the first attempt in music generation to integrate such diverse capabilities within a single model rather than relying on multiple specialized models for music generation.

审稿意见

评分: 6置信度: 42024-07-10

The paper presents a novel approach for lyrics-based song generation. The method leverages language models for semantic tokens modeling and then applies latent diffusion model to generate final music. A dual-sequence language model (DSLM) is introduced to not only handle vocals and accompaniment but also integrate them together. The model is applicable to a variety of lyrics-based song generation tasks and extensive experimental results demonstrate its effectiveness.

优点

SongCreator is flexible and applicable to eight different lyrics-based song generation tasks.
The components in SongCreator are mostly open-sourced and details of training and model hyper-parameters are provided.
Generated samples are in good quality.

缺点

While the proposed system looks promising, it requires multi-stage and multiple models during inference. The latency is not discussed.
Training data heavily relies on the quality of sound separation tool (Demucs). While its quality is ok for two stream (vocal & accompaniment), it becomes more problematic when more streams are separated and therefore has limitations on more instrumental-level controls.
Semantic tokens (either vocal or accompaniment) are a mixture of multiple features, and therefore it is challenging to support disentanglement control (e.g., tempo).

问题

How do you align lyrics and Voice Activity Detection (VAD) results? could you elaborate more on this?
Does it support languages in addition to English?

局限性

Authors discussed limitations in Section 5.

作者回复

2024-08-07

We thank the reviewer for recognizing our work. We appreciate the constructive comments the reviewer provided, which will help us further improve our paper. We are delighted to have the following discussion with the reviewer.

Regarding the latency discussion

The reviewer is correct that SongCreator comprises four components for successful inference, i.e., a self-supervised model (BEST-RQ), a language model (DSLM), and a latent diffusion model. However, to achieve high-quality generation, state-of-the-art music generation models such as MusicLM [1] consists of 6 components: 3 language models (semantic, coarse & fine), a self-supervised model (w2v-BERT), a prompt encoder (MuLan) and an autoencoder (SoundStream). Similarly, state-of-the-art speech synthesis models like Seed-TTS [2] consists of 4 componets: an self-supervised model (speech tokenizer), a language model, a diffusion model and a vocoder. All of these have similar or even higher complexity compared to our work.

Considering that song generation is a complex task that includes vocal composition, instrumental arrangement and harmonious generation, our primarily focus is on optimizing the musicality and quality of the generated songs, rather than real-time requirements at this stage. Therefore, we have not discussed latency. We hope to further simplify this process to achieve real-time song generation in the future.

Regarding the impact of Demucs quality and more instrumental-level controls

The reviewer’s argument is thought-provoking. We have provided a detailed discussion on the impact of Demucs quality in the global rebuttal section at the top, where we explain how our approach mitigates its impact on the overall quality of generated songs. While achieving more instrumental-level controls remains challenging at present, we believe our approach offers valuable insights and assistance in minimizing the influence of separation quality as much as possible.

Regarding supporting disentanglement control

The reviewer is correct that disentangling control for semantic tokens, which are mixtures of multiple features, is challenging. While we are not focused on disentanglement control, we believe that our proposed DSLM can be extended to address this problem. One possible approach is to disentangle the elements within the semantic tokenizer, as explored in previous work [3, 4]. Another approach is to introduce textual descriptions to control various attributes and different streams in the generated music (e.g., tempo and different instruments), which has been widely attempted in instrumental music generation. These methods are compatible with our proposed DSLM, giving it the potential to address disentanglement control challenges in the future.

Regarding aligning lyrics and Voice Activity Detection (VAD) result

We are sorry that we did not explain the detailed process clearly. Specifically, we employed an automatic speech recognition (ASR) model to provide timestamps for each sentence in the lyrics and a voice activity detection (VAD) model to detect silent segments. We then select appropriate silent segments to split the data into segments of no more than 30 seconds, ensuring the completeness of the sentences. We will include detailed explanation in the final version.

Regarding support for other languages

Due to the cost of data collection, our experiments in the paper were conducted only on the English datasets. However, as a generative model, it is not inherently bound to a specific language. With sufficient data for a specific language, the model can be adapted to support generation in other languages.

[1] Lam M W Y, Tian Q, Li T, et al. Efficient neural music generation[J]. Advances in Neural Information Processing Systems, 2024, 36.

[2] Anastassiou P, Chen J, Chen J, et al. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models[J]. arXiv preprint arXiv:2406.02430, 2024.

[3] Zhang X, Zhang D, Li S, et al. Speechtokenizer: Unified speech tokenizer for speech large language models[J]. The Twelfth International Conference on Learning Representations, 2024.

[4] Ju Z, Wang Y, Shen K, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models[J]. arXiv preprint arXiv:2403.03100, 2024.

2024-08-13

Thanks for the response. I appreciate the clarification for alignment between lyrics and VAD and the possible extension for instrument/disentanglement control. However, I am not sure why authors do not run a simple test for inference latency. I totally understand it requires multiple modules/stages to achieve good music quality just like the SOTA method but since you have already implemented these baselines for comparison, why not include the latency as well. It should give readers a high-level idea how fast it is and if it is not fast, what is the trade-off between performance v.s. speed. I strongly suggest authors adding such a table in the final version.

In addition, while I don't think it is critical to have comparison with SUNO for accept/reject, it would be great to include some examples to see the gap between academic research and commercial product.

Finally, given the flexibility of the proposed system and overall performance, I believe the paper is valuable to appear in NeurIPS but I will recommend authors considering the above comments. I will keep my original score.

评论- Response to the reviewer

2024-08-13

We appreciate the reviewer's overall positive feedback and constructive comments.

Thanks very much for the suggestion to conduct test for inference latency. We would like to supplement the evaluation by comparing the real-time factor (RTF) for SongCreator and other baselines. RTF represents the time (in seconds) required for the system to synthesize one second of waveform. The evaluation was performed on a single NVIDIA V100 GPU with a batch size of 1. We randomly selected 20 generated audio samples, each longer than 20 seconds, to conduct the evaluation. These additional results will be included in the final paper.

model	RTF
MusicLM	14.545
MusicGen	2.104
GPT	1.525
GPT (Vocals & Song)	3.059
SongCreator	2.793

The results indicate that methods utilizing a single LM module are significantly faster than MusicLM, which employs multiple LMs in cascading manner. Taking into account the experiments corresponding to Table 3 in the paper, we observe that although GPT and MusicGen, which only model the song token sequence, are faster than GPT (Vocals & Song) and SongCreator, which predict multiple sequences, this gain in speed comes at the cost of reduced performance. In comparison to GPT (Vocals & Song), our proposed SongCreator, which leverages DSLM to simultaneously model both vocals and accompaniment, achieves not only faster speeds but also better results.

Furthermore, we acknowledge the comments related to Suno and will include Suno-generated samples on the final demo page as advised.

审稿意见

评分: 8置信度: 42024-07-13

The authors present SongCreator, a music generation system capable of simultaneous generation of vocals and accompaniment tracks. SongCreator consists of a language model generating two streams of BEST-RQ ([57] in the paper) semantic tokens, one for the vocals and the other for the musical accompaniment, a non-autoregressive transformer mixing the two streams, followed by a latent diffusion model translating the semantic tokens to VAE latents that is then decoded back to audio. The model is conditioned on lyrics and optional style audio prompts for either vocals or accompaniment.

The authors suggest using Bidirectional Cross Attention (BCA) for the two-stream language model, which is the main modeling contribution of the paper. A thorough ablation study is performed suggesting that BCA is crucial for coherent generation of songs - i.e. music with both vocals and instrumental accompaniment.

Additionally, the zero-shot voice cloning capabilities of SongCreator is demonstrated through comparison to baselines such as VALL-E ([9] in the paper), showing superiority in terms of singer voice similarity.

Finally, the benefits of multi-task training is demonstrated, and in specific, in an interesting contribution, the authors show that vocal generation benefits from dual accompaniment and vocal generation objectives seen during training.

优点

A diverse lyrics editing capabilities of SongCreator is demonstrated. Namely three variations: Direct editing of the mixture track, editing a separate vocal track or editing the vocals given an accompaniment track. This is highly practical for music production, and demonstrates the flexibility of the proposed system.
The design of the Bidirectional Cross Attention (BCA) between the vocal semantic decoder and the accompaniment semantic decoder is an important modeling contribution. Moreover, it is validated extensively throughout the experiments section (section 4), demonstrating clear advantage of BCA compared to both independent two stream modeling and single stream modeling alternatives.
The superiority in the SECS metric compared to a VALL-E like architecture is an important contribution (table 6), demonstrating effective zero-shot voice cloning capabilities of the proposed model.
The significant increase in vocal generation quality, when given an accompaniment track is an important contribution, demonstrating the effectiveness of the Dual Sequence Language Modeling (DSLM) in learning from temporally aligned auxiliary musical signals.
The authors perform an extensive experimentation in order to validate their modeling design choices. A wide range of baselines is implemented using reproduction of prior work, in addition to ablation study on the main components of SongCreator.

缺点

The quality of samples, as demonstrated in the demo page, is relatively low compared to prior work.
As stated in line 361, SongCreator cannot control and doesn't support global textual descriptions of genre, style or instrumentation. This is a major weakness compared to prior work.
line 119 - The decision to use best-RQ as the semantic tokenizer should be supported with an ablation study, comparing it to open-source alternatives such as MusicFM [1] or MERT [2]. In addition, it is unclear how did the authors validate it indeed "encapsulate sufficient semantic and acoustic details" as stated in line 121?
Though the baseline set is broad, neither baseline is an official checkpoint. All baselines are reproductions of prior work, which lowers the reliability of comparison. The comparison in table 15 reveals a significant gap in performance comparing to the SingSong official samples.
It is unclear which samples were used for subjective evaluation. In specific, a crucial factor is whether a source separated data was used or studio stemmed data. A source separated data may bias the results towards models with audio prompts due to information leakage between the artificially extracted stems.
No planned model checkpoints release, which reduces the reliability and reproducibility of the research significantly.

[1] A Foundation Model for Music Informatics, Won et al. 2023

[2] MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training, Li et al. 2023

问题

line 215 - what is the rationale behind using the None strategy 20% of the time?
The sentence in lines 149-153 needs rephrasing. It is difficult to understand.
Table 3 - how do the authors explain the opposite trend in FAD compared to the trend in subjective quality?
Table 4 shows a significant increase in vocal quality, when given an accompaniment track. Did the author validate the diversity of such a model, in the sense that there's no information leakage from the artificially separated accompaniment track?
In the detailed description of the dataset, in line 607, it is unclear whether "separate instrumental music and vocals" refers to stemmed data, based on studio recorded tracks, or to single instrument / a cappella performances.
table 9 - music continuation study - why was the reproduced MusicGen omitted from this study?
The quality of the "original song" samples from the demo page is relatively poor. In the paper, it is reported that a sampling rate of 44.1kHz is used in the work. What is the reason for the relatively low quality of the original data samples?

局限性

The authors adequately addressed the limitations of their work, as well as the broader impact of the proposed system. In specific, potential negative usage of the voice cloning capabilities of SongCreator is discussed. Moreover, the authors decide not to publish model checkpoints due to the harmful potential of such feature.

作者回复

2024-08-07

We are grateful for the reviewer's overall positive response and the consstructive comments provided. We address the concerns and questions below.

Regarding the audio quality, semantic tokenizer and using the None strategy 20% of the time

We take these concerns seriously and have provided a detailed explanation in the global rebuttal section at the top.

Regarding the control through textual descriptions

The reviewer is correct that the current model cannot control the generated songs through textual descriptions. This limitation is mainly due to the dataset. Open-source datasets only contain instrumental music with textual descriptions and lack song data with textual descriptions. Collecting and annotating song data with textual descriptions requires significant time. We look forward to addressing this issue in future work.

Regarding the baseline set

First, we would like to note that only Jukebox has implemented song generation. In section 4.2, we compared our model with Jukebox's official samples, showing that SongCreator was preferred in 60% of cases. Other baselines are SOTA models for instrumental music generation or speech synthesis, which can’t generate songs with both vocals and accompaniment. To ensure a reliable and fair comparison, we reproduced these models on our song dataset based on reliable open-source code.

Regarding the experiments corresponding to Table 15, it is important to note that the six samples provided by SingSong’s official sources are (at least partially) cherrypicked. In contrast, we used non-cherrypicked samples for all experiments. Additionally, in our reproduced results, SingSong performs comparably to our model in the subjective evaluations of the Vocals-to-song and Accompaniment-to-song tasks. We speculate that the better performance in their official demo is mainly due to their use of a larger and higher-quality dataset.

Regarding the samples used for subjective evaluation

We thank the reviewer for reminding us to introduce the samples used for subjective evaluation. During subjective evaluation, we did not use audio prompts in any experiments except for the prompt-based experiments in Tables 5 and 6, to ensure a fair comparison.

Regarding the reliability and reproducibility of the research

Due to concerns over music copyright and the potential misuse of voice cloning capabilities, we do not plan to release the checkpoints trained on the full dataset. To better assist readers in reproducing the experiments in the paper, we have provided detailed descriptions of the model structure and hyperparameter settings and plan to open-source our code in the future.

Regarding the sentence in lines 149-153

We appreciate the reviewer’s constructive comment. We will revise this part for clarity and ease of understanding.

Regarding the opposite trend of FAD compared to subjective quality in Table 3

Thanks for the reviewer's insightful comments. We believe one possible reason for this phenomenon is the inconsistency in evaluation criteria. On one hand, subjects focus mainly on the clarity and intelligibility of the songs, specifically whether the lyrics are accurately conveyed. On the other hand, FAD evaluates the overall quality and fidelity of the audio. Additionally, prior work [1] suggests that existing objective quality metrics, including FAD, fail to reliably predict the perceptual quality of generative music. We also observed a similar phenomenon in MusicGen. Given that we are assessing the combined vocals and accompaniment, we believe the perceptual quality in the listening study is more reliable.

Regarding the increase in vocal quality in Table 4

We thank this comment and apologize for the confusion caused by our presentation. In the experiments presented in Table 4, we did not provide the model with an accompaniment track. The difference between SongCreator and SongCreator (Vocal Only) lies in whether BCA and the accompaniment decoder are used during inference. SongCreator (Vocal Only) uses only the vocal decoder, while SongCreator uses the same setup as in the lyrics-to-song task, generating both vocal and accompaniment tokens before using the obtained vocal tokens to generate vocals. This means the accompaniment track in this part is still generated by the model, not artificially provided.

Furthermore, to prevent information leakage from artificially separated tracks in the accompaniment-to-song and vocals-to-song tasks, we added noise to the inputs to conceal artifacts and used only semantic tokens as inputs. This approach has been validated for its effectiveness in SingSong.

Regarding the “separate instrumental music and vocals”

This refers to non-vocal music data and a cappella performances rather than artificially separated tracks.

Regarding the omission of MusicGen from the music continuation study

First, we would like to note that MusicGen does not emphasize its capability for music continuation, focusing mainly on text-to-music generation. Therefore, we did not consider its music continuation ability. And as shown in MusicGen's paper and our experiments in Table 3, MusicGen performs worse compared to the Flattening approach. Consequently, we chose AudioLM, which uses the Flattening approach, as the baseline for music continuation.

Regarding the reason for the relatively low quality of the original data samples

Thanks for the reviewer’s insightful comments. High-quality song data from professional singers are often strictly copyrighted, so most of our data comes from performances by non-professional singers on the internet. Although these data samples have a sampling rate of 44.1kHz, their overall quality is relatively low due to the limitations of the recording environment and equipment.

[1] Vinay A, Lerch A. Evaluating Generative Audio Systems and Their Metrics[C]//Ismir 2022 Hybrid Conference. 2022.

2024-08-13

I thank the authors for the clarifications. My concerns were adequately answered, and my score would remain unchanged.

The only thing that remained unclear to me is the quality of the "original song" samples in the demo page. Were this samples taken from the DISCO-10M ([68] in the paper), or from the in-house datasets? In both cases, the low-quality isn't fully explained by recording environment and equipment. In case the samples were either preprocessed, or processed using one of the encoder models of SongCreator, this should be mentioned both in the paper and in the demo page.

评论- Response to the reviewer

2024-08-13

We appreciate the reviewer's constructive comments. We would like to clarify that the "Original Song" samples on the demo page are reconstructed samples, not the original recordings. These samples have been reconstructed using BEST-RQ encoding and LDM decoding to eliminate the potential impact from the encoding and decoding processes during our experiments. Talking into account the reviewer's suggestion, we have updated the text on the demo page to accurately reflect this information. Specifically, we have changed "Original Song" to "Original Song (Reconstructed)" and added a note to explain this. We will also make the necessary revisions in the final paper to ensure that this information is clearly stated.

审稿意见

评分: 6置信度: 42024-07-13

The authors introduce a novel system for lyrics-based song generation. It can handle various inputs (lyrics, vocal prompts, accompaniment prompts) and generate different outputs (full songs, vocals only, etc.). The paper proposes a dual-sequence language model (DSLM) that separately models vocals and accompaniment while capturing their interactions through a bidirectional cross-attention mechanism. An attention mask strategy is specifically designed for song generation to song generation tasks of various forms. The authors present competitive performance on eight different song-related tasks.

优点

The model can handle multiple song-related tasks within a single framework, including lyrics-to-song, lyrics-to-vocals, accompaniment-to-song, vocals-to-song, music continuation, and various editing tasks.
The bidirectional cross-attention mechanism in DSLM enables the model to capture the mutual influences between vocals and accompaniment, contributing to more harmonious and coherent generation.
The attention masking strategy enables the model to perform various tasks like generation, editing, and continuation within a single architecture.

缺点

It is no clear how the data is collected, how the audios and lyrics are processed, and what is the input of lyric encoder. (Words or phonemes? If so, how do you obtain it from lyrics?) Will the dataset be open-sourced? If not, it is very challenging to reproduce the experiments in the paper to validate the proposed method.
It's not clear how to obtain the vocal prompt and accompaniment prompt and how to pass them into the model. It may be better to further explain this issue.
In Figure 2, I can get the attention mask strategies, but the figure is confusing. The authors should use a straight line with an arrow to illustrate the mask relationship between tokens in two sequences.
Although the authors describe the effects of different mask strategies, they only list the SA mask and BCA mask strategies for each task in Table 2 without clarifying why these masks are chosen for each task. The authors need to clarify this, even if just for one of the tasks.

问题

Have you compared semantic tokens obtained from different models? Or have you explored other forms of intermediate representations (acoustic tokens or continuous representations)? If so, have you analyzed and compared the differences between these different types of tokens?
From the reviewer’s experience, Demucs does not perform very well for source separation tasks. The vocal samples often have reverb, which significantly affects the quality of the synthesis. How do the authors address these issues?
What is the training detail of the baselines in the paper? Are the models trained on the same dataset?
What is the training strategy for the VAE? What are the components of its loss function?
The Lyrics-to-vocals samples are quite expressive. They sound natural and exhibit some singing techniques such as trill. What do the authors think contributes to this?

局限性

No Limitations

作者回复

2024-08-07

We are grateful for the reviewer's overall positive response. We will address the specific suggestions regarding Figure 2 and the information mentioned in the rebuttal in the final version of the paper. The other concerns and questions raised are addressed below.

Regarding the training dataset

We thank the reviewer for reading our paper carefully. Our data is collected from the internet, including part of the DISCO-10M and some in-house data. For the audio, we employed an Automatic Speech Recognition model to provide timestamps for each sentence in the lyrics and used a Voice Activity Detection model to detect silence segments. We chose appropriate silent segments to split the data into segments no longer than 30 seconds and ensure sentence integrity. The lyrics are tokenized by the tokenizer of BERT. Due to copyright issues, we can't open-source this dataset. To assist readers in reproducing the experiments, we have provided detailed descriptions of the model structure and hyperparameter settings and plan to open-source our code in the future.

Regarding the usage of vocal prompt and accompaniment prompt

We followed the setup from VALL-E. Specifically, the prompt audio is converted into tokens by BEST-RQ and then passed as a prefix to the DSLM. The model uses this prefix to sequentially predict the following token sequence. During training, our vocal and accompaniment prompts are taken from the previous sentence of the target audio. During inference, we randomly select unseen prompts from the test set.

Regarding the choice of masking strategy for each task

We agree with the reviewer’s suggestion and will provide detailed explanations in Appendix. Briefly, for sequences that need to be generated, we use a causal mask in SA to support autoregressive generation. For pre-determined track (e.g., accompaniment in accompaniment-to-song or vocals in vocals-to-song), we use a non-causal mask in SA to better encode the contextual representation. Regarding the BCA mask, when both vocals and accompaniment need to be generated simultaneously (e.g., in lyrics-to-song or song editing tasks), we use the BR strategy to consider the interrelationship between vocals and accompaniment. For song generation from pre-determined track (e.g., accompaniment-to-song or vocals-to-song), we use the corresponding A2V or V2A strategy to ensure that the sequence to be generated can consider the full context of the other sequence. For independent sequence generation (e.g., music continuation or vocals editing), we use the None strategy to support independent generation. Ablation experiments and results supplemented in our response to #reviewer kmxW demonstrate that our chosen masking strategies for different tasks are reasonable.

Regarding the intermediate representations

Thank you for the insightful comments. We have provided a detailed explanation about semantic token in the global rebuttal section at the top. Regarding other forms of intermediate representations, we experimented with acoustic tokens extracted from the Encodec model. In the Lyrics-to-song experiment in Table 3, we used GPT and MusicGen models, which have similar hyperparameter settings and structures (autoregressive transformer decoders). However, MusicGen’s prediction target is acoustic tokens, whereas GPT’s prediction target is semantic tokens. The results show that GPT has an advantage in terms of musicality and quality in subjective evaluations.

Regarding the impact of the Demucs quality

Noticeably, #reviewer nDSW also share a similar concern. We take these comments seriously and have provided a detailed explanation in the global rebuttal section at the top.

Regarding the training detail of the beaseline

All baselines are trained using similar strategies to those used for DSLM, includes the same dataset, training resources, optimizer settings, and similar parameter scales. Each model was trained for 500K steps. Additionally, for a fair comparison, baselines with semantic tokens as the prediction target (e.g., GPT, SingSong (Diffusion)) shared the same BEST-RQ and LDM modules as DSLM.

Regarding the training strategy for the VAE

To train the VAE, we first adopted the pre-trained model provided in DAC, then fine-tuned the encoder and decoder components (i.e., replacing the vector quantizers with a diagonal Gaussian re-sampler as in LDM). We retained the frequency-domain reconstruction loss, discriminators and adversarial loss from DAC and added a KL loss typically used for training VAEs. The VAE was trained on our prepared dataset of 100k hours of songs data, which is the same as the one used for training BEST-RQ.

Regarding the expressiveness of generated vocals

The reviewer's argument is thought-provoking, and we are pleased to share our findings. We also noticed this interesting contribution and conducted experiments to verify it. Different from works that only focus on vocal generation, SongCreator generates both vocals and accompaniment tokes before using the obtained vocal tokens to generate vocals. This means that even if the model only generates vocals, the relationships between vocals and accompaniment are also considered. As presented in the experimental results (see Table 4), we find that this approach significantly enhances the musicality of the generated vocals compared to SongCreator (Vocal Only), which only considers vocal generation. This indicates that taking the relationships between vocals and accompaniment into account helps generate more expressive vocals.

审稿意见

评分: 4置信度: 32024-07-20

The paper presents SongCreator, a system for full song generation including the vocals and accompaniments. The system comprises several steps:

First a quantizer is trained to tokenize audio, which is used to tokenize the song, vocals, and accompaniments.
Next, a language model (DLSM) conditioned on various conditioning signals such as lyrics, vocal prompt, accompaniment prompt is trained to predict the tokenized forms of the input conditioning signals. The DLSM consists of two transformer decoders operating on the vocal and accompaniment prompts respectively. The decoders both cross-attend to each other and the authors propose multiple masking strategies for the self and cross-attention modules.
The final component is a latent diffusion model which generates audio conditioned on the semantic tokens. The first and last steps are pretrained and based on existing literature while the DLSM is trained using a multi-task setup carefully designed to account for various tasks that the model should be able to handle such as generating songs from lyrics, generating songs from pre-determined vocal/accompaniment track, or song editing.

优点

The authors have conducted a very thorough evaluation of their method using subjective and objective metrics. They have also compared against fairly strong baselines for the various tasks.
The results reported in the paper show that their proposed DLSM is superior to standard GPT-based LMs for the chosen tasks.

缺点

The paper lacks some clarity and could benefit from improvements in the illustrations and writing. More specifically, while Fig. 1 gives a good overview of the approach, it seems to be misaligned with what is being presented in the text. At first glance it felt like the text is mentioning that the model accepts text prompts for describing the vocals and accompaniments, but the figures shows those signals passing through the semantic token extractor. The figure is correct, but it took a few minutes to realize that. Similarly, it is not obvious that the attention masking strategies are different for different tasks. Some pre-conditioning in the figure/method section would be beneficial for clarifying the design for the readers.
The audio quality of the final generations are not very convincing. It is indeed better than Jukebox though.
The significance of the paper’s contributions seems low. The authors have themselves mentioned that the semantic tokenizer and the LDM are prior work. Furthermore, the overall design is very similar to other recent work. The main difference is in the specific tasks the authors have chosen and indeed I find that the design is useful for those tasks.

问题

One of the major contributions is the use of the BCA and the authors have ablated the utility of that component, however they seem to have skipped any ablation studies on the different masking strategies for BCA. This might be an interesting point of discussion in the paper.

局限性

The authors have discussed limitations both in terms of technicality as well as societal impact of their work.

作者回复

2024-08-07

We appreciate the thorough review regarding our study. We provide detailed responses to your concerns, as summarized below.

Regarding the clarity of paper

We are thankful for the reviewer's constructive comment, which we take seriously to revise this paper for a clearer presentation. Some of the amendments are described below:

We will emphasize in the abstract and introduction sections that the current vocal prompts and accompaniment prompts are audio rather than text.
We will include pre-conditioning in the introduction and method sections to clarify the use of different attention masking strategies for different tasks. Additionally, we will provide a more detailed explanation of the basis for setting different attention masking strategies in the Appendix.
We will refine the style and legend of the vocal/accompaniment prompt in Figure 1 to avoid ambiguity, and further highlight the relationship between different tasks and attention mask strategies in Figure 2.

Regarding the audio quality

We acknowledge the reviewer's comment regarding the audio quality. We take this seriously and have provided a detailed explanation in the global rebuttal section at the top.

Regarding the contribution of the proposed method

We appreciate the reviewer's feedback and hope to clarify the novelty of our proposed SongCreator. The core contribution lies in DSLM and the corresponding attention masking strategies, a novel approach to achieve dual-sequence modelling such as song data including both vocals and accompaniment. In song generation, DSLM offers advantages over independent single-stream or dual-stream modeling, and enabling independent control over the generated vocals and accompaniment. By incorporating various attention masking strategies, our model can complete diverse song generation tasks, such as generation, editing, and understanding, while multi-task training further enhances the musicality of the generated songs. These advantages are beyond the capabilities of previous works, and we believe our approach provides valuable insights for other dual-sequence modeling tasks.

Furthermore, as the reviewer mentioned, our overall framework is a well-validated approach in the audio domain (the combination of LM and Diffusion). However, on the one hand, we are the first to apply this framework to the task of song generation, and its performance greatly exceeds that of the previous SOTA Jukebox. On the other hand, the framework is used to validate the effectiveness of our proposed DSLM and attention mask strategy. By comparing our method with other approaches within the same framework, we demonstrated that DSLM not only enables more flexible universal song generation but also achieves state-of-the-art or competitive performance across all eight tasks.

Regarding the ablation studies on the different masking strategies for BCA

Thank you for the suggestion. We have supplemented additional ablation studies on the different masking strategies for BCA. Specifically, we conducted AB preference tests for the lyrics-to-song and accompaniment-to-song tasks. In lyrics-to-song, we compared BR with A2V and V2A, and in accompaniment-to-song, we compared A2V with BR. The results are as follows:

Lyrics-to-song

BR	A2V	V2A	None	NP
76%	20%			4%
71%		25%		4%
85%			14%	1%

Accompaniment-to-song

BR	A2V	NP
27%	59%	14%

For lyrics-to-song, the comparison between BR and None has been already presented in the paper (see Figure 4). The results indicate that in lyrics-to-song, replacing the BR strategy with other strategies leads to a significant performance deterioration, demonstrating that the BR strategy is helpful for the model generate harmonious vocals and accompaniment. The None strategy, which disregards the relationship between vocals and accompaniment, performed the worst. In accompaniment-to-song, participants preferred the song generated with the A2V strategy. We believe that this is because the A2V strategy provides more context about the accompaniment sequence when generating vocals.

作者回复

2024-08-07

We sincerely appreciate the detailed feedback and constructive comments from all reviewers, which are extremely helpful to us in revising this paper. We are grateful for your recognition of the comprehensiveness of our experiments, and we are also glad that our approach is recognized for its novelty, strong performance and flexibility. Initially, we will address the major concerns and issues raised by multiple reviewers in the global rebuttal. Subsequently, we will respond to each of the specific comments made by the reviewers individually.

Regarding the audio quality

We acknowledge that current audio quality is limited by the semantic tokenizer and LDM modules used. However, we would like to note that the core contribution of our work lies in the proposed DSLM and attention masking strategies for universal song generation. Given that there are no open-source semantic tokenizer and LDM module for high-quality song generation and the previous works have been limited to generating instrumental music, we retrained the well-prefoming modules on song datasets to validate our proposed approach.

Although the current audio quality is temporarily suboptimal due to the interference between vocals and accompaniment, our fair comparisons have demonstrated that DSLM significantly enhances the musicality and intelligibility of generated songs compared to other LM-based methods. Additionally, the proposed DSLM exhibits diverse capabilities in generating, continuing, and editing songs, and supports flexible control and input combinations — advancements that were not achievable in previous studies.

Indeed, it is noteworthy that DSLM can be paired flexibly with various semantic tokenizers and LDMs, holding the potential for high-quality, universal song generation in the future. Additionally, we are committed to ongoing research into semantic tokenizers and LDMs to enhance the audio quality.

Regarding the selection of the semantic tokenizer

Thanks for the reviewers' suggestions. Initially, we carried out preliminary validation experiments using MERT and MusicFM. We found that while these models could reproduce high-quality accompaniment after quantization, the clarity of the vocals was limited. Considering that BEST-RQ also performed well in MusicFM’s experiments, we decided to train a BEST-RQ model specifically on song data, incorporating separate instrumental music and vocals to enhance vocal clarity. We believe that it encapsulates sufficient semantic and acoustic informations, as evidenced by the retention of key song components—such as lyrics, vocal timbre, instruments, and melody—in the reconstructed audio after converting semantic tokens to audio via the LDM.

In response to the reviewers’ requests, we plan to incorporate additional ablation studies to compare the performance of our BEST-RQ with open-source alternatives such as MusicFM and MERT. However, due to the complexity involved in retraining multiple modules, we will include these comparative results in the final version of the manuscript.

Actually, we believe that this will not affect the novelty of DSLM. Our current choice of semantic tokenizer model was primarily to validate the effectiveness of DSLM. As mentioned above, DSLM is adaptable to other semantic tokenizers. The choice of tokenizer may affect the audio quality of the generated songs but does not impact the model’s ability to perform multiple tasks or the musicality of the generated songs.

Regarding using the None strategy 20% of the training time

We adopted the None strategy to allow the model to learn to generate accompaniment or vocal track independently, supporting the independent generation tasks like music continuation. But obviously, training the model to capture the relationships between vocals and accompaniment through the bidirectional cross-attention (BCA) is more critical for generating songs. Therefore, we configured the model to employ the BR strategy 80% of the time and the None strategy 20% of the time. This probability setting was inspired by classifier-free guidance related work [1, 2] to ensure it does not disrupt the training of the BCA.

Regarding the impact of the Demucs quality

As the reviewers mentioned, Demucs separated vocals can exhibit reverb, which might affect synthesis quality. However, we hope to clarify that it does not significantly impact our proposed method for several reasons.

Firstly, the self-supervised models (BEST-RQ) with vector quantization are noise-robust, which is often leveraged in speech synthesis [3, 4]. Secondly, for most tasks where songs are the final generation target, we directly utilize the song tokens generated by the song decoder within our framework, rather than the vocal and accompaniment tokens. During the training of DSLM, target song tokens are extracted from the original songs without separation, and the training loss for vocals and accompaniment tokens primarily helps the model to learn the musicality of the accompaniment, the expressiveness of the vocals, and the relationships between them. Consequently, although Demucs may have limitations, our methodology effectively reduces its influence on the overall quality of generated songs.

[1] Le M, Vyas A, Shi B, et al. Voicebox: Text-guided multilingual universal speech generation at scale[J]. Advances in neural information processing systems, 2024

[2] Du Z, Chen Q, Zhang S, et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens[J]. arXiv preprint arXiv:2407.05407, 2024

[3] Fujita K, Sato H, Ashihara T, et al. Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters[C]. ICASSP, 2024

[4] Zhao X, Zhu Q, Hu Y. An Experimental Comparison of Noise-Robust Text-To-Speech Synthesis Systems Based On Self-Supervised Representation[C]. ICASSP, 2024

最终决定Accept (poster)

2024-09-25

The authors present SongCreator, a music generation system capable of simultaneously generating vocals and accompaniment tracks. SongCreator is based on a music language model generating two streams of semantic tokens, one for the vocals and the other for the musical accompaniment. Next, a non-autoregressive transformer mixes the two streams, followed by a latent diffusion model for decoding back to the time-domain signal. Overall the paper is clearly written, and the proposed method is interesting and would be valuable to the community. The authors addressed most of the comments raised by the reviewers, hence I recommend accepting this paper. I strongly recommend the authors to include the additional results and clarifications raised by the reviewers in the final manuscript.