MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners
A framework that employs rotary positional embeddings in decoupled cross-attention layers to achieve precise controllability for musical attribute conditioning, as well as audio inpainting and outpainting.
摘要
评审与讨论
MuseControlLite is an efficient adapter-based, controllable text-to-music based on Stable Audio Open, by adopting decoupled cross-attention (IP-adapter). The key finding is that for time-varying control signals, integrating suitable positional encoding (i.e. RoPE) to the adapter itself is crucial to achieve good results with relatively small number of parameters for the adapter.
给作者的问题
In the author's opinion, which time-varying musical attributes provide the best perceptual control? The ablation study discussed in Sec 5.2 along with Table 4 and numerical results, but how the differences would translate to human perception of correctness of the control?
I would like to support this important direction (controllable music generation), but the experimental rigor has room for improvement to warrant acceptance. Can the authors consider subjective evaluation, or at very least a qualitative study by adding baseline models, to the demo for the readers to form their opinion? I believe having nuanced findings regarding the relation of controlled attributes and human perception would add significant academic merits especially for this work.
论据与证据
The necessity of positional embeddings for time-varying conditioning signal is intuitive, and this work constitues a timely execution of devising a successful recipe of the efficient control over several musical attributes (dynamic, melody, and rhythm) along with the claimed RoPE application to the IP-adapter.
方法与评估标准
The method is a combination of existing techniques (pre-trained Stable Audio Open backbone, IP-adapter, and RoPE) which is technically correct. Evaluation metrics employ commonly found ones (FD, KL, CLAP), along with melody accuracy to measure correctness of the chromagram-based melody condition.
理论论述
This paper is mostly empirical, and I find no standout theoretical claims to evaluate. Multi-attribute classifier free guidance (appendix A) has been investigated in existing literature.
实验设计与分析
The work considered several music generation models with controllability (MusicGen-Melody and a reproduced Stable Audio Open ControlNet). While I acknowledge that this work aims efficient adaptation, the objective metrics do not seem to be a clear win over Stable Audio Open ControlNet. Given that objective metrics often do not correlate with human perception, I believe conducting subjective evaluation over attributes in this work (melody, rhythm, dynamics) through mean opinion scores (MOS) will strengthen the merits of this work.
补充材料
I have reviewed the demo page.
与现有文献的关系
A timely contribution towards controllable music generation, which is a prominent area of research after the success of recent text-conditional music generative models.
遗漏的重要参考文献
None.
其他优缺点
While enabling fine-grained control over musical attributes (dynamic, melody, and rhythm) is a welcomed addition, it also potentially adds complexity to the users depending on how hard it is to obtain and manipulate the attributes to non-experts. I acknowledge that these attributes are also used in previous studies, so the choice of attributes itself is not the drawback of this work.
其他意见或建议
None.
We sincerely appreciate your supportive feedback and hope that our responses below address and alleviate your major concerns.
On Melody Representation:
Thanks to your valuable comment, we have identified an oversight that causes the perceived inferiority of our model’s output on the original demo website. Specifically, our initial model (denoted as v1) adopted the melody representation from MusicGen-Melody, which offers lower pitch resolution compared to that employed by Stable-Audio ControlNet. To ensure a fair comparison with Stable-Audio ControlNet, we have retrained our model and developed a new version (v2), which now aligns with Stable-Audio ControlNet’s melody representation. This adjustment significantly enhances the perceptual quality of our generated samples.
- v1: This version employs a one-hot 12-pitch-class chromagram as melody condition, same as MusicGen-Melody. However, this melody representation lacks octave specificity, causing the model to misinterpret pitch information.
- v2: This version adopts a top-4 128-pitch-class CQT to represent melody condition as proposed by Stable-Audio ControlNet. To ensure a fair comparison with Stable-Audio ControlNet, we modified only the conditioning input, leaving the remainder of the pipeline unchanged.
As shown in Table B below, MuseControlLite v2 outperforms v1 in FD and KL. While v2 exhibits a lower Mel acc. than v1, we recognize that this metric does not fully capture perceptual melody alignment. Thanks to the review comments, we conducted a listening test, detailed below, which confirms that v2 surpasses v1 in melody control performance.
Table B
| Model | Train Parms | Total Parms | FD | KL | CLAP | Mel Acc. |
|---|---|---|---|---|---|---|
| MusicGen-stereo-melody-large | 3.3B | 3.3B | 187.0 | 0.47 | 0.36 | 43.7% |
| Stable-audio ControlNet | 572M | 1.9B | 97.7 | 0.27 | 0.40 | 56.6% |
| v1 | 85M | 1.4B | 135.5 | 0.38 | 0.40 | 70.9% |
| v2 | 85M | 1.4B | 82.2 | 0.25 | 0.38 | 61.4% |
On Missing Listening Test:
Initially, we excluded a subjective evaluation due to the unavailability of the code and weights for Stable-Audio ControlNet. However, in response to multiple reviewers’ requests, we have conducted a listening test by using examples from the Stable-Audio ControlNet project website (https://stable-audio-control.github.io/web/), despite that these samples may have been cherry-picked. For this evaluation, we recruited 34 participants and utilized the same text and melody conditions as those demonstrated on their website. We generated music using both our model and MusicGen-Melody, then compared these outputs with the samples retrieved from their demo page.
As demonstrated in the following Table C (mean opinion scores ∈ [1, 5]), the results demonstrate that, our v2 model performs favorably with Stable-Audio ControlNet, despite requiring only about 1/6 of the trainable parameters. Moreover, we note that we only used only the MTG-Jamendo dataset for training, while Stable-audio ControlNet used four training datasets (MTG-Jamendo, FMA, MTT, Wikimute).
Table C
| Model | Text adherence | Melody similarity | Overall preference |
|---|---|---|---|
| MusicGen-stereo-melody-large | 3.12±0.25 | 2.67±0.23 | 3.06±0.23 |
| Stable-audio ControlNet | 3.69±0.28 | 4.17±0.23 | 3.65±0.25 |
| Ours v1 | 3.34±0.27 | 3.62±0.27 | 2.93±0.25 |
| Ours v2 | 3.58±0.20 | 4.21±0.20 | 3.63±0.22 |
We provide samples in the “Updated Melody-conditioned Comparison” section of the demo page to showcase the audio generated with the new melody condition. These samples are the same ones used in our subjective evaluation, with no cherry-picking at all.
On Usability:
We agree that creating complex melody or rhythm conditions can sometimes be challenging. Our solution is to provide a reference audio sample that contains the desired condition. For example, a user could record themselves humming or clapping, and we can post-process that audio to extract both the melody and rhythm conditions. This approach should also work for dynamics; alternatively, users can simply draw a dynamics curve, which our model will accept.
On Questions for Authors:
Regarding which time-varying musical attributes provide the best perceptual control? For Melody Condition: Ours (v2) offers the best perceptual control in our opinion, since it provides more comprehensive conditional information. The misalignment between the melody conditions in v1 and v2 is due to the v1 version being limited to only one octave of information. For Rhythm and Dynamics, on the other hand, we consider the objective metrics for rhythm and dynamics as well aligned with the human perception.
Thank you for the rebuttal. I think the added experiments along with the subjective evaluation would make this work more convincing. Also I appreciate the v2 addition that fixes the melody representation with further improvements. With that said, I also would like to point out that the v2 has been a late addition which makes consistent evaluation of the work bit difficult to the reviewers.
Having seen the matching/improved results now with v2, can the authors discuss bit more on the motivation of being "lite" in attaching the adapters? While it is obvious that the smaller adapter would be preferred in general, the readers may also wonder scalability of the method. For example, can it (especially now with v2) beat the baseline further if the user scales the size to a similar regime (e.g. 500M)? or, is the improvement capped at the presented size (85M)? I acknowledge the timeframe is limited to prepare the full result, so sharing the author's preliminary observations would still be valuable at this point.
We sincerely appreciate the reviewer for the detailed feedback. Our motivation for making the model "lite" is to increase accessibility for users with limited computational resources for training and inference. Our approach is significantly more lightweight than ControlNet, while still offering similar fine-grained control capabilities.
Although we have already integrated decoupled cross-attention layers into every transformer block, there remains room to increase the number of trainable parameters by employing deeper neural architectures, rather than relying solely on single linear layers for the key and value projections.
Due to time constraints, we were unable to retrain the model with the scaled-up adapters before the April 8 deadline. However, we did evaluate the inference speed of our model:
- Original Stable-audio: 4.92 iterations/second
- Ours with 85M trainable parameters: 4.88 steps/second
- Ours with 500M trainable parameters: 3.95 steps/second
In this test, we naively scaled up the key and value projections in the decoupled cross-attention layers using multiple linear layers and activation functions. All models were evaluated using fp32 precision during inference.
We appreciate the reviewer’s suggestion and see this as a promising direction for future work. We plan to explore this further prior to the open-source release, and we will include training results with the scaled adapters in the camera-ready version.
Within the domain of raw audio music generation, motivated by the need for (1) lighter alternatives for fine-tuning, together with the need for (2) better control accuracy (i.e. for the user), the authors propose MuseControlLite, a system for time-varying condition control for music generation. MuseControlLite reduces parameter count relative to, e.g. Music ControlNet (Copet et al 2024), by using decoupled cross-attention layers (Ye et al 2023) in a diffusion transformer, and then to get this to work well they incorporate a modification of rotary positional embedding (Su et al 2024).
The resulting model supports joint attribute and audio control, and they train a separate set of adapters for audio conditioning to allow inpainting and outpainting. Results are presented on the public Song Describer benchmark (Manco et al 2023) and they report a 14% improvement in melody accuracy. Example generated audio files are provided on a demonstration website.
给作者的问题
If a misunderstanding is leading to my expectations being unreasonable for the quality of audio samples (see my comment above in Section on "Claims & Evidence'), then I would be glad to try to identify the source of the misunderstanding so that it may be corrected. Otherwise, I would like to see a discussion about the generated audio, as I explained earlier.
This is an impressive system, and my score is a placeholder; if the above issue can be resolved, I will raise it.
[EDIT APRIL 7 -- Raised Score]
论据与证据
Yes, the claims are overall well supported.
One problematic claim is the somewhat implicit claim that the model handles multiple conditioning signals well (e.g. “These results suggest that the model effectively learns to respond to multiple controls simultaneously, despite the added complexity.” in Section 5.2; and “a lightweight training method that [..] enables precise control of music generation under specified musical attribute conditions” in Section 6, etc). However, listening to the demonstration audio files on the provided website (https://musecontrollite.github.io/web/) indicates that while some certainly do sound good (impressive!), there are quite a few that do not sound good, and/or do not effectively achieve what seems to be the intention. To list a few examples:
- In “Dynamics Control”, the “recording of a melodic piano solo” neither sounds like a piano solo (it has other instruments) nor is it melodic; it’s almost purely textural.
- Also in “Dynamics Control”, the jazz band has the right instrumentation, but is musically incoherent (and I appreciate free jazz, but that is not the issue in this case :)
- In “Melody Control”, the jazz band version of the Chopin Eb nocturne (Op 9 No 2) does not reflect the melody other than a few seconds here and there; even an experienced musician who knows that Nocturne would likely be unable to guess that that is where the melody is coming from. (See also my comment in the next section on metrics)
- In “Rhythm Control”, for the Mozart (“Eine Kleine…”) / cello quartet combination, the simple solution would have been to simply return the same basic piece but with a more legato sound, as requested in the text—it is practically already almost in a harmonized string quartet arrangement, whereas the generated example doesn’t sound “harmonized” as requested, and sounds more just like repeated notes (which is fine but not really addressing the text prompt).
- There are just a few examples; there are others.
I still think that this is an impressive system, and the quality of the audio output, overall, is good! So I believe that the above issue could easily be addressed by (a) distinguishing between quantitative results (which seem to be relatively good) and perceptual quality (which seems to be variable), and (b) adjusting the language/tone in a few places to be more aligned with/reflective of the actual audio outputs, and (c) providing some qualitative discussion about all this, with pointers to some of the examples.
方法与评估标准
Relative to the conventions within the music generation community, yes, the evaluation criteria make sense. The metrics include Melody Accuracy, Dynamics Correlation (i.e. the correlation between the dynamics curve of the generated audio with the ground truth), Rhythm F1 (a fairly standard, if somewhat problematic, way to evaluate beat alignment), and self-similarity-matrix-based Novelty Value (Muller 2015). All of these are reasonable choices.
However, my comments in the “claims/evidence” section above point to an example (the Chopin Nocturne (Op9 No2) / Jazz band in “Melody Control”) where the pitch chroma might be somewhat matched (i.e. perhaps decent “melodic accuracy”), but perceptually speaking, the generated melody is effectively unrecognizable. Evaluating generative models is difficult and there are no great solutions at the moment. So, while these metrics are reasonable in context of prior work and available tools, they are very limited. These limitations—both of the system and also of the evaluation metrics—simply need to be acknowledged.
(It is possible that there is something I am fundamentally misunderstanding about the control process, such that my expectations are incorrect; if so I would be glad to be corrected. I think this is slightly unlikely, because there are other examples that do sound good and “as I expected”.)
In terms of baseline comparisons, there appears to be no other system that accepts the combination of controls that MuseControlLite accepts, and therefore there would be no direct comparison in any case, and some other related systems are infeasible for comparisons for yet other reasons, so in that context, the baselines chosen (i.e. MusicGen, Stable Audio Open ControlNet, and a simple baseline implemented by the authors using Stable Audio Open) also do make sense.
Critically, the authors also provide a fairly extensive set of examples to listen to. (It would be nice if they were marked as “cherry picked” vs “random”, e.g. as done for the ControlNet paper). This is a valuable and important part of a paper on generative methods for audio.
理论论述
N/A
实验设计与分析
I read the experimental descriptions and analyses and overall did not notice any issues.
I did notice that perhaps some of the audio examples that “didn’t sound good” were cases where it was just a particularly challenging task, e.g. the Chopin Nocturne / jazz band is not a simple request. Again, it is not clear how to analyze/evaluate/quantify this, but it’s possible that if one were able to explore some of this systematically, then ultimately it might work “in favour” of the proposed system, e.g. the jingle bells-xylophone pairing (in the “Rhythm Control” section) is inherently easier and indeed works reasonably well; perhaps the system does better on the easier ones, which would be entirely fair.
补充材料
I did not read the appendix carefully (providing details on the separated guidance scale formulation). I did listen to many/most of the accompanying demonstration audio clips on the provided website.
与现有文献的关系
I believe the authors do a good job of relating the paper’s contributions to the broader literature, e.g. the need for more/better/local/time-varying control, and the relationship to a variety of other audio music generation models, as well as to a few relevant diffusion models more generally.
遗漏的重要参考文献
N/A
其他优缺点
This paper demonstrates an impressive level of engineering combined with some creative innovation and insight in how (and why) to put together many complicated moving parts. In particular, the use of rotary positional embeddings for getting the decoupled cross-attention to work well was very nice.
其他意见或建议
N/A
We sincerely thank the reviewer for the insightful and valuable feedback, which has inspired us to make several significant updates into our work, as detailed below. We hope the reviewer will agree that these revisions greatly enhance the scientific quality of the paper.
On Melody Representation:
Thanks to your valuable comment, we have identified an oversight that causes the perceived inferiority of our model’s output on the original demo website. Specifically, our initial model (denoted as v1) adopted the melody representation from MusicGen-Melody, which offers lower pitch resolution compared to that employed by Stable-Audio ControlNet. To ensure a fair comparison with Stable-Audio ControlNet, we have retrained our model and developed a new version (v2), which now aligns with Stable-Audio ControlNet’s melody representation. This adjustment significantly enhances the perceptual quality of our generated samples.
- v1: This version employs a one-hot 12-pitch-class chromagram as melody condition, same as MusicGen-Melody. However, this melody representation lacks octave specificity, causing the model to misinterpret pitch information.
- v2: This version adopts a top-4 128-pitch-class CQT to represent melody condition as proposed by Stable-Audio ControlNet. To ensure a fair comparison with Stable-Audio ControlNet, we modified only the conditioning input, leaving the remainder of the pipeline unchanged.
As shown in Table B below, MuseControlLite v2 outperforms v1 in FD and KL. While v2 exhibits a lower Mel acc. than v1, we recognize that this metric does not fully capture perceptual melody alignment. Thanks to the review comments, we conducted a listening test, detailed below, which confirms that v2 surpasses v1 in melody control performance.
Table B
| Model | Train Parms | Total Parms | FD | KL | CLAP | Mel Acc. |
|---|---|---|---|---|---|---|
| MusicGen-stereo-melody-large | 3.3B | 3.3B | 187.0 | 0.47 | 0.36 | 43.7% |
| Stable-audio ControlNet | 572M | 1.9B | 97.7 | 0.27 | 0.40 | 56.6% |
| v1 | 85M | 1.4B | 135.5 | 0.38 | 0.40 | 70.9% |
| v2 | 85M | 1.4B | 82.2 | 0.25 | 0.38 | 61.4% |
On Missing Listening Test:
Initially, we excluded a subjective evaluation due to the unavailability of the code and weights for Stable-Audio ControlNet. However, in response to multiple reviewers’ requests, we have conducted a listening test by using examples from the Stable-Audio ControlNet project website (https://stable-audio-control.github.io/web/), despite that these samples may have been cherry-picked. For this evaluation, we recruited 34 participants and utilized the same text and melody conditions as those demonstrated on their website. We generated music using both our model and MusicGen-Melody, then compared these outputs with the samples retrieved from their demo page.
As demonstrated in the following Table C (mean opinion scores ∈ [1, 5]), the results demonstrate that, our v2 model performs favorably with Stable-Audio ControlNet, despite requiring only about 1/6 of the trainable parameters. Moreover, we note that we only used only the MTG-Jamendo dataset for training, while Stable-audio ControlNet used four training datasets (MTG-Jamendo, FMA, MTT, Wikimute).
Table C
| Model | Text adherence | Melody similarity | Overall preference |
|---|---|---|---|
| MusicGen-stereo-melody-large | 3.12±0.25 | 2.67±0.23 | 3.06±0.23 |
| Stable-audio ControlNet | 3.69±0.28 | 4.17±0.23 | 3.65±0.25 |
| Ours v1 | 3.34±0.27 | 3.62±0.27 | 2.93±0.25 |
| Ours v2 | 3.58±0.20 | 4.21±0.20 | 3.63±0.22 |
We provide samples in the “Updated Melody-conditioned Comparison” section of the demo page to showcase the audio generated with the new melody condition. These samples are the same ones used in our subjective evaluation, with no cherry-picking at all.
On Variable Quality of the Examples on the Demo Page:
The audio samples on our initial demo page were selected at random, resulting in variable quality. Regarding the dynamics and rhythm control samples specifically noted by the reviewer, we wish to clarify that the observed text adherence challenges stem primarily from the characteristics of the training data utilized during both the pre-training and fine-tuning phases.
- Fine-Tuning Dataset: The Jamendo dataset, employed for MuseControlLite, includes limited representation of classical instruments, which constrains the model’s ability to capture such timbres effectively.
- Pretraining Dataset Text Descriptions: The pretrained Stable Audio model was trained on data with limited musical specificity. Consequently, it struggles to interpret nuanced musical terms such as "melodic," "legato," "harmonized," and other concepts rooted in jazz theory.
We have updated our demo website in the “Highlighted Audio” section to better showcase the inherent limits of the pretrained model (Stable Audio) in terms of text adherence.
I thank the authors for their detailed rebuttals, and for the update to (v2) and associated explanations, and experiments!
(Indeed it sounded like chroma, but not necessarily octaves, were previously being matched, so this all made sense.)
v2 clearly matches melodies better than v1! Nice!
I do have a few notes and questions:
- in the "Updated Melody-conditioned Comparison", example 3 (starts with a low solo plucked banjo-like sound with a bluegrassy band coming in at ~0:09.5), (v2) is better than (v1), but it still completely fails to get the "piano" sound. I understand the issue with musical-text descriptions that the authors mention in their rebuttals, but it's interesting that "solo piano" is so hard. It's also interesting that all the models break down on this one in one way or another. This is not a critical issue, I just think it's good to highlight where things don't currently work well.
- in same section, example 2 (piano solo in minor key), one prompt says "tabla used for percussion in the middle". I hear almost no percussion at all (or am I missing something?), and definitely no clear tabla anywhere. For the same example, another prompt says "string ensemble", but the percussive onset of the piano is still very clearly there: the model knows what a "string ensemble" is (the sustain sounds like strings), but it has issues removing the piano attack. Again, all this is not surprising, but I think it's important to hightlight where things don't work well.
- For same section, example 5 (starts with a solo viola-ish sound with more layers added after a couple of seconds), one prompt says "cheerful piano performance", and the piano sound is kind of there, but the string sounds is also still kind of there, and at around 0:13-0:17 the model struggles with the string dynamics in a way that reminds me a bit of the struggles it has with modifying the beethoven symphony in the "Melody, Rhythm & Dynamics Control" section. Again, it's OK but just acknowledge where things don't work well.
- I could list many more examples that I hear in the audio, but at some point I believe it is the authors' responsibility and role to acknowledge and discuss such hard-to-measure but perceptually salient observations.
I really appreciate that the authors provided extensive demo materials to allow the readers this kind of observation in the first place! My original review comment still holds: "I still think that this is an impressive system, and the quality of the audio output, overall, is good! So I believe that the above issue could easily be addressed by (a) distinguishing between quantitative results (which seem to be relatively good) and perceptual quality (which seems to be variable [though now Improved with v2!!]), and (b) adjusting the language/tone in a few places to be more aligned with/reflective of the actual audio outputs, and (c) providing some qualitative discussion about all this, with pointers to some of the examples." (unless the authors addressed this somewhere and I missed it? if so, I apologize, and please point me to it.) - To double check: (v2)-generated samples were added to the demo page only in that first section where they are explicitly labelled as such, and previous samples were created with (v1) and left as is, is that correct? I ask because I am still curious whether using (v2) would help with the chopin nocturne example (Ex 3 in "Melody Control") which I think is still the previous version? and/or any of the other examples that I listed in my review? Or is the issue something else? (Or is there some reason I missed that v2 cannot be used in this context?)
I am still inclined to raise my score, because I think this is good work; at this point I would simply need to see these framing issues addressed.
We sincerely thank the reviewer for the detailed feedback.
In the camera-ready version, we will include a qualitative discussion and elaborate on the common failure modes of our model, as outlined below:
Text Adherence and Training Data Alignment
- We have found that using text prompts closely aligned with the training data (including both the Stable-Audio pretraining and our fine-tuning data) improves the text adherence of the generated music.
- However, please note that the samples in the Updated Melody-conditioned Comparison section on the demo page originate from the Stable-Audio ControlNet demo, and may have been generated using an LLM that may be different from the pretraining and fine-tuning text.
Instrumental Residuals in Melody Conditioning
- We observed that, although not frequently, certain instruments exhibit distinct patterns in the melody condition that the model recognizes. As a result, even if an instrument is not explicitly mentioned, the model may still render its sound.
- For example:
- Example 3 (Pop solo piano…): The string contour was not completely eliminated.
- Example 2 (A string ensemble…): The piano attack remains.
- These observations suggest that the melody condition may sometimes include timbre information.
Potential Remedies
We believe that this issue could be mitigated by:
- Increasing the guidance for text, or decreasing the guidance for musical attributes.
- Reducing the percentage of dropped text conditions during fine-tuning.
- Originally, we dropped text condition 50% of the time. Increasing the percentage of seeing the text condition
Hard-to-Measure but Perceptually Salient Observations
We have noticed that the instrumentation in the generated audio does not always precisely align with the text prompt. It appears that CLAP score may have limitations in distinguishing between multiple instruments. Moreover, if the melody condition retains timbre information from the reference audio, the final output could sometimes reflect a fusion of timbres from both the text prompt and the reference audio.
Clarifications on demo page examples
- All melody-conditioned samples were generated using version (v1) when not explicitly labeled.
- The Chopin nocturne example has been updated in the Highlighted Audio section.
We hope these responses address the reviewer’s comments effectively.
The paper introduces MuseControlLite, a lightweight fine-tuning mechanism for text-to-music generation that extends previous control work.
Its main contributions include a new adapter design using decoupled cross-attention with positional embeddings for time-varying musical attributes.
The model claims to control melody, rhythm, and dynamics—and supports both inpainting and outpainting tasks—with significantly fewer trainable parameters compared to some existing methods.
给作者的问题
N/A
论据与证据
I personally like this work, which showing great results, but I need to point out that the claim needs more evidence and a wider and fair comparison.
While the paper presents experimental results that show improvements in control accuracy—particularly in melody control—the evidence for some claims is not entirely convincing.
For instance, the parameter efficiency claim is undermined by an unfair comparison: one baseline (coco-mulla) reportedly uses only 4% of parameters (about 60M), yet is dismissed by the authors as unsuitable for comparison.
This raises questions about whether the reported gains in control precision are solely attributable to the proposed design.
方法与评估标准
The proposed method of integrating positional embeddings into decoupled cross-attention layers appears reasonable for managing time-varying conditions in music generation.
The evaluation criteria (including melody accuracy, rhythm F1 score, and audio realism metrics) are standard and appropriate for this domain.
However, the method largely extends existing approaches rather than introducing fundamentally new ideas, which somehow limits its novelty.
理论论述
The paper does not provide deep theoretical proofs or rigorous analyses to substantiate its claims.
While the discussion on the importance of positional embeddings is interesting, the theoretical foundation remains somewhat informal.
No detailed proof is provided for the improvements claimed, so the correctness of any theoretical claims is not thoroughly validated.
实验设计与分析
The experiments are comprehensive, addressing multiple control aspects (melody, rhythm, dynamics) and tasks (inpainting and outpainting).
However, the experimental design could benefit from a more balanced comparison against baselines—especially regarding the tuning parameter counts.
Additionally, the paper does not explicitly discuss its limitations, which makes it difficult to assess the potential trade-offs and areas where the method may fall short.
补充材料
N/A
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
Overall, while the work is interesting and provides a useful baseline—especially if the code is open-sourced—the contribution in terms of theoretical innovation is limited. The paper might be better suited for an applications-focused venue (e.g., ISMIR) rather than a flagship conference like ICML.
We sincerely appreciate your supportive feedback and hope that our responses below address and alleviate your major concerns.
On the Need of More Empirical Evidence:
As elaborated in our response to Reviewer c3ce, we conducted a user study to bolster the empirical rigor of our paper. Initially, we excluded a subjective evaluation due to the unavailability of the code and weights for the key baseline, Stable-Audio ControlNet. However, in response to multiple reviewers’ requests, we have conducted a listening test by incorporating examples from the Stable-Audio ControlNet project website (https://stable-audio-control.github.io/web/), despite the possibility that these samples may have been selectively chosen by the authors. For this evaluation, we recruited 34 participants and utilized the same text and melody conditions as those demonstrated on their website. We generated music using both our model and MusicGen-Melody, then compared these outputs with the samples retrieved from their demo page. We put the detailed results of this listening test in our response to Reviewer 5zZP as Table C. The results demonstrate that, when equipped with the same melody representation as Stable-Audio ControlNet, our model performs favorably, despite requiring significantly fewer trainable parameters. We believe this strengthens the robustness of our findings.
On Missing Comparison with Coco-Mulla:
Coco Mulla employs parameter-efficient fine-tuning (PEFT) to enhance the pretrained MusicGen model, enabling control over chord, rhythm, and piano roll features. It would have served as a great baseline for our study, facilitating comparisons with both larger adapters (e.g., ControlNet) and models with varying numbers of trainable parameters (e.g., Coco Mulla). However, a direct comparison is confounded by differences in rhythm representation between our approach and Coco Mulla’s, as well as by the distinct pretrained backbone models utilized. Additionally, Coco Mulla’s prefix-based conditioning strategy is optimized for language models (LMs) rather than diffusion models, rendering it incompatible with the diffusion-based architecture of Stable Audio Open. Consequently, we have excluded this empirical comparison from our paper.
Nevertheless, we concur with the reviewer’s observation that Coco Mulla’s parameter-efficient fine-tuning approach for text-to-music generation merits recognition. We will revise the final version of the paper to acknowledge this contribution appropriately.
On Missing Discussion on Limitations:
We concur that such discussions are essential for evaluating potential trade-offs and identifying limitations in our approach. Accordingly, we intend to incorporate the following discussions on the weaknesses of our model into the final version of the paper
- The fine-tuning approach, which employs decoupled cross-attention along with rotary positional embedding and zero convolution, becomes unnecessary when training from scratch is feasible.
- The generated distribution of our model is largely influenced by the training dataset used for the pretrained backbone.
- Using multiple classifier-free guidance requires passing multiple batches during inference, which slightly reduces inference speed.
Thank you for your response. While I appreciate your efforts to address my concerns, I must admit that I remain unconvinced on several key points.
Regarding the subjective evaluation, acknowledging the limitations of using potentially biased samples from the Stable-Audio ControlNet website does not fully alleviate my concern about the lack of consistent improvement shown by your model. The fact that your model doesn't consistently outperform baselines across all metrics raises questions about the actual effectiveness of your proposed method. Simply stating the inherent challenges of subjective evaluation doesn't negate the data presented, which suggests the improvements are not as clear-cut as initially claimed.
On the matter of comparing with Coco-Mulla/AIR-Gen, while I understand the technical differences you highlight, your initial dismissal of this comparison felt inadequate. The argument about architectural incompatibility with this model feels somewhat weak, especially since you did compare with the original MusicGen, which also has its own architectural nuances. Coco-Mulla (or AIR-Gen) represents a readily accessible and relevant baseline for parameter-efficient conditional music generation, and I am okay if you can find some alternative baselines for a more through-out comparison. While a qualitative discussion is better than nothing, it doesn't provide the necessary quantitative grounding to truly assess the parameter efficiency claim in a fair context. Your work shares a similar goal, and therefore, a more direct comparison, even with its challenges, would have been significantly more informative.
Therefore, while I acknowledge your willingness to include a more detailed discussion in the revised paper, my core concerns about the lack of robust empirical evidence and fair comparisons, particularly concerning parameter efficiency, are not fully addressed by this rebuttal. My initial recommendation of weak reject still stands.
Thank you for your response.
Regarding the subjective evaluation:
- The subjective evaluation for Stable-audio was conducted using samples from their official demo website. These may be cherry-picked, which raises concerns about their generalizability. In contrast, our evaluations, as well as those from other models, are conducted in the wild, ensuring a more fair and realistic performance assessment.
- Our model is trained using significantly fewer resources:
- We use <1/6 of the trainable parameters compared to Stable-audio ControlNet.
- We rely solely on the MTG-Jamendo dataset, whereas Stable-audio ControlNet uses a combination of MTG-Jamendo, FMA, MTT, and Wikimute.
Despite these constraints, our model achieves comparable performance in both subjective and objective metrics. This demonstrates that our method is highly effective, especially considering the resource disadvantage. We do not claim to outperform all baselines on every metric, but our results clearly indicate strong performance under fair and realistic conditions.
Regarding Coco-Mulla
Coco-mulla differs from our method, MuseControlLite, in several key aspects:
Conditioning Method
- Coco-mulla uses a quantized codec representation from drum tracks(separated by Demucs), which limits its applicability to audio that doesn't include drums.
- In contrast, MuseControlLite extracts rhythmic features directly from the audio, making it more flexible and broadly applicable.
Architecture
- We employ decupled cross-attention across all transformer blocks in our adaptation of Stable-audio, and that will be 6% of the pretrained backbone. It is possible to only employ decupled cross-attention for only several block as done in coco-mulla, which will avheieve less training parameters, but due to time constraint, we were not able to explore this configuration in our current experiments.
Parameter & Inference Efficiency
- Although Coco-mulla uses less than 4% of trainable parameters, its inference speed is significantly slower due to its auto-regressive architecture and prefix conditioning method.
Using the official implementation and MusicGen-Large as the backbone, we benchmarked 20-second audio generation on a single RTX 3090:
- MusicGen-Large: 53.95 seconds
- Coco-mulla: 101.03 seconds
➝ Coco-mulla is about 87% slower than its own backbone.
In comparison, MuseControlLite introduces minimal slowdown:
- Original Stable-audio: 4.92 steps/sec
- MuseControlLite (85M trainable params): 4.88 steps/sec
➝ *Only 1% slower.
Additional Evaluation on Song Describer Clips
To ensure a fair evaluation, we manually selected 30 clips from the Song Describer dataset, ensuring each clip contained drums (as coco-mulla requires). Evaluation results are shown below:
| Model | Train Parms | Total Parms | FD | KL | CLAP | Rhythm F1 |
|---|---|---|---|---|---|---|
| coco-mulla | *132M | 3.3B | 217.94 | 0.47 | 0.36 | 0.63 |
| MuseControlLite | 85M | 1.4B | 216.27 | 0.48 | 0.39 | 0.87 |
*Note: While the reviewer mentioned that coco-mulla only uses 60M trainable parameters, based on their official code and model specs (including a hidden size of 2048), we estimate the correct figure to be around 132M, consistent with MusicGen-Large.
FD, KL, and CLAP are computed using the Stable-audio evaluation metrics. Rhythm F1 is computed using madmom, consistent with both Coco-mulla and our evaluation. Our model outperforms Coco-mulla on FD, CLAP, and Rhythm F1, while achieving comparable KL divergence, showcasing strong and balanced performance. We believe our work demonstrates robustness, efficiency, and flexibility under realistic constraints. We have been mindful to conduct comprehensive and fair experiments, and we appreciate the opportunity to present our findings.
The paper introduces MuseControlLite, a parameter-efficient methodology for aligning a pre-trained, DiT-based text-to-music model, to both symbolic and audio controls. The authors demonstrate the capability of MuseControlLite to extend the controllability of a pre-trained StableAudio-Open model, from text prompts, to conditioning on melodies, dynamics, rhythm and audio excerpts for inpainting and outpainting, with light-weight zero-convolution additive adapters operating on decoupled cross-attention layers.
给作者的问题
n/a
论据与证据
The claims made in the submission are mostly clear and supported by evidence. However, I have the following concerns:
1.The authors present the incorporation of both symbolic and audio controls in music generation as a core contribution of the paper. This claim isn’t aligned with prior work. JASCO, e.g., a prior work cited by the authors, is a trainable text-to-music model that combines text, symbolic and audio controls.
-
The experimental section lacks a subjective evaluation measuring the performance of MuseControlLite in terms of control adherence, and more importantly, in terms of the perceptual quality compared to the baselines. Listening to samples from the demo page, it is apparent that MuseControlLight is likely inferior in terms of audio quality and musicality compared to the baselines. This might be unaligned with the trend implied by the objective evaluation, in which MuseControlLight is on-par with the baselines in terms of quality and musicality.
-
In Section 5.2. - “Ablation Study for Musical Attribute Conditions” - the authors attribute the reduction in FD, KL and CLAP obtained by adding musical controls as an evidence for improved audio quality and semantic alignment with the reference dataset. While this argument is partially sound, the authors didn’t address the uncertainty of a potential information leakage when conditioning on musical attributes that are a function of the ground truth audio. In specific, the significant reduction in FD, obtained by introducing melody conditioning, might be instead attributed to additional information on the GT samples rather than quality improvement. To make the argument more valid, the authors should experiment with style transferred samples, e.g. the original melody with an out-of-genre text prompt or vice versa, and to validate the trends with a human study.
方法与评估标准
In general, the methodology and the evaluation criteria make sense for the problem and for supporting the proposed approach. However, the lack of subjective evaluation significantly reduces my confidence in the effectiveness of the proposed technique.
理论论述
I briefly checked the derivation of the multi-source classifier free guidance in appendix A, and I didn’t find any issues.
实验设计与分析
I checked the design of the following experiments:
- Baseline comparison in terms of quality and melody adherence.
- Ablation on the influence of the different musical attributes on MuseControlLite performance.
- Comparison to baselines in terms of inpainting and outpainting.
I didn’t find major issues beyond what was previously mentioned in the “Claims and Evidence” section.
补充材料
I thoroughly reviewed the audio samples provided by the authors in the paper demo page.
与现有文献的关系
The paper expands prior work on temporally controlled music generation, by introducing an adaptor-based method that is significantly more efficient in-terms of having a light-weight set of trained parameters, compared to prior ControlNet based approaches, such as MusicControlNet. As opposed to MusicControlNet’s control adaptation technique, that requires a set of new weights that is in the order of magnitude of the original model, here, the adaptor consists of a parameter set that is an order of magnitude lighter than the baseline StableAudio-Open model. In terms of the set of musical controls, the paper provides a set of controls that is more diverse than MusicControlNet, but very similar to the control set proposed by DITTO.
遗漏的重要参考文献
n/a
其他优缺点
Weaknesses:
- Fixed mask positioning, at least in the demo page, limiting the applicability of the system.
- Qualitative inpainting results expose a significant degradation of the audio quality in the in-painted areas. The transition between original and in-painted areas sounds unsmooth.
Strengths:
- The ablation on the effect of ROPE introduction to decoupled cross-attention adapters demonstrates a clear and convincing trend.
- Planned open-sourcing of code and model checkpoints.
其他意见或建议
- Typo in figure 1, “farword” instead of “forward”.
- Line 257 - “Since the segments controlled by c_audio are more rigid, we propose to use musical attribute conditions to flexibly control the masked audio segments.” - this is unclear
- Line 295 - “as we found that latent space length has only a minor influence on audio quality.” - Please add an explanation in addition to the empirical observations.
- Baselines - section 4.4 - “or not generating a relatively short audio” should be “or generating a relatively short audio”
We sincerely thank the reviewer for the insightful and valuable feedback. In response to your comments, we have made several plans to update to our submission, as outlined below.
On Claims & Evidence:
Firstly, you are right that JASCO has effectively integrated symbolic and audio controls. We will revise the paper to reflect this accurately. For example, in the second contribution outlined at the end of Section 1, we will update the phrasing from "first trainable model" to "first trainable lightweight adapter".
For the reviewer's information: Jasco differs from ours in that it's trained from scratch and that it generates 10s audio. Moreover, Jasco's audio condition is quantized to facilitate style transfer, while we use full-resolution audio for in/out-painting.
Secondly, we initially omitted a subjective evaluation due to the unavailability of the code and weights for the key baseline, Stable-Audio ControlNet. However, in response to multiple reviewers’ requests, we have conducted a listening test by incorporating examples from the Stable-Audio ControlNet project website, despite the possibility that these samples may have been cherrypicked. We recruited 34 participants and used the same text and melody conditions as those demonstrated on their website. We generated music using both our model and MusicGen-Melody, then compared these outputs with the samples retrieved from their demo page. Due to space limit, we present the detailed results in our response to Reviewer 5zZP as Table C. The results show that, when equipped with the same melody representation as Stable-Audio ControlNet, our model performs favorably, despite requiring significantly fewer trainable parameters. We appreciate the reviewer’s suggestion, which prompted this valuable addition.
We wish to highlight that the perceived inferiority of our model’s output on our original demo site stemmed from our use of a different melody representation than that of Stable-Audio ControlNet. This oversight was identified thanks to a comment from Reviewer 5zZP. Specifically, our initial model (denoted as v1) adopted the melody representation from MusicGen-Melody, which offers lower pitch resolution compared to that employed by Stable-Audio ControlNet. For fair comparison, we have retrained our model and developed a new version (v2), which now aligns with Stable-Audio ControlNet’s melody representation. This change significantly enhances the perceptual quality of our generated samples. We have updated our demo website accordingly, and we invite the reviewer to explore the improved results.
Finally, we share the reviewer’s intrigue regarding potential information leakage and have addressed this concern by implementing the suggested style transfer experiment. Using the Song Describer Dataset, we divided it into two disjoint subsets. We generated samples by pairing text from the first subset with attributes extracted from the second subset, ensuring that musical attributes are independent of the ground-truth audio. The generated samples were then evaluated against the first subset as the reference set. As presented in Table A below, the results align with our prior findings: using more conditions enhances FD and KL scores. We will replace the Table 4 of our paper by this new Table A.
Besides, since the text and melody conditions in the aforementioned user study were from distinct music clips, we have also extended this "style transfer" approach to a human evaluation.
Table A
| Grp | Melody | Rhythm | Dynamics | FD | KL | CLAP | Mel Acc. | Rhythm F1 | Dyn cor. |
|---|---|---|---|---|---|---|---|---|---|
| None | – | – | – | 185.48 | 0.67 | 0.36 | 0.10 | 0.22 | 0.08 |
| Single | ✓ | – | – | 139.89 | 0.49 | 0.37 | 0.69 | 0.43 | 0.17 |
| Single | – | ✓ | – | 158.64 | 0.62 | 0.34 | 0.10 | 0.85 | 0.47 |
| Single | – | – | ✓ | 189.05 | 0.64 | 0.34 | 0.10 | 0.49 | 0.93 |
| Double | ✓ | ✓ | – | 124.38 | 0.49 | 0.34 | 0.70 | 0.87 | 0.52 |
| Double | ✓ | – | ✓ | 145.00 | 0.42 | 0.38 | 0.69 | 0.67 | 0.94 |
| Double | – | ✓ | ✓ | 173.93 | 0.56 | 0.32 | 0.10 | 0.88 | 0.95 |
| All | ✓ | ✓ | ✓ | 138.18 | 0.47 | 0.35 | 0.70 | 0.86 | 0.95 |
On Other Weaknesses:
In our demo page, we used fixed mask positioning mainly for simplicity. However, during training, we applied random masking ranging from 10% to 90% of the audio condition. This enables arbitrary mask sizes at inference time, ensuring applicability.
The suboptimal audio quality observed in the inpainted examples likely stems from an overly challenging configuration in our initial setup. Specifically, we masked the middle 20s---2/3 of the total audio duration---while providing only the first 5s and last 5s as context. In contrast, existing models (e.g., DITTO) typically adopt a less demanding setting, such as masking the middle 1/3 of the audio. We acknowledge this oversight and plan to rerun the experiment using this more standard config to see whether it gives improved audio quality.
I thank the authors for the clarifications and for the additional experiments, adequately answering the main concerns reflected in my review. I raised the score to 3.
Dear Reviewer, I appreciate your recognition of the clarifications and additional experiments we provided. We are grateful for the time and consideration you devoted to reviewing our work.
Sincerely,
This paper introduces MuseControlLite, a lightweight methodology to finetune a DiT-based text-to-music model for precise music control. During the author response period, comprehensive experiments are conducted and the requested revisions are submitted by the author. As a result, most reviewers increase the score ratings, acknowledging that "the authors have indeed made some attempt to incorporate/respond all of my requests", " I think the added experiments along with the subjective evaluation would make this work more convincing". Although there is one reviewer still having concerns on the empirical evidence of the baseline comparison, the authors have tried the best to alleviate the issues. One reviewer even highlights that "I think this paper fits well as a mixed combination of Application-Driven Machine Learning together with Deep Learning (generative models). Overall, it is a adequate paper for ICML. Therefore, I recommend acceptance to this paper.