Efficient Fine-Grained Guidance for Diffusion Model Based Symbolic Music Generation
We introduce Fine-Grained Guidance for AI-driven diffusion-based symbolic music generation, enabling precise, real-time, and interactive control over pitch and harmony with strong theoretical and empirical results.
摘要
评审与讨论
This paper proposes a method called Fine-Grained Guidance, FGG, to improve precision and controllability in diffusion-based symbolic music generation, especially in the domain of tonal music where out-of-key notes can be perceived as mistakes.
update after rebuttal
I raised the score to 4 as the rebuttal addressed my concerns.
给作者的问题
论据与证据
- Tables give evidence that the out-of-key notes are never generated. However, the composers intentionally put these out-of-key notes to give some tension to the model. Using this FGG is of course beneficial for most of the cases, but it "eliminates" the generation of such tension-imposed music. How can the authors justify this?
- I'm not sure if Proposition 2 is tight enough. If it's not tight, then the ground of Section 4 is largely weaken.
- Could the authors provide more intuition of Eq. (8)? If (l,h) is in w_K(l), then epsilon will be the maximum value, but I could not imagine easily of the intuition of . If you plug in the value to Eq. (6), then the predicted X_0 will be 0.5. Could you explain more about this and the general intuition behind it?
- Can you apply this method to non-tonal, avant-garde, or jazz style music? While I appreciate the strong technical contributions of this work, I would like to raise a broader discussion point regarding modality-specific research directions, particularly in the audio domain. Unlike other generative domains such as vision or language, where models are typically expected to generalize across all possible natural images or text, audio — and especially symbolic music — remains heavily siloed. We often see separate models for speech, music, and even for specific genres like pop, classical, or jazz. This raises the question: why is music treated as fundamentally different in its modeling approach compared to image or video generation? Should we continue to develop highly specialized generative models for narrow musical domains, or should the field aim toward more unified, generalized audio models? The current direction in symbolic music seems to serve a small subset of expert users, and it’s unclear whether this granularity is justified in terms of broader impact or scalability. I fully acknowledge that each modality has unique properties — but from a reviewer’s perspective, this fragmentation affects how we measure and compare contributions across fields. I do not see this as a flaw of the paper per se, but I encourage the authors and the community to consider this broader question of generality, user impact, and the future direction of audio generation research.
方法与评估标准
I did not carefully read the experiments.
理论论述
I didn't read proofs.
实验设计与分析
The experimental section looks solid enough.
补充材料
I read a partial of Appendix when necessary.
与现有文献的关系
Related to the general public in this community who are particularly interested in music domain (and AI for science given that they also want to generate a sample that a certain constraint is met)
遗漏的重要参考文献
References are enough
其他优缺点
其他意见或建议
We deeply appreciate the reviewer's valuable comments. Please allow us to provide responses as follows:
1. Using this FGG is of course beneficial for most of the cases, but it "eliminates" the generation of such tension-imposed music. How can the authors justify this?
We agree that some composers intentionally put out-of-key notes to add “creativity”. However, it is observed that generative models often fail to create an accommodating context for such “creativity”. In other words, most of the out-of-key notes generated by models cannot add creativity, but instead disrupts harmony. We think the main reason is that out-of-key notes are relatively rare in most datasets, which makes it difficult for the model to learn how to “correctly” use out-of-key notes. Given such a limitation in data, the benefit of avoiding out-of-key notes could outweigh the loss of potential creativity.
2. I'm not sure if Proposition 2 is tight enough. If it's not tight, then the ground of Section 4 is largely weaken.
We also cannot theoretically verify whether it is tight. We would like to clarify the following: If Proposition 2 is not tight—meaning the error probability decays even slower than —then the probability of generating "unresolved wrong notes" remains even higher. Therefore, the lack of tightness in Proposition 2 does not undermine the main argument in Section 4.
3. Could the authors provide more intuition of Eq. (8)?
Eq. (8) ensures that: if (l,h) is not an out-of-key position, predicted will be unchanged; if (l,h) is out-of-key and predicted exceeds 0.5, then predicted will be reset to 0.5. This acts as a projection method, enforcing that out-of-key notes cannot be present. The threshold of 0.5 is chosen because, in piano roll quantization, values <= 0.5 mean no note, while values >0.5 indicate a note. Instead of correcting only at the final step, we apply this constraint throughout the sampling process to maintain the flexibility of diffusion models while ensuring harmonic coherence.
4. Why is music treated as fundamentally different compared to image or video generation?
Music generation follows two main approaches: treating music as an image (e.g., piano roll representation) or as language (e.g., sequences of pitch, duration, and timing tokens). However, it has the following challenges:
-
Precision Sensitivity – In image generation, slightly changing a single pixel usually has minimal impact, whereas one misplaced note in music can significantly disrupt harmony.
-
Temporal Overlaps – Unlike text, where words follow a sequential order, musical notes often overlap, which leads to challenges when conducting music generation in the “next token predictor” manner.
-
Genre-Specific Rules – Music follows complex, style-dependent structures, such as tonal harmony in classical music, or intricate rhythms in jazz.
These factors make it difficult to develop a unified, generalized model for music generation. For now, the fragmentation in music models is partially due to the complexity of musical structure and the precision required to generate high-quality music. But as generative AI matures, we should aim for models that can adapt dynamically to different musical styles rather than being strictly bound to a single one. In fact, our control method can also be applied in stylized music generation (see the next following paragraphs).
5. Does our method has a broader impact or scalability?
We think that our proposed method (especially the sampling control part) is generalizable. In our paper, we enforce a "no out-of-key notes" constraint within the sampling process. As shown in the demo page, this method can be applied to generate stylized music with special scales (e.g., Chinese pentatonic scale).
More broadly, our method can be adapted to various musical constraints. Many genres and styles can be defined through rules (e.g., harmonic, rhythmic, or structural constraints), which our sampling method can also incorporate. For example, rhythmic control could be implemented by ensuring that certain time positions must (or must not) contain notes. Then we can incorporate this requirement in the sampling process as follows: in each step, we edit to project predicted to the correct domain where the rhythmic constraint is satisfied. Such rhythmic control method could be used for generating jazz music which has special rhythmic requirements.
This paper presents a fine-grained guidance (FGG) mechanism for improving symbolic music generation using diffusion models. The proposed approach incorporates strict harmonic control by integrating domain knowledge into the generative process, ensuring that generated musical sequences adhere to predefined chord progressions and key signatures. The paper introduces a conditional generation setup, where fine-grained harmonic constraints guide the sampling process, leading to a controlled and structured output. Theoretical justifications are provided, demonstrating the necessity of such structured control in diffusion-based symbolic music generation. To validate the approach, the authors provide both theoretical and empirical evidence, showing that FGG prevents out-of-key generation and enhances structural coherence in generated compositions. The paper also presents an interactive demonstration, showcasing generation capabilities.
给作者的问题
- Why do you use negative values for the condition piano roll when no rhythm control is given?
- How do you justify your choice of evaluation metrics? could you give motivation to what they intend to capture and how does the score demonstrates that? Could you supply additional, more interpretable metrics of evaluation?
- Could you elaborate on how prior work in this field addressed the gaps you highlight in the introduction? what gaps remain?
论据与证据
The paper makes three primary claims:
- The proposed fine-grained guidance mechanism ensures strict in-key generation.
- Theoretical justifications demonstrate the necessity and effectiveness of FGG.
- Empirical evaluation supports the effectiveness of FGG in improving symbolic music generation.
Assessment of Claims and Supporting Evidence
- Fine-Grained Guidance (FGG) for Strict Harmonic Control
- Supported Evidence: The methodology section details how fine-grained constraints are integrated into the generative process, ensuring adherence to predefined harmonic structures. The 0.0% off-key note generationresult confirms this strict control.
- Evaluation: Well-supported. The claim is effectively demonstrated, as the model enforces harmonic rules explicitly, leading to an expected perfect in-key generation.
- Theoretical Justification of FGG’s Necessity and Effectiveness
- Supported Evidence: Theoretical formulations, including Proposition 1 and Proposition 2, argue the necessity of structured control in diffusion-based symbolic music generation.
- Evaluation: Partially supported. The paper is not self-contained, requiring extensive prior knowledge in statistical methods and diffusion models. Several key derivations rely on unstated assumptions, making it difficult for non-expert readers to follow.
- Recommended Improvement: The presentation should be more self-contained, with additional intermediate steps, particularly in the appendix, to improve accessibility. Re-formalizing the problem setup and explicitly deriving key constraints, such as the first constraint over in Appendix A.1, would make the proofs much more understandable.
- Empirical Evaluation Demonstrating Effectiveness
- Supported Evidence: The authors conduct experiments using cosine similarity via a VAE latent representation and overlapping area (OA) measures to assess generation quality.
- Evaluation: Not convincingly supported. Aside from strict off-key detection, the evaluation metrics lack clear intuition regarding what they measure. The rationale behind using VAE-based cosine similarity and OA instead of direct accuracy or intersection-over-union (IoU) is unclear.
- Recommended Improvement: The authors should either justify their choice of evaluation metrics or adopt more interpretable alternatives that directly measure harmonic consistency and generation quality.
方法与评估标准
The methodological approach of integrating harmonic constraints in diffusion-based symbolic music generation is novel and well-motivated. However, the evaluation criteria lack clarity and justification. The authors use VAE-based cosine similarity and overlapping area (OA) measures to assess performance, but these metrics are not well explained, and their relevance to assessing generation quality is uncertain. Alternative evaluation methods, such as direct accuracy, intersection-over-union (IoU), or out-of-key note ratios beyond 0.0%, could provide more interpretable and robust validation. Moreover, justification for some design choices are lacking, e.g. why did you use negative values for the condition piano roll when no rhythm control is given?
理论论述
See claims and evidence.
实验设计与分析
The experimental setup demonstrates the technical feasibility of the approach but fails to convincingly validate the claimed improvements. The 0.0% off-key rate is expected due to strict constraints, and the other evaluation metrics do not provide clear insight into how well the method improves generative quality.
The paper would benefit from:
- Justifying the choice of VAE-based cosine similarity and OA as evaluation metrics.
- Introducing alternative, more interpretable accuracy metrics.
- Expanding empirical analysis beyond off-key note prevention.
补充材料
I have tried the Huggingface Space to play with the model which quite consistently showed very nice generations!
与现有文献的关系
The progression made prior to this work is insufficiently described at the opening sections of this work. The related work section does name dropping rather than giving context as to what recent works did to tackle the challenges specified by you in the introduction part, what is the progression made from one to the other? What key contribution did they introduce? How do they aim to address the gaps you pointed out in the introduction? You should draw the progression for the reader to better understand the context before diving into the specifics in the next sections. The current related work serves to specify some differences of the suggested approach from some other prior works, not addressing any of the above.
遗漏的重要参考文献
- I believe that the related work section needs to be revised as a whole and to draw the progression made in recent years, point out the gaps and how other people tried to address them.
- Theoretical derivations could benefit from expanding derivation steps, see comment in Claims And Evidence.
- Justification for Empirical evaluation metrics are required.
其他优缺点
Strengths:
- Novel structured control mechanism for enforcing harmonic accuracy.
- Theoretical rigor in proving the necessity and effectiveness of the approach.
- Well-illustrated methodology and results.
- Practical value through an interactive demo.
Weaknesses:
- Writing clarity issues: The writing requires significant effort to follow due to structural issues and ambiguous wording. Examples:
- The contribution summary should be clearer and more structured. Instead of long paragraphs, each contribution should be presented in a concise sentence.
- Related work section lacks coherence, with frequent references to the authors' own work disrupting the logical flow. Example: "We adopt classifier-free guidance..." appears prematurely in Related Work rather than in Methods.
- Justification for evaluation metrics and design choices:
- The authors do not justify why negative values are used for the condition piano roll during training when no rhythm control is given.
- The choice of evaluation metrics (VAE-based cosine similarity, OA) lacks clear reasoning.
- Lack of self-containment:
- Key derivations require extensive prior knowledge and are not well explained.
- Theoretical justifications, especially in Appendix A.1, should explicitly present intermediate steps and provide clearer intuition behind their design choices.
其他意见或建议
Overall I think that this work is interesting, and shows great performance by imposing domain knowledge constraints on the generative process, and demo also is very fun to play with and shows very nice generations. That being said I strongly suggest a careful revision to improve clarity, as I found it difficult to follow key claims throughout the paper.. You should be more structured, well defined and avoid vague wording. Specifically related work, background and method should be revised. make sure your claims are plainly stated, your design choices justified and that the problem setup is based before moving on. Although I overall like this work yet due to lacking clarity along the paper I find it difficult to confidently give a high ranking at this stage. Please address my concerns, revise your writing and I shall increase my graded score.
We deeply appreciate the reviewer's suggestions on revising! Due to this year's policy, it is not allowed to upload a revised paper, external links can only contain figures/tables, and rebuttal has a 5000 length requirement. Please allow us to describe a revision plan in the follows and promise to follow through in next phase.
Responses
-
To train a model that can handle both rhythm+chord and chord-only conditions, we must distinguish when rhythmic conditions are provided. We achieve this by using negative values to indicate the absence of rhythmic conditions, preventing the model from misinterpreting 0s and 1s as rhythmic constraints applied everywhere. This design choice was determined through empirical experimentation.
-
Chord progressions are harmonic structure of music, and the latent representation better accounts for the degrees of similarity between chords. For features like pitch, duration, and note density, we evaluate the overlapping area between their distributions in generated and ground truth segments, to evaluate how well the generative model captures the statistical properties inherent in the original compositions. We also added three metrics to directly measure generation quality: Direct Chord Accuracy, IoU of Chord, and IoU of Piano Roll. New results are shown in https://drive.google.com/file/d/1IAcAqK4qK4AiQVKWriFSJ91QNhrg5at-/view.
-
See "related literature" in the following.
Revisions in writing
1. Contribution:
Motivation: We theoretically and empirically characterize the challenge of precision in symbolic music generation
Methodology: We incorporate fine-grained harmonic and rhythmic guidance to symbolic music generation with diffusion models.
Functionality: The developed model is capable of generating music with high accuracy in pitch and consistent rhythmic patterns that align closely with the user’s intent.
Effectiveness: We provide both theoretical and empirical evidence supporting the effectiveness of our approach.
2. Related Work:
Thank you for your suggestions, we really appreciate the structure line of organizing this paragraph that you have provided us. We will integrate it as:
To leverage well-developed generative models for symbolic music, Huang et al. (2018) introduced a Transformer-based model with a novel relative attention mechanism designed for symbolic music generation. Subsequent works have enhanced the controllability of symbolic music generation by incorporating input conditions. For instance, Huang and Yang (2020) integrated metrical structures to enhance rhythmic coherence, Ren et al. (2020) conditioned on melody and chord progressions for harmonically guided compositions, and Choi et al. (2020) encoded musical style to achieve nuanced harmonic control. These advancements have contributed to more interpretable and user-directed music generation control.
To better capture spatio-temporal harmonic structures in music, researchers have adopted diffusion models with various control mechanisms. Min et al. (2023) incorporated control signals tailored to diffusion inputs, enabling control over melody, chords, and texture. Wang et al. (2024) extended this by integrating hierarchical control for full-song generation. To further enhance control, Zhang et al. (2023) and Huang et al. (2024) leveraged the gradual denoising process to refine sampling. Building on these approaches, our work addresses the remaining challenge of precise control in real-time generation.
3. Theory
We will make the theoretical proofs in appendix more detailed by (1) expanding more steps and (2) adding more recalls of formulations from the text directly to the appendix. We will also simplify proposition 2 by removing the conditional probability argument, the details of which are in our response to reviewer uD9t.
4. Sections 2 and 3.
We will add subtitles to section 2, namely “Data representation of the piano roll” and "Formulation of the diffusion model", at corresponding places.
We will delete the third paragraph “The FGG method improves…” of section 3.
We will revise the first paragraph of section 3.2 to:
We first provide a rough idea of the harmonic sampling control. To integrate harmonic constraints into our model, we employ temporary tonic key signatures to establish the tonal center. Our sampling control mechanism guides the gradual denoising process to ensure that the final generated notes remain within a specified set of pitch classes. This control mechanism removes or replaces harmonically conflicting notes, maintaining alignment with the temporary tonic key.
We will add subtitles or topic sentences “mathematical formulation of the harmonic sampling control”, "preliminaries", "Edit intermediate-step outputs of the sampling process" and “theoretical property of the sampling control” to corresponding places of section 3.2, and do further refinement.
Thank you for taking the time to address my concerns.
The added IoU-based evaluation metrics significantly improve the interpretability of the evaluation and help support the model's harmonic precision and structural accuracy. These additions directly address some of my earlier concerns, and I find them convincing.
That being said, I still believe the explanation for the Overlapping Area (OA) metric requires further clarification. It remains unclear what an increase in OA actually indicates. Why is a higher OA necessarily desirable? Is the intention to show that the generated segments better match the distribution of observed features in the test set? And is deviation from the test set distribution necessarily a negative outcome? I'm not fully convinced this is always the case. I don’t expect the authors to replace the metric, but it would be helpful to clearly articulate what OA is meant to capture, what an increase in its value signifies in practice, and in what sense it reflects an improvement.
I also appreciate the clarification regarding the use of negative values for missing rhythmic conditions. However, since the rebuttal mentions that this design was motivated by empirical findings, I believe the paper would be strengthened by briefly including this evidence — for example, noting that removing this distinction results in a drop of X% in chord accuracy or a similar concrete metric.
Overall, the revision plan seems promising and addresses many of the concerns I raised. I’d be happy to revisit my score and would feel more comfortable doing so if the revised content becomes available — even via an external link, as was done with the evaluation tables.
Edit Apr 6th.
I have updated my grade, as I think that you did a good work in this rebuttal, and work should be accepted. That being said, I still think that the OA is a vague metric, which doesn't stand to compare your suggested approach to the observed baselines. As you mentioned yourself, higher is not necessary better - hence I think this actually serve to confuse the reader rather than emphasize a point you wish to make. I would suggest you consider dropping this metric and use the IOU and accuracy measures alone. Best of luck with your submission.
Thank you very much for your additional comments and suggestions. Please allow us to provide response as follows:
1. Explanation about OA
Thank you for your insightful comment. The OA metric is designed to measure the degree of overlap between the distribution of key features in the generated outputs and those in the ground truth. A higher OA suggests that the structural patterns of the generated accompaniments better align with the patterns found in human-composed ground truth accompaniments. In other words, it indicates that the model is more capable to produce a similar range and distribution of musical structures, rather than collapsing into a narrow subset of possibilities or generating unrealistic patterns.
We agree with the reviewer that deviation from the ground truth distribution is not always a negative outcome, especially in creative domains where novelty and innovation are highly valued. We do not claim that maximizing OA is always equivalent to improving artistic quality. Instead, we treat the ground truth as a "reference," and use OA as a complementary evaluation to assess whether the model maintains coverage of plausible structural features — an important aspect of generation quality alongside harmonic precision. We will revise the manuscript to better clarify the purpose, interpretation, and limitations of the OA metric.
2. The use of negative values for missing rhythmic conditions
Thank you for this helpful suggestion. We agree that providing empirical evidence would strengthen the paper.
In our experiments, we found that removing the distinction for missing rhythmic conditions — that is, not using negative values — led to a 8%–15% decrease in chord accuracy across different evaluation settings. (For example, direct chord accuracy drops from 0.485 to 0.421, and chord similarity drops from 0.767 to 0.705). This drop highlights the importance of explicitly encoding missing rhythmic information, as it helps the model better distinguish between different musical contexts.
We will include this empirical observation, adding a note to make the motivation for this design choice clearer to readers.
Regarding revision plan and revised content
We much appreciate the reviewer’s recognition of our revision plan and the willingness to spend time to review the revised content. We would be happy to submit a revised paper but unfortunately this year’s policy only permits the external link to contain tables and figures. We consulted AC about the possibility of sharing a revised paper and was advised that it is not allowed at this stage. Nevertheless, we want to assure that we will carefully implement the planned revisions in the next phase of the process.
Thank you again for the valuable comments.
The two main motivators of this work are (1) the importance of providing user control for symbolic music generation, and (2) specifically the importance of exact pitch control, and the particular challenge of achieving this when using an image-based representation (i.e. piano roll). The authors solve this by proposing a conditioning-based approach that allows some harmonic and some rhythmic control. They introduce a piano roll representation of the conditioning signal, and they use these signals both during train time of a conditional diffusion model, and also as a mask-like representation of the constraints to provide further guidance during the sampling process. Essentially—if I have understood correctly—if the constraint says “don’t use note X at time T”, then at each step of denoising, the sample value of note X at time T is pushed toward zero. They provide a theoretical bound on how much this adjustment can affect the overall resulting distribution. The authors also provide a theoretical argument to try to explain why harmonic precision is hard for uncontrolled (i.e. unconditional) models.
They provide experimental results using the POP909 dataset, with piano rolls of size 2x64x128, corresponding to 4-bars of 4/4 time quantized into 16th notes, with one channel for onset and one channel for sustain. They measure both objective (e.g. percentage of out-of-key notes) and subjective quantities. They present an ablation study where (1) the control signal is provided as a conditioning input, but “wrong” notes are removed only after the completing the reverse-diffusion process; (2) one study where the control signal is again provided as input but no notes are removed; and (3) one study where an unconditional model is used and no control occurs during the sampling process.
They find that providing the conditioning signal (i.e. training a conditional model) helps, and providing both conditioning signal together with interventions during the sampling process helps even more.
给作者的问题
(See any questions above). Also:
Q1. I am a bit confused about the Dorian mode generation. If harmonic constraints are implemented as “sets” of eligible notes, then how does this allow generation in specific modes that is any different from a major key? For example, G major consists of exactly the same set of notes as A dorian, B phrygian, etc. So in the demo page, which shows A dorian, how is this different from generating in G major? I do agree that this example has a bit of a dorian sound to it, but what is causing that? Presumably sometimes you would have just got something that clearly sounds in G major, right?
Q2. Essentially, this problem appears to be an instance of inpainting where parts of the image are known and other parts need to be filled-in. This has been studied extensively. Why ensure Roll[time,pitch] <=0.5 rather than ensure Roll[time,pitch] == 0? I.e. why not explore an inpainting method to fill in the remainder of the piano-roll by grounding the Roll[time,pitch]==0 wherever required?
Q3. Out of curiosity: In general, the data-distribution is normalized to [-1,1] before training a diffusion model. So, I was just wondering if the equations use [0,1] data for notational convenience or did the authors actually use [0,1] data for training?
论据与证据
NOTE: Since this question requires me to “convert” the standard contributions-based representation into a claim-based representation (some of which are implicit in my reading of the paper), I invite the authors to restate their claims explicitly if I have misrepresented anything here.
Claim: The authors provide statistical theory evidence to characterize “the precision challenge” in symbolic music generation. (e.g. see Section 4).
Assessment: I think I like the provided reasoning/argument, in as much as I followed it. However, I found parts of the argument hard to follow (i.e. unclear), perhaps unnecessarily so (see “strengths/weaknesses” below)).
Claim: The proposed model allows future users precise control over harmonic and rhythmic components of symbolic music generation. (e.g. see Section 3, first few paragraphs).
Assessment: This claim fails to highlight that of course it can only do this in as much as the conditioning representation can support the desired guidance. I think the conditioning representation is reasonable and good (it makes sense to present represent constraints in a piano roll format!) but also it is quite limited. For example, it does not allow probabilistic constraints (as in the demo link earlier). As other examples, one could certainly imagine other kinds of rhythmic constraints as well, and there are also different ways of specifying harmonic constraints too. But one paper cannot provide all possible forms of control!
Easy Fix: This could be fixed with a clear and explicit discussion on limitations, which I believe are significant.
Claim: The proposed controlled diffusion model is capable of generating music with high accuracy and consistency with respect to harmonic and rhythmic guidance, despite limited training data. (e.g. see Section 1, ‘methodology’ and ‘effectiveness’ paragraphs).
Assessment: I believe this claim is basically supported by clear evidence.
Claim: The model supports this control even when the controls push the music towards an out-of-sample style. (e.g. see Section 1, ‘methodology’ paragraph).
Assessment: This depends on what one means by “out-of-sample”. If one thinks of the training+sampling procedures as teaching+enforcing where to place or not place notes, based on the conditioning signal, then I think that providing a slightly different subset of notes (i.e. a different scale or chord from what is used in the training signal) is not necessarily an out-of-sample task. In particular, the adjusted sampling procedure will absolutely guarantee that only the “allowed” notes will be used (i.e. the others will be removed) so if the set of “allowed” notes is very far from anything seen in the training data, then the generated samples will still satisfy the constraints; the only remaining question is whether the generated samples sound good (and perhaps subquestions such as: how much did the conditional training alone help in those particular cases, versus how much was the sample-editing procedure required?).
Fix: Either justify exactly what is meant by out-of-distribution, and answer my questions above, or remove/carefully-qualify this claim (at no great cost to the paper, in my opinion).
Claim: “We have published a demo page to showcase performances, as one of the first in the symbolic music literature’s demo pages that enables real-time interactive generation.” [L 32-34].
Assessment:
(1) “interactive”: The demo allows the user to switch between 4 presets (and regenerate multiple times for each one). This is good (far, far better than no demo!), but when I read about an interactive demo, I was excited to try inputting more complex chords and melodies myself, and melodies that include, e.g. less harmonically obvious notes, and see how the system sounds. I do think it’s OK to call it an interactive demo, though.
(2) “one of the first in the symbolic music literature’s demo pages”: That reads to me like a significant exaggeration. For example, there are over 15 interactive musical demos on this page alone (https://magenta.tensorflow.org/demos/web/ , which includes [1]) many of which allow the user to input, e.g. melodies and constraints (i.e. more than just from a dropdown of 4 choices). An effective demo where the user provided melodic contour constraints in real-time was presented in [2]. There are many other online MIDI generation demos as well, e.g. a transformer model here: https://huggingface.co/spaces/skytnt/midi-composer, and many others are available as well. Incidentally, OpenAI’s MuseNet was available interactively as well a few years ago, although it’s not available anymore, so it’s understandable if the authors had not come across it.
That said, I think that providing any kind of user interaction is important and commendable and in fact should generally be expected for generative models. The issue here is simply that the claim is incorrect and can be easily fixed.
[1] Roberts, A. and Engel, J. and Hawthorne, C. and Simon, I. and Waite, E. and Oore, S. and Jaques, N. and Resnick, C. and Eck, D., “Interactive Musical Improvisation with Magenta”, NeurIPS 2016 (Demo).
[2] Donahue, C, and Simon, I., and Dieleman, S., “Piano Genie”, IUI 2019 (see https://www.i-am.ai/piano-genie.html for an online demo)
方法与评估标准
Yes, the proposed methods and evaluations do make sense, generally speaking, especially in relation to the conventions in this field.
Also: No, the conventional evaluations do not necessarily make sense, but the current authors are not responsible for that. However, it would be helpful to mention potential limitations associated with the evaluation criteria and dataset.
I found the demo page to be very helpful. I would like to see more examples and information on that page (and/or in the appendix, whichever is easier):
- to help make sure I understand how the accompaniment was generated for the sample melodies, I would like to see the constraint matrices (i.e. in “piano roll format”) that were used to generate those accompaniments.
- it would be interesting and helpful to see some samples generated for each of the ablative conditions
- for the sample melodies in the demo page, how were these melodies obtained? In particular, one of them sounds almost as though it was taken from an existing piece (from the dataset? ), and the melody was “extracted” by hand– is that the case, or is that just a coincidence?
Baselines. The authors have essentially posed/re-framed the symbolic-control problem as an image in-painting problem, where certain pixels have constrained values. The in-filling literature for images is extensive, so I would expect there to be at least one baseline from that literature that would be reasonable to apply here. Could the authors please respond to this?
理论论述
No; I looked at parts of the proofs, but I did not check them line-by-line.
实验设计与分析
Yes, the experimental designs seemed fairly reasonable. For the ablations, the authors mentioned a comparison to a “simple rule-based post-sample editing”, but I don’t think I saw this in Table 2. Am I missing something? (unless this refers to the case where the conditional-trained model is still used, but the editing only happens once at the end, i.e. “training control edit after sampling”). I assumed “simple rule-based post-sample editing” meant an unconditional model, but with an edit applied after sampling. Now I’m thinking maybe it means a conditional model after all– clarification on this would be welcome.
In 5.1.5, regarding the ablations, the authors write “In contrast, the latter employs a brute-force editing approach that disrupts the generated samples, affecting local melodic lines and rhythmic patterns. The numerical results further validate this analysis.” However, as far as I can tell, the numbers in Row 1 and Row 2 of Table 2 look nearly identical. E.g. OA(pitch) is 0.628+/- 0.005 versus 0.624 +/- 0.005. I do believe that the brute-force editing might disrupt the generated samples, affecting melodic lines, but to me, this absolutely demands qualitative listening-samples because (if I’ve understood correctly) the numerical results do not validate this analysis. This also points to shortcomings in the standard approaches for evaluating this kind of work (see my earlier comment in the Methods & Evaluations section about limitations of evaluations).
An ablation or comparison that would have been interesting is to simply zero out—at every step of the sampling process—those pixels/cells of the piano roll where there is not “supposed” to be a note. I wonder if this would capture the main benefits of the current proposed method, in that it would reduce the possibility of “wrong notes” directly at every iteration of the sampling process. I am not requesting that the authors do this, but if it’s feasible, then I would certainly be very interested to see (and hear) the results of such a modified sampling process. I realize it might be theoretically less sound, but would still be an interesting comparison point. (Or, again, am I misunderstanding something?) Related, see Q2 below.
Incidentally, another related comparison could be some equivalent of the 'MIDI scale' function that Ableton provides, i.e. just "round notes up/down" to the nearest allowable note that satisfies the constraints.
补充材料
I reviewed Section B.3 (including Algorithm 2 for DDPM sampling with fine-grained textural guidance), along with Fig 4, Sections C, D, E, H. I also skimmed through all the other sections of the supplementary material.
与现有文献的关系
Some parts of the paper's motivation relate to challenges that are specific to diffusion on piano rolls, not to symbolic music generation in general. For example, try the demo at https://magenta.tensorflow.org/demos/performance_rnn/ to see that it is possible to use a simple language model to enforce certain precise harmonic controls easily and effectively (i.e. probability distribution over pitch classes). Of course language models have their own challenges that diffusion models don’t face. For a more detailed discussion and evaluation of conditioning MIDI-based language models with a variety of controls, see for example [3].
[3] Nicholas Meade, Nicholas Barreyre, Scott C Lowe, Sageev Oore, “Exploring Conditioning for Generative Music Systems with Human-Interpretable Controls”, Int’l Conf on Computational Creativity (ICCC) 2019
遗漏的重要参考文献
Could optionally consider including any of the references mentioned above.
Image inpainting. A significant area of related work is image inpainting/infilling, since a premise of this paper is to convert symbolic control into an infilling task. (See also my comment on Baselines in the section above on Experimental Designs.) Some of these papers could be discussed explicitly.
其他优缺点
I appreciated occasional well-articulated observations throughout the paper (e.g. in the introduction regarding common limitations, precision demands for music generation; also at the end of the appendix, etc). As one example: (Sec3): "One challenge of symbolic music generation involves the high-precision requirement in harmony. Unlike image generation, where a slightly misplaced pixel may not significantly affect the overall image quality, an `inaccurately' generated musical note can drastically disrupt the harmony, affecting the quality of a piece." Absolutely!
In Section 4, the authors write: “We provide an intuitive explanation under the statistical convergence framework.” Personally, I found the explanation, including Proposition 2, highly unintuitive (or at least unclear). Once I spent time parsing it, then I did appreciate the argument (assuming I understood it correctly). Some intuition and clarity would help.
Quantizing to 16th notes and {1,0} note indicators is both ignoring velocity (i.e. dynamics, rhythmic accents), and also ignoring a large class of rhythms (e.g. triplets, swing, other groupings). Also, if I understood correctly, it requires the data to be “beat-aligned”, e.g. would this representation allow ingesting data such as any sophisticated but unquantized performance data which does not have barline/beat information (e.g. MAESTRO dataset)? Quantization is OK in the sense that simplifications need to be made to get ML systems to work, but it can also be a significant limitation that needs to be addressed as such. How complicated would it be to “scale” up to incorporate some of these aspects? How much is lost by not incorporating these? Again, I do understand the need to simplify; it is simply important to be clear and thoughtful about the extent and impact of the simplification.
Lack of Limitations. This paper is missing almost any discussion of limitations whatsoever. I would like to see this such discussion added, either in one place, and/or throughout the paper in order as appropriate.
Overall: This is an interesting paper and I look forward to seeing the authors' response and any discussion.
其他意见或建议
My current score is a placeholder. If my concerns are addressed then I will consider raising my mark.
We deeply appreciate the reviewer's suggestions on revising! Due to this year's policy, it is not allowed to upload a revised paper, external links can only contain figures/tables, and rebuttal has a 5000 length requirement. Please allow us to describe a revision plan in the follows and promise to follow through in next phase.
Responses
-
To enforce Dorian, we restrict pitch classes to the Dorian scale, and additionally use the Am-Em-C-D chord progression for shaping. The combination of them makes it different with G major.
-
We added a comparison with the inpainting baseline (see 4. Additional Experiments and demo page below). We did not initially frame our methodology from an inpainting perspective. Generating with minor correction looks more efficient to us, since a well-trained model aligns with chords, yielding only 2% incorrect notes. In contrast, inpainting adds complexity to the model, while much of the Roll[time,pitch]=0 in the input might be redundant.
-
We actually use [0,1] data for training.
-
"simple rule-based post-sample editing" is “training control edit after sampling”
Revisions
1. Additional related work. We will add two sections, one on precise control over symbolic music generation with other generative models, the other on image inpainting.
2. Modified claims.
- We will explain that the user-specified control refers specifically to user-designed control within the constrained piano roll format.
- We will remove the misleading word “out-of-sample”, and replace with the statement “our method can shape the output towards a specific tonal quality.”
- We will describe the demo page as “a demo page to showcase performances, which enables real-time generation”, removing "one of the first".
3. Theory. We decide to remove the argument regarding conditional probability , which seems too intricate. Further, the introduction of proposition 2 (after the discussion of out-of-key notes and resolutions and before proposition 2) will be:
We provide an explanation using statistical reasoning. Consider a piano roll segment, represented as a random variable . Suppose we are interested in whether this segment contains an out-of-key note (denoted as event ) and whether that note is eventually resolved within the segment (denoted as event ). In our training data, almost every out-of-key note is resolved, meaning the probability of unresolved out-of-key note is close to 0, i.e., .
Now, we examine the probability in the generated music. The key question then is whether the generative model also learns to keep small. The following proposition 2... (same as in manuscript).
4. Additional Experiments and Samples.
In numerical experiments, we have added a baseline named inpainting, in which we treat the pixels where there should not be a note as known (value should be 0), and let the model inpaint the remaining parts. To do this, we add a mask to model inputs, and trained an inpainting model. We also added rounding the out-of-key notes to the nearest allowable in our ablation study. We also added more interpretable metrics. Results are shown in https://drive.google.com/file/d/1IAcAqK4qK4AiQVKWriFSJ91QNhrg5at-/view.
Samples generated from ablation conditions are added to Section 3 of demo page. Across all ablations, we observed occasional occurrences of excessively high-pitched notes and overly dense note clusters.
5. Discussion of limitation:
The 16th-note quantization follows prior work (Wang et al., 2024), which admittedly reduces rhythmic flexibility, and cannot ingest data without beat information. A potential improvement is integrating our pitch class control method with (Huang et al., 2024), which adds a dynamic dimension, and uses 10ms time quantization for greater rhythmic flexibility. Another key limitation is the control format. Our method supports pitch class and rhythmic control in the piano roll representation, but does not accommodate more abstract forms or probabilistic control. Additionally, the evaluation methods and datasets present challenges in accurately assessing generated music quality. Since music evaluation is inherently detailed and partly subjective, the metrics used in this work has fundamental limitation in metrizing quality improvement.
6. More explanations of the demo page
In section 2 of demo page, the chord conditions are converted to condition matrices exactly following Figure 2 of the paper, the melody conditions are provided in an additional channel, also in the form of a piano roll, and we did not use rhythmic constraint in the generation. We will add a section in appendix to provide description of the matrix. Sample melodies are either randomly picked from the test set of POP909, or extracted by hand from some of our favorite existing pieces.
I thank the authors for their thorough rebuttal and for the additional experiments they have run and presented, and their discussion of limitations. I appreciate this effort, I find it helpful, and as I indicate below, I feel it improves the paper.
Based on the direction of their rebuttal so far, and assuming continued responsiveness, I am raising my score to a 3.
I am also assuming that the promised revisions will be made (unless I explicitly indicate otherwise, e.g.see point (4) below).
My additional questions/comments are below:
- Dorian explanation: OK this almost made sense, thank you, but I have a followup question:
"restrict pitch classes to Dorian" --> this makes sense.
"additionally use the Am-Em-C-D chord progression for shaping" --> is chord progression 'shaping' implemented differently from pitch class restrictions? Is this done with some additional conditioning? Or do you mean that you gave the Am/Em/C/D chord progression as restrictions in the piano roll, and that this naturally also limits the pitches to the Dorian, since each of those chords is inside (i.e. a subset of) the A dorian scale?
I think all of this relates to the discussion in Sec 3.2 about and , footnote (4), appendix D.1, etc. I had thought I understood fairly well what was happening, but now I am slightly puzzled. To help me understand, could you answer the following: note that C-major triad can be the chord of C major, or the chord of F major, and might be played differently in each of those cases. But specifying the scale is not as "specific" as specifying the chord. Exactly how do you specify both the scale (which implies the "wrong" notes) and the chord (which implies the "important" notes)? Please make sure that in the final version, you would clarify all of this somewhere (could be appendix), and refer to it appropriately, e.g. when describing the results on Dorian-mode control.
Last-minute edit: Re-reading parts of the paper again, I think (?) I finally understood: The training process allows chord-conditioning at the input level (e.g. "focus on these 4 notes"), whereas the sampling process allows scale-constraints applied at the output (i.e. "set these out-of-key notes closer to zero at each reverse diffusion step").
[2nd EDIT after posting: And critically, the harmony constraint matrix used as conditioning input does not need to be the same as the out-of-key constraint matrix used at sampling! and this is what footnote(4) was about?] Are these [two] edits correct? If so, the paper will be stronger if this simple (but effective!) concept is presented more clearly.
-
Comparisons. If there are indeed distinct chord- and scale-conditioning mechanisms (that I didn't previously understand), then are the baseline comparisons still "fair"? They might be; I just would like to hear the author's view on this. E.g. if inpainting is a baseline for scale-conditioning, then shouldn't the inpainting also allow chord conditioning? (or maybe it already does..?) I should reiterate: I really appreciate that the authors added the in-filling baseline in the first place. Even though this baseline turns out to be relatively strong, to me this still strengthens the paper.
-
Simplification? Did the authors ever try simply zeroing out the constrained notes at every sample step, rather than gradually "reducing" them (i.e. using eq (8) in eq (2))? My guess would be that it would work almost exactly as well, and be simpler. Proposition 1 would risk seeming slightly less relevant, but it would still provide an interesting approximate theoretical justification.
-
Regarding probability of out-of-key notes: I want to clarify that I leave it completely up to the authors to decide what/how to include or not include on this point. Their proposed revision is good too. Like I said in the review: I did (truly) appreciate the argument, once I parsed it. I just got the sense there might be simpler ways to explain it. I.e. I don't want to risk weakening the paper by insisting that the authors remove an intuitive argument that they feel is a contribution in itself. No response needed on this point: I leave this entirely up to the authors in their final version.
Thank you very much for your comments and suggestions. Please allow us to provide response as follows:
- Dorian explanation. Yes, your two edits are absolutely right — the chord input conditioning matrix is distinct from the out-of-key sampling constraint matrix. The input conditioning matrix specifies the intended chord for each measure. Specifically, the chord is encoded into a matrix that highlights the chord tones — the pitch classes that constitute the given chord. In contrast, the sampling constraint matrix is designed in parallel with the conditioning matrix to help regulate the output. Let’s illustrate this with an example:
- Suppose we are generating two measures, the first in C major and the second in C minor. In the input conditioning matrix, the time span corresponding to the first measure [0, T/2) will highlight the chord tones of C major: C, E, and G. The second measure [T/2, T] will then highlight the chord tones of C minor: C, E♭, and G.
- For one plausible version of sampling constraint matrix, the first measure allows all pitch classes in the C major scale (C, D, E, F, G, A, B), while suppressing pitches outside the scale such as C♯, D♯, F♯, G♯, and A♯. The second measure, being in C minor, allows pitches in the C natural minor scale (C, D, E♭, F, G, A♭, B♭) and suppresses out-of-scale tones such as C♯, E, F♯, G♯, and A♯.
The natural question that arises is: how should the sampling constraint matrix be derived from the conditioning matrix? This remains a very open design decision and can be chosen by the user depending on the musical goals. In our demonstration and experiments (except for the Dorian and Chinese style samples), we restrict the harmonic vocabulary to major, minor, and dominant seventh chords. The constraint matrix is then aligned as follows:
- A major chord is associated with the corresponding major scale,
- A minor chord with the natural minor scale,
- A dominant seventh chord with the major scale plus the minor seventh (e.g., for C7: C, D, E, F, G, A, B♭).
To explain, in this correspondence, we take the “key” (coming from the term "out-of-key") as the “temporary tonic key” implied by the one current chord. It would be interesting to try inferring key constraints from consecutive chords!
As for the mode-specific samples, the sampling constraint matrix would be the intersection of the notes allowed by the temporary tonic key and the notes allowed in the style-specific scale. If the chord is A minor and the scale is A-Dorian, we allow pitches in the intersection of A minor scale (A-B-C-D-E-F-G♯) and A-Dorian scale (A-B-C-D-E-F♯-G), which is A-B-C-D-E.
Thank you again for highlighting these points of confusion. We will revise the text to better distinguish between the chord conditioning matrix and the out-of-key sampling constraint matrix. Additionally, we will add an appendix section detailing how the Dorian samples were generated. Specifically, we will provide the chord conditioning matrix, and the sampling constraint matrix.
-
Comparisons. The chord condition is also allowed for all the baselines (including WholeSongGen, GETMusic, Inpainting method, as well as those ablation studies with training control), so the comparison is relevant and “fair”. Specifically, for the inpainting baseline, we provide the model with both chord condition and scale condition (which serves as the mask for inpainting) as input.
-
Simplification. Thank you very much for your advice. Now we added an additional experiments where we zero out the constrained notes. Specifically, in each sampling step, we reset the value of predicted at out-of-key positions as 0. The results are shown in https://drive.google.com/file/d/1xMHxW0bNQivPocYgwf84aQOER0-wOSOc/view?usp=sharing. The results are close to our results. Although this method does not reduce the computational cost much, we do agree that it is simpler in terms of formulation.
In fact, “zeroing out” the constrained notes is also theoretically compatible with our framework, grounded in “projection”. To explain, zeroing out can be viewed as “projecting predicted to the set {0}” at the out-of-key positions, while our method is “projecting predicted to the set ”. We will add a discussion regrading this to our manuscript.
- Probability of out-of-key notes. Thank you very much for your thoughtful consideration and explanation. We will organize the content according to space constraints and the overall readability of the paper. It might also be logically more coherent to first introduce , and the proposition 2, and then enhance our argument by discussing the conditional probability .
This paper introduces a Fine-Grained Guidance (FGG) approach for diffusion-based symbolic music generation, addressing precision and controllability challenges.
FGG incorporates harmonic and rhythmic constraints during both training and sampling, ensuring generated music aligns with user intent.
Theoretical analysis bounds the impact of guidance on learned distributions, while experiments demonstrate reduced out-of-key notes, improved chord similarity, and subjective quality.
A demo showcases real-time interactive generation, highlighting practical applicability.
给作者的问题
N/A
论据与证据
Yes
方法与评估标准
Yes
理论论述
Basically yes. The hypothesis seems to be a little bit strong, but empirical results and ablations can support them.
实验设计与分析
The experiments are enough for this paper. I am satisfied at chosen baselines (GETMusic, WholesongGen), where WholeSongGen is a strong baseline.
The Analysis is firmly around of its hypothesis. It highlights the chord, pitch, rhythm attributes, showing a significant result.
I would be happy if more experiments can be conducted on OOD datasets, which showing the model generalisation ability in the wild.
Note that a subjective study is provided in the appendix.
补充材料
I checked the supplementary material, mainly the subjective results, which supports the hypothesis well.
与现有文献的关系
N/A
遗漏的重要参考文献
I believe current references are enough under the topic of symbolic diffusion-based piano music generation.
其他优缺点
N/A
其他意见或建议
N/A
伦理审查问题
N/A
We deeply appreciate the reviewer’s recognition of our work's theory, experiments, and the practicality shown by the interactive demo.
Regarding OOD datasets, we unfortunately have not yet found a high-quality dataset other than the one we used. Alternatively, in the numerical experiments, the songs of POP909 dataset were split upfront into training and testing sets, to prevent data-leakage in the training process.
Meanwhile, although there is not an appropriate OOD dataset for us to run a thorough experiment, we tried to use our model to generate stylized music pieces in section 2 of our demo page. Those styles did not appear in the POP909 dataset but the generated results seem satisfying, which hopefully serves as an indication of our model’s generalization ability.
We would love to thank Reviewer 2Mem again for the comments and suggestions, which are very helpful to us.
Thanks. I really acknowledge your continuous efforts on this paper. Good luck.
The paper proposes fine-grained guidance to control harmonic characteristics in diffusion-based symbolic music generation on piano-roll format.
-
Strength
- Reviewers all acknowledged the necessity of precise control of harmonic quality in symbolic music generation.
- The proposed method has a theoretical justification and empirical evidence to support its effectiveness.
- The generated results are of high quality, both in subjective listening tests and interactive demos.
-
Weakness
- The writing and structure of the paper has clear margin to be improved. The reviewers suggested several ideas for the revision, and the authors provided revision plan to address this issue.
The proposed method clearly achieved the goal. It would benefit symbolic music research by providing powerful controllability and the broader ML community through potential application of guiding diffusion with specific fine-grained conditions.