6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

2.3

置信度

正确性2.7

贡献度2.7

表达2.3

NeurIPS 2024

Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

Jin Woo Lee,Jaehyun Park,Min Jun Choi,Kyogu Lee

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We propose a model that simulates a musical string instrument from the physical properties.

摘要

关键词

Differentiable Audio Signal ProcessingPhysical ModelingMusical Sound SynthesisPhysical Simulation

评审与讨论

审稿意见

评分: 6置信度: 22024-07-10

The paper presents a differentiable model that can synthesize musical string sound and simulate motion based on physical properties. The method uses finite-difference time-domain (FDTD) solver to obtain numerical solutions and take them as ground truth. Then a differentiable pipeline with neural network components is used to map physics properties to output waveform. Experimental results show that it achieves better performance than Modal synthesis and DDSPish baselines.

优点

To my best knowledge, it is the first differentiable method that can generate the musical string sound and motion from physics properties.
The proposed method is more efficient and accurate than baselines.

缺点

The proposed method relies FDTD solutions as ground truth and I am wondering what will be the gap between the simulated solutions versus real-world audio data.
Following the previous question, there might not be too many real-world data available for learning and usually material properties are unavailable. Is it possible to infer these properties through the differentiable pipeline?
It needs ablation studies for using two MLPs for modulation layers, as well as a MLP for mode estimator.
There are references typos in L135 and L229.

问题

See questions in weakness

局限性

Authors mentioned limitations in the main text.

作者回复

2024-08-05

We thank reviewer Hh9k for the extensive review. Below are the responses to your concerns. Each item in Weakness is labeled with a number following the W (from the top, W1, W2, ...)

W1 (The gap between the simulation and the real-world audio data)
- We summarized some of the main differences between the simulated data (using StringFDTD-Torch [1] or DMSP) and the real-world audio data as below.
  
  Simulated Recorded
  Can obtain sound ✓ ✓
  Can obtain displacement (i.e. movement) ✓ ✗
  Free from measurement errors ✓ ✗
  Free from modeling errors ✗ ✓
  Close to the 'sounds' of everyday life* ✗ ✓
  
  *This is not true for all FDTD simulations, but it is true for the simulations covered in this manuscript.
  
  The first four rows illustrate the main differences in system modeling.
  - Simulated data has the advantage that the displacements of the strings are directly found as solutions of the PDE, giving access to the motion information for all positions and that there is little measurement error; however, since the string system is represented by a parametric PDE, this does not prevent the modeling itself from introducing errors, but there have been many validation studies showing that these modeling errors are small enough [2-3].
  - On the other hand, recorded audio data can be said to have no modeling error because it drives the actual string as it is, but it is characterized by a large loss of information in that it uses a specific receiver sensor to record audio data. The audio data records the vibration of the measurement equipment (e.g. microphone membrane) transmitted from the vibration of the string to the receiver at a specific location, through the medium of the measurement environment. So the displacement of the string is unknown, and measurement errors (such as microphone coloration or room reverberations) are inevitably introduced.
  The last row of the table is a slightly different context to this systematic difference and considers its value as a musical instrument.
  - As described in the manuscript, and as the systematic difference suggests, simulated audio is significantly different from real recorded audio, since we are in effect “listening to the displacement” picked up at a particular location on the string. The system in which the vibration of a string propagates through the air and is recorded at the receiver, as we hear in our daily lives, is another active area of research [4]. Based on these studies and ours, we believe that in the future, we will be able to further bridge the gap between simulated data and actual recorded data.
  [1] Lee, J. W., Choi, M. J., & Lee, K. (2024, April). String Sound Synthesizer On GPU-Accelerated Finite Difference Scheme. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1491-1495). IEEE.
  
  [2] Bensa, J., Bilbao, S., Kronland-Martinet, R., & Smith III, J. O. (2003). The simulation of piano string vibration: From physical models to finite difference schemes and digital waveguides. The Journal of the Acoustical Society of America, 114(2), 1095-1107.
  
  [3] Ducceschi, M., & Bilbao, S. (2022). Simulation of the geometrically exact nonlinear string via energy quadratisation. Journal of Sound and Vibration, 534, 117021.
  
  [4] Bilbao, S., & Ahrens, J. (2020). Modeling continuous source distributions in wave-based virtual acoustics. The Journal of the Acoustical Society of America, 148(6), 3951-3962.
W2 (Possibility of inferring the material properties through the proposed method)
- In response to this question, which holds good insights for further research, we share the results of an informal experiment. From what we have tried, we believe that it is not impossible, but requires some tricks and a well-controlled experimental setup to get meaningful conclusions. For example, the problem of estimating string’s material properties and initial conditions from single-channel audio recordings is ill-posed and it is well-known [5] that there can be multiple solutions to similar problems. However, with better DMSP models (as mentioned in the answer to the previous question) and tricks to better address ill-posedness for properties, we are optimistic about solving this problem (e.g., inferring the material properties and the initial conditions from the sound.)
  
  [5] Kac, M. (1966). Can one hear the shape of a drum?. The american mathematical monthly, 73(4P2), 1-23.
W3 (Ablations with the number of MLP layers)
- In response to a reviewer's question, we found that the difference in performance based on the number of MLP layers is insignificant, which is why we did not report it separately in the manuscript. This can be taken similarly to what we said in the text about using GRUs instead of MLPs, which is that there is a small difference in performance, but not enough to change the ranking between models. Modifying the details for each module did not have a significant impact on the performance change, which is consistent with conventional knowledge: Stacking more layers than fewer generally performs better, but the performance gains taper off, so we chose a modest number of layers, taking into account GPU memory and batch size. The most important factor for model performance is how the mode information is utilized, and we found the decoder structure that modulates a sinusoidal oscillator through AM and FM performs best for motion synthesis.
W4 (Typos and missing references)
- Thank you so much for pointing this out. Please see the global response (Author Rebuttal by Authors).

	Simulated	Recorded
Can obtain sound	✓	✓
Can obtain displacement (i.e. movement)	✓	✗
Free from measurement errors	✓	✗
Free from modeling errors	✗	✓
Close to the 'sounds' of everyday life*	✗	✓

2024-08-13

Dear Reviewer Hh9k,

As the discussion between the reviewers and authors is coming to an end, could you please respond to the authors to confirm whether your concerns have been addressed?

Thanks!

2024-08-13

Thanks for the response. I appreciate the clarification and additional ablation studies. My concerns are mostly addressed and I am happy to raise my score.

审稿意见

评分: 7置信度: 22024-07-13

A computational framework that can approximate the motion of nonlinear strings is proposed. The implementation is differentiable, so one can train then neural nets in the framework as usual.

优点

The paper is written nicely such that I could follow the basic discussions even if I am not familiar with the topics of speech and audio. It seems to include all the information required in this kind of paper, such as technical background, model description, loss description, and experiments.
The empirical performance of the method is clearly superior to the reasonable baselines.

缺点

I have only minor comments as follows. Basically no need to respond to them.

In Table 1, the computational complexity part does not consider the computation needed for learning unknown parameters. The same kind of complexity analysis may be difficult for that part, but at least, the fact that such a training procedure happens for some of the methods should be noted somewhere close.
Lines 137 and 138: there seems to be confusion about the definition of the initial condition and/or the operator $\mathcal{S}$ . First, $u_0 \in \mathcal{U}$ sounds strange if $\mathcal{U}$ is the space of functions $\Omega \times [0,\infty) \to \mathbb{R}$ because $u_0$ does not take the time $\in [0,\infty)$ as the argument. Second, the domain of $\mathcal{S}$ is the product of $\mathcal{P}$ and the space of initial conditions, isn't it?
The role of the noise decoder is unclear. Is it important in the given experiments where the data are purely from the simulator?

问题

I don't have major questions.

局限性

Limitations are simply stated in the conclusion section, which sound reasonable.

作者回复

2024-08-05

We thank reviewer gawP for the constructive review. We also thank you for appreciating the novelties made by the differentiable string motion synthesis. Below are the responses to some of your comments. Each item in Weakness is labeled with a number following the W (from the top, W1, W2, ...)

W1 (Specifying that the training procedure happens)
- The authors agree that it's a good idea to clarify that differentiable methods need to be trained in advance. We will mention this in the caption, as we are concerned that labeling this as one of the columns in the table could potentially mislead some readers with the “Pre-computation” in the table, which is to apply the least-squares method for fitting the modes to each specific initial condition (IC). As reviewers recognize, there are differences between pre-computation and training: | | Training | Pre-computation | | --- | ---: | ---: | | Performed before inference | ✓ | ✓ | | Performed every time for new ICs | ✗ | ✓ | | Performed every time for new materials | ✗ | ✓ | | Typical wall-time required | 0 days | 00 secs |
  
  In the manuscript, Table 1 only refers to the inference scenario, and we believe that it is difficult to make a rigorous comparison of the computational complexity for training along the same lines. However, we agree with the reviewer that it would be nice to include the fact that neural networks require parameter optimization for inference. Following the reviewer’s suggestion, we included this in the caption of the table, albeit briefly.
W2 (Problem statement)
- Please see the global response (Author Rebuttal by Authors).
W3 (Role of noise decoder)
- The role of the noise decoder is not dominant, but it does help model the simulated nonlinear strings. As the nonlinear strings considered herein are modeled as the coupled system between the transverse and the longitudinal vibrations, the motion along the longitudinal axis contains strong near-noisy timbre along with (in)harmonic pitch skeletons (please see Figure 8.10 in [1]). This coupling leads to the appearance of timbres that can be approximated as filtered noise, especially in the transient region of the highly nonlinear strings with large pluck amplitudes. The noise decoder was designed to model such characteristics.
[1] Bilbao, S. (2009). Numerical sound synthesis: finite difference schemes and simulation in musical acoustics. John Wiley & Sons.

Per the reviewer's advice, we will add them to the camera-ready version. We plan to add each of the points mentioned in W1, W2, and W3 to the caption of Table 1, Section 3.1 of the main text, and the Appendix. Again, we thank the reviewer for the constructive advice.

2024-08-13

Thanks for the authors' rebuttal, it further clarified the discussion. I would maintain my originally positive score.

审稿意见

评分: 5置信度: 32024-07-13

This paper proposes Differentiable Modal Synthesis for Physical Modeling (DMSP), which is a neural-model-based method to predict the vibration of nonlinear strings. DMSP takes the physical parameters for the differentiable equation as the input, predicts the mode frequencies and AM/FM effects, and finally outputs the waveform of vibration.

优点

S1: This paper is among the first attempts to study neural audio synthesis using physical properties, which makes a meaningful contribution in this sub-area.

S2: Section 2 provides some background which is helpful for understanding the context of the proposed method.

缺点

W1: Some method details are not introduced with sufficient clarity. In the loss section, the detailed mathematical forms for pitch loss and mode frequency loss are not explained. Is the pitch loss L2 loss, L1 loss, or in other forms? What type of regularization is used for the mode frequency loss? Also, since there are typically many mode frequencies for a linear/nonlinear system, it is unclear how many mode frequencies the mode estimator predicts. Is it a fixed number or a variable number? If it is a variable number, how did the neural model output a variable number of mode frequencies?

W2: As the authors have also discussed, the mode predictor turns out to be not functioning appropriately based on the results in Table 2 (the big gap between DMSP and DMSP-N). Since the mode predictor is a major component of the proposed system, the design of the proposed pipeline is not well-justified.

W3: It is generally unclear what advantages the proposed physics-driven sound synthesis paradigm has over neural-based generative models, such as AudioLM or diffusion models. The latter has been shown to synthesize high-quality sounds, music, and speech. Does the proposed paradigm have higher quality, lower computational overhead, greater generalizability, or better in other aspects than the latter paradigm?

W4: (Minor) There are many empty references, e.g., line 135, line 229, and wrong references, e.g. Table 1 is line 243 which should probably be Table 2. Also, in line 243, ‘the ablation of the mode information is studied..’ Shouldn’t this table be the main result? Calling it an ablation result rather than the main result may cause some confusion, especially when Table 3 shows another ablation study result.

W5: (Minor) The problem statement in 3.1 is a bit confusing to me. For example, it is unclear to me why the initial condition $u_0$ , is an element of $\mathcal{U}$ . $u_0$ is a function on $\Omega \times \{0\}$ , whereas $\mathcal{U}$ is a set of $(x, t)$ pairs. Also, in line 138, it seems that the mapping $\mathcal{S}$ is not only a function of $\mathcal{P}$ , but also a function of initial conditions. Why is the latter omitted?

问题

Please see ‘weaknesses’.

局限性

Limitations are minimally discussed in Section 5. It would be nice to extend the discussion to include the failure of the mode predictor, the computational complexity of DMSP-N, etc.

作者回复

2024-08-05

We appreciate reviewer ygxN’s effort in reviewing our paper. Below are the responses to your concerns.

W1 (Clarity in methodological details)
- We clarify the mathematical definition of loss based on the reviewer's comments. We used $\ell^1$ distance as $\mathcal{L}_{f_0} = \|\hat{f_0} − f_0\|_1$ for the metric between all mode frequencies; however, we found no significant deviation in training results according to the $p$ value of the $\ell^p$ -norms.
- As specified in Appendix D, we used 40 numbers of modes considering the fundamental frequency and the Nyquist limit by the temporal sampling frequency. This mode count was determined to be an integer value such that an integer multiple of the maximum fundamental frequency in the dataset is less than or equal to the Nyquist frequency, and as the reviewer notes, a more detailed synthesis would be possible if this mode count could be assigned dynamically according to $f_0, \kappa$ , and $\alpha$ .

W2 (Role of the mode predictor)

With all due respect, we disagree with the reviewer’s point for two reasons:
- The Modal synthesis is pretty much just a solution in itself for linear strings (the difference between Modal and FDTD comes only from the frequency-dependent damping term, compare Eqn. 1 and 3 for $\alpha=1$ ), so it is perfectly natural for other models, including DMSP, to lag behind Modal, as the training data does not have $\alpha$ exactly $1$ like the Linear ( $\alpha=1$ ) test set. Thus, the discrepancy in DMSP's Linear string test result compared to Modal's does not imply that DMSP’s mode estimator doesn't work at all.
- As the title of this paper suggests, our main interest is on planar (thus nonlinear; see Section 2.1) strings. As the Nonlinear string test result suggests, the mode predictor is not a major component: DMSP still outperforms the rest of the baselines and ranks second, suggesting that even with errors in the mode estimator, synthesis results can be better.
Our statement in L255 means that "errors can occur in estimating the mode through a neural network", and nowhere in the text states that the mode estimator is the main module. However, when it comes to the wording of this sentence, we agree that it could be potentially misleading to readers, and we will refine the sentence to be clearer about what we're claiming.

The following additional experimental results further demonstrate how trivial the error in the mode predictor is in the overall proposal.

	Linear	($\alpha=1$)			Nonlinear	($\alpha>1$)
	SI-SDR	SDR	MSS	Pitch	SI-SDR	SDR	MSS	Pitch
Modal	–3.191	0.681	18.449	0.420	–16.611	–1.900	17.254	2.316
DDSPish	–39.478	–2.598	11.047	5.518	–25.951	–2.102	9.745	3.306
DDSPish-woFM	–46.609	–2.257	10.911	11.304	–46.858	–2.272	10.299	14.013
DMSP-N	–2.844	1.496	12.525	0.792	15.670	16.455	4.772	1.027
DMSP	–22.298	–2.000	12.504	1.717	–10.315	0.221	5.656	1.437

For this additional experiment, we increased the training data and the training time (about a week using a single 2080), while keeping all other network structural details the same. For the Linear results, DMSP still lags behind DMSP-N and Modal, suffering from its disadvantage in mode estimation accuracy, but clearly outperforms the baselines in nonlinear strings. This emphasizes that the error of the mode estimator is not a critical part of the overall pipeline, but rather the decoder part.

W3 (Comparison with generative models)
- While we fully understand the reviewer's curiosity, we believe that providing a formal analysis of this point in the text would detract from the main argument:
  1. As we believe the reviewer also agrees with (as noted in Strength S1 of the review), there is no prior work on generative models based on neural networks (such as AudioLM or diffusion) for synthesizing string motion, i.e., the time-dependent displacement of a string represented by the solution of a PDE, as in this study.
  2. The main point of the paper is 'making one of the physical modeling methods (modal synthesis) differentiable facilitates efficient and effective nonlinear string synthesis', and it is considered somewhat out of context to add claims such as 'the proposed model structure and training method is even better/worse than those trained with the diffusion framework and/or LLM training method'.
  Nevertheless, we tried various architectures (such as Transformer and WaveNet) and also trained conditional generation on top of a diffusion framework (like DiffWave [1] or Music Spectrogram Diffusion [2]). Please see the PDF file attached to the global response for a comparison under different architectures. Most attempts were prone to fail in motion synthesis, i.e., modeling the physical correlation of displacement with position, and DMSP performs well because it directly utilizes mode information to represent the physical correlation.
  
  [1] Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In International Conference on Learning Representations, 2021.
  
  [2] Hawthorne, C., Simon, I., Roberts, A., Zeghidour, N., Gardner, J., Manilow, E., & Engel, J. (2022, December). Multi-instrument Music Synthesis with Spectrogram Diffusion. In ISMIR 2022 Hybrid Conference.
W4 (Typos and missing references)
- We thank the reviewers for pointing this out. Please see the global response (Author Rebuttal by Authors).
W5 (Problem statement)
- Please see the global response.

2024-08-13

Thanks for the response. Most of my concerns are addressed and I am happy to raise the score.

2024-08-13

Dear Reviewer ygxN,

As the discussion between the reviewers and authors is coming to an end, could you please respond to the authors to confirm whether your concerns have been addressed?

Thanks!

作者回复

2024-08-05

We would like to thank all the reviewers for taking the time to review the paper and for their efforts to improve the quality of the manuscript with their constructive comments. We responded to each reviewer's comments and concerns individually, and the responses below address common points made by reviewers.

Missing references, typos, and misleading expressions are revised as follows.
- L135: "as specified in Equation 4,"
- L229: "linear wave solution as in Equation 1,"
- L243: "The efficacy of DMSP is studied as shown in Table 2."
Problem statement is revised as follows.
- We assume that the solution $u : \Omega × [0, \infty) \to \mathbb{R}$ resides within the Banach space $\mathcal{U}$ . For a given PDE parameter $\rho\in\mathcal{P}$ and initial condition $u_0\in\mathcal{U}\_0\subset\mathcal{U}$ with $u\_0:\Omega\times\{0\}\to\mathbb{R}$ , let $\mathcal{S}:\mathcal{P}\times\mathcal{U}\_0\to\mathcal{U}$ denote a nonlinear map, specifically the FDTD numerical solver tailored to the context of this study.
  
  Assume that we are provided with observations $\\{\rho^{(i)}, u\_0^{(i)}, u^{(i)}\\}^N_{i=1}$ , where $\rho^{(i)}$ and $u\_0^{(i)}$ are i.i.d. samples drawn from a probability measure supported on $\mathcal{P}$ and $\mathcal{U}\_{0}$ respectively, and $u^{(i)} = \mathcal{S}(\rho^{(i)},u\_0^{(i)})$ potentially contains noise. Our goal is to construct an approximation of $\mathcal{S}$ denoted as $\mathcal{S}\_\theta : \mathcal{P}\times\mathcal{U}\_0 \to \mathcal{U}$ , and select parameters $\theta^*\in\mathbb{R}^{N\_\theta}$ such that
  $\min_{\theta}\\mathbb{E}\_{\rho\sim\mu\_{\mathrm{pa}},u\_0\sim\mu\_{\mathrm{ic}}}\left\\|\mathcal{S}(\rho,u\_0) - \mathcal{S}\_\theta(\rho,u_0)\right\\|\_{\mathcal{U}}\approx\min\_\theta\frac{1}{N}\sum\_{i=0}^N\left\\|\mathcal{S}(\rho^{(i)},u\_0^{(i)}) - \mathcal{S}\_\theta(\rho^{(i)},u\_0^{(i)})\right\\|\_{\mathcal{U}}$
  where $\rho^{(i)}\sim\mu\_{\mathrm{pa}}$ and $u_0^{(i)}\sim\mu\_{\mathrm{ic}}$ . Leveraging $\mathcal{S}\_\theta$ , one can compute the solution $\hat{u} = \mathcal{S}\_\theta(\rho,u\_0)$ corresponding to a new parameter $\rho\in\mathcal{P}$ and a new initial condition $u\_0\in\mathcal{U}\_0$ . By specifying values for $x$ and $t$ , one can then either synthesize the sound of the string picked-up (also referred to as read-out) at a specific location $x_0$ as $\hat{u}(x\_0, t)$ , or simulate the motion of the string by concatenating $\hat{u}(x, t)$ across all $x \in\Omega$ .

In addition to this answer, the attached PDF file includes the following:

a table of synthesis results for the improved model from additional experiments,
a table of ablation study for the improved model,
a comparison table of training results for various neural network architectures such as Transformer and WaveNet,
a scatter plot of the improved scores, and
a figure showing the motion synthesis results over time for different pluck positions.

最终决定Accept (poster)

2024-09-25

The paper introduces a novel model for simulating the spatio-temporal motion of nonlinear strings by integrating modal synthesis and spectral modeling within a neural network framework.

It is highlighted as the first differentiable method capable of generating musical string sound and motion from physical properties. The empirical performance of the proposed method is noted to be superior to reasonable baselines. The reviewers appreciate that the paper is well-written and easy to follow, even for those not familiar with speech and audio topics. It includes all necessary information such as technical background, model description, loss description, and experiments.

Most of the reviewers' concerns have been well addressed during the rebuttal period.

Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation

摘要

评审与讨论

优点

缺点

问题

局限性

*This is not true for all FDTD simulations, but it is true for the simulations covered in this manuscript.

优点

缺点

问题

局限性

优点

缺点

问题

局限性