DiffSound: Differentiable Modal Sound Simulation for Inverse Reasoning
We propose a differentiable sound simulation framework for physically based modal sound synthesis and inverse problems
摘要
评审与讨论
This paper presents an end-to-end framework for inferring the geometry and material properties of objects based on the frequency domain representation of the sound that they make. To overcome some of the challenges, e.g., a sparse spectrogram representation, the authors propose a hybrid loss that first uses optimal transport to compute an approximate solution, and then the L1 loss to refine the solution. Experiments are run to test material, geometry, impact positioning, independently.
优点
- The paper is tackling an important and challenging problem, which is especially of interest with increasing interest in AR/VR applications.
- The writing is good, and the problem and solution are easy to follow, even for a non-expert.
缺点
- I accept that the problem being tackled here is challenging, but the experimentation seems very limited. The paper more shows anecdotal examples rather than present summary statistics for a larger test set with examples for illustration.
- I am wondering how does the method fare in terms of accuracy for different materials? Different object sizes? and so on.
问题
How sensitive is the approach in terms of placement of the sensor? If the microphone is too far away, does environmental effects influence the results (e.g., reverberation or other material properties that might affect reflectance, etc.?).
How much does the complexity of the shape influence reconstruction? For example, an intricate and non-convex shape that impedes direct path to the sensor?
For Equation (10), why are both terms required? One is just a compressed form of the other?
伦理问题详情
N/A
- Indeed, if the microphone is positioned too far, the model may be susceptible to noise interference. However, it's noteworthy that reverberation or reflectance should not impact the results since our model exclusively relies on frequency information in the sound, and these phenomena do not alter the frequency of modes.
- In the context of shape reconstruction, we specifically leverage the eigenmodes of objects, making the sensor placement immaterial to the results. Notably, as objects of different shapes can produce remarkably similar sounds, a challenge arises when numerous shapes yield similar sounds to the ground truth, potentially leading to optimization convergence towards an incorrect shape.
- The absence of logarithmic transformation results in rapid signal decay, causing the model to focus solely on the initial spectrogram frames. The use of log spectrograms proves beneficial in capturing damping effects over time. However, it is essential to note that for modes with substantial damping factors, relying solely on log spectrograms might overlook their contribution. Hence, a combination of both approaches is preferable.
Thanks for providing the clarifications -- these make sense.
This paper presents DiffSound framework that connects material parameters of a solid body and acoustic features from the body in a differentiable manner. Using this model, we can construct a neural network that simulates audio signals when impacting the object or inferres the object shape from the audtory information.
优点
The differentiable simulation is carefully derived from relevant literatures such as tetrahedral mesh, generalized eigenvalue decomposition and superimposed sinusoidal signals.
缺点
My major concern is that the reviewer is not convinced with the importance of shape geometry reasoning from audio signals. I think audio modality is not as informative to recover the shape of objects. Indeed, Figure 6 gives smoothed mesh surface. The shape may be distorted without sufficient voxel constraints.
Is there any application scenario? Perhaps this model may be applied to non-invasive examination of solid structures like impacting the surface and observing the responding signals. However, we cannot see such usages from the current set of experimental results.
问题
- Eq. (5) to (6)
I could follow the derivation of Eq. (6) from (5). Do you use or some transformation of ?
- What do you mean by hybrid loss?
Is it hybrid because linear and logarithm error is combined as in Eq. (10)? Or does this mean the use of loss and OT-based loss?
- Using ground truth
In the result of Table 1, how is estimation of critical to reduce the error in the spectrogram? Does the error reduce if ground truth Poisson's ratio is given?
- In the general eigendecomposition definition, .
- The term "hybrid" denotes the utilization of both loss and OT-based loss.
- Indeed, if the ground truth vector is provided, the error will diminish. Conversely, if is poorly estimated, the error will be substantial.
The paper proposes a differentiable sound simulation framework called DIFFSOUND, containing three components. The first component is a differentiable tetrahedral representation, which uses implicit neural representation to encode SDF values and convert the encoded SDF into an explicit tetrahedral mesh. The second component uses a high-order finite element method to optimize material properties and shape parameters. In the end, an additive audio synthesizer synthesizes the sound.
优点
- The idea of building a differentiable sound simulation pipeline is very interesting. While the task is challenging, I am glad that the authors come up with a solution that will definitely be useful for various applications.
- The component introduced in this work is highly interpretable. To my best knowledge, physical properties such as Young's modulus and Poisson's ratio were not modeled in the previous audio synthesizers.
- Three inverse problems are conducted, and the results look reasonable.
- The supplementary includes the code, which will be useful for reproduction.
缺点
- One main concern is that the paper writing is very rough. For example, in 3.1 Differentiable tetrahedral representation, there is no formal mathematical definition for the input-output, INR, tetrahedron mesh, and transformation function. The description in 3.1 is high-level and not informative. In 3.3, the loss equations 7 and 10 use the same annotation, but the means totally different things. Section 4 is a weird combination of both ablation studies and experiments on three inverse problems. I believe the inverse problems should take an independent section because it is one of the main contributions of this work. In tables and figures, annotations like baselines 1, 2, and 3 could be confusing since there is no corresponding description in captions. While each of these items could be a minor issue, the overall reading experience is actually bad.
- I am interested in how fast the optimization could be done for each object, but there is no clue in the paper. While it is okay that the current approach could not support real-time applications, it should contain an analysis for the optimization time.
- One thing that confuses me is the ground truth eigenvalues. How do you obtain the ground truth eigenvalue? Without supervision on the eigenvalues, the optimization problem becomes much more challenging. Is it possible to optimize with only the audio loss?
- From my own experiences, the Wasserstein distance is indeed helpful in bridging the ground truth and predicted spectrograms. However, the switch timing between Wasserstein loss and L1/L2 spectrogram loss is undefined. How do you determine 'sufficient convergence'?
问题
See my questions in the weakness section.
- The experiments concerning shape geometry reasoning are conducted on a synthetic dataset, where ground truth eigenvalues are computed using the standard modal analysis process. For material parameters reasoning, real-world datasets are employed, relying solely on audio loss for optimization.
- Currently, we arbitrarily define 1000 epochs as a point of 'sufficient convergence.'
The paper describes a differential simulation framework for sound synthesis of physical objects impacts. The framework is a pipeline that employs a NeRF-like MLP to reconstruct the Signed Distance Function and translate it into the shape of the object. These are then being used by Finite Elements Method to recover object shape and an Additive Synthesizer to generate sound which is optimized by minimizing loss between the expected and groundtruth spectrograms. Experiments are performed on ObjectFolder-Real dataset for sounds from 100 objects.
优点
-
The work proposes an additional step of recovery of Signed Distance Function done by MLP to assist with object shape recovery and synthesis of impact sound of the object.
-
Synthesized results appear to be corresponding to objects and their expected sounds.
-
The paper is well written.
缺点
-
The choice of baselines and whether these are strongest possible baselines is unclear.
-
The experiments are done on 100 objects only.
-
Train/validation/test split is not specified and thorough quantitive accuracy of these is not presented.
-
Technical contribution is limited since the components of the pipeline are standard. Ablations with extensions of the components are needed to examine whether these are optimal for sound synthesis.
问题
-
The current pipeline is split between a neural network approach and FEM simulator. Could both steps be modeled with a neural networks?
-
How would the work compare with impact sound generation from videos through diffusion model? Su, Kun, et al. "Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
-
What is the computational complexity of the pipeline?
- While it is possible to model both steps using neural networks, achieving comparable accuracy to a classic solver is challenging.
- Our paper specifically addresses the inverse problem of inferring physical parameters from impact sound. In contrast, the referenced paper by Su, Kun, et al. focuses on predicting sound from visual features.
- The computational complexity of the pipeline can be approximated as O(MN^2), where N represents the number of vertices and M is the number of modes. This estimation accounts for the most time-consuming aspect, which is the eigen decomposition process.
The submission simulates and estimates the sound properties of objects. It has received 4 negative reviews, which were critical of the following aspects of the paper:
- Weak experiments on a small dataset,
- Lack of details of the experimental setup,
- Choice of baselines,
- Limited novelty,
- lacking clarity in presentation and writing.
The authors provided some answers, but only extremely sparingly. Some easy questions have been ignored, as for instance the train/validation/test split. In general, the answers were not sufficient.
The AC agrees with the reviewers and judges that the paper is not suitable for publication at ICLR 2024.
为何不给更高分
为何不给更低分
Reject