PaperHub
6.5
/10
Poster4 位审稿人
最低6最高7标准差0.5
6
7
6
7
4.0
置信度
正确性3.3
贡献度3.0
表达3.3
NeurIPS 2024

Learning Spatially-Aware Language and Audio Embeddings

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

We train a model that aligns 3D Spatial Audio with open vocabulary captions.

摘要

关键词
multimodal embeddingsspatial audiocontrastive learning

评审与讨论

审稿意见
6

This paper presents an approach to learning spatially aware language representations. The authors propose a contrastive representation model that integrates spatial context into language representations, aiming to enhance the performance of tasks that require spatial reasoning. The model combines visual and textual data to create embeddings that are sensitive to spatial relationships. The contributions include an architecture of the proposed model, extensive experimental results demonstrating improved performance on spatial reasoning benchmarks, and an analysis of the model's ability to generalize across different spatial contexts.

优点

  1. The paper makes a strong contribution to spatial reasoning. The integration of spatial context into text-to-audio generation models is an important and underexplored area, and this work offers a novel and effective solution.
  2. The experimental setup is rigorous, with well-designed experiments that effectively validate the model's performance.
  3. The paper is well-written, with clear explanations of the methodology and results.
  4. The findings have significant implications for improving spatial language understanding in various applications.

缺点

  1. The reliance on synthetic datasets may limit the generalizability of the findings. The authors could explore the way to train the model on in-the-wild data.
  2. The current interpretation experiments (Sec. 5.4) only study a four-class classification ("left," "right," "up," "down"), which is insufficient for real-life scenarios. For instance, spatial audio applications often require more nuanced classifications, such as distance perception (e.g., strong/weak reverb in indoor/outdoor settings), which are critical for capturing and representing spatial information. The authors should consider extending the experiement to handle a wider range of spatial attributes to enhance its applicability in diverse settings. For example, the authors should consider using prompts like "xxx is making a sound in the distance" and "xxx is making a sound nearby" to figure out if the results are different.
  3. The paper could benefit from a more detailed error analysis, identifying common failure cases and understanding why the model fails in certain scenarios. This analysis would provide insights for further improvement and refinement of the model.
  4. While the model performs well on tasks like retrieval and source localization, its ability to generalize to spatial text-to-audio generation remains to be seen.

问题

See weakness.

局限性

The authors have addressed the limitations of their work by discussing the datasets used and the model's computational requirements. Additional limitations can be found in the Weaknesses section.

作者回复

Thank you for taking the time to provide comment and feedback on our submission. We address the questions and concerns below. Please let us know if further clarification is required.

The reliance on synthetic datasets may limit the generalizability of the findings. The authors could explore the way to train the model on in-the-wild data.

The scarcity of publicly available, open vocabulary, labeled spatial audio datasets presents a challenge. While modalities like text, image, and video benefit from vast amounts of internet-sourced data, that is not the case for spatial audio. Collecting and annotating high-quality spatial audio data is considerably more complex, requiring specialized equipment (suitable microphones, headphones, and room measuring equipment) and human expertise for accurate spatial perception and labeling. We acknowledge the importance of exploring in-the-wild data and consider it a valuable future step in our research.

The current interpretation experiments (Sec. 5.4) only study a four-class classification ("left," "right," "up," "down"), which is insufficient for real-life scenarios. For instance, spatial audio applications often require more nuanced classifications, such as distance perception (e.g., strong/weak reverb in indoor/outdoor settings), which are critical for capturing and representing spatial information. The authors should consider extending the experiement to handle a wider range of spatial attributes to enhance its applicability in diverse settings. For example, the authors should consider using prompts like "xxx is making a sound in the distance" and "xxx is making a sound nearby" to figure out if the results are different.

Experiments in Section 5.4 were designed to better understand the embedding space of the audio and text. The experiment referred to in question (paragraph 1 of Section 5.4) does directional classification using a shallow (one hidden and one output layer) MLP on the audio embeddings, and uses the MLP to then classify corresponding text embeddings. For completeness, we have also run these experiments for distance and elevation, and obtained accuracies of 76.5% and 55.1% respectively.

We are a little unsure on how to best address the suggestion about natural prompts (xxx is making a sound in the distance....). Table 2 actually presents results for zero-shot classification accuracy using diverse prompts of the suggested form. For every spatial attribute, we take language descriptors (as detailed in Table A.3 of the appendix) and affix it in a prompt (Sound coming from <>). We then run classification using cosine similarity of the text and audio embeddings and report the accuracies. Please let us know if we have misunderstood your question, we would be happy to clarify further or run any additional experiments.

The paper could benefit from a more detailed error analysis, identifying common failure cases and understanding why the model fails in certain scenarios. This analysis would provide insights for further improvement and refinement of the model.

Thank you for raising this issue. As discussed in the global response (Exp. B.), we have carried out this analysis for the direction-of-arrival error according to azimuth, elevation, distance, floor area, mean T30, and TUT Sound Events 2018 semantic classes. Results show little variance across all those conditions, though errors are higher at the extrema of the conditions.

While the model performs well on tasks like retrieval and source localization, its ability to generalize to spatial text-to-audio generation remains to be seen.

This study is an initial investigation into aligned representations of spatial-audio and spatial-captions. To measure the success of ELSA, we chose perception-based tasks involving retrieval and source localization because these are well-understood and have clear ways of measuring performance objectively. We agree that spatial text-to-audio generation is an important task, however the success of the generation requires either user-studies or the development of new metrics to measure the quality of spatial audio generation. We are not aware of any established benchmarks for reporting generation quality of spatial audio. For these reasons we opted for tasks for which we can directly measure performance, and leave generative tasks for future work.

评论

Thank you for addressing most of my concerns. I appreciate your effort on the error analysis, and please make sure to include this in a revision. I’m happy to increase my score to WA.

评论

Error analysis will be included in the final iteration. We appreciate your thoughtful review of ELSA and your valuable insights. Thank you!

审稿意见
7

This paper describes a method for learning to represent spatial audio (and text). The proposed model is trained on synthetically spatialized audio data with corresponding text prompts. The authors evaluate the system on audio captioning/retrieval and localization tasks, showing that the proposed model effectively represents both semantics and locality of audio.

优点

The paper is well written, clearly organized, and easy to follow. The proposed method makes intuitive sense and appears to be effective. The empirical evaluation is reasonably thorough and the choice of tasks and baseline models seem appropriate. Spatial audio is a growing area (in the context of machine learning / event detection / representation learning), and I think the work described here does fill an unaddressed area of the literature in an interesting way. Overall I think they did a great job here.

缺点

I don't have much to fault here, but there are a few points that I think could be expanded to improve clarity and help readers understand the contribution here. I'll elaborate on these points in the questions section, but the high-level gloss is:

  • While the spatial representation part of the work (ie FOA-derived input) is explained well, there is almost no explanation of how the spatialization was implemented.
  • There is little (if any) qualitative analysis of the results, only aggregate scores reported in tables.

问题

  • How was the spatialization implemented? I expect this was done via standard methods (ISM implemented by pyroomacoustics? something else?), but there is no mention of this in either the main text or the appendix. Additionally, I think some of the details from the appendix (Table A.2) should be mentioned directly in the main text, such as the range of room sizes, angle distributions, etc.; these details are important, and do not take much space. (If you do need to sacrifice anything, I don't think the definition of log-mel transformation is as critical to include since it is standard.)
  • Since TUT2018 is a finite vocabulary dataset, it would be informative (and entirely possible) to see a per-class and per-environment breakdown of the evaluations reported in table 1. This would be informative because it's not necessarily a given that your spatialization is equally effective across categories (or rooms / room sizes). If the model does turn out to perform consistently across categories - great! If not, it may suggest a weakness in either the spatial rendering or prompt generation. (If you do compute these results, it may or may not make sense to store in the appendix, depending on how interesting the results are.)

局限性

The limitations sections seems sufficient to me.

One potential caveat here is that the authors do not explicitly mention any limitations imposed by the accuracy of the spatialization process, e.g., whether it will only work well for simulated closed environments (shoebox model) or if it can accurately capture open environments. This I think would be easy to resolve with a bit more information about the process and an extra line the limitations (if necessary).

作者回复

Thank you for your time reviewing the paper and providing valuable feedback and suggestions. We will do our best to clarify the points that you have raised below. Please let us know if there are further questions.

While the spatial representation part of the work (ie FOA-derived input) is explained well, there is almost no explanation of how the spatialization was implemented. How was the spatialization implemented? I expect this was done via standard methods (ISM implemented by pyroomacoustics? something else?), but there is no mention of this in either the main text or the appendix. Additionally, I think some of the details from the appendix (Table A.2) should be mentioned directly in the main text, such as the range of room sizes, angle distributions, etc.; these details are important, and do not take much space. (If you do need to sacrifice anything, I don't think the definition of log-mel transformation is as critical to include since it is standard.)

Thank you for the suggestion regarding Table A.2. We will move it over to the main paper in the camera ready version. As per your suggestion we will also include more details on the simulation algorithm. Please let us know if there are any other suggestions as well.

The augmentation pipeline mirrors that of Spatial LibriSpeech [1], employing a combination of geometrical- and boundary-element acoustical simulation. This allows us to physically model the way that sources radiate sound, as well as its propagation in both enclosed and open spaces.

[1] Sarabia et al. (2023) Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning. InterSpeech.

There is little (if any) qualitative analysis of the results, only aggregate scores reported in tables. + Since TUT2018 is a finite vocabulary dataset, it would be informative (and entirely possible) to see a per-class and per-environment breakdown of the evaluations reported in table 1. This would be informative because it's not necessarily a given that your spatialization is equally effective across categories (or rooms / room sizes). If the model does turn out to perform consistently across categories - great! If not, it may suggest a weakness in either the spatial rendering or prompt generation. (If you do compute these results, it may or may not make sense to store in the appendix, depending on how interesting the results are.)

Thank you for this suggestion. Please refer to the general response and attached PDF for a detailed error analysis (Exp. B.). Briefly, we find little variation across different attributes, though the errors are higher at the extrema of the ranges. We will update the manuscript appendix with this analysis.

审稿意见
6

The paper presents ELSA (Embeddings for Language and Spatial Audio), a novel model designed to learn spatially-aware audio and text embeddings using multimodal contrastive learning. The primary aim is to address the limitations of existing audio foundation models, which lack spatial awareness, and sound event localization and detection models, which are constrained to a fixed number of classes and absolute positional descriptions. The authors spatially augment several classical open-source audio datasets in order to train ELSA. Results show that ELSA is able to capture spatial attributes and semantic meaning of the audio.

优点

  • The focus of this paper is on learning spatial audio embeddings associated with natural language description, which is a very interesting and rewarding problem for which there is a lack of models.
  • These authors synthesize large amounts of non-spatial audio data under various spatial configurations, which is a valuable contribution to the field of spatial audio understanding.

缺点

  • For this paper, my biggest concern is the generalizability of the model to real scenarios. While the synthetic dataset is extensive, there is a risk that the model might not generalize well to real-world scenarios due to potential biases in simulated environments. To show the performance of model generalization to real scenarios, the experiments only on a small real-world dataset appear too thin. Would it be possible to test ELSA in other real scenarios, for example, in some of the tasks in the latest DCase competition, e.g. Sound Event Localization?
  • For paper writing, too much important information is put in appendices, such as the structure figure of the whole model. Perhaps the layout of the writing could be adjusted to make it easier to read.
  • The citation format of the article is highly problematic and needs to be standardized.

问题

The experiments in Table 2 confuse me a lot. In Sec. 5.2, the authors mentioned that "The ELSA text embeddings for such captions are extracted from the pre-trained encoder and compared in a zero-shot fashion with ELSA audio embeddings for samples from the test set using cosine similarity. We classify the match as correct if the spatial attribute in the closest audio sample matches the spatial attribute of the query caption". However, the number of classes of a spatial attribute is very limited (For instance, there are only two classes "far" and "near" for the "distance" attribute), which means there are only two captions that will be used for the "distance" attribute? Wouldn't there be very few captions being used for testing totally? Hopefully, the authors can explain the experimental configuration a bit more.

  • To train ELSA on single-channel audio, the authors repeat the single channel 4 times to fake a 4-channel FOA audio and compute Intensity Vectors. However, the way IV is calculated possibly doesn't make sense for this kind of faked 4-channel audio. Why is it designed this way? Why not try to design a separate feature extractor for single-channel audio?
  • It is natural to understand computing bearing information from spatial audio, which is essentially a bit similar to calculating the "Time Difference of Arrival" based on different channels. But how to understand that the model can get distance information from spatial audio? In other words, where does the information about distance come from?

局限性

  • As the authors mentioned, for creating a spatial audio caption dataset, using LLM to rewrite the caption might lead to hallucinations.
  • Model performance in real scenarios is yet to be verified.
  • ELSA looks very suitable to be used as a spatial audio encoder for a caption model to conduct spatial audio captioning, but unfortunately, the authors did not show this kind of capability in the paper.
作者回复

Thank you for taking the time to provide comments and suggestions. We address the points raised below.

Would it be possible to test ELSA in other real scenarios, for example, in some of the tasks in the latest DCase competition, e.g. Sound Event Localization?

Model performance in real scenarios is yet to be verified

Table 1 reports localization performance on the TUT sounds dataset, which is a dataset of real-world room impulse responses encoded as FOA and convolved with clean recordings. We use this dataset specifically because it has properties similar to the synthetic samples we have used for model training. In particular, each sample has a single and stationary point source, which allows us to focus specifically on performance differences due to the sim-to-real gap. Other datasets, such as STARSS23 (latest dataset in the DCASE - SELD task), contain overlapping and moving sound sources (features we will support in future).

Tables 2 and A.6 present the spatial and semantic performance of our model on the spatial-RWD dataset (our real world dataset). We acknowledge that this dataset is small in size. However, it is important to note that its data distribution is significantly different from the training data used for our model. As such, the performance metrics on this dataset offer valuable insights into the model's generalization beyond simulated data.

ELSA looks very suitable to be used as a spatial audio encoder for a caption model to conduct spatial audio captioning .....

We are thankful for this suggestion. As mentioned in the global response we trained a prefix encoder to perform spatial audio captioning. Please refer to the global response (Exp. C.) for implementation details and results.

For paper writing, too much important information is put in appendices, such as the structure figure of the whole model......

We apologize for inconvenience due to the paper structure. We will move the model architecture to the main paper in the camera ready version. Please let us know if there are any other suggestions.

The citation format of the article is highly problematic and needs to be standardized.

Apologies. We have standardized the references and ensured everything is consistent for the camera-ready version of the paper.

The experiments in Table 2 confuse me a lot. In Sec. 5.2, ... . However, the number of classes of a spatial attribute is very limited (For instance, there are only two classes "far" and "near" for the "distance" attribute), which means there are only two captions that will be used for the "distance" attribute? Wouldn't there be very few captions being used for testing totally? Hopefully, the authors can explain the experimental configuration a bit more.

This task measures accuracy and alignment of spatial attributes encoded in the audio and text embeddings. For each spatial attribute, we have a finite number of values (e.g., left, right, front, and back for the attribute direction; the full list of attribute/value pairs detailed in Table A.3 of the appendix). We then generate a text caption for each spatial attribute (e.g., for direction, it would be “Sound from the left”, “Sound from the right”, and so on). Finally, the class assigned to a test sample is the class of the text embedding with highest cosine similarity to the audio embedding. This evaluation is conducted over the entire test set. You are correct that there are a limited number of classes per spatial attribute for this task. However, we also report open vocabulary spatial caption retrieval performance in Table A.6 of the appendix. This table reports retrieval accuracy over a much larger (~1000 sample) set of complex spatial captions (eg "Sound of an alarm coming from the far left of a large room.") for Spatial-Clotho and Spatial-AudioCaps.

To train ELSA on single-channel audio, the authors repeat the single channel 4 times to fake a 4-channel FOA audio and compute Intensity Vectors. However, the way IV is calculated possibly doesn't make sense for this kind of faked 4-channel audio. Why is it designed this way? Why not try to design a separate feature extractor for single-channel audio?

This is a good question. The duplicated channels result in identical intensity vectors for the x, y, and z dimensions. The spatial attributes branch learns to associate this condition with lack of spatial bias. We tried alternative methods: using zeros for the other three channels, using random values for the other three channels, as well as selectively using the spatial attributes branch only for spatial audio samples. Empirically, our current design produced the best results. We will update the manuscript with this information.

It is natural to understand computing bearing information from spatial audio, which is essentially a bit similar to calculating the "Time Difference of Arrival" based on different channels. But how to understand that the model can get distance information from spatial audio?....

Distance is primarily encoded in the DRR (direct-to-reverberant ratio) of the soundfield, as well as the absolute level of the sound. Since our data synthesis is based off physical simulation of soundfields, these biases are inherently present in the resulting training examples, and may be learned by both branches of the audio encoder.

评论

I thank the authors for their detailed replies, which address many of my concerns. The new experiments are very valuable. Please add them to the revised paper if possible. Consequently, I have raised my score to "Weak Accept".

评论

The new results will be duly incorporated into the revised paper. We extend our sincere gratitude for your review, suggestions and experiment ideas. Thank you!

审稿意见
7

The paper presents ELSA (EMbeddings for Language and Spatial Audio), a spatially aware-audio and text embedding model. The training data is created by synthesizing spatial audio in ambisonic format and augmenting text captions with spatial information. A small real world data is also collected for evaluations. The model training itself largely follows standard CLIP/CLAP training by using contrastive losses. Additional losses for direction and distances are added for the spatial part. Evaluations are done on both semantic retrieval tasks and spatial tasks.

优点

– The paper addresses a key part of multimodal contrastive embeddings. Sounds contain a significant amount of spatial information and humans naturally rely on directional information from sounds. Considering this it is expected that embeddings with spatial information are created. The paper is a good step in the right direction.

– For the most part, the paper is well done. Spatial audio can present several challenges with respect to data (more so in multimodal settings, training approach). Considering the challenges around learning from spatial audio, the paper presents a good approach for learning spatially-aware language embeddings. The experiments are also reasonably good.

– The paper is also well written and mostly clear.


Score increased after rebuttal.

缺点

There are a few weaknesses which are worth addressing in the paper.

– For table 2, I would be curious to see what CLAP on its own can achieve. It would be good to contrast this zero-shot classification on the spatial task.

– How were the non-spatial audio-text pairs used in training (as shown in Table 3, last row) ?

– Using non-spatial audio-text seems crucial for good semantic retrieval. This is evidenced by A.6 as well where the models training on just spatial audio-text pairs do not do well on semantic retrieval task. This is a bit surprising. The CLIP loss is still present in training, the semantics are also intact in spatial audio-text pairs. Why should there be a performance drop in that case ? it would be good to provide a good discussion and justification

– In Table A.7, the performance of the model trained on spatial Clotho and Audiocaps is better on RWD data than even on Clotho and Audiocaps itself. That is a bit surprising. We would expect that the model would be better in it’s own domain. The difference also is pretty big.

– The discussion in Section 5.4 is a bit adhoc. I would suggest not referring to anecdotal observations. The experiments could be better designed.

– Several of the classification experiments end up using 3-4 layers MLP. I think a more shallower model (maybe even just linear classifier) would provide a better confirmation of what information the embeddings store. Otherwise such deeper networks are able to push the numbers on their and it’s not clear how good the embeddings are.

– Some form of clustering and distance visualization would be good. It has been incorporated in some form in Table 2, but it would be good to explicitly show how the distances between embedding represent the spatial information.

– All the spatial mapping in terms of the language is very discrete (A.2). The range for distance, direction etc. can appear a bit arbitrary and forced. While this is perhaps a good first attempt, a more continuous form of “spatial-language” is desirable. Another thing could be a perception driven approach can also be taken where the boundaries are decided by what people generally perceive as left or right w.r.t sound direction.

问题

Please address the weaknesses above

局限性

Please add some limitations.

作者回复

Thank you for your suggestions and comments, which we address here. Please let us know if anything remains unclear.

For table 2, I would be curious to see what CLAP on its own can achieve.

Performance using a pre-trained CLAP checkpoint is close to random for all tasks, which is expected as CLAP is not trained with spatial audio or captions that describe spatial features of audio.

S-ClothoS-AudiocapsS-RWD
Distance4854.353
Direction28.229.327.3
Elevation56.751.359.4
Room Area46.366.5N/A
Reverb57.352.5N/A

How were the non-spatial audio-text pairs used in training (as shown in Table 3, last row) ?

We merge the spatial and non-spatial datasets, and use all samples from this mixed dataset in each epoch to learn ELSA. Samples are input to both the spatial attributes encoder and HTSAT (semantic encoder). For the HTSAT encoder, the non-spatial audio is used as is and we pass just the first channel for the spatial audio. For the spatial attributes encoder, which expects four-channel FOA, we use the spatial audio as is and we repeat the single channel non-spatial audio four times. The loss is weighted the same for both forms of input. We will revise Section 4.1 and Figure A.1 for clarity in the camera-ready version.

Using non-spatial audio-text seems crucial for good semantic retrieval....

This is a great question and something we investigated during the architecture design for ELSA. Table A.6 evaluates retrieval on spatial captions while Table 3 (main paper) and the table below evaluate retrieval on semantic captions. It is worth noting that the task in Table A.6 is harder than Table 3 as the model must retrieve a caption with both semantic and spatially correct elements. We will be clearer about this distinction in the camera-ready version of the paper.

ModelACClotho
Training Data AudioTraining Data CaptionsT->AT->AT->AA->TA->TA->TT->AT->AT->AA->TA->TA->T
R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
CLAPNon-spatialNon-spatial32.768.881.540.77484.714.437.650.718.340.555.1
ELSASpatialNon-spatial27.162.776.136.668.778.411.332.644.412.428.350
ELSASpatialSpatial25.359.372.534.864.575.29.93139.812.135.347.3
ELSAMixedMixed33.268.28140.974.486.11536.750.820.143.255.4

We remark that changing from non-spatial audio only to spatial audio only results in a drop of -5.6% and -3.1% in T→A R@1 for Audiocaps and Clotho respectively. Similarly, training only with spatial captions results in further drops of -1.8% and -1.4% in T→A R@1 for Audiocaps and Clotho respectively. Mixing both captions and audio restores the performance to the original levels.

The discussion in Section 5.4 is a bit adhoc. I would suggest not referring to anecdotal observations.

Apologies, we should not have used the word anecdotal. The experiments are indeed formal. We will rephrase for the camera-ready version of the paper.

We have since run experiments for the remaining attributes to understand the alignment between captions and audio (extension of the experiment in paragraph 1, section 5.4) and obtained accuracies of 76.5% for distance and 55.1% for elevation. We have also performed clustering visualizations on the embeddings (Exp. A. in global response). We are happy to incorporate further changes or experiments that would strengthen this section.

Several of the classification experiments end up using 3-4 layers MLP...

We will correct Section 4.3 — all MLPs consist of a two layer MLP with a hidden layer of 64 dimensions and an output layer. We will add the parameter counts (33,345 parameters or only 0.02% of ELSA).

Some form of clustering and distance visualization would be good.

We address this question in the global response (Exp.B).

All the spatial mapping in terms of the language is very discrete (A.2). The range for distance, direction etc. can appear a bit arbitrary and forced....

We selected our mappings to correspond to general terms commonly used for broad spatial descriptions. We wholeheartedly agree that incorporating more natural spatial language is desirable. The idea of perceptually motivated boundaries is also excellent. Ideally both of these would be derived by running user-studies. Firstly to align with the perception of spatial attributes(eg, the azimuth would a human consider a sound to be coming from the left) , and secondly to understand the distribution of language people use to refer to spatial attributes. These user studies are not something that we would be able to setup and have approved in this rebuttal period, but is something we will strongly consider for future work.

评论

We made a small mistake when addressing the following point:

Several of the classification experiments end up using 3-4 layers MLP. I think a more shallower model (maybe even just linear classifier) would provide a better confirmation of what information the embeddings store.

In our response, we only mention Section 4.3, but we meant to refer to both Sections 4.3 and 5.4. That is, all the downstream classifiers and regressors we reported in the paper had one hidden layer with 64 parameters. We will make sure to update the manuscript accordingly.

评论

Thanks for providing a detailed rebuttal to all the reviews. Overall, it is a good paper on learning spatial audio-text embeddings with certain limitations. Limitations include (only synthetic data, performance drop in semantic similarity, spatiality on the language side, etc).
I would recommend authors release their trained models and codes for reproducibility. Also strongly recommend authors to incorporate the additional results and clarifications they provided. Assuming all of this is done, I am willing to increase the score.

评论

We are committed to releasing the relevant model checkpoints, code, and data. Additionally, we will incorporate all experimental findings and discussions from the rebuttal into the main paper. We sincerely appreciate the time and effort you have invested in reviewing our work. Your feedback has significantly contributed to improving our paper, thank you!

作者回复

We are pleased to see such a strong positive sentiment from the reviewers about our work. Reviewers highlighted how interesting and rewarding (SQBX), strong and significant (rKMo), and how rigorous (rKMo) our work is. Reviewers also mentioned how well-written (JMmV, USXb) and intuitive (USXb) the proposed approach is. We agree that the research area is under-addressed (USXb) and under-explored (rKMo), and appreciate that reviewers feel ELSA addresses this gap in a novel and effective (rKMo) way. We want to thank all reviewers for their time and thoughtful feedback.

In response to the reviewers' suggestions, we have conducted the following additional experiments that we believe will be of interest to all:

Exp. A. Visualization of ELSA Embeddings

Figure R.1 in the rebuttal document shows a UMAP projection of the ELSA embeddings from the test sets of Spatial-AudioCaps and Spatial-Clotho. Note that the UMAP projection was guided with the embeddings and labels of the training sets of both datasets. Additionally, we computed the Wasserstein distances between the 512-dimensional embeddings of both test sets:

leftrightfrontback
left0.001.040.940.98
right1.040.000.920.97
front0.940.920.000.81
back0.980.970.810.00

Overall, both results show that the data clusters well with the direction labels, though there is some degree of confusion between back and front. We carried out the same analysis for distance and floor area, and obtained similar results. We are happy to share with the reviewers if they are interested. We will add these results to the appendix of our paper.

Exp. B. Fine-grained of direction-of-arrival error analysis

We analyzed the errors of a two-layer MLP trained to regress the direction-of-arrival (same setting as Table 1, last column in the paper). We observe how the errors vary along the following dimensions: source azimuth, source elevation, source distance, room floor area, room mean T30, and TUT Sound Events 2018 semantic classes. Results are rendered as boxplots in figure R.2 in the attached PDF.

Overall, we find there is little variability in the direction-of-arrival error across the studied dimensions. However, we note the errors tend to be higher at the extrema of the dimensions. Regarding the TUT Sound Events 2018 semantic classes, we again find few differences between the different sound classes, with drawer having the lowest error and phone having the highest one.

Exp. C. Spatial Audio-Caption Generation

Inspired by reviewer SQBX’s suggestion, we trained a Spatial Audio caption generator. Decoding multimodal embeddings into natural language can be achieved by prefixing an autoregressive causal language model [1-4], where the prefix is constructed from a projection of the multimodal embeddings. To facilitate audio captioning using ELSA, we fine-tune a GPT model with 12 attention layers each having 12 heads. The ELSA embeddings are projected onto the prefix using a single dense layer. With the ELSA encoder frozen, we train the GPT model on the audio embeddings of Spatial AudioCaps, and perform early-stopping to avoid overfitting. The following results are obtained with fine-tuning on only 150K embedding caption pairs from the train splits of Spatial Clotho and Spatial AudioCaps.

We evaluate results on the test splits of both Spatial AudioCaps and Spatial Clotho. We report metrics from Audio Captioning task of the DCASE Challenge [5] (the metric description are taken verbatim from their associated github page).

MetricShort descriptionRangeSpatial ClothoSpatial Audio Caps
SPIDErMean of CIDEr-D and SPICE[0, 5.5]0.190.34
FENSECombines SBERT-sim (Cosine-similarity of Sentence-BERT embeddings) and Fluency Error rate (fluency errors in sentences with a pretrained model)[-1, 1]0.590.68
VocabNumber of unique words in candidates.[0, inf\inf]11031258

Additionally, we also present examples of generated samples from both datasets:

(1) Generated caption

In a medium-sized room located at the far back, an electric motor is emitting a high-pitched whine, accompanied by a whirring noise. In the background, adult male voice can be heard speaking.

(1) Ground truth caption from test set:

From deep within a medium-sized room, the noise of a robust industrial engine can be heard whirring loudly.

(2) Generated caption:

The sound of water flowing and splashing is emanating from the front of a room.

(2) Ground truth caption from test set:

The sound of gentle rowing and paddling in the water is emanating from the vicinity of a medium-sized room.

(3) Generated caption:

The sound of cheering coming from a crowd is heard near the medium-sized room.

(3) Ground truth caption from test set:

The sound of applause, indicating that people are praising the musicians after their performance, is emanating from the medium-sized room.

[1] Mokady et al. (2021) ClipCap: CLIP Prefix for Image Captioning. ArXiV preprint.

[2] Gu et al. (2023) I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision. CVPR.

[3] Kim et al. (2023) Prefix tuning for automated audio captioning. ICASSP.

[4] Deshmukh et al. (2024) Training audio captioning models without audio. ICASSP.

[5] https://dcase.community/challenge2024/task-automated-audio-captioning

评论

Due to lack of space, we could not include the following tables for Exp. B. in the general response.

  1. Direction-of-arrival error breakdown for azimuth ranges evaluated on test sets of Spatial Clotho and Spatial AudioCaps.
[-3.13, -2.51]: mean: 0.36, stddev: 0.44, median: 0.24, iqr: 0.21, min: 0.00, max: 3.02, n: 272
[-2.51, -1.88]: mean: 0.29, stddev: 0.46, median: 0.16, iqr: 0.17, min: 0.01, max: 3.06, n: 251
[-1.88, -1.25]: mean: 0.23, stddev: 0.43, median: 0.12, iqr: 0.16, min: 0.00, max: 3.09, n: 258
[-1.25, -0.63]: mean: 0.24, stddev: 0.37, median: 0.16, iqr: 0.17, min: 0.02, max: 3.03, n: 245
[-0.63, 0.00]: mean: 0.27, stddev: 0.31, median: 0.19, iqr: 0.16, min: 0.01, max: 2.33, n: 215
[0.00, 0.63]: mean: 0.23, stddev: 0.25, median: 0.18, iqr: 0.15, min: 0.01, max: 1.99, n: 224
[0.63, 1.26]: mean: 0.23, stddev: 0.24, median: 0.16, iqr: 0.15, min: 0.00, max: 1.70, n: 270
[1.26, 1.88]: mean: 0.19, stddev: 0.23, median: 0.12, iqr: 0.18, min: 0.00, max: 1.67, n: 220
[1.88, 2.51]: mean: 0.22, stddev: 0.22, median: 0.16, iqr: 0.16, min: 0.01, max: 1.81, n: 259
[2.51, 3.14]: mean: 0.24, stddev: 0.29, median: 0.18, iqr: 0.14, min: 0.01, max: 2.46, n: 244
  1. Direction-of-arrival error breakdown for semantic classes evaluated on the test set of TUT Sound Events 2018.
[Drawer]: mean: 0.12, stddev: 0.16, median: 0.07, iqr: 0.15, min: 0.00, max: 0.98, n: 97
[Laughter]: mean: 0.16, stddev: 0.22, median: 0.11, iqr: 0.17, min: 0.00, max: 1.95, n: 95
[Cough]: mean: 0.16, stddev: 0.16, median: 0.09, iqr: 0.20, min: 0.00, max: 0.83, n: 87
[Clearthroat]: mean: 0.17, stddev: 0.29, median: 0.10, iqr: 0.14, min: 0.00, max: 1.95, n: 115
[Keyboard]: mean: 0.16, stddev: 0.15, median: 0.15, iqr: 0.19, min: 0.00, max: 1.19, n: 97
[Speech]: mean: 0.14, stddev: 0.17, median: 0.06, iqr: 0.13, min: 0.00, max: 0.88, n: 105
[Phone]: mean: 0.24, stddev: 0.30, median: 0.18, iqr: 0.18, min: 0.01, max: 2.21, n: 117
[Pageturn]: mean: 0.12, stddev: 0.15, median: 0.09, iqr: 0.12, min: 0.00, max: 1.32, n: 115
[Knock]: mean: 0.15, stddev: 0.17, median: 0.11, iqr: 0.13, min: 0.00, max: 1.10, n: 112
[Doorslam]: mean: 0.17, stddev: 0.17, median: 0.10, iqr: 0.18, min: 0.00, max: 0.85, n: 101
[Keysdrop]: mean: 0.15, stddev: 0.21, median: 0.09, iqr: 0.15, min: 0.00, max: 1.45, n: 111
最终决定

The paper presents ELSA (Embeddings for Language and Spatial Audio), a novel model designed to learn spatially-aware audio and text embeddings using multimodal contrastive learning. The primary aim is to "fuse" existing audio foundation models, which lack spatial awareness, and sound event localization and detection models, which are constrained to a fixed number of classes and absolute positional descriptions. The authors synthesize training data by spatially augmenting several classical open-source audio datasets in order to train ELSA largely following a standard CLIP/CLAP approach with contrastive loss. Additional losses for direction and distances are added for the spatial part. Evaluations are done on both semantic retrieval tasks and spatial tasks. Results show that ELSA is able to capture spatial attributes and semantic meaning of the audio.

Reasons to accept: – The paper creates for the first time spatially aware audio embeddings in an LLM framework (with natural language descriptions) -- good novelty. Sounds contain a significant amount of spatial information and humans naturally rely on directional information from sounds. The paper is a good step in adding this capability to machine hearing. – For the most part, the paper is well written and the experiments are well executed. Considering the challenges around learning from spatial audio, the paper presents a good approach for learning spatially-aware language embeddings, even though simulated data may not be ideal. – The integration of spatial context into text-to-audio generation models is an important and underexplored area, and this work offers a novel and effective approach. The findings have significant implications for improving spatial language understanding in various applications.

  • Authors are committed to releasing the relevant model checkpoints, code, and data

Reasons to reject:

  • The paper uses only synthetic data, and the generalizability/ performance on "real" data remains to be seen, even though early experiments show that the model has potential. There is also a performance drop in semantic similarity, spatiality on the language side, etc
  • While the spatial representation part of the work (ie FOA-derived input) is explained well, there is almost no explanation of how the spatialization was implemented (can probably be addressed in the camera ready) and only four classes (left, right, up, down) are being implemented and tested.
  • A lot of relevant information is in appendices and some issues in the writing (these can be fixed in camera ready)