PaperHub
4.8
/10
withdrawn4 位审稿人
最低3最高6标准差1.1
5
6
5
3
4.3
置信度
正确性2.3
贡献度2.3
表达2.3
ICLR 2025

AVSET-10M: An Open Large-Scale Audio-Visual Dataset with High Correspondence

OpenReviewPDF
提交: 2024-09-27更新: 2024-11-14
TL;DR

An Open Large-Scale Audio-Visual Dataset with High Correspondence

摘要

关键词
udio-visual corresponding datasetsound separationaudio-video retrieval

评审与讨论

审稿意见
5

The authors propose a multi-stage filtering approach to enhance audio-visual correspondence and remove voice-over from two public youtube datasets (AudioSet and Panda-70M), resulting in a new subset, and further they show that this new subset can be good pre-training data for two audio-visual tasks including retrieval and separation.

优点

  • This paper is clearly written and the narrative is straightforward to follow. Proposed dataset is a good contribution to multimodal and audio-visual research.
  • Proposed filtering stages are interesting and relevant. In stage 2, the data driven approach for analyzing distribution of similarities is informative. In stage 3, voice-over removing is a necessary step.
  • It is encouraging that the authors provide ongoing mechanisms to update their dataset in order to help preserving privacy.

缺点

  • Proposed mechanisms all rely on pre-trained models, for filtering (Imagebind), and for annotations (PANNs), these pre-trained models can propagate errors in the pipeline, and it might help to understand better to what extent these errors are inherited in the processing pipeline. One suggestion is to leverage human in the loop to examine a manageable subset.
  • For the benchmark experiment in Section 4.1 audio-video retrieval, it is stated that image features are extracted from Imagebind, and video features from InternVid, however, it is not clear how audio features are acquired. It might worth stating explicitly with one or two sentences that Imagebind or Freebind architecture is utilized and pre-trained with proposed datasets. Similarly, it might worth adding a sentence or two to describe CLIPSep so that it is easier for the reader to follow.
  • It is worth adding another downstream benchmark for audio only tasks, there are several benchmark such as HEAR [1] that can be utilized. Even including a subset of these tasks can help provide a more thorough view of proposed dataset.
  • In section 4, missing separation in (2) Vision-queried sound.

[1] Turian, J., Shier, J., Khan, H. R., Raj, B., Schuller, B. W., Steinmetz, C. J., ... & Bisk, Y. (2022, July). Hear: Holistic evaluation of audio representations. In NeurIPS 2021 Competitions and Demonstrations Track (pp. 125-145). PMLR.

问题

  • For stage 4 in the filtering process, what is the performance of the source separation model? This information can be helpful for the reader to understand the potential errors propagated in the process.
  • In both table 5 and table 6, in addition to the performance of two stages, pre-training with one dataset and fine-tuning on another, it might worth add an extra row just to combine the two dataset in the pre-training phase.
  • A suggestion: since the proposed subset is from two publicly available lists, it might worth release those ids removed in each stage. This can potentially enable various interesting research directions.
审稿意见
6

The authors introduce a new dataset, AVSET-10M for audio-visual correspondence. The dataset is large scale with 10M samples with 527 audio categories of the Audioset. The authors take videos from Audioset and Panda-70M and apply audio-visual correspondence filtering using ImageBind (cosine similarity between image frames and audio representation) to select videos with high audio-visual correspondence. They filter out videos that have background music or speech narrations that do not correspond to the visuals. The PANNs network is used to detect musical and speech content. They also recycle some samples by removing speech content using a speech separation model and rechecking for correspondence without the speech. The authors perform audio-visual retrieval and vision-queried sound separation tasks to show the importance of the dataset and compare it with existing datasets.

优点

  1. The paper is well-presented, clearly written, and easy to follow.
  2. The main contribution for me is the scale of the dataset. With 10M audio-visual corresponding samples, the dataset can be used to develop large multimodal models.
  3. The audio-visual correspondence of such scale can help in building strong audio-visual foundation models.
  4. The correspondence filtering technique used can encourage similar efforts in other audio-visual fields for dataset creation.

缺点

  1. The dataset itself is a large-scale version of an existing audio-visual correspondence dataset VGGSound.
  2. To a large extent, the quality of the dataset is over-dependent on the performance of the Imagebind model.

问题

  1. The authors claim that their dataset includes samples where the audio is subtle or silent but has a correlation with the visuals. It is not fully clear how their approach ensures this. How using a fixed number of frames from the video for creating a visual representation can efficiently identify such a relationship between the audio and visual modalities where temporal resolutions are critical?

  2. While narrations and music are considered non-corresponding elements, they often convey emotional information/describe the subject of the scene. Currently, the authors remove all such samples from the dataset (provide a recycled version removing them). It can be a good idea to open-source such samples as a separate subset for the community. The non-corresponding samples can also be significant to the community.

  3. This work relies heavily on ImageBind. The authors of ImageBind mention in their paper " The embeddings may create unintentional associations" as they try to align many modalities. They also say that their model leverages pretrained image-text embeddings. Here the authors claim that models trained with image-text embeddings are not ideal for AVC. Although Imagebind does align the other modalities, the basis seems to be image-text. In such a case, why was ImageBind chosen for the filtering purpose compared to some other audio-visual correspondence models?

伦理问题详情

No real concerns. The video files used are taken from existing open-source datasets and he authors have said they will support deletion requests.

审稿意见
5

This paper introduces AVSET-10M, a large-scale dataset containing 10 million audio-visual samples designed to enhance research in tasks requiring strong audio-visual correspondence. The authors built this dataset by integrating and filtering data from established sources like AudioSet and Panda-70M, emphasizing semantic and temporal alignment between audio and visual components. The dataset spans 527 audio categories and aims to address a long-standing need for larger, more reliable datasets in this field. The dataset construction involved a multi-step process: data collection, audio-visual correspondence filtering, voice-over filtering to exclude irrelevant sounds, and sound separation for recovering useful samples. AVSET-10M’s utility was demonstrated through benchmarks on audio-video retrieval and vision-queried sound separation, showing moderate performance gains. The authors also took care to address ethical concerns by releasing only metadata and setting up mechanisms for data removal requests.

优点

  1. AVSET-10M is among the largest publicly available datasets in the audio-visual field, offering a vast array of samples across 527 unique categories. This scale could indeed be useful for training models that require large datasets to capture nuanced audio-visual patterns.

  2. The dataset emphasizes strong alignment between audio and visual components, an important requirement for research involving temporal and semantic precision.

  3. The authors employed a comprehensive filtering process, addressing common data quality issues such as misalignment and irrelevant audio, resulting in a dataset with stronger audio-visual alignment than previous resources.

  4. The authors incorporated privacy protections, including user data removal mechanisms and proactive synchronization with upstream datasets, ensuring that sensitive content can be addressed responsibly.

缺点

  1. While AVSET-10M is a large dataset, the paper lacks methodological novelty, as it primarily uses established filtering techniques on existing datasets rather than introducing new models or technical innovations. For a conference like ICLR, which emphasizes technical contributions, this may be seen as a limitation.

  2. While AVSET-10M showed some gains in performance, the improvements were relatively modest and could perhaps have been achieved through preprocessing alone. This raises questions about whether the dataset fundamentally advances the field or simply provides an incremental improvement.

  3. By filtering out complex audio-visual scenes in favor of pure audio classes, the dataset may not capture the full richness of real-world environments, potentially limiting its relevance for generation tasks or applications where more contextual data is necessary.

  4. The dataset primarily involves filtering existing datasets without much original data collection or annotation. This limited workload may raise questions about the overall contribution, especially given the high standards expected at top conferences like ICLR.

问题

Given the dataset’s focus, I’m curious about its performance in tasks beyond audio-video retrieval and sound separation. Have the authors explored its applicability in areas such as audio-visual event localization, parsing, or cross-modal generation tasks? Without detailed annotations, the dataset may be limited for generation tasks. Are there plans to include textual descriptions or leverage multimodal annotations that could enhance its utility for tasks needing contextual information?

审稿意见
3

This paper introduces AVSET-10M, an audio-visual dataset containing 10 million samples. The authors benchmarked two key tasks on their dataset, AVSET-10M: audio-visual retrieval and visual query sound separation. These tasks highlight the importance of precise audio-visual correspondence in advancing audio-visual research.

优点

  • The article is easy to follow.
  • The writing is coherent, with illustrative figures that aid understanding.

缺点

  1. Why use Panda-70M? Since Panda-70M consists of high-quality video-caption pairs that only ensure semantic consistency, how do the authors guarantee that their audio-visual semantics are also consistent?

  2. Given that Panda-70M is used as a large base for the audio-visual correspondence dataset, no experiments or data regarding Panda-70M's audio-visual consistency are provided. Is it really appropriate to use Panda-70M?

  3. If the goal is to achieve audio-visual consistency, why not choose many high-quality audio-visual datasets as sources for AVSET-700k, such as VGGSound, Valor, Vast, UnAV-100M, and others?

  4. Regarding AudioSet, it mainly consists of weakly labeled audio events. Many of these contain audio-visual unrelated content. The authors mention a recent paper, Meerkat, that uses this dataset. However, it only utilizes a small, strong subset (96.5K+24.1K) from 2M videos in AudioSet. Having used AudioSet since 2017, I can confirm that it is very noisy. This is because it was initially collected with only audio events in mind. This is why the research community began using the VGGSound dataset [1]. Therefore, I am concerned about the data quality used in this study. The validity of the CLIP+CALP-based selection is questionable without sufficient human annotator justification.

  5. Why is Dcorresponding 7500, Drandom 7500, while Dnon-corresponding is 70,000?

  6. The novelty of the method is insufficient and the motivation is lacking. There are now more fine-grained datasets available, whether with temporally fine-grained annotations or more detailed captions. This makes the proposal of such a dataset unnecessary.

  7. Using ImageBind maybe not be appropriate. ImageBind is trained on AudioSet for alignment. Yet the authors use the VGGSound dataset to compute their normal distribution and audio-visual consistency without fine-tuning the dataset on VGGSound. This leads to unreasonable results.

  8. The setting of μ + 3σ is not explained in detail and no ablation studies have been conducted.

  9. Regarding Figure 1, VGGSound has a threshold of 0.18 here. Is this ratio too large and is this threshold setting reasonable?

  10. In Figure 1, AudioSet has 65% of samples below the μ + 3σ threshold of the non-corresponding distribution Nnon−corresponding. This ratio is too high, which suggests that this threshold may not be appropriate. Conversely, if this threshold is suitable, then the source of your "high audio-visual consistency" dataset AVSET comes from the "low audio-visual consistency" noisy AudioSet. This seems contradictory.

  11. In point 10, it states that AudioSet has 65% of samples below the μ + 3σ threshold of the non-corresponding distribution Nnon−corresponding, while in InternVL_IB++ it is 35%. Is the model selection too different? The model's choice is crucial for the dataset's consistency. The same samples may exhibit different audio-visual consistency under different models. The results in Figure 1 and Figure 4 demonstrate the differences in model selection. However, the authors do not explain why they chose ImageBind among many audio-visual alignment models such as OnePeace and CAV-MAE. They also did not conduct ablation studies. Therefore, I find both the principle and the results insufficiently reliable.

  12. The experiments are lacking. There are no ablation studies for the threshold or model selection. There are also no experiments on the effects of using this dataset across different models. This references the Panda-70M paper.

  13. In Table 1, why is only the AVE dataset compared? AVE is just one of many audio-visual downstream tasks and the data volume is not large. Other downstream tasks such as AVS and AVSS have many datasets and they are high audio-visual consistency datasets.

问题

See Weakness.

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.