PaperHub
5.3
/10
Poster4 位审稿人
最低4最高7标准差1.1
5
5
7
4
4.0
置信度
正确性2.3
贡献度2.3
表达2.5
NeurIPS 2024

Extending Multi-modal Contrastive Representations

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06

摘要

Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes $Ex$tending $M$ultimodal $C$ontrastive $R$epresentation (Ex-MCR), a training-efficient and paired-data-free method to build unified contrastive representation for many modalities. Since C-MCR is designed to learn a new latent space for the two non-overlapping modalities and projects them onto this space, a significant amount of information from their original spaces is lost in the projection process. To address this issue, Ex-MCR proposes to extend one modality's space into the other's, rather than mapping both modalities onto a completely new space. This method effectively preserves semantic alignment in the original space. Experimentally, we extend pre-trained audio-text and 3D-image representations to the existing vision-text space. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods. Moreover, semantic alignment also emerges between the extended modalities (e.g., audio and 3D).
关键词
Multi-modal learning; Representation Learning

评审与讨论

审稿意见
5

This paper introduces the Extending Multi-modal Contrastive Representation (Ex-MCR), an efficient method for learning multi-modal contrastive representations without relying on paired modality data. Unlike the previous C-MCR scheme, which connects two pre-trained modalities to a new embedding space, Ex-MCR extends one modality (leaf) to another (base), while the base modality remains frozen. This approach has the merit of preserving the alignment of pre-trained modalities in their original space while facilitating a unified modality space via an overlapping modality (e.g., using text to map audio-text to text-image). Experimentally, pre-trained audio-text and image-3D representations are aligned within the image-text modality space without requiring paired modality data, achieving superior performance over other methods.

优点

  • It effectively addresses the limitations of previous work, which struggled to preserve multi-modal alignment in the original representation spaces. This is validated by experiments showing that the trained Ex-MCR method also yields promising results in audio-text and image-text retrieval tasks, as compared to C-MCR, detailed in Table 1.

  • Such improvements are possible even without paired multi-modal data and without requiring large computational resources for training, demonstrating its efficiency. This method also shows positive potential for aligning audio to 3D modalities.

  • The idea of treating intra-modal alignment and inter-modal alignment differently is interesting, with linear layers and MLP layers being used respectively, considering the similarity of representation spaces. This design choice is validated by experiments, in Tables 5 and 6.

缺点

  • My major concern lies in the overall clarity of this paper. Although Section 3.1, which explains the underlying motivation, was clear, I found Fig. 1(b) and Section 3.2.1 difficult to understand despite several rounds of reading.

    • For Figure 1(b), it is difficult to interpret what is happening in each modality. It is suggested that more illustrative figures be depicted and supplemented with detailed captions.

    • For Section 3.2.1, overall clarity is limited. It seems to describe the construction of pseudo-modality pairs by retrieving similar embeddings from other modalities one by one, and incorporating various enhancing techniques. From Equation 1, the notations are inconsistent between the first and second lines. For consistent audio embeddings, T^A is used, while for consistent image embedding, V^I is used. In the formulation for retrieving similar audio samples given text as a query, should it be A^A instead of T^A in the first line? In a similar context, it would be desirable to verify the notations in Equation 2. Moreover, the clarity of the process and general description of modality-centric data pool could be further enhanced.

  • No clear evidence can be found that Ex-MCR improves the alignment of inter-modalities over C-MCR. For audio-image and 3D-image tasks in Tables 1 and 2, Ex-MCR conveys only slight improvements over C-MCR.

  • Sensitivity analysis on the hyper-parameters such as \lambda and \tau is not provided.

问题

  • From [L303-304], the terms 'audio-visual' and 'audio-image' seem to have duplicative meanings.

  • As an oracle, how can the proposed Ex-MCR scheme boost performances when trained on paired modality data?

局限性

It adequately touches the limitations.

作者回复

W1: Statements and Figure in the “Various Modality-centric Data” part

Thanks for pointing out the typos and providing suggestions about figures, we have fixed them in the new version.

1.1 Unclear Figure 1(b) and its caption.

We have uploaded the detailed captioned Figure 1(b) to the newly submitted PDF.

  • We replace "softmax" with "A->B similarity" in the figure to emphasize that when using Audio or Image as the query, the shared element in the text aggregation process is the similarity weight results, not the softmax operation itself.
  • We add a more detailed caption to better illustrate the overall pipeline of pseudo pair construction.

1.2 Typos in Eq(1)

We sincerely apologize for the misunderstanding caused by the typo. The correct formula (1) is as follows:

t~iA=tiA;   t~iI=tiI;   a~iA=softmax((t~iATA)/τ1)(AA)T;    v~iI=softmax((t~iIVI)/τ1)(VI)T\tilde{\mathbf{t}}_i^A = \mathbf{t}_i^A; \ \ \ \tilde{\mathbf{t}}_i^I = \mathbf{t}_i^I;\ \ \ \tilde{\mathbf{a}}_i^{A} = \operatorname{softmax}((\tilde{\mathbf{t}}_i^A \cdot \mathbf{T}^{A})/\tau_1) \cdot (\mathbf{A}^{A})^{T}; \ \ \ \ \tilde{\mathbf{v}}_i^{I} = \operatorname{softmax}((\tilde{\mathbf{t}}_i^I \cdot \mathbf{V}^{I})/\tau_1) \cdot (\mathbf{V}^{I})^{T}

Eq(2) is correct. When audios serve as queries to construct pairs, the first step is to obtain the relevant text embeddings through the inter-modality alignment of CLAP. Note that t~iA\tilde{\mathbf{t}}_i^A and t~iI\tilde{\mathbf{t}}_i^I always share the similarity weights retrieved by CLAP (the softmax term in the formula) during data aggregating. Next, we use t~iI\tilde{\mathbf{t}}_i^I and the inter-modality alignment in CLIP to obtain the relevant v~iI\tilde{\mathbf{v}}_i^I.

W2: Explanations for performance enhancement on inter-modalities when compared to C-MCR

To avoid misunderstanding, we further clarify our experiment settings in the response to all reviewers(Part 1). Benefit from the fully enhanced alignment learning pipeline described in Section 3.2, Ex-MCR makes significant progress in preserving existing inter-modalities alignments in leaf space(Audio-Text and 3D-Image) and has steadily improved the newly learned inter-modalities(Audio-Image and 3D-Text) alignments, compared to C-MCR.

Most importantly, Ex-MCR fully inherits the inter-modalities alignments in base space, significantly outperforming C-MCR. The fully preserved base space brings strong modality scalability. Serving image or text as overlapping modalities, various multi-modal contrastive learning spaces can be projected into a unified Ex-MCR space.

W3: the analysis of hyperparameter choosing

Please refer to the response to all reviewers(Part 3).

Q1: Inconsistent wording

Thanks for pointing out the inconsistencies in wording. "audio-vision" and "audio-image" have the same meaning in the paper, and we have modified these two expressions into "audio-image".

Q2: trained on paired modality data

We finetuned Ex-MCR-huge on paired Audio-Text data (trainset of Audiocaps). The results are as follows:

ModelFlickrAVEVGGSSACaps
R@5R@5R@5R@5
Ex-MCR-huge6.167.3611.7759.60
Ex-MCR-huge finetune on paired data6.467.4811.9460.43

Using paired data in training not only improves the alignment performance between the corresponding two modalities(Audio-Text) but also helps the model to obtain better alignment between other modalities(Audio-Image).

评论

Thank you for the response, which addressed many of my initial concerns.

To further clarify my previous question, I believe it is crucial to present the oracle performance to validate the uni-modal training scheme. Specifically, how much does the model perform when trained on paired audio-image datasets? In addition, how much does the proposed Ex-MCR model benefit from a small amount of such paired audio-image training data after fine-tuning?

评论

Dear Reviewer V8oT,

Thanks for your comments. To address your concerns, we conducted experiments on Ex-MCR-huge using paired audio and images from Audioset. As you mentioned in your review, we performed training from scratch and fine-tuning(1000 steps, less than 15 minutes) on the ”standard“ Ex-MCR-huge model in the paper, using these paired data. Please note that the relevant text embeddings are still constructed through retrieval and aggregation. The experimental results are as follows:

ModelFlickrAVEVGGSSACaps
R@5R@5R@5R@5
Ex-MCR-huge6.167.3611.7759.60
Ex-MCR-huge train on A-V paired data8.6312.4418.0562.63
Ex-MCR-huge finetune on A-V paired data9.3713.0919.8463.18

All retrieval metrics in both experimental settings outperform the "standard" Ex-MCR-huge, especially for the Audio-Image metrics. Similar to the results from fine-tuning on Audio-Text paired data, Ex-MCR's alignment in both Audio-Image and Audio-Text tasks benefit from the Audio-Image pairs.

Furthermore, the experimental results show that fine-tuning the "standard" Ex-MCR-huge outperforms training from scratch. We believe that the pseudo pairs aggregated by the Ex-MCR method and the real data pairs provide complementary knowledge.

We would be happy to discuss our paper in detail if you have additional comments.

Best regards, Authors

评论

Thank you for presenting the additional experiments.

Despite using audio-image paired datasets for training, what could be the reason for the low performance (R@5) in audio-image retrieval tasks such as Flickr and AVE? Could you provide any insights into the reasons behind this?

评论

Dear Reviewer V8oT,

Ex-MCR utilizes a pre-trained CLAP audio representation extractor, where audio is inherently aligned with text, rather than image. Audio representations aligned with text and audio representations aligned with images contain complementary information. Therefore, to maintain this alignment with text, we aggregate pseudo-text embedding for the ground-truth Audio-image pair from Audioset and limit the fine-tuning steps. As the result shown below, over-fitting to the Audio-Image pair can affect the Audio-Text alignment.

stepsAverage Audio-Image performance(R@5)Average Audio-Text performance(R@5)
08.4359.60
50013.2462.11
100014.1063.18
200015.8162.74
500016.0661.68
800016.2359.27

On the other hand, averaging the audio representations which are respectively aligned with text and image in the same space is a train-free and more economical, effective method to balance Audio-Text and Audio-Image alignment than conducting large-scale training on paired data. The experimental results are as follows:

ModelFlickrAVEVGGSSACaps
R@5R@5R@5R@5
ImageBind20.7840.1135.6727.47
IB+Ex-MCR-huge21.26(+0.48)38.95(-1.16)37.55(+1.88)47.44(+19.97)
IB+Ex-MCR-huge finetune on A-V paired data21.85(+1.07)40.15(+0.04)38.55(+2.88)48.18(+20.71)

We feel sorry for not conducting experiments on the same scale as ImageBind due to the limited time. However, the results above clearly demonstrate that Ex-MCR, by integrating representations from different sources of knowledge, achieves significantly better overall performance than ImageBind at a much lower additional computational cost.

Thank you again for your comments and we hope the experimental results and analysis provided above can address your concerns. If you have any other questions, please feel free to discuss them with us.

Best regards, Authors

评论

Thank you for the additional experiments and further clarifications.

I believe the clarity of the paper has improved based on the authors' responses. I will increase the score to 5.

审稿意见
5

This paper focus on the multi-modal contrastive representation of more than 3 modalities in multi-modal learning. To address the flaws of existing works, such as the dependency on large-scale, high-quality paired data and the expensive training cost, this paper introduces Ex-MCR, a training-efficient and paired-data-free method to build unified contrastive representation for multiple modalities. The key idea of the proposed method is to extend one modality's space into the other's. Experiments are conducted in audio-text and 3D image representations to the existing vision-text space.

优点

  1. The idea of preserving one modality's space is straight-forward and interesting.

  2. The writing is easy to follow and the figures clearly show the technical designs.

缺点

  1. The abstract claim 'a significant amount of information from their original spaces is lost in the projection process' and the statement 'C-MCR mainly focuses on learning a new space for the two non-overlapping modalities, while the modality alignments in powerful original pre-trained spaces are forgotten' should be supported by explicit evidence, experimental results, or theoretical analysis.

  2. There are various methods in the multi-modal learning area, and projection/projector-based methods are not the only option for binding different modalities. Importantly, there is overlapping information across different modalities, such as semantics, as pointed out by [1], and experiments in [1] have proved this.

    [1] UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All, CVPR 2024.

  3. The claim in the contribution part, lines 48-51, 'Such as simple yet effective approach maximizes the preservation of modality alignment within base space, demonstrating great potential for augmenting existing unified space and integrating more pre-trained spaces,' should be supported by examples conducted by the authors to validate this claim.

  4. The proposed method also integrates projectors, a.k.a., the decoupled projector. What are the differences between the proposed one and existing ones?

  5. Many related works are not mentioned, not only in the related work section but also in the experimental comparison. These include but are not limited to:

    [2] Imagebind: One embedding space to bind them all, CVPR 2023.

    [3] Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following, arXiv 2023.

    [4] Onellm: One framework to align all modalities with language, CVPR 2024.

    [5] Meta-transformer: A unified framework for multimodal learning, arXiv 2023.

问题

  1. Evidence for Information Loss:

    • Can the author provide experimental results or theoretical analysis to support the claim of information loss during the projection process?
  2. Support for Modality Alignment Forgetting:

    • Are there specific examples or experiments that demonstrate the forgetting of modality alignments in original pre-trained spaces?
  3. Comparison with Non-Projection-Based Methods:

    • How does the proposed method compare with non-projection-based methods, especially those highlighting overlapping semantic information as discussed in UniBind?
  4. Validation of Contribution Claims:

    • Can the author provide examples or experiments that validate the claim about preserving modality alignment and integrating more pre-trained spaces?
  5. Differences in Decoupled Projector:

    • What are the key differences between the proposed decoupled projector and existing projectors?
  6. Inclusion of Related Works:

    • Can the author include comparisons and discussions involving the following works to provide a more comprehensive view?
      • Imagebind: One embedding space to bind them all, CVPR 2023.
      • Point-bind & point-llm: Aligning point cloud with multi-modality for 3D understanding, generation, and instruction following, arXiv 2023.
      • Onellm: One framework to align all modalities with language, CVPR 2024.
      • Meta-transformer: A unified framework for multimodal learning, arXiv 2023.

局限性

N/A

作者回复

W1 & Q1 & Q2: Evidence for Information Loss & Support for Modality Alignment Forgetting

Taking the reprojecting CLIP representations to C-MCR space as an example, the following table(as a sub-table of Table 1) shows the performance differences before and after the projection:

COCO R@1COCO R@5
before reprojecting40.2464.78
after reprojecting16.6737.04

In the reprojected CLIP representations, the Image-Text retrieval performance is significantly reduced, which indicates that the modality alignment of the original CLIP space is forgotten during the reprojection.

W2 & Q3: Comparison with Non-Projection-Based Methods

We compared Ex-MCR with non-projection-based methods, LanguageBind[1], UniBind[2], and PointBind[3] (Note that PointBind has the same models in Audio-Image-Text with ImageBind). For Audio-Image-Text experiments, we conduct the comparison under the same Image-Text performance.

The results are as follows:

Results of Audio-Image-Text experiments. The best results are bolded.

FlickrAVEVGGSSAudiocapsCOCO
R@1R@5R@1R@5R@1R@5R@1R@5R@1R@5
AudioCLIP1.374.910.612.651.253.943.5311.3017.5137.50
WAV2CLIP0.823.410.954.242.5110.470.884.2240.2464.78
C-MCR1.395.691.254.491.947.6915.7641.3716.6737.04
ImageBind(PointBind)7.6820.7818.0040.1114.8235.679.2427.4757.2879.54
LanguageBind1.526.361.495.962.559.8612.4236.753.2476.48
UniBind7.7420.8717.7839.9614.1435.749.2427.4757.2879.54
Ex-MCR-huge+Imagebind7.9221.2617.1138.9515.4937.5518.3447.4457.2879.54

Ex-MCR achieved leading performance in 4 of the 6 Audio-Image retrieval metrics and a significant advantage in Audio-Text semantic alignment.

In 3D-Image-Text experiments, to explore the upper bound of performance in advanced pre-trained space, we use Ex-MCR to fuse EVA-CLIP-18B(Image-Text) and Uni3D-G(3D-Image-Text)(Please refer to the response to all reviewers). The newly constructed 3D-Image-Text space achieves state-of-the-art performance in all three domains (3D-Image, 3D-Text, Imge-Text), compared with non-projection-based methods.

Moreover, since Ex-MCR is extremely flexible, its performance can easily boost, and benefit from the breakthrough of the advanced pre-training spaces.

[1] LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment, ICLR 2024

[2] UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All, CVPR 2024

[3] Point-Bind & Point-LLM: Aligning 3D with Multi-modality, arxiv 2023

W3 & Q4: Validation of Contribution Claims

(1) Claim about preserving modality alignment:

Ex-MCR preserving modal alignment in two ways:

On the one hand, by extending the modalities in one space to another space rather than constructing a new representation space, Ex-MCR preserves the base space. For example, all Image-Text modality alignment of CLIP is fully inherited in Ex-MCR.

On the other hand, by redesigning the data aggregation method, the structure of the projector, and loss functions, Ex-MCR can preserve better cross-model alignment from the leaf space(Audio-Text and 3D-Image alignment, compared to C-MCR).

(2) Claim about integrating more pre-trained spaces:

In this unified space, we preserve advanced intermodal alignment (see Table 1, Table 2 in the paper), and obtain emergent 3D-Audio alignment (see Figure 2 in the paper and more diverse examples provided in PDF). This result proves that Ex-MCR is a unified representation space that can accommodate multiple modalities.

Moreover, the result also indicates that both image and text(modalities in base space) can serve as the overlapped modality to extend the space.

W4 & Q5: Differences between decoupled projector and existing projector

The decoupled projector reduces interference among different optimization objectives, which helps the model learn modality alignment better.

We performed ablation experiments on Ex-MCR-huge, and the results are as follows:

project typeFlickrAVEVGGSSACaps
R@5R@5R@5R@5
non-decoupled projector7.226.5311.2858.36
decoupled projector6.167.3611.7759.60

The above results show that the decoupled projector reaches overall performance.

W5 & Q6: Inclusion of Related Works

Thank you for providing more related works. We will cite these methods in the revised paper. Please refer to the response to W2 & Q3 for comparison results.

评论

Thanks for the authors' response, I will keep my initial rate

评论

Reviewer 1W4Z:

Any chance you can provide more details in your response? This will help the authors. Please do it immediately as this is the last day for author-reviewer discussion.

-AC

审稿意见
7

The paper introduces Extending Multi-modal Contrastive Representations (Ex-MCR), a novel method for learning unified contrastive representations across multiple modalities without the need for paired data. Ex-MCR extends one modality's representation space into another, preserving semantic alignment and reducing training costs. The method enhances the learning pipeline with modality-centric data, a decoupled projector, and a dense alignment objective. Experiments show that Ex-MCR achieves state-of-the-art performance on multi-modal tasks, highlighting its efficiency in representation learning.

优点

  • This paper explores an interesting problem. In practice, as the number of modalities increases, the costs of data preparation and model training for learning a contrastive representation space rise significantly. Therefore, how to purse a training-efficient and pair-data-free representation is critical but challenging.

  • This paper is well-written and the motivation is stated clearly.

缺点

  • The effectiveness of the proposed method heavily depends on choosing hyperparameter in Eq.(7). A sensitivity analysis of hyperparameter is necessary, and discussing insights into choosing this hyperparameter is essential.

  • The performance improvements of the proposed method on 3D-image-text experiments are relatively limited. Especially, Ex-MCR is so far behind ULIP-v2 in the 3D image domain. The authors should explain them clearly.

  • Ex-MCR requires additional training cost and inference time compared to the previous method C-MCR. The authors should give more analysis.

问题

Please refer to the weakness above.

局限性

None

作者回复

W1: the analysis of hyperparameter choosing

Please refer to the response to all reviewers(part 3 The analysis of hyperparameter choosing).

W2: 3D-Image-Text performance improvements are relatively limited.

For 3D-Image-Text experiments, we think there the main reason for the performance loss is the limited training data. In our latest experiments(referring to the official comment to all reviewers), we use Ex-MCR to fuse more powerful models, EVA-CLIP-18B(Image-Text), and Uni3D-G(3D-Image-Text). Ex-MCR achieves leading performance in all three domains (3D-Image, 3D-Text, Image-Text).

For the explanation of the Ex-MCR-base's 3D performance, please refer to the response to review nY4o's Q2.

W3:Ex-MCR requires additional training cost and inference time compared to C-MCR

Taking extending CLAP to CLIP as an example

For training, both Ex-MCR and C-MCR can be trained for less than 4 hours on a single RTX 4090, and the cost difference in reaching convergence can be ignored.

For inferencing, Ex-MCR only projects a single modality in one space, while C-MCR needs to project three modalities in two spaces, therefore C-MCR has a higher inference overhead than Ex-MCR. Detailed statistics for inferencing overhead are shown in the following table:

Params for projecting(M)

modalityAudioImageTextTotal
C-MCR1.05321.05321.05323.1596
Ex-MCR2.3690002.3690

Computation overhead for projecting(GFlops)

modalityAudioImageTextTotal
C-MCR0.00110.00110.00110.0033
Ex-MCR0.0024000.0024
评论

Dear Reviewer KwrG,

We are very thankful for your valuable feedbacks and comments! Please let us know if you have further questions regarding the rebuttal response or any other questions related to the paper. We are looking forward to your further feedbacks!

Thanks,

Authors of Paper Submission 6854

评论

Thanks for your efforts during the rebuttal phase. Most of my concerns have been addressed. I will raise my score accordingly. Additionally, some influential previous works are not discussed such as UMT/UME[1], QMF[2], MMParato[3], and ReconBoost[4]. I encourage authors to discuss them in the revised version.

[1] On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. ICML 2023. [2] Provable Dynamic Fusion for Low-Quality Multimodal Data. ICML2023 [3] MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance. ICML 2024 [4] ReconBoost: Boosting Can Achieve Modality Reconcilement. ICML 2024.

评论

Dear reviewer KwrG,

Thank you for your constructive comments. We are glad we could address your concerns regarding the paper. We will discuss the related works you provided in the revised version.

Once again, thank you for your efforts in the review.

Best regards, Paper 6854 authors

审稿意见
4

This paper proposes Extending Multimodal Contrastive Representation, a training-efficient and paired-data-free method to build unified contrastive representation for many modalities. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods.

优点

  1. The authors propose Extending Multi-modal Contrastive Representations (Ex-MCR), a novel training efficient and paired-data-free representation learning method for more than three modalities.
  2. The authors comprehensively augment the entire space alignment learning pipeline from the perspectives of training data, architecture, and learning objectives.

缺点

  1. The contribution of this paper is the multimodal contrastive learning method. The multimodal contrastive learning method proposed by the authors is not very novel, because contrastive learning is widely used in the field of multimodal retrieval.
  2. The method proposed by the author is too simple, and it is recommended to deepen the improvement of the method.

问题

  1. In Figures 2 and 3, which are the images that are retrieved correctly, the author should mark the images that are retrieved correctly and the images that are retrieved incorrectly.
  2. In Tables 1 and 2, the author's method is lower than the compared method in some parts. How can the advantages of the author's method be explained?

局限性

See the weaknesses section

作者回复

W1 & W2: Lack of novelty in the article & the method is too simple

The main contribution of Ex-MCR is not to "apply contrast learning in multimodal retrieval tasks", but to efficiently integrate multiple pre-trained contrastive multimodal representation spaces into a new and more comprehensive one.

As a novel learning paradigm, Ex-MCR is extremely flexible and able to integrate multiple pre-trained contrastive learning spaces with different modalities to achieve state-of-the-art performance.

Moreover, the paper also redesigned the data aggregation methods, the structure of the projector, and the alignment objective function, all of which make Ex-MCR more powerful. We illustrate the motivation of these designs in Section 3.2 Enhancing Alignment Learning Pipeline. Our ablation studies in Section 4.5 Enhancing Alignment Learning Pipeline fully demonstrate the effectiveness of these novel designs. In the response to reviewer 1W4Z's W4&Q5, we further show the superiority of the decoupled projectors.

Q1:Why does not mark the 3D-audio retrieval results as correct or wrong

There is no existing benchmark for Audio-3D retrieval, so there is no objective definition of right or wrong for Audio-3D pairs. To demonstrate the emergent semantic alignment in Audio-3D, we directly show some top 5 retrieval results in our paper.

We also provided more randomly selected results for audio->3D and 3D->audio retrieval in the newly uploaded PDF.

Q2: Ex-MCR’s performance is lower than the compared method in some parts

For Audio-Image-Text experiments, Ex-MCR reconstructs the alignment between CLAP's audio encoder and the CLIP image encoder and text encoder. Ex-MCR outperformed ImageBind in 4 of the 6 performance metrics of Audio-Image while having much better Audio-Text alignment and the same Image-Text alignment. Therefore, comprehensively, Ex-MCR reached state-of-the-art Audio-Image-Text performance.

For 3D-Image-Text experiments, in our latest experiments(referring to the official comment to all reviewers), we use Ex-MCR to fuse more powerful models, EVA-CLIP-18B(Image-Text), and Uni3D-G(3D-Image-Text). Ex-MCR achieves leading performance in all three inter-modality alignments (3D-Image, 3D-Text, Image-Text).

Additionally, as requested by the reviewers, we provide further clarification on the 3D performance of the Ex-MCR-base. Firstly, the Ex-MCR-base demonstrates stronger Image-Text alignment and achieves a comparable level of 3D-Text alignment to ULIPv2. Therefore, overall, Ex-MCR and ULIPv2 exhibit similar alignment capabilities in 3D-Image-Text space.

We think the main reason for the performance loss in 3D-Image is the limited training data. The pseudo data for training are mainly derived from the inherent alignment of leaf space and base space. Therefore, this extending process can be regarded as distilling leaf space’s alignment knowledge to construct a new alignment for base space. The feather and ability of the distillation teacher can affect the distillation result, and when we use a stronger pre-trained space, Ex-MCR outperforms the teacher.

评论

Dear Reviewer n4Yo,

We are very thankful for your valuable feedbacks and comments. Please let us know if you have further questions regarding the rebuttal response or any other questions related to the paper. We are looking forward to your further feedbacks!

Thanks,

Authors of Paper Submission 6854

评论

Reviewer n4Yo:

Please let us know whether the authors have addressed your concerns.

Thanks.

-AC

作者回复

Response to all reviewers:

1 Further clarification of the experimental setup

The final unified Ex-MCR representations are composed of CLIP's image, text representations, projected CLAP's audio representations, and projected ULIP's 3D representations (i.e. the tiI\bf{t_{i}^{I}}, viI\bf{v_{i}^{I}} ,fmA(flA(aiA))\bf{f_m^A(f_l^A(a_{i}^{A}})), fmU(flU(piU))\bf{f_m^U(f_l^U(p_{i}^{U}})) ) are the final audio-text-image-3D unified representation). Ex-MCR only uses the aligned 3D-image representation of ULIP and extends it to a different image-text model (i.e., CLIP) via the paired-data-free way.

2 3D-Image-Text experiment on Advanced pre-trained space

We use Ex-MCR to fuse advanced pre-trained models, EVA-CLIP-18B(Image-Text)[1], and Uni3D-G(3D-Image-Text)[2].

To fully demonstrate the performance, we tested 3D-text alignment using three 3D classification tasks: Modelnet40[3], Objaverse-lvis[4] and Scanobjnn[5].

The details of benchmarks are shown in the following table:

BenchmarksItemsNum of Classes
Modelnet40246840
Objaverse-lvis468321156
Scanobjnn289015

As in the paper, we use Objaverse-lvis and COCO to test the alignment between 3D-Image and Image-Text, respectively.

We compared Ex-MCR with other non-projector-based methods such as PointBind[6], and UniBind[7]:

Results of 3D-Image-Text experiments.

ModelNet40Objaverse-lvisScanobjnnObjaverse 3D-image retrievalCOCO
R@1R@5R@1R@5R@1R@5R@1R@5R@1R@5
Uni3D-g87.5699.2753.1381.5964.1291.6343.5767.6159.5181.45
ULIPv273.0691.5030.2655.0146.7576.337.5418.8022.9246.33
PointBind76.1897.0413.8330.3455.0586.895.8614.5957.2879.54
UniBind63.2189.598.5420.2941.7374.393.519.2857.2779.53
Ex-MCR(EVA-CLIP-18B & Uni3D)88.0999.3153.5982.2563.9491.9746.3370.0460.6182.08

Ex-MCR achieves leading performance in all three inter-modality alignments, indicating that its performance can easily benefit from the breakthrough of the advanced pre-training spaces.

[1] EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters, arxiv 2024

[2] Uni3d: Exploring unified 3d representation at scale}, ICLR 2024

[3] 3d shapenets: A deep representation for volumetric shapes, CVPR 2015

[4] Objaverse: A universe 347 of annotated 3d objects, CVPR 2023

[5] Revisiting point cloud classification: A new benchmark dataset and classification model on 484 real-world data, ICCV 2019

[6] Point-Bind & Point-LLM: Aligning 3D with Multi-modality, arxiv 2023

[7] UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All, CVPR 2024

3 The analysis of hyperparameter choosing

We conducted ablation experiments on Ex-MCR-huge for training temperature τ2\tau_2 in Eq.(6) and L2 loss weight λ\lambda in Eq.(7)(referring to the official comment to all reviewers), and the results are as follows:

Ablation results for temperature τ2\tau_2 in Eq.(6).

τ2\tau_2FlickrAVEVGGSSACaps
R@5R@5R@5R@5
0.014.355.419.0261.34
0.024.825.319.9162.73
0.035.466.0610.7962.02
0.045.836.5111.1461.12
0.056.167.3611.7759.60
0.066.106.4711.2257.66
0.076.116.6710.8956.27
0.086.216.3110.6055.10
0.096.056.5110.4455.09
0.105.936.4010.4153.68

Ablation results for L2 loss factor λ\lambda in Eq.(7).

λ\lambdaFlickrAVEVGGSSACaps
R@5R@5R@5R@5
0.006.025.8110.4658.59
0.016.196.5711.4260.88
0.036.246.4711.3659.93
0.056.196.3511.1059.41
0.106.167.3611.7759.60
0.155.746.4311.2359.34
0.205.926.3111.1758.08
0.255.846.1411.2358.15
0.305.796.1911.1557.59
0.355.676.3610.9356.97

The results of ablation experiments show that the performance is insensitive to the τ2\tau_2 , and the commonly used 0.05 can achieve good performance.

The λ\lambda we picked is only to equal the absolute value of different loss terms, so we did not extensively search all values. Meanwhile, the experimental results show that further searching of hyperparameters can reach better performance.

最终决定

The current ratings are 5, 7, 5, 5 (one of the reviewers expressed to increase his rating but did not actually did it in the system). AC saw that maybe the only concern that is left seems to be 3D-Image-Text performance remains low. Authors provided some verbal explanations but AC remains somewhat unconvinced on that point.

However, in view of the ratings and it seems most except one concerns have been addressed, this paper looks ok to be accepted. AC urges the authors to address all the points raised in the discussion in the final manuscript.