PaperHub
4.3
/10
Rejected4 位审稿人
最低3最高6标准差1.3
3
3
5
6
3.8
置信度
正确性2.5
贡献度2.5
表达1.8
ICLR 2025

TangentBind: Unlocking the Potential of Emergent Alignment in Multimodal Model

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We propose TangentBind to enhance the emergent alignment ability between indirect alignment modality and retain alignment with core modality.

摘要

关键词
TangentBindMulti-modal AlignmentOptimization

评审与讨论

审稿意见
3

The paper introduces TangentBind, a novel paradigm for training a multimodal unified representation space. By training a generation model capable of producing embeddings-to-embeddings, TangentBind establishes a new alignment anchor when additional modalities are involved. The article elaborates the training methodology of this generation model in detail and demonstrates through experiments that utilizing this additional representation for alignment results in superior performance compared to scenarios where it is not used.

优点

  • The paper astutely identifies a critical shortcoming in the current multimodal unified representation spaces, namely the non-unified distribution across different modalities induced by the InfoNCE loss.
  • The paper address the limitations of the single-term InfoNCE loss in characterizing multimodal representation spaces by incorporating a new alignment component provided by a generative model, which enhances the overall alignment quality.

缺点

Weakness 1: Regarding Formatting and Writing

  • The positions of Table 1 and Table 2 on page 7 are inverted.
  • Numerous small tables occupy excessive space. The authors might consider integrating several tables and restructuring the layout.
  • Section 5.4 contains large portions of repeated content.

Weakness 2: Regarding Table 1 and Figure 1

  • In Table 1, the R@10 data is severely missing. If the paper has indeed conducted reproducibility experiments, both R@1 and R@10 results should have been obtainable; it is peculiar that the paper presents the R@1 results but not the R@10 results.
  • There are boldface marking errors: although C-MCR exhibits significantly better performance on Audiocaps, the paper still highlights its own method in bold.
  • I am particularly curious why the CLAP results were not included in Table 1 for reference, and similarly, in Figure 1, CLAP's performance is not fully displayed. The performances of Clotho and Audiocaps should not be set to zero.

Weakness 3: Regarding Figure 6 and Section 5.4

Given that the loss function optimized by the generative network in the paper is the mean-squared error, and the authors mention in section 3.1 that the representations are normalized to a hypersphere, optimizing the squared error between two unit norm vectors essentially optimizes their cosine similarity. Therefore, the insight that Figure 6 aims to demonstrate is self-evident. This figure merely confirms that the generative network is trained correctly. Although the authors arrange this experiment in the ablation study, in my opinion, the baseline that is really worth comparing should also be other methods of finding alignment anchors, such as directly generating images with captions and then extracting embeddings, or employing retrieval methods as mentioned in C-MCR to find alignment anchors, rather than simply using Gaussian noise for comparison.

问题

Please refer to weaknesses.

评论

Thank you so much for your insightful comments and suggestions. Below, we address your concerns.

Weakness1: Regarding Formatting and Writing

  1. We have corrected the positions of Table 1 and Table 2 on page 7, as suggested. The tables are now presented in their correct order, ensuring that the flow of information aligns with the discussion in the text.
  2. Responding to the concerns about the excessive space occupied by numerous small tables, we have consolidated the content of Table 5 and Table 6 into a comprehensive table. Additionally, Table 4 has been expanded to include new experimental results based on feedback from various reviewers. We have minimized the whitespace around it to enhance layout efficiency.
  3. We have revised Section 5.4, removing repetitive content and integrating descriptions of the new experimental results.

Weakness2: Regarding Table 1 and Figure 1

  1. We apologize for the oversight in not including the R@10 data in Table 2. We have now updated Table 2 to include both R@1 and R@10 results for all relevant methods.
  2. We initially used boldface exclusively for our method to draw reader attention, especially since our TangentBind faces inherent challenges in emergent**emergent** tasks where direct modality pair training is not utilized, unlike other models that benefit from such training. We have added notes on the meaning of the bolded words in the table caption.
  3. The primary focus of Table 2 (previously Table 1) and Figure 1 in our manuscript is to showcase the zeroshot**zero-shot** capabilities of the proposed models, meaning that the models have not**not** been trained on the datasets used for testing. However, as indicated in reference [1], the training datasets for the CLAP model include Clotho and AudioCaps, which are among the datasets we used for the zeroshot**zero-shot** performance evaluation. Therefore, comparing CLAP's performance with other models in Table 2 and Figure 1 would be unfair under the zero-shot conditions we are evaluating. Besides, it is important to clarify that the central values depicted in Figure 1 are not zero; they simply represent the absence of data display for CLAP in those particular scenarios. We hope this explanation clarifies the rationale behind our choices and aligns with the scope of our study focusing on zero-shot performance evaluation.

[1] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.

Weakness 3: Regarding Figure 6 and Section 5.4

We have made significant modifications to Figure 6 to more clearly demonstrate the value and efficacy of our generative network's ability to align embeddings with real data. The CDF curves allow for a preliminary evaluation of each embedding generation method before step 3. We introduced six new CDF curves comparing the similarity between embeddings predicted by ResNet, VAE, C-MCR, and actual data text/image paired embeddings. When the CDF curve approaches 1.0, it indicates the generated embeddings are similar to the actual data embeddings. As illustrated in Figure 6, the embeddings produced by the diffusion model closely resemble the actual embeddings.

The results in Figure 6 are consistent with those in Table 4, which demonstrate better performance of the diffusion model. Our analysis confirms that diffusion models not only perform exceptionally well in terms of alignment but also excel in downstream tasks, often outperforming direct data generation and embedding extraction methods.

For the ablation study, the pipeline methods, such as directly generating images with captions and then extracting embeddings, depend on high resource consumption end-to-end generative models. Thus, these methods are unfit for our motivation, and we do not adopt the pipeline methods as baselines. Besides, C-MCR is compatible with TangentBind and used as a baseline.

We appreciate the opportunity to clarify these aspects and hope these response addresses your concerns effectively.

审稿意见
3

This paper proposes a multimodal pre-training method called TangentBind, which aims to enhance the potential alignment capability of multimodal models. Unlike previous methods, TangentBind first aligns all modalities to a core modality (such as image or text), and then introduces a generative network to generate the embeddings of the second modality based on the embeddings of the core modality. The authors also introduce the Tangent Term, which is a loss function that can prevent the generated embeddings from negatively impacting the alignment with the core modality. The experimental results show that TangentBind significantly outperforms previous binding models in terms of potential capability across various datasets and tasks.

优点

  1. TangentBind introduces a novel approach to enhance the emergent alignment capabilities between indirectly aligned modalities while maintaining alignment with the core modality.
  2. The method's effectiveness is validated through experiments using VISION and TEXT as core modalities and including other modalities such as AUDIO, DEPTH, and INFRARED.
  3. The method proposed in the paper do not need additional datasets where all modalities are present simultaneously, nor does it rely on large amounts of generated data across different modalities, making it more efficient and practical.

缺点

1.Little benefits from the proposal: In this system, the authors proposed two key components. However, the idea of generative network stems from DALLE-2, which lacks its motivation. The idea of TangentTerm has relatively low benefit on final results. 2.Limitations on the significance of the task: In comparison, LanguageBind still performs better on some task compared to TangentBind. 3.Lack of experiments on the structure of generative network: In this paper, the generative network proposed by the authors only compares with noisy embedding. The authors may use some other structures to replace generative network. 4.Lack of ablation study: Reviewer suggest removing tangent term and generative network together, and add them each at a time to show the benefits from either of them more clearly.

问题

1.The proposed method trains a Generative Net at Step 2, which generates reconstructed image embeddings conditioned text. The motivation of using a generative network for aligning different modalities seems to be unclear. Since training a generative network consumes much more time and resources than other ways of fusing different modalities. 2.It would help if the authors could provide more details on the training datasets and resources consumption. As the quality of core modality, VISION and TEXT could greatly influence the performance of successive modality encoder. 3.In the experiment part, the authors give the test results on various benchmarks with existing methods like ImageBind and LanguageBind, but it could be strengthened by including more comprehensive comparisons with methods train with modality pairs, (image, text) and (audio, text) for example, like CLIP and AudioCLIP.

伦理问题详情

No

评论

Thank you so much for your insightful questions and suggestions. We would like to address your questions and concerns below.

Weaknees1: Little benefits from the proposal: In this system, the authors proposed two key components. However, the idea of generative network stems from DALLE-2, which lacks its motivation. The idea of TangentTerm has relatively low benefit on final results.

Our experimental results clearly demonstrate a substantial improvement in emergentzeroshot**emergent zero-shot** performance, which we believe validates the effectiveness of our approach and is a significant extension beyond the scope of initial methods. Besides, the experiments results presented in Section 5 highlight the emergentzeroshot**emergent zero-shot** improvements achieved by TangentTerm, without compromising the alignment with the core modality. This benefit underscores the novel contribution of TangentTerm compared to traditional methods like infoNCE.

The motivation behind employing a generative network is primarily because it does not necessitate the simultaneous presence of all modalities in additional datasets, nor does it depend on generating massive amounts of data across various modalities. This methodology significantly streamlines the training process while efficiently handling data missing. The decision to employ a diffusion model is made after extensive ablation studies detailed in Section 5.4 and Appendix.C. These studies empirically demonstrate that diffusion models provide superior performance in generating embeddings that effectively close the modality gap[1], compared to other baselines.

[1] Fuest, M., Ma, P., Gui, M., Fischer, J.S., Hu, V.T., & Ommer, B. (2024). Diffusion Models and Representation Learning: A Survey. ArXiv, abs/2407.00783.

Weakness2: Limitations on the significance of the task: In comparison, LanguageBind still performs better on some task compared to TangentBind.

As discussed in response to a previous comment (Weakness 1), TangentBind is specifically designed to significantly enhance the emergentzeroshot**emergent zero-shot** capabilities of models, and our experimental results robustly support this claim, showing marked improvements in tasks where image and text serve as core modalities.

It is important to note that one of TangentBind's key advantages is its ability to boost emergentzeroshot**emergent zero-shot** performance without compromising the alignment capabilities with the core modality. This functionality is a significant contribution of our approach. The instances where LanguageBind outperforms TangentBind are in non**non-**emergent zero-shot scenarios. The marginal performance differences observed are acceptable considering the improvement on emergentzeroshot**emergent zero-shot** performance.

Our method provides a novel approach that not only preserves but also enhances the model's capacity to adapt to new, unseen modalities while maintaining robust performance across standard tasks. This balance is crucial for real-world applications where adaptability and robustness are essential. We appreciate the opportunity to clarify these aspects and hope that the distinction between the specific enhancements offered by TangentBind and the broader performance metrics is now more apparent.

Weakness3: Lack of experiments on the structure of generative network: In this paper, the generative network proposed by the authors only compares with noisy embedding. The authors may use some other structures to replace generative network.

We acknowledge the importance of a comprehensive assessment of the generative network structure to substantiate the robustness and efficiency of our proposed methodology.

In response to your comments, we have expanded our experimental framework to include a detailed ablation study on the structure of the generative network and other methods, as outlined in Section 5.4 and detailed further in Appendix C. Besides, we have conducted additional experiments where the diffusion model, originally used to generate embeddings, is replaced with various alternatives such as VAE, ResNet50, and C-MCR. Results show that the diffusion model is the best option for our specific application needs.

Weakness4: Lack of ablation study: Reviewer suggest removing tangent term and generative network together, and add them each at a time to show the benefits from either of them more clearly.

It is important to note that when both the tangent term and the generative network are removed from TangentBind, the model effectively reverts to the baseline models, ImageBind and LanguageBind. We have provided results for ImageBind and LanguageBind as baseline models in our experiments.

We appreciate your feedback, which has been instrumental in enhancing the depth and rigour of our study.

评论

Thank you so much for your insightful questions and suggestions. Below, we address your questions and concerns.

Questions1: The proposed method trains a Generative Net at Step 2, which generates reconstructed image embeddings conditioned text. The motivation of using a generative network for aligning different modalities seems to be unclear. Since training a generative network consumes much more time and resources than other ways of fusing different modalities.

The motivation for using a generative network has been mentioned in reply to weakness 1. Besides, it is crucial to note that our generative network operates within the latent space, considerably reducing the computational overhead compared to direct image generation. Extensive ablation studies presented in Section 5.4 and Appendix C corroborate that employing a diffusion model for generating embeddings is more effective than the methods we tested.

By leveraging the generative network in this manner, we improve the emergentzeroshot**emergent zero-shot** of modality alignment without direct pairings and maintain a manageable level of computational demand. We believe that these points adequately address the concerns raised and underscore the value of our methodological choices.

Questions2: It would help if the authors could provide more details on the training datasets and resources consumption. As the quality of core modality, VISION and TEXT could greatly influence the performance of successive modality encoder.

We have expanded the details in our manuscript. A comprehensive description of the training datasets can be found in Appendix B and C, where we outline the specific datasets used, their characteristics, and how they align with the objectives of our study to ensure the quality and relevance of the core modalities.

For models with image (text) as the core modality, the text (image) modality corresponds to the modality Ma\mathcal{M}_a in step 1 in Section 3. As noted in Section 5.1, using ImageBind (LanguageBind) as an initialization provides pre-aligned image and text encoders. Thus we do not train the image and text encoders further in subsequent steps.

Questions3: In the experiment part, the authors give the test results on various benchmarks with existing methods like ImageBind and LanguageBind, but it could be strengthened by including more comprehensive comparisons with methods train with modality pairs, (image, text) and (audio, text) for example, like CLIP and AudioCLIP.

As you pointed out, CLIP is specifically designed for aligning image and text modalities, hence its experiments are inherently tied to <RGB, Text> pairs. In our study, as detailed in Appendix C, TangentBind does not alter the coefficients of the image and text encoders during training. Therefore, comparing TangentBind directly with CLIP is equivalent to comparing our baseline methods, ImageBind or LanguageBind, with CLIP. Thus, this comparison is unnecessary to our work. To address your point about including comparisons with audio-text modality pairs, we have included results from AudioCLIP in Table 1 (previously Table 2).

We appreciate the opportunity to clarify these aspects and hope these response addresses your concerns effectively.

审稿意见
5

This paper introduces TangentBind, a multimodal pre-train method based on latent space generation, and a Tangent Term, an improvement of infoNCE, aligning the generated modality embeddings. It also shows that the proposed Tangent Term not only enhances the emergent capabilities of the multimodal alignment model but also preserves alignment with the core modality.

优点

  • This paper revises InfoNCE loss by adding Tangent Term, where its functionality is mapping the embedding to the space tangent to the image vector and scaling it to the unit hypersphere. This is novel, and the Tangent Term actually improves the overall performances of multimodal models.

  • This paper clearly shows the effectiveness of the TangentBind with comprehensive experiments and outperforming results in various downstream tasks like RGB image, depth image, infrared image, and audio.

缺点

  • While the whole paper talks about InfoNCE loss and develops it to propose the Tangent Term, the actual InfoNCE paper [1] does not seem to be referenced. This is critical - the author MUST reference the original InfoNCE paper.

  • While the proposed TangentBind is novel enough with exhaustive experiments, minor typos, a missing reference, spacing between lines, and the location of figures and table, etc. (please refer to Additional Comments), diminish the overall quality of the paper, making it seem to be incomplete. I would definitely encourage the authors to take a slow look at the entire paper to refine the paper's quality.

[1] Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv preprint arXiv:1807.03748 (2018).

问题

Additional Comments

  • Please add a line mentioning that the details of the dataset are located in the Appendix at the beginning of Section 5: Experiments and Results.
  • Typo in line 315: TangenBind -> TangentBind
  • Typo in line 316: i4mage -> image
  • I might have missed the other typos in the paper. Please take a careful look to revise the paper's overall quality. I would like to raise my score if the authors revise the paper more like reader-friendly.
评论

Thank you so much for your insightful questions and suggestions. We would like to address your questions and concerns below.

Weakness1: While the whole paper talks about InfoNCE loss and develops it to propose the Tangent Term, the actual InfoNCE paper [1] does not seem to be referenced. This is critical - the author MUST reference the original InfoNCE paper.

Thank you for your careful reading of our manuscript and for pointing out the omission of the reference to the original InfoNCE paper, which indeed plays a fundamental role in the theoretical underpinnings of our work. To rectify this, we have now included the reference to the original InfoNCE paper as a reference in our manuscript. We have ensured that this reference is cited appropriately within the context of our discussion on related work of contrastive learning (Section 2.2). We appreciate your guidance in improving our manuscript and hope that this correction meets your expectations for scholarly thoroughness and accuracy.

[1] Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding." arXiv preprint arXiv:1807.03748 (2018).

Weakness2: While the proposed TangentBind is novel enough with exhaustive experiments, minor typos, a missing reference, spacing between lines, and the location of figures and table, etc. (please refer to Additional Comments), diminish the overall quality of the paper, making it seem to be incomplete. I would definitely encourage the authors to take a slow look at the entire paper to refine the paper's quality.

Thank you for your constructive feedback and encouragement to refine our manuscript. We appreciate your detailed observations regarding minor typos, missing references, line spacing inconsistencies, and the placement of figures and tables. Based on your valuable input, we have thoroughly reviewed and revised the entire manuscript. Specific changes include correcting all typos and grammatical errors, adding the missing reference, and adjusting line spacing for consistency. Additionally, we have optimized the placement of figures and tables to align closely with the related text, significantly enhancing the flow and coherence of the content. Notably, we have reduced the whitespace around tables to improve layout efficiency and added citations for all non-bind methods referenced in Table 1 and Table 2, ensuring thorough documentation of sources. These improvements have significantly bolstered the quality and completeness of our paper. Thank you once again for your invaluable insights.

Questions: Additional Comments: • Please add a line mentioning that the details of the dataset are located in the Appendix at the beginning of Section 5: Experiments and Results. • Typo in line 315: TangenBind -> TangentBind • Typo in line 316: i4mage -> image • I might have missed the other typos in the paper. Please take a careful look to revise the paper's overall quality. I would like to raise my score if the authors revise the paper more like reader-friendly.

Thank you for your valuable feedback and comments to improve the clarity and readability of our manuscript. In response to your comments, we have made the following revisions:

  1. At the beginning of Section 5: Experiments and Results, we added the sentence: 'For the dataset and experimental implementation details, please refer to the Appendix. B and C.'

  2. We have corrected the typographical errors you noted: 'TangenBind' has been corrected to 'TangentBind', and 'i4mage' has been corrected to 'image'.

  3. Additionally, we thoroughly reviewed the entire manuscript to identify and correct other typographical and grammatical errors, improving the overall readability.

We hope these revisions address your concerns and enhance the quality of the manuscript.

Thank you again for your insights, and we appreciate the opportunity to strengthen our work based on your recommendations.

审稿意见
6

TangentBind introduces a novel multi-modal alignment approach using generative networks and a Tangent term to enhance cross-modal performance while maintaining core modality alignment. The method shows significant improvements over baselines across various tasks and datasets, with innovative theoretical contributions in generative network modeling and parameter optimization strategies.

优点

  • Theoretical Innovation. TangentBind introduces new concepts through its Tangent term and generative network, backed by rigorous mathematical analysis and proofs.

  • Experimental Validation. The method demonstrates superior performance across multiple modalities (audio, depth, infrared) and diverse datasets (AudioSet, VGG-S, NYU-D), with comprehensive ablation studies validating each component.

  • Practical Robustness. TangentBind exhibits consistent performance across different core modalities (image and text), with minimal degradation in core alignment capabilities. The method's stability is evidenced through detailed parameter sensitivity studies and direct comparisons with alternative approaches like InfoNCE, demonstrating its readiness for real-world deployment.

缺点

  • Details of Generative Network. There is no detailed description of the generative network architecture, the learning target used, and the training process.

  • The Value of Lambda. In line 515, it is shown that the best performance is achieved when lambda is set to 1.25. However, since this value is larger than 1, it seems to contradict the theoretical analysis.

  • Writing to be improved. For instance, missing full stop in eq. 5, and incomplete sentence in line 277.

问题

Please see weaknesses.

伦理问题详情

NA

评论

Thank you so much for your insightful questions and suggestions. We would like to address your questions and concerns below.

Weakness1: Details of Generative Network. There is no detailed description of the generative network architecture, the learning target used, and the training process.

In response to your comments, we have added comprehensive details on these aspects in Appendix.C of our manuscript. This includes a full breakdown of the network architecture, the specific learning targets we have employed, and a step-by-step description of our training methodology. We believe this additional information will provide clarity and enhance the understanding of our generative model's setup and operational framework for all readers.

Weakness2: The Value of Lambda. In line 515, it is shown that the best performance is achieved when lambda is set to 1.251.25. However, since this value is larger than 11, it seems to contradict the theoretical analysis.

In line 515, where it is noted that the best performance is achieved when λ\lambda is set to 1.251.25, it seems to contradict the theoretical analysis initially suggesting that λ1\lambda \leq 1 prevents an increase in Lalign\mathcal{L}^{\text{align}} (where a smaller Lalign\mathcal{L}^{\text{align}} indicates higher similarity between modality pairs). It is important to clarify that Theorem 1 only indicates that Lalign\mathcal{L}^{\text{align}} does not increase when λ1\lambda \leq 1. However, it is possible for Lalign\mathcal{L}^\text{align} also does not increase when λ=1.25\lambda=1.25. Thus model performance at λ=1.25\lambda=1.25 is possible to be better than model performance at λ1\lambda\leq 1. Detailed analyses regarding the relationship between Lalign\mathcal{L}^{\text{align}} and model performance on downstream tasks are thoroughly discussed in [1].

[1] Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.

Weakness3: Writing to be improved. For instance, missing full stop in eq. 5, and incomplete sentence in line 277.

Thank you for your careful review and valuable feedback. We apologize for the oversight and appreciate your pointing out the missing punctuation and incomplete sentences in our manuscript.

  1. Regarding the missing full stop at the end of equation 5, we have now added it to ensure the text conforms to proper grammatical standards. This change can be found on page 5.

  2. Regarding the incomplete sentence on line 277, we have revised it to ensure it clearly conveys the intended information without ambiguity. The revised sentence now reads: 'we refer to the method as TangentBind.' This modification can also be viewed on page 6, line 272 of the updated document.

In addition to these specific corrections, we have thoroughly reviewed the entire manuscript to refine the overall language quality and rectify any other typographical errors found throughout the text. These adjustments aim to improve readability and ensure the precision of the content, enhancing the manuscript's overall clarity and presentation. We hope these revisions address your concerns adequately. Thank you again for your insights which help in improving the quality and clarity of the paper. We look forward to your further suggestions.

We appreciate your guidance in making these improvements to our paper.

AC 元评审

The paper introduces TangentBind, an approach to enhancing modality alignment in multimodal models. TangentBind first aligns all modalities to a selected core modality (e.g., image or text), then employs a generative network to produce embeddings for secondary modalities based on this core. This setup allows other modalities, like audio, to be aligned with both the core and generated embeddings, thereby enhancing emergent capabilities while maintaining alignment. The training process incorporates infoNCE and introduces a Tangent Term to improve the accuracy of generated embeddings.

The decision to Reject is based on several weaknesses identified by reviewers.

  • Lack of Detail and Clarity: The generative network architecture, learning targets, and training processes were inadequately described in the original submission, leading to confusion about implementation and effectiveness. Additionally, there are numerous writing and formatting issues, such as missing references, typos, and structural inconsistencies, which diminish the overall quality.

  • Experimental Rigor: The experiments lack comprehensiveness, including the absence of key performance metrics (R@10) and a broader range of comparisons for the generative network. Missing baseline comparisons beyond Gaussian noise were also cited by reviewers.

  • Limited Impact: Reviewers expressed concerns regarding the utility of the proposed TangentTerm, suggesting it provides minimal improvements over existing methods (like LanguageBind).

Reviewers were unanimous in their appreciation of the theoretical motivations behind TangentBind and note that the proposed approach improves the indirect alignment in multimodal models while maintaining alignment between core modalities. The consensus is that this paper has some very interesting ideas, however problems with technical clarity and experimental validation limit appreciation of its potential impact.

审稿人讨论附加意见

Reviewer concerns focused on technical clarity and experimental evaluation. In rebuttal the authors provided clarifications, concentrating on the improvement of emergent zero-shot capabilities of multi-model models. The authors performed extensive revisions of the manuscript during rebuttal to improve clarity, but these were insufficient to overcome initial reviewer concerns.

最终决定

Reject