PaperHub
6.3
/10
Poster4 位审稿人
最低3最高8标准差2.0
6
8
3
8
4.5
置信度
正确性3.0
贡献度3.3
表达3.3
ICLR 2025

Trusted Multi-View Classification via Evolutionary Multi-View Fusion

OpenReviewPDF
提交: 2024-09-25更新: 2025-03-31

摘要

关键词
Trusted multi-view classificationNAS-based multi-view classificationevolutionary multi-view fusionmulti-view learning

评审与讨论

审稿意见
6

The paper addresses the issues present in pseudo-views generated by previous methods and proposes a solution. It introduces an evolutionary multi-view architecture search approach to generate high-quality fusion architectures to serve as pseudo-views, thus enabling adaptive selection of views and fusion operators.

优点

The overall writing of the paper is smooth and easy to understand.

The experimental results are extensive and solid.

缺点

I'm not sure if the proposed pseudo-view generation method is innovative compared to previous neural architecture search (NAS) methods.

Additionally, I'm uncertain whether the proposed method can be applied to large-scale end-to-end multimodal datasets. This is because the method might be inefficient, especially for high-dimensional and complex multimodal data, and the number of operators in the NAS process may be insufficient to cover all necessary functions. Increasing the number of operators might also lead to a significant increase in computational cost. I am open to increasing the score upon receiving a well-justified explanation.

问题

Could you please describe in detail the differences from other multi-modal NAS methods, such as the DC-NAS (Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification) mentioned in the manuscript?

At the methodological level, what is the significance of concatenating pseudo-views with the original views? In principle, it seems that the pseudo-view already contains the information of the original view.

评论

Q2: At the methodological level, what is the significance of concatenating pseudo-views with the original views? In principle, it seems that the pseudo-view already contains the information of the original view

Re: The pseudo-view already contains the information of the original view. However, its quality still limited due to the view imbalance problem. Moreover, the searched pesudo view will be optimized with other multiple views for trusted fusion. In this process, the problem of imbalanced multi-view learning exacerbates because as the searched pesudo view contains a disproportionate amount of information compared to individual views. Hence, we enhance each view within the fusion architecture by concatenating the fusion architecture's decision output with its respective view.

The strategy is inspired by cortico-thalamocortical circuit, where the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus. Besides, from the perspective of characteristic among early fusion and late fusion, multi-view features are typically integrated through neural networks to achieve late fusion representations. However, this approach may result in the loss of certain raw information within individual views. On the other hand, early fusion methods combine features at an earlier stage but face challenges such as feature heterogeneity and high sample complexity. To address the limitations of both approaches, we innovatively extract late fusion information from the fusion architecture discovered in the first stage. This information is then incorporated into the second stage, where it is concatenated with the original views to generate pseudo-views. This strategy bridges the gaps in both early and late fusion methods, leading to improved results.

Based on above analysis, we concatenate pseudo-views with the original views. The experiment results (See Table3 that is part of Table 5 in the original manuscript) show that the strategy is very effective. TEF2_{2} directly uses the pseudo-view without concatenating with the original views for trustworthy fusion.

Table 3

MethodsAWANUAReuters5Reuters3VoxCelebYoutubeFace
TEF2_{2}89.81±0.4273.93±0.5280.11±0.5184.79±0.4662.74±0.2074.47±2.18
TEF93.26±1.2575.12±0.5782.26±0.2386.49±0.1092.41±0.1486.02±0.41

Thank you again for professional comments. Please kindly let us know if you have any follow-up questions or areas needing further clarification. Your insights are valuable to us, and we stand ready to provide any additional information that could be helpful.

评论

Thank you for professional comments. We have tried our best to address your questions and revised our paper by following suggestions from all reviewers.

W1 and Q1: Could you please describe in detail the differences from other multi-modal NAS methods, such as the DC-NAS (Divide-and-Conquer Neural Architecture Search for Multi-Modal Classification) mentioned in the manuscript?

Re: This is a very good comment. The main difference between TEF and other multi-modal NAS methods is that the pesudo view searched by TEF will be enhanced by concatenating the fusion architecture's decision output with each view within the fusion architecture while ones searched by other methods are not. The strategy alleviates the view imbalance problem, which allows us to obtain better quality pesudo view than other multi-modal NAS methods. This has been verified by the empirical results that TEF armed with the our obtained pesudo view achieves a better performance than others. The results are shown in Table 1 and 2, where TEF and TEF1_{1} denote that TEF is armed with our pseudo-view and one induced by other NAS methods, respectively. That is, our pseudo view is enhanced by concatenating it with the fusion architecture's decision output, while the compared one is not. Table 1 is part of Table 5 in the original manuscript. Table 2 is provided by conducting more experiments.

Table 1

MethodsAWANUAReuters5Reuters3VoxCelebYoutubeFace
TEF1_{1}91.60±0.2073.71±0.4879.93±0.5485.33±0.4191.44±0.1476.68±1.44
TEF93.26±1.2575.12±0.5782.26±0.2386.49±0.1092.41±0.1486.02±0.41

Table 2

MethodsPIEHandWrittenScene15Caltech-101CUBAnimalNUS
TEF1_{1}95.81±0.7898.75±0.2475.74±0.6595.11±0.5194.17±0.5689.16±0.2845.18±0.23
TEF97.57±0.7899.65±0.1378.01±0.4896.04±0.3295.92±0.6290.18±0.0847.80±0.30

Additionally, we want to clarify that the paper aims to further release the potential of the trusted multi-modal learning methods (TMML) by introducing high-quality pesudo view into TMML and breaking its late-fusion limitation. We formulate the pesudo view generation as a population-based multi-view neural architecture problem in the framework of TMML. It is natural that other multi-modal NAS methods are easy to be coupled into TEF. We indeed do not pay more focus on the NAS design, however the pesudo view enhancement is novel and original.

W2: Additionally, I'm uncertain whether the proposed method can be applied to large-scale … I am open to increasing the score upon receiving a well-justified explanation.

Re: As you mentioned, evaluating the fitness of all NN chromosome vectors indeed requires too much computing time at each generation. In fact, this still is open problem in population-based search methods. Fortunately, evolutionary multi-view learning has provided many accelerated solutions such as fitness caching (FC), search guided by core structure (CS). In this paper, we use these strategies to accelerate the TEF, and find that the version of TEF armed with FC and CS is very high-efficiency. Obviously, new developments of the acceleration techniques in the evolutionary computation community can be readily integrated into the TEF, achieving a more higher-efficiency implementations of TEF if more effort is made. Table

MethodsFCCSTime
TEFw/o_{w/o}FalseFalse13.21h
TEFTrueTrue2.46h

Additionally, it is hard to afford the overload that each modality backbone network is also searched. The researchers provide a tradeoff between effectiveness and efficiency by fixing the backbone network and only searching the fusion strategies, such as [1-5]. In this paper, we follow the manner for dealing with large-scale end-to-end multimodal datasets. Moreover, in the era of big modal, training time seem not be limitation for algorithm application. One algorithm can be thought to be good if the inference time is within the user's tolerance range in terms of efficiency. TEF is the kind of algorithm.

Last, we would like to explain why chose evolutionary NAS. The NAS methods can be roughly three categories: gradient-based NAS (GNAS), reinforcement learning-based NAS (RLNAS), and evolutionary NAS (ENAS). GNAS requires a predefined search space and substantial memory, RLNAS relies on extensive computational resources, ENAS offers advantages such as global search capability, flexibility, and parallelization. This makes it particularly suitable for handling complex multi-view tasks and large search spaces.

[1] Core-structures-guided multi-modal classification neural architecture search, IJCAI,2024

[2] DC-NAS: Divide-and-conquer neural architecture search for multi-modal classification, AAAI,2024

[3] BM-NAS: Bilevel multimodal neural architecture search, AAAI,2022

[4] Deep multimodal neural architecture search, ACMMM,2020

[5] MFAS: Multimodal fusion architecture search, CVPR,2019

评论

Dear Reviewer a8gx,

We thank you for your thorough review of our paper and for providing constructive feedback that has significantly contributed to its improvement. Your insights have been invaluable in helping us refine our work.

We sincerely hope that our responses have sufficiently addressed the issues you highlighted in your review and follow-up comments. As the author-reviewer discussion period approaches its end, please do not hesitate to let us know if there is anything further we could do to improve your impression and final rating of our work.

评论

Dear Reviewer a8gx,

We are pleased to address your concerns. And thank you very much for raising the score.

Best regards,

the authors.

审稿意见
8

This research is significant in the field of trustworthy fusion and addresses two main challenges in current methods. First, many approaches overlook feature-level interactions, resulting in suboptimal performance. Second, high-quality pseudo-views exacerbate multi-view imbalance. The authors propose the TEF method, which utilizes evolutionary multi-view architecture search to generate high-quality pseudo-views, enabling adaptive selection of viewpoints and fusion operations. Experimental results demonstrate that this strategy significantly enhances TEF's performance on complex multi-view datasets, particularly in cases with three or more viewpoints. Evaluation results confirm the superiority of the proposed method compared to existing techniques.

优点

  1. The authors first review previous work and identify its limitations.
  2. They innovatively introduce neural architecture search to generate high-quality pseudo-view architectures, providing a detailed description that effectively addresses the issue of insufficient feature interaction. Although this technique is time-consuming, it offers an effective solution.
  3. The use of concatenation operations to solve the imbalance between multi-views is actually quite interesting, appearing simple yet effective.
  4. Extensive experiments also demonstrate the effectiveness of TEF.

缺点

See in the questions.

问题

  1. The paper indicates that introducing a single pseudo-view is effective; have there been any attempts to use multiple pseudo-views to further enhance the results?
  2. Is this method a general concept? Can other trustworthy fusion methods also adopt similar strategies to improve model performance? 3.What do you believe is the fundamental reason behind achieving state-of-the-art results? 4.Regarding the time-consuming issue caused by using the NAS algorithm, while the paper suggests some solutions, I would still like to know if the authors have plans for further research in the future to tackle more complex data or deeper network issues.
评论

Thank you for professional comments. We have tried our best to address your questions and revised our paper by following suggestions from all reviewers.

Q1: The paper indicates that introducing a single pseudo-view is effective; have there been any attempts to use multiple pseudo-views to further enhance the results?

Re: Yes, we have considered this and conducted experiments on the NUS dataset. The table below shows the differences between not introducing any pseudo-views and introducing 1, 2, 3, or 4 pseudo-views. Introducing a single pseudo-view typically results in better performance, while introducing two or more pseudo-views does not significantly improve outcomes. This is because introducing one optimal pseudo-view has already enabled sufficient feature interaction, and additional pseudo-views may lead to conflicts or feature redundancy.

Number of Pseudo-ViewsAccuracy (%)
072.73
175.12
275.24
375.19
475.09

Q2: Is this method a general concept? Can other trustworthy fusion methods also adopt similar strategies to improve model performance?

Re: This method can be viewed as a plug-and-play module, suitable for conventional reliable multi-view classification methods. By introducing a pseudo-view generation framework, other methods can effectively address the common issues of traditional multi-view methods: these methods mostly use late fusion techniques, tend to merge the local features of each view, and overlook the global feature interactions between views. The introduction of pseudo-views not only compensates for this deficiency but also significantly enhances the overall performance of the model by integrating global features. Therefore, other credible fusion methods could consider adopting a similar strategy to achieve comparable improvements in model performance. We integrated this method into the classic TMC framework, and the results displayed in the table below illustrate its effectiveness. Compared to the original TMC, the enhanced TMC_TEF model demonstrates significant performance improvements across seven datasets. Specifically, the improvements on the PIE, Scene15, and NUS_1 datasets are 4.18%, 12.7%, and 10.43%, respectively.

MethodPIEHandWrittenScene15Caltech101CUBAnimalNUS1_1
TMC [1]91.85±0.2398.51±0.1367.71±0.3092.80±0.5090.57±2.9679.31±0.4335.18±1.55
TMC_TEF96.03±1.3598.86±0.1380.41±0.5194.71±0.5795.25±0.7989.36±0.1946.51±0.27

[1] Trusted Multi-View Classification 2021 ICLR

Q3: What do you believe is the fundamental reason behind achieving state-of-the-art results?

Re: The main reasons include: First, most credible fusion methods employ a late fusion strategy, which limits the interaction of information between views, consequently leading to suboptimal utilization of multi-view data. To overcome this challenge, we adopted an evolutionary multi-view architecture search method, creating a high-quality fusion architecture to serve as a pseudo-view. This architecture abandons the traditional constraints of late fusion, allowing features to fully interact and effectively aggregate global features rather than just focusing on local features. Second, the introduction of pseudo-views may exacerbate the imbalance issue in multi-view learning, as the pseudo-views contain more information relative to individual views, which could potentially weaken model performance. To mitigate this issue, we enhanced each view within the fusion architecture by concatenating the decision outputs of the fusion architecture with their corresponding views, thereby enhancing the efficacy of the pseudo-views and further optimizing the overall performance of the model.

评论

The authors have addressed my concerns by conducting more experiments and providing more explanations. I would like to increase my score.

评论

Thank you for raising the score. We are glad to hear that our revisions and feedback have successfully addressed your initial concerns.

审稿意见
3

The paper presents TEF (Trusted multi-view classification via Evolutionary Fusion), aiming to address challenges in multi-view classification such as limited view interaction and learning imbalance. TEF utilizes evolutionary neural architecture search to create pseudo-views and applies a balanced fusion strategy. The method shows potential improvements over existing approaches in certain comparisons.

优点

  1. Combines evolutionary NAS with multi-view classification, offering a novel approach.
  2. The methodology is clearly explained with effective visuals.

缺点

  1. The emphasis on learning imbalance within pseudo-view-guided multi-view classification may not be as critical as the authors suggest. As long as the overall contributions of different views are balanced, focusing on the specific imbalance of the pseudo-view might be unnecessary. More evidence or justification would be helpful to show that this imbalance significantly impacts performance.

  2. The experimental comparisons may not ensure fairness, as it seems that the benchmarks used in this study do not align with those originally used in previous TMVC methods (e.g. ECML [1]). This raises concerns about consistency and fairness. Clarifying whether all models were evaluated under similar conditions would improve the reliability of the results.

  3. The claimed robustness of TEF in handling noisy or uncertain real-world data is not well-demonstrated. More experiments simulating practical conditions are needed to validate this aspect.

[1] Reliable conflictive multi-view learning. AAAI 2024

问题

  1. Can you provide more justification or empirical evidence showing that learning imbalance within pseudo-view-guided multi-view classification significantly impacts model performance?

  2. The benchmarks used in the paper seem different from those used in prior TMVC studies. Were the TMVC methods evaluated under the same conditions as your proposed TEF? Clarifying this would address concerns about fairness in experimental comparisons.

  3. Can you show additional experiments or real-world examples demonstrating TEF's effectiveness in handling noisy or uncertain data?

  4. Actually, the method seems to rely on aligning pseudo-view generation with target domain performance using validation labels. However, when samples are noisy, there appears to be no specific strategy to mitigate this noise. How does your approach show advancements over existing methods, especially given that Evidential Deep Learning has known limitations in fusion frameworks that could risk performance degradation?

评论

Thank you for professional comments. We have tried our best to address your questions and revised our paper by following suggestions from all reviewers.

W1,Q1: Can you provide more justification or empirical evidence showing that learning imbalance within pseudo-view-guided multi-view classification significantly impacts model performance?

Re: One of the main topics in multi-view learning is how to effectively integrate heterogeneous information from different views. However, although multi-view learning aids in comprehensively understanding the world by integrating information from various senses, most models often fail to satisfactorily achieve multi-view collaboration and do not fully utilize all views. Additionally, while it is anticipated that multiple input views would enhance model performance, we actually find that although multi-view models outperform single-modality models, their potential is still not fully realized [1] [2] [3]. Particularly, the introduction of pseudo-views can exacerbate this imbalance issue, where easily trainable views may suppress the potential of pseudo-views, leading the entire model to suboptimal performance. In Figure 3 of the paper, we demonstrate the performance difference before and after addressing the imbalance issue with pseudo-views, showing noticeable changes. Furthermore, we have added results from seven datasets commonly used in TMVC, leading to a consistent conclusion that the learning imbalance in pseudo-view guided multi-view classification has a significant impact on model performance.

DatasetPIEHandWrittenScene15Caltech101CUBAnimalNUS1_1YouTubeReuter5Reuter3AWANUS2_2VoxCeleb
Imbalance95.81±0.7798.75±0.2375.74±0.6595.11±0.5194.16±0.5689.16±0.2745.00±0.2376.68±1.4479.93±0.5485.33±0.4191.33±0.4173.71±0.291.44±0.14
Balance97.57±0.7899.64±0.1378.00±0.4896.04±0.3295.92±0.6290.18±0.0847.52±0.3086.02±0.4182.49±0.2386.49±0.1093.28±1.2575.12±0.5792.41±0.12

W2,Q2: The benchmarks used in the paper seem different from those used in prior TMVC studies. Were the TMVC methods evaluated under the same conditions as your proposed TEF? Clarifying this would address concerns about fairness in experimental comparisons.

Re: We have clearly stated in Appendix A.4 that the other TMVC methods and TEF are evaluated under the same conditions. Specifically, these methods utilize the same data processing workflows, dataset partitioning strategies, and consistently employ a 128-dimensional view dimension. Additionally, we have meticulously adjusted the parameter settings of the other TMVC methods to ensure they achieve better performance. Based on these fair comparison conditions, the results ensure the fairness of the evaluation.

To fully demonstrate the advantages of our method and further verify its fairness, we also conducted evaluations on seven benchmark datasets commonly used by other TMVC methods. Using the same data partitioning strategy as Xu et al. (2024), we repeated the experiments ten times randomly and calculated the average results and standard deviations. The results show that our method exhibits significant advantages across all seven datasets.

MethodPIEHandWrittenScene15Caltech101CUBAnimalNUS1_1
EDL [4]86.25±0.8996.90±0.1652.76±0.5473.35±1.7386.22±0.3684.30±1.7622.33±0.64
DCCAE [5]81.96±1.0495.45±0.3574.62±1.5289.56±0.4185.39±1.3682.72±1.3835.75±0.48
CPM-Nets [6]88.53±1.2394.55±1.3667.29±1.0190.35±2.1289.32±0.3887.40±1.1235.37±1.05
DUA-Nets [7]90.56±0.4798.10±0.3268.23±0.1193.43±0.3480.13±1.6778.65±0.5533.98±0.34
TMC [8]91.85±0.2398.51±0.1367.71±0.3092.80±0.5090.57±2.9679.31±0.4335.18±1.55
TMDL-OA [9]92.33±0.3699.25±0.4575.57±0.0294.63±0.0495.43±0.2087.05±0.2834.39±0.44
RCML [10]94.71±0.0299.40±0.0076.19±0.1295.36±0.3894.50±1.1384.01±6.334.04±0.27
RMVC [11]91.18±0.2498.51±0.0473.05±0.2488.73±0.6093.18±0.4787.67±0.1734.68±0.32
Ours97.57±0.7899.64±0.1378.00±0.4896.04±0.3295.92±0.6290.18±0.0847.52±0.30
评论

I cannot find the reply to question 3,4

评论

Dear Reviewer jfaY,

We have replied the question 1,2 and the question 3,4 in two windows,respectively. However, we find that only one window is shown when it is read from the mobinephone. Two windows will be shown when it is read from the computer.

We are sorry for this issue. For your convenience, we also show the question 3,4 as follows.

W3, Q3: Can you show additional experiments or real-world examples demonstrating TEF's effectiveness in handling noisy or uncertain data?

Re: Thank you for your insightful query, which enables us to more comprehensively demonstrate the advantages of the TEF framework. Concerning the seven conflicting datasets, they have been structured in accordance with Xu et al., 2024 [10]. TEF was rigorously tested on each dataset through ten iterations to ensure statistical robustness, with both mean values and standard deviations reported. These results substantiate TEF's effectiveness in managing datasets with inherent noise or uncertainty. As demonstrated, TEF consistently delivers superior performance, even under challenging conditions involving noisy or uncertain data. Notably, on the NUS dataset, TEF achieved a performance that was 12.08% higher than the second-best method, and on the Scene15 dataset, it surpassed the next best by 15.3%.

MethodPIEHandWrittenScene15Caltech-101CUBAnimalNUS1_1
EDL [4]21.76±0.6757.2514.28±0.2455.74±0.1253.75±0.4230.71±0.2718.07±0.28
DCCAE [5]26.89±1.1082.85±0.3825.97±2.8660.90±2.3263.57±1.2864.30±2.1132.12±0.52
CPM-Nets [6]53.19±1.1783.34±1.0729.63±1.1266.54±2.8968.82±0.1764.83±0.3529.20±0.81
DUA-Nets [7]56.45±1.7587.16±0.3426.18±1.3175.19±2.3460.53±1.1762.46±1.1231.82±0.43
TMC [8]61.65±1.0392.76±0.1542.27±1.6190.16±2.4073.37±2.1664.85±1.1933.76±2.16
TMDL-OA [9]68.16±0.3493.05±0.4548.42±1.0290.63±2.3574.43±0.3664.62±0.1532.44±0.26
RCML [10]84.00±0.1494.40±0.0556.97±0.5292.36±1.4876.50±1.1562.67±0.8131.19±0.22
RMVC [11]76.47±3.4394.75±0.7549.83±2.2380.56±0.7172.78±0.4266.00±0.5924.62±3.19
Ours86.76±0.4998.70±0.3172.27±0.4393.42±0.5877.41±0.4770.61±0.1245.84±0.31

Q4: Actually, the method seems to rely on aligning pseudo-view generation with target domain performance using validation labels. However, when samples are noisy, there appears to be no specific strategy to mitigate this noise. How does your approach show advancements over existing methods, especially given that Evidential Deep Learning has known limitations in fusion frameworks that could risk performance degradation?

Re: It is important to clarify that all methods were configured using 80% of the data as the training set and 20% as the test set for training and testing purposes. Considering the potential data leakage issues associated with Neural Architecture Search (NAS), we further divided the training set into a training set and a validation set to prevent data leakage. After completing the search and fusion architecture, we merged the training and validation sets into a new training set to proceed to the credible fusion stage. This practice is consistent with all traditional Multi-view Convergence (TMVC) methods and does not rely on using validation labels to align pseudo-view generation with target domain performance. On datasets R3, R5, and seven additional noise-inclusive datasets, TEF showed significant improvements compared to other methods, thanks to the introduction of pseudo-views. Even when multi-view data contain noise, as they still represent different descriptions of the same object, we can enhance performance through early fusion and feature interaction, which is evident from the experimental results. Additionally, pseudo-views are generated through a search process that effectively excludes some highly noisy views. Regarding the known limitations of Evidential Deep Learning in fusion frameworks, which could potentially degrade performance, the core issue is that they employ late fusion methods that overlook early feature interactions. Introducing a pseudo-view that has undergone sufficient feature interaction can significantly address these shortcomings.

Thank you again for professional comments. Please kindly let us know if you have any follow-up questions or areas needing further clarification. Your insights are valuable to us, and we stand ready to provide any additional information that could be helpful.

评论

Dear Reviewer jfaY,

We thank you for your thorough review of our paper and for providing constructive feedback that has significantly contributed to its improvement. Your insights have been invaluable in helping us refine our work.

We sincerely hope that our responses have sufficiently addressed the issues you highlighted in your review and follow-up comments. As the author-reviewer discussion period approaches its end, please do not hesitate to let us know if there is anything further we could do to improve your impression and final rating of our work.

Best regards,

The authors.

评论

Thanks for your reply. I am sorry for the late response.

Firstly, I found your response somewhat confusing due to the inadequate citation of sources. It was difficult for me to trace each reference to its corresponding article, which caused considerable difficulty in understanding your points.

To better understand the development in this field, I carefully reviewed the literature related to [1]. You just mentioned early feature interactions, so I would like to ask how the method presented in [2] differs from yours. I was also surprised to find that you cited experimental data from this work but did not provide a direct comparison with it. Additionally, there are other relevant works, such as [3], which seem to bear some relation to your study. Given its publication timeline, I am unsure whether this work was available before your submission. For now, I believe it would be sufficient for you to clarify the similarities and differences between [2] and your approach.

I was quite perplexed by your mention of the evolutionary optimization pseudo-view method. The adaptive function is directly determined by downstream performance (classification accuracy), which means it is entirely supervised. This raises a tricky question: could there be a more detailed division of the training set, or further "evolution" across multiple dimensions? In any case, since the evolution is ultimately supervised, it seems that this work is quite incremental. If the pseudo-view evolution were based on other self-supervised or unsupervised metrics, I believe it would offer more insightful contributions.

[1] Xu et al. Reliable Conflictive Multi-view Learning, AAAI 2024

[2] Huang H et al. Trusted Unified Feature-Neighborhood Dynamics for Multi-view Classification, arXiv'24

[3] Fu et al. Core-Structures-Guided Multi-Modal Classification Neural Architecture Search, IJCAI'24

审稿意见
8

This paper illustrates two main issues in existing trustworthy fusion methods: the lack of feature interaction in late fusion and the insufficient attention given to multi-view imbalance, which leads to inadequate training and suboptimal performance. To address this, the authors innovatively propose an evolutionary computation adaptive method that introduces high-quality pseudo-views, significantly enhancing the performance of trustworthy fusion methods on complex multi-view datasets. Furthermore, the authors present an effective solution for view imbalance and conduct extensive experiments to validate its effectiveness.

优点

  1. The paper presents the motivation for the research in detail through illustrations, and the structure of the paper is logical and well-written.
  2. The paper demonstrates considerable innovation, as the authors first identify two key challenges in trustworthy multi-view fusion and provide effective solutions.
  3. The authors validate the effectiveness of their method through extensive experiments, with results showing significant advantages over existing trustworthy and non-trustworthy fusion methods across five evaluation metrics on six datasets.

缺点

  1. I am interested in your concatenation operation. Why is the late-stage fusion information only concatenated with the pseudo-views and not with the original views? What effects would concatenation have?
  2. Why did you choose evolutionary computation for generating pseudo-views? Have you considered gradient neural architecture search or reinforcement learning?
  3. Can early fusion features not assist in training? Why was the final choice made to use late-stage features to address view imbalance?

问题

See in the weakness.

评论

Thank you for professional comments. We have tried our best to address your questions and revised our paper by following suggestions from all reviewers.

W1: I am interested in your concatenation operation. Why is the late-stage fusion information only concatenated with the pseudo-views and not with the original views? What effects would concatenation have?

Re: In the first stage, evolutionary neural architecture search generates an optimal feature interaction fusion architecture. In the second stage, the method produces a high-quality pseudo-view. However, the training difficulties between the single-view training architecture and the feature interaction fusion architecture exacerbate the imbalance problem, leading to insufficient training of the feature fusion architecture. Auxiliary post-processing features generated in the first stage can assist in the training of the second stage to achieve the desired performance. It is evident from the table that the use of feature concatenation and its placement significantly impacts the performance of the TEF architecture. TEF^0 introduces a pseudo-view framework, TEF^1 introduces pseudo-views without concatenation, TEF^2 performs concatenation before the classification layer, and TEF^3 concatenates within the original view. Comparing TEF^3 and TEF^1, TEF^3 shows a significant performance improvement on all datasets, including a 9.44% gain on YouTubeFace. Furthermore, the performance of TEF^3 surpasses that of TEF^2, better preserving and utilizing multi-view information when concatenating at the original view stage rather than after view fusion but before classification.

MethodsAWANUSReuter5Reuter3VoxCelebYoutubeFace
TEF0TEF^088.59±0.2572.73±0.3079.60±0.5684.23±0.3573.13±0.1571.18±2.27
TEF1TEF^191.60±0.2073.71±0.4879.93±0.5485.33±0.4191.44±0.1476.68±1.44
TEF2TEF^293.16±1.2175.06±0.6681.14±0.6386.45±0.1892.06±0.1485.35±0.62
TEF3TEF^393.26±1.2575.12±0.5782.26±0.2386.49±0.1092.41±0.1286.02±0.41

W2: Why did you choose evolutionary computation for generating pseudo-views? Have you considered gradient neural architecture search or reinforcement learning?

Re: We have comprehensively considered three neural architecture search (NAS) methods and ultimately chose evolutionary neural architecture search (eNAS) for the following reasons: Compared to gradient-based NAS, which requires predefined search spaces and substantial memory, and reinforcement learning-based NAS, which relies on extensive computational resources, evolutionary NAS offers advantages such as global search capability, flexibility, and parallelization (Liang et al., 2021). This makes it particularly suitable for handling complex multi-view tasks and large search spaces.

W3: Can early fusion features not assist in training? Why was the final choice made to use late-stage features to address view imbalance?

Re: We conducted an analysis on the impact of extracting features from various layers for view enhancement, using the Reuters3 dataset. Specifically, we concatenated features from the last four layers of the fusion architecture to assess their final effect in the TEF (Tensor Ensemble Framework). Assuming the fusion architecture consists of n layers, we extracted features from the n-th, (n-1)-th, (n-2)-th, and (n-3)-th layers, and also set a baseline that did not use any features for comparison. The experimental results, as shown in Figure 6, indicate that the closer the layer is to the final one, the better the performance. For instance, concatenating features from the n-th layer resulted in a 0.59% improvement in accuracy compared to concatenating features from the (n-3)-th layer. This is because the deepest layers of the fusion architecture contain the most complex and advanced representations of the model, which can capture more in-depth patterns and semantic information from the data. Enhancing the original view with multi-view representations post-fusion helps improve the expressiveness of the view and prevents information loss during the fusion process. This approach facilitates better outcomes in multi-view fusion architectures and alleviates the multi-view imbalance issue to some extent.

MetricNonen-3n-2n-1
Accuracy84.0284.4784.6885.14
Recall83.5583.8584.6785.27
Precision84.7384.8385.1585.69
F184.1484.3484.9185.48
Kappa83.4483.5983.8584.41

Thank you again for professional comments. Please kindly let us know if you have any follow-up questions or areas needing further clarification. Your insights are valuable to us, and we stand ready to provide any additional information that could be helpful.

评论

I would like to thank the authors for the detailed answers, which have addressed my concerns. After carefully reviewing the other reviews, I will maintain my current ratings.

评论

Dear Reviewer por9,

Thank you for the approval of our work contribution, and give it 8: accept, good paper.

We are glad to hear that our revisions and feedback have successfully addressed your initial concerns.

评论

Dear Reviewers,

Thank you again for your valuable feedback. We have carefully addressed your comments in the rebuttal and revised the manuscript accordingly.

We kindly invite you to review the updates, and please let us know if you have any further questions or suggestions. Your time and insights are greatly appreciated.

Best regards,

Authors

AC 元评审

This paper introduces the Enhancing Trusted Multi-View Classification via Evolutionary Multi-View Fusion (TEF) approach. During the rebuttal phase, the authors effectively addressed key concerns related to experimental fairness, robustness in noisy scenarios, computational efficiency, and generalizability. As a result, three out of four reviewers provided strong ratings of 8, 6, and 6, demonstrating the overall positive reception of the work.

The first-stage pseudo-view generation approach is regarded as sound, as it ensures no leakage of testing samples, thereby maintaining the integrity and independence of the evaluation process. The core contribution of this paper lies in enhancing trusted multi-view classification by advancing feature-level interaction between views—a less-explored but critical aspect in this domain.

Thus, acceptance is recommended.

审稿人讨论附加意见

This paper improves feature-level interactions in trusted multi-view classification and merits acceptance.

最终决定

Accept (Poster)