We sincerely appreciate your recognition of both the novelty of our method and the practical value of our motivation, as well as your positive feedback on our paper's representation and experiments. We are also deeply grateful for your thorough review and valuable suggestions.

Q1: The empirical setting of w(p+) and the possibility of adaptive learning by the model.

A1: Thank you for your helpful suggestion.

Yes, we empirically set to 0.8 and 0.2 for the two settings based on experimental validation.

Following your suggestion, we implemented adaptive learning for and validated it on three datasets with =0.1. The results show performance improvements of 0.3%~0.5% compared to fixed parameters. Notably, the learned parameters (ranging from 0.664~0.805 and 0.195~0.336) align well with our empirical values (0.8/0.2), validating our empirical setting. We implemented this by randomly initializing the two settings in [0,1], with their sum constrained to 1.

	ACC (Fixed: 0.8/0.2)	ACC (Adaptive)	(Learned)
Caltech	0.791	0.796 ( 0.05)	0.805/0.195
Hdigit	0.892	0.895 ( 0.03)	0.664/0.336
CIFAR10	0.861	0.864 ( 0.03)	0.782/0.218

Given the improved performance and greater flexibility of adaptive learning, we will adopt this improvement in the revised version. We again thank you for the constructive suggestion.

Q2: Aggregation of view-specific representations into consensus representation.

A2: Thank you for your comment. The consensus representation is obtained through the following steps:

step1: View-specific representations are learned through autoencoders from original data.

step2: Inter-sample structural relationships are captured in relationship matrix through a Transformer-based self-attention mechanism.

step3: Structure-aware representations are computed as for each view.

step4: View weights are learned through a view weight learning module.

step5: Final consensus representation is obtained by weighted fusion: .

Q3: Regarding the imbalanced ratio in Figure 1.

A3: Thank you for your comment. The imbalance ratio only applies to the training set, while the test set remains balanced. PROTOCOL maintains superior performance across different train imbalance ratios, demonstrating its effectiveness and robustness in handling class-imbalanced multi-view data.

Q4: Adding visualization results for more extreme imbalance data.

A4: Thank you for your insightful suggestion. Following your recommendation, we conducted tests on the Hdigit dataset with an even more extreme imbalance ratio of =0.05, with visualization results shown in Figure A3 of the PDF file provided in the anonymous link (https://zenodo.org/records/15117646). The results demonstrate that, compared to baseline methods, PROTOCOL can effectively identify smaller clusters and clearly distinguish cluster structures of varying scales, validating our method's effectiveness and robustness under extreme imbalance ratios.

To more intuitively demonstrate PROTOCOL's ability to perceive imbalanced data distributions, we conducted a quantitative analysis of the clustering results from Figure 3 in the original paper and Figure A3, where we calculated the number of samples in each class from the test results and computed the actual imbalance ratios.

As shown in the table below, when the imbalance ratio =0.1, other methods produced actual imbalance ratios between 0.26~0.38, while PROTOCOL achieved an actual imbalance ratio of only 0.14. Similarity, when the imbalance ratio =0.05, other methods produced actual imbalance ratios between 0.23~0.37, while PROTOCOL achieved an actual imbalance ratio of only 0.12. This indicates that our method can more accurately perceive and maintain the class distribution characteristics of the original data.

Actual_	MFLVC	CSOT	GCFAggMVC	SEM	PROTOCOL
=0.1	0.26	0.28	0.39	0.38	0.14
=0.05	0.28	0.25	0.23	0.37	0.12