PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高5标准差0.5
5
5
4
4
4.5
置信度
创新性3.5
质量3.3
清晰度2.8
重要性3.0
NeurIPS 2025

Scalable Cross-View Sample Alignment for Multi-View Clustering with View Structure Similarity

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29

摘要

关键词
Multi-view Clustering; K-means Clustering; Similarity Graph Learning; Representation Learning

评审与讨论

审稿意见
5

A scalable multi-view clustering algorithm based on sample alignment is proposed in this paper, where an alignment matrix is learned and incorporated into a late-fusion framework to facilitate clustering in the absence of aligned samples.

优缺点分析

a. The manuscript is well-written and introduces a novel perspective on addressing sample misalignment in multi-view clustering.

b. Experimental results on eight benchmark datasets demonstrate the effectiveness of the proposed method.

c. It remains unclear whether cross-view similarity between unaligned representations can reliably capture semantic correspondence under strong view heterogeneity.

d. The scalability of the method to large-scale data is not thoroughly discussed, particularly regarding the computational implications and potential optimizations.

e. While the use of a k-nearest neighbor graph is intuitive, the sensitivity of the model to the choice of k warrants further analysis.

问题

a. Without explicit correspondence, what principle underlies the computation of cross-view similarity between non-aligned samples?

b. What is the rationale for enforcing orthogonality on the alignment matrix M? A clear justification is currently lacking and should be provided.

局限性

Yes

最终评判理由

The authors have well solved my concerns and I decide to raise the rating.

格式问题

None

作者回复

Q1: It remains unclear whether cross-view similarity between unaligned representations can reliably capture semantic correspondence under strong view heterogeneity.
A1: Thanks for your careful review. We would like to clarify that in our framework, the cross-view similarity between unaligned representations is not intended to directly capture semantic correspondence or high-level semantic consistency across views. Instead, it mainly estimates sample-level correspondence between views under unaligned scenarios. Specifically, our approach constructs cross-view similarity graphs by leveraging the view-specific graph structures, i.e., the similarity between the aligned samples and the unaligned samples for each view. These similarity measures are designed to identify samples that are likely to be counterparts across views, even in the presence of view heterogeneity. They are not responsible for learning semantic representations, but rather provide a relational criterion to guide the sample alignment process.

Q2: The scalability of the method to large-scale data is not thoroughly discussed, particularly regarding the computational implications and potential optimizations.
A2: Thanks for your constructive comments. As detailed in Section 4.3, the primary sources of computational cost in our proposed method include cross-sample similarity learning, cross-view similarity construction, and the late fusion stage. Among these, the largest computational complexity is the generation of the partition matrix Fv\mathbf{F}^v during the late fusion process, which leads to an overall complexity of O(n2)\mathcal{O}(n^2). Fortunately, this cost can be significantly reduced in practice. In particular, when more efficient or approximate strategies are adopted for generating the feature matrix Fv\mathbf{F}^v, such as sparse graph construction, anchor-based clustering, or randomized low-rank approximations, the computational complexity can be reduced to nearly linear in the number of samples. Moreover, other components of the proposed method, such as CLM-based baseline view selection and cross-view similarity learning, scale linearly or near-linearly concerning the number of samples. In our experimental results, the proposed method has already demonstrated the ability to handle relatively large datasets within a reasonable time.

Q3: While the use of a k-nearest neighbor graph is intuitive, the sensitivity of the model to the choice of k warrants further analysis.
A3: Thank you for your valuable comment. We agree that the choice of the kk-nearest neighbors parameter may impact the performance of the proposed method, and thereby we conduct additional experiments to analyze the proposed method's sensitivity to different values of kk under 50% sample alignment. As shown in the following experimental results, we evaluated clustering performance in terms of ACC, NMI, ARI, and F1score across a range of kk values (kk=5, 9, 15, 19, 25) on six benchmark datasets. The results demonstrate that while minor performance fluctuations exist across different kk values, the overall performance remains relatively stable. For instance, on the HW and 100leaves datasets, the ACC and NMI remain consistently high regardless of the choice of kk, suggesting that our method is robust to the kk setting. Although some datasets exhibit moderate variations, these changes do not significantly affect the general trend of the ranking of results. Moreover, the best performance does not concentrate on a single kk value, indicating that the model is not sensitive to a specific value. In a word, these observations confirm that our method maintains a stable and competitive performance across a reasonable range of kk values, suggesting that it is not highly sensitive to the neighborhood size. This robustness enhances the practicality of our method, especially when the optimal kk is not known a priori.
Effect of kk-nearest neighbors on ACC under 50% sample alignment.

Different kk-nnYale3sourcesMSRCV100leavesHWScene
kk=563.33±2.9962.49±0.9585.76±0.1570.85±1.2995.29±3.4835.63±1.49
kk=964.79±2.4861.39±0.5084.45±0.3270.65±1.3696.07±0.0436.17±0.58
kk=1564.24±3.6264.44±1.2983.52±0.3970.63±1.2996.55±0.0035.91±0.27
kk=1964.52±3.0962.57±0.3383.38±0.6070.76±0.9895.84±2.9436.60±0.75
kk=2564.09±2.4064.56±0.1883.31±0.5770.68±0.9195.28±3.3436.11±1.03

Effect of kk-nearest neighbors on NMI under 50% sample alignment.

Different kk-nnYale3sourcesMSRCV100leavesHWScene
kk=568.18±1.4157.97±0.5572.96±0.3284.55±0.3391.43±1.3330.81±0.44
kk=969.34±1.1657.67±0.1371.66±0.4584.65±0.5291.42±0.0729.85±0.55
kk=1569.31±1.3261.20±0.8370.28±0.5585.34±0.4292.09±0.0030.27±0.20
kk=1969.26±1.8258.81±0.3170.93±0.3784.50±0.4191.75±1.0929.53±0.68
kk=2568.91±1.5162.05±0.0670.87±0.3884.61±0.3891.41±1.1530.16±0.29

Effect of kk-nearest neighbors on ARI under 50% sample alignment.

Different kk-nnYale3sourcesMSRCV100leavesHWScene
kk=547.26±2.4343.06±1.2470.11±0.3158.95±0.9891.12±3.3617.45±0.71
kk=949.23±1.6741.67±0.6567.97±0.5859.34±1.4991.48±0.0816.78±0.50
kk=1548.91±2.6545.98±1.6166.34±0.6659.97±0.9592.49±0.0017.15±0.32
kk=1949.24±2.8243.88±0.3266.45±0.8058.94±1.1491.80±2.6816.91±0.63
kk=2548.46±2.1147.10±0.1866.34±0.7859.09±1.0091.10±3.0617.66±0.56

Effect of kk-nearest neighbors on F1score under 50% sample alignment.

Different kk-nnYale3sourcesMSRCV100leavesHWScene
kk=550.59±2.2554.20±1.0374.27±0.2759.36±0.9692.02±2.9823.24±0.65
kk=952.44±1.5653.10±0.5272.43±0.5059.74±1.4792.33±0.0722.64±0.44
kk=1552.17±2.4356.59±1.3471.03±0.5760.37±0.9493.24±0.0022.94±0.29
kk=1952.47±2.6154.82±0.2771.14±0.6859.34±1.1292.63±2.3922.78±0.56
kk=2551.72±1.9857.44±0.1571.04±0.6759.49±0.9992.00±2.7223.41±0.50

Q4: Without explicit correspondence, what principle underlies the computation of cross-view similarity between non-aligned samples?
A4: Thanks for your insightful comments. Our proposed method addresses this challenge through a carefully designed two-step strategy. First, within each view, every unaligned sample is represented by its affinity relative to the aligned samples in the same view, effectively embedding the unaligned data into a consistent reference criterion defined by these aligned samples. Since the aligned samples are assumed to correspond across views, cross-view similarity between unaligned samples can then be evaluated indirectly by comparing their affinities over the common aligned samples. This approach builds on the fact that non-aligned samples displaying similar affinity patterns with a common set of aligned samples across their respective views are likely to share semantic consistency, even without explicit correspondence. Furthermore, the experimental results demonstrate the effectiveness of the proposed strategies.

Q5: What is the rationale for enforcing orthogonality on the alignment matrix M? A clear justification is currently lacking and should be provided.
A5: We appreciate your insightful question. In our framework, we aim to establish soft correspondences between unaligned samples by reconstructing them based on structurally relevant information within their respective subspaces, rather than relying on hard 0-1 alignment. The orthogonality constraint is introduced to preserve the intrinsic geometric structure of the data during this reconstruction process. Specifically, orthogonality helps ensure that the transformation encoded by M\mathbf{M} behaves in a structure-preserving manner, avoiding distortion of sample relationships that could arise from degenerate or overly flexible alignments. This is particularly important in the presence of noise and semantic inconsistencies across views, where maintaining relative distances and semantic separability becomes challenging. Additionally, orthogonality serves as a form of regularization that stabilizes the optimization process and prevents trivial or collapsed solutions.

评论

I have read the rebuttal and the concerns have been well sovled by the authors ,especially using of a k-nearest neighbor graph is intuitive, the sensitivity of the model to the choice of k warrants further analysis. Therefore, I tend to give a positive vote for this paper.

评论

Thank you very much for your positive feedback. We sincerely appreciate your time and effort in reviewing our work. We are glad to know that our responses have addressed your concerns. Your comments have been invaluable in improving the quality of our manuscript.

Best regards,
Authors

审稿意见
5

A novel multi-view clustering method based on cross-view sample alignment is proposed in this manuscript, and its effectiveness is demonstrated through experiments on eight multi-view datasets.

优缺点分析

Strengths:
(1) The paper addresses the challenge of multi-view clustering with unaligned samples in a principled and practical manner. The proposed framework is conceptually clear and yields competitive results across multiple benchmarks.
(2) The use of cluster-label matching (CLM) for benchmark view selection is a noteworthy design choice. This mechanism appears to improve alignment stability, particularly in the presence of inconsistent or noisy views.
(3) The introduction of an alignment graph to connect aligned and unaligned samples enhances representational flexibility and contributes to performance gains under partial correspondence settings.

Points for Consideration:
(1) It would be helpful for the authors to clarify the specific motivations behind adopting the CLM mechanism for benchmark view selection. A brief discussion comparing it to alternative strategies (e.g., view consensus or uncertainty-based methods) could further strengthen the justification.
(2) As the alignment process is anchored on a single benchmark view, there is a potential risk that misrepresentative or noisy data within this view could influence the quality of alignment across other views. Some discussion on the mitigation of such risks would be valuable.
(3) Constructing similarity graphs between non-aligned and aligned samples is central to the proposed method. However, the paper could benefit from further analysis on how initial clustering inaccuracies might affect the propagation of alignment errors.
(4) While the alignment matrix plays a key role in sample correspondence, its robustness under noisy benchmark views remains unclear. Additional empirical or theoretical insight into the stability of alignment in such cases would enhance the completeness of the work.

问题

While the proposed late-fusion strategy elegantly avoids the need for early alignment, it raises an interesting question: to what extent does this design truly exploit the structural and semantic complementarities between views? One might wonder whether certain types of cross-view information—particularly those that only emerge through joint modeling—are inadvertently underutilized in the alignment process.

Another intriguing aspect is the reliance on cluster-label matching for selecting a benchmark view. This appears to work well in practice, yet it remains somewhat unclear how robust this step is to imperfections in the initial clustering. Given that early-stage clustering often involves noise or instability, especially in high-dimensional or heterogeneous views, it would be worth exploring how such variance might influence the view selection outcome—and, by extension, the final alignment quality.

局限性

Yes.

最终评判理由

Most of my concerns have been resolved. I would like to recommend acceptance.

格式问题

N/A

作者回复

Q1: It would be helpful for the authors to clarify the specific motivations behind adopting the CLM mechanism for benchmark view selection. A brief discussion comparing it to alternative strategies could further strengthen the justification.
A1: Thanks for your valuable suggestion. The motivation for adopting the CLM-based mechanism lies in its ability to provide a view-independent evaluation of clustering quality. Unlike many existing methods that either randomly select a baseline view or rely on view consensus, our approach aims to directly assess the intrinsic structural information of each view's clustering. Consensus-based strategies typically identify a baseline view by measuring its agreement with other views. However, such methods may favor views that are consistent yet collectively poor in quality, leading to confirmation bias across views. Similarly, uncertainty-based methods rely on the distributional confidence, but they often lack explicit consideration of clustering structure, especially in the absence of ground truth labels. In contrast, the CLM mechanism jointly evaluates intra-cluster compactness and inter-cluster separability in a scale, shift, and class cardinality-invariant manner. This allows for fair and consistent comparison across heterogeneous views. By selecting the view with the highest CLM score, the most structurally reliable view is used in the proposed method, thereby enhancing the overall robustness of cross-view alignment.

Q2: As the alignment process is anchored on a single benchmark view, there is a potential risk that misrepresentative or noisy data within this view could influence the quality of alignment across other views.
A2: Thanks for your insightful comments. We fully agree that anchoring the alignment process on a misrepresentation or noisy baseline view could propagate errors to other views and negatively affect overall performance. To mitigate this risk, our framework employs a carefully designed selection mechanism based on the CLM score, which evaluates each view's clustering structure through a combination of compactness and robustness to data scale, shift, and class imbalance. By computing the CLM score for each view, we explicitly avoid relying on random selection. Instead, we select the view with the highest structural quality as the baseline view, ensuring that it reflects well-formed clusters and minimizes the chance of incorporating noise from poorly separated structures. Furthermore, since CLM is constructed to be scale and data cardinality-invariant, it is less sensitive to outliers or imbalanced cluster sizes, making it more robust in practice. Additionally, according to the extensive experimental results in Section 5, we can observe that it consistently achieves satisfactory alignment performance across diverse multi-view datasets.

Q3: The paper could benefit from further analysis on how initial clustering inaccuracies might affect the propagation of alignment errors.
A3: We thank your insightful comments. We agree that inaccuracies in the initial clustering stages may potentially affect the construction of similarity graphs between aligned and misaligned samples, thereby influencing the propagation of alignment errors. To mitigate this, we integrate both feature-level information and structural-level information to construct the consensus representation. This dual-perspective design ensures that the model does not rely solely on any single level of information, and enables correction and compensation for localized clustering inaccuracies during the optimization of the consensus representation. Furthermore, the late fusion clustering framework is introduced to decouple the final clustering decision from the initial alignment assumptions, providing an additional layer of robustness. Essentially, even in the presence of some local misalignment, the global fusion step can improve the clustering results by weighted integration of diverse information from multiple views, without being overly affected by any single misaligned view.

Q4: While the alignment matrix plays a key role in sample correspondence, its robustness under noisy benchmark views remains unclear.
A4: We appreciate your valuable comment regarding the robustness of the learned alignment matrix under noisy baseline views. From a theoretical perspective, our alignment matrix learning framework is designed to incorporate global structural information from multiple views rather than relying solely on local pairwise similarities. This global view enables the alignment matrix to be less sensitive to localized noise or misalignment in any single baseline view. Furthermore, the alignment objective jointly optimizes for consistency across views, implicitly enforcing robustness through structural constraints and regularization. These design choices ensure that the learned correspondences capture the intrinsic relations among samples, improving stability despite noise. Empirically, we validate this theoretical robustness by conducting experiments where the alignment matrix learned by our method is applied to align clustering results from several different algorithms. Compared to using the classical Hungarian algorithm, which is primarily a local matching approach, our alignment matrix consistently achieves better clustering accuracy across diverse datasets and algorithms. The detailed experimental results are presented in the following, and the detailed introductions can refer to Section 5.7. This demonstrates that our approach can generalize beyond the specific baseline view used during alignment learning and effectively mitigate the influence of noise views. In a word, the theoretical design and empirical evidence strongly support the conclusion that our learned alignment matrix is stable and robust under the noise baseline views.
Results of competitors on the 100leaves under a sample alignment ratio of ρ\rho=50%

MetricsSettingDealMVCMVCANEBMGCDCMVCLMTCTMSLDSTL
ACCUnaligned7.69±0.0049.51±1.2433.94±0.0048.83±0.8335.58±0.9447.47±1.4036.87±1.42
Aligned+Ours12.42±0.5248.81±1.2843.06±0.0053.75±0.8940.82±1.3148.18±1.3735.60±0.90
NMIUnaligned25.34±0.4469.50±0.9558.22±0.0067.15±0.4258.75±0.5569.33±0.6160.53±0.55
Aligned+Ours38.01±0.4669.95±0.5964.52±0.0072.10±0.4964.58±0.8070.54±0.5761.09±0.53
ARIUnaligned1.89±0.2530.37±1.5315.36±0.0029.23±0.9117.24±0.8131.41±1.1518.49±0.77
Aligned+Ours5.16±0.1230.58±0.9924.62±0.0036.50±0.8724.46±1.2432.67±1.1719.38±0.68
F1scoreUnaligned6.68±0.1537.49±1.2916.16±0.0035.08±0.6118.07±0.8132.09±1.1419.35±0.75
Aligned+Ours11.47±0.1437.74±0.9325.33±0.0041.38±0.8525.21±1.2333.34±1.1520.23±0.67

Q5: To what extent does this design truly exploit the structural and semantic complementarities between views?
A5: Thanks for your constructive comments. Indeed, it is critical to ensure that the late-fusion strategy fully leverages the structural and semantic information among views. To address this, we designed our framework with the following considerations. Firstly, while the fusion process is conducted at a later stage, we do not rely solely on independent view-wise clustering results. Instead, the alignment phase incorporates both feature-level information and graph-structural similarities, which are used to learn a unified latent representation space. This design allows the alignment process to indirectly capture inter-view dependencies before fusion. Secondly, to address the potential underutilization of cross-view information that only emerges through joint modeling, we introduce a weighted consensus mechanism in the fusion step. Specifically, each view contributes to the final clustering based on its alignment consistency and structural reliability, allowing high-quality views to reinforce low-quality ones and flexibly preserving multi-view complementary information. In summary, although our method does not perform explicit joint modeling in the traditional sense, it captures structural and semantic information through a combination of unified representation learning and adaptive weighted fusion, and the experimental results demonstrate the effectiveness of our proposed method.

Q6: Another intriguing aspect is the reliance on cluster-label matching for selecting a benchmark view.
A6: We sincerely appreciate your insightful comments. Unlike methods that assess each view independently, CLM measures the relative consistency of each view concerning all other views through a pairwise, normalized label-matching function. By aggregating over all (k2)\binom{k}{2} cluster pairs, CLM captures global structural similarity in a redundant and smoothed manner, which effectively reduces the impact of local noise or instability in individual views. Furthermore, this strategy enables CLM to select the view that is most structurally aligned with the consensus of the remaining views, even when all views contain some level of noise. As such, the view selection process becomes inherently more robust than strategies that either randomly choose a baseline view or rely on the absolute clustering performance of a single view. Furthermore, as observed in Section 5.7, using our alignment matrix to align the clustering results of other algorithms achieves superior performance compared to the traditional Hungarian matching approach, which demonstrates the robustness and generalization of the selected baseline view.

评论

Thank you for your detailed response. Most of my concerns have been thoroughly addressed in the authors' response, including clarifications on CLM selection, robustness analysis, and view complementarity. So I decided to raise my rating.

评论

We sincerely appreciate your thoughtful comments and your decision to raise the score. Your recognition is truly encouraging and valuable to us.

Best regards,
Authors

审稿意见
4

The proposed method addresses unalignment samples in multi-view clustering by aligning views through a benchmark-based cluster-label matching strategy and constructing cross-view similarity graphs, enabling effective clustering without requiring sample correspondence.

优缺点分析

The strengths of this manuscript are listed as follows:

The proposed method effectively handles unalignment multi-view data by eliminating the need for strict sample correspondence across views; The use of cluster-label matching to select a benchmark view ensures more reliable and consistent alignment across different views; By integrating the alignment criterion into a late-fusion framework, SSA-MVC maintains scalability and flexibility across diverse datasets.

The weaknesses of this manuscript are listed as follows:

Can the proposed method be effectively applied to cases where samples are entirely unaligned across views; the same approach is used for both unaligned sample representation and cross-view similarity—are these the only feasible choices, or can alternative methods be employed; as alignment performance depends on the benchmark view, how is its reliability ensured; while the datasets are standard in multi-view clustering, could the authors clarify how they were preprocessed to simulate sample misalignment?

问题

  1. Can the proposed method be effectively applied to cases where samples are entirely unaligned across views?

  2. The same approach is used for both unaligned sample representation and cross-view similarity—are these the only feasible choices, or can alternative methods be employed?

  3. As alignment performance depends on the benchmark view, how is its reliability ensured?

  4. While the datasets are standard in multi-view clustering, could the authors clarify how they were preprocessed to simulate sample misalignment?

  5. The proposed method belongs to MVC methods with shallow models, so it is recommended to compare more SOTA shallow MVC methods like tensor-based or subspace-based methods.

局限性

Yes.

最终评判理由

Thanks for authors' response. This paper proposed an interesting method with promising results, and I have no further questions.

格式问题

N/A.

作者回复

Q1: Can the proposed method be effectively applied to cases where samples are entirely unaligned across views?
A1: Thanks for your valuable comment. Our method remains effective even in the absence of any sample-level correspondence across views. Rather than depending on explicit alignment, it captures the shared clustering structure implicitly present across different modalities. This is achieved by projecting data into a latent space where structural patterns are preserved through orthogonality constraints. These constraints ensure that the alignment process respects the intrinsic geometry of each view, enabling the model to discover coherent groupings without relying on direct sample matching. In fully unaligned scenarios (i.e., alignment ratio ρ\rho=0), we adapt our strategy by discarding alignment-based similarity measures and instead computing cross-view relationships via the affinity between view-specific partition matrices. This allows the model to emphasize clustering-level consistency, which serves as a reliable proxy for alignment under severe view discrepancy. To assess the practical effectiveness of our method in such settings, we conduct comparisons with two representative models designed for unaligned multi-view clustering: VSC_mH and OpVuC. Extensive experiments on six benchmark datasets demonstrate that our method consistently achieves higher performance across most metrics. These results confirm that even in the most challenging case of complete sample misalignment, our framework successfully exploits latent structural information to produce robust clustering results.
Clustering results under fully unaligned views

MetricMethodsYale3sourcesMSRCV100leavesHWScene
Vsc_mH36.36±0.0040.24±0.0039.05±0.0013.94±0.0013.50±0.0029.45±0.00
ACCOpVuC24.24±0.0024.26±0.0024.29±0.008.75±0.0012.90±0.0010.90±0.00
Ours59.36±2.4055.77±0.4268.31±0.6654.93±1.3719.25±0.9519.88±0.65
Vsc_mH41.79±0.0011.27±0.0021.78±0.0044.50±0.003.42±0.0027.85±0.00
NMIOpVuC28.96±0.004.53±0.005.97±0.0032.91±0.000.93±0.003.52±0.00
Ours63.97±1.2650.21±0.9749.65±0.9373.11±0.865.11±0.369.98±0.63
Vsc_mH14.61±0.0011.07±0.0012.71±0.004.82±0.000.10±0.0014.39±0.00
ARIOpVuC2.57±0.000.27±0.001.59±0.001.08±0.000.02±0.000.56±0.00
Ours40.97±1.9437.35±0.5743.51±1.0239.29±1.282.25±0.234.78±0.34
Vsc_mH20.49±0.0031.99±0.0025.05±0.006.09±0.0015.99±0.0021.10±0.00
F1scoreOpVuC8.74±0.0019.24±0.0017.57±0.002.53±0.0010.97±0.009.06±0.00
Ours44.75±1.8049.59±0.4651.44±0.8839.90±1.2612.06±0.2111.34±0.32

Q2: The same approach is used for both unaligned sample representation and cross-view similarity—are these the only feasible choices, or can alternative methods be employed?
A2: Thank you for your insightful comment. In our method, we adopt the adaptive neighbor graph learning for both unaligned sample representation learning and cross-view similarity learning. This approach provides a unified and effective way to capture local geometric structures and inter-view relationships. Of course, we acknowledge that this is not the only feasible solution. Alternative strategies, such as metric learning techniques, multiple kernel learning, could also be employed for them. We chose the current method due to its flexibility and effectiveness in our preliminary studies. Therefore, future work could explore the impact of these alternative methods.

Q3: As alignment performance depends on the benchmark view, how is its reliability ensured?
A3: Thanks for your constructive comments. In the existing alignment-based multi-view clustering methods, the baseline view is either randomly selected or chosen heuristically without considering the intrinsic clustering quality of each view. This randomness can lead to sub-optimal alignment, especially when the selected view exhibits poor cluster structure or noise. To address this, we introduced a baseline view selection mechanism based on the CLM algorithm, and the detailed introductions can refer to Section 3.2. As presented in Eqs. (4)-(5), the measures assess the separation and compactness of cluster structures in a scale-invariant, shift-invariant, and class-size-invariant manner, enabling a fair comparison of clustering quality across unaligned views. Specifically, we compute the CLM score for each view and select the one with the highest score as the baseline view, thereby ensuring that alignment is guided by the most reliable structural information. Therefore, compared with the existing baseline view selection methods, the proposed method mitigates the limitations of random selection and enhances the stability and performance of the alignment process.

Q4: While the datasets are standard in multi-view clustering, could the authors clarify how they were preprocessed to simulate sample misalignment?
A4: Thank you for your valuable comments. As existing multi-view datasets are originally aligned across views, we simulate the sample misalignment scenario by artificially disrupting the arrangement of the samples. Specifically, we first apply the CLM algorithm to determine the baseline view. Then, we select a fixed proportion of samples with the same indices across views to serve as aligned samples for each view. The remaining samples in each view are considered as unaligned by randomly shuffling their order. Additionally, in the baseline view, we preserve the original sample order to maintain consistency with the ground-truth labels. Through this process, we construct multi-view datasets with varying degrees of sample misalignment, which allow us to effectively evaluate the robustness of our proposed method under sample misalignment scenarios.

Q5: It is recommended to compare more SOTA shallow MVC methods like tensor-based or subspace-based methods.
A5: Thank you for your insightful suggestion. In our experimental comparison, we have added two representative shallow multi-view clustering methods, i.e., TPCH [1] and ESTMC [2], and we report results under a 50% sample alignment ratio, with and without Hungarian alignment. The results demonstrate that our proposed method consistently outperforms them across all datasets in four clustering metrics. Notably, applying Hungarian alignment to these baselines does not necessarily lead to performance gains and sometimes even degrades the results, indicating that rigid post-processing alignment may be unreliable when sample correspondences are uncertain. In contrast, our method achieves significantly better and more stable performance by jointly modeling soft sample alignment and cross-view structural consistency, further demonstrating its effectiveness in handling partially aligned multi-view data.
Clustering performance under a sample alignment ratio ρ\rho=50%

MetricsMethodsYale3sourcesMSRCV100leavesHWScene
TPCH27.88±0.0030.77±0.0033.81±0.0032.38±0.0025.20±0.0019.15±0.00
TPCH+Hungarian33.94±0.0030.18±0.0027.62±0.0016.25±0.0017.95±0.0015.43±0.00
ACCESTMC51.42±3.7449.59±0.9154.95±2.2235.53±1.0469.63±4.6423.81±0.92
ESTMC+Hungarian32.94±1.7136.92±0.5645.21±1.9635.54±1.1845.94±0.8120.77±0.50
Ours64.24±3.6264.44±1.2983.52±0.3970.63±1.2996.55±0.0035.91±0.27
TPCH32.65±0.007.03±0.0014.31±0.0056.35±0.009.51±0.0012.65±0.00
TPCH+Hungarian35.95±0.008.31±0.008.18±0.0045.32±0.003.17±0.005.27±0.00
NMIESTMC56.44±3.0534.70±1.5645.57±2.9658.76±0.6361.21±2.0617.37±0.78
ESTMC+Hungarian35.61±1.5919.24±0.7323.58±1.5258.84±0.7323.72±0.5514.11±0.31
Ours69.31±1.3261.20±0.8370.28±0.5585.34±0.4292.09±0.0030.27±0.20
TPCH5.33±0.002.54±0.006.72±0.0012.62±0.005.14±0.005.78±0.00
TPCH+Hungarian8.84±0.003.41±0.002.08±0.003.10±0.001.30±0.002.93±0.00
ARIESTMC30.82±4.1425.26±1.0832.04±2.4217.15±0.9152.12±3.567.97±0.41
ESTMC+Hungarian8.19±1.269.98±1.0015.50±1.0717.24±1.0318.60±0.486.49±0.23
Ours48.91±2.6545.98±1.6166.34±0.6659.97±0.9592.49±0.0017.15±0.32
TPCH11.39±0.0022.24±0.0019.87±0.0013.66±0.0014.74±0.0012.33±0.00
TPCH+Hungarian14.60±0.0021.95±0.0016.32±0.004.12±0.0012.73±0.0010.53±0.00
F1scoreESTMC35.18±3.8939.76±0.8741.53±2.0517.98±0.9056.94±3.1814.34±0.39
ESTMC+Hungarian13.96±1.1827.77±0.6527.33±0.8618.08±1.0226.85±0.4412.91±0.22
Ours52.17±2.4356.59±1.3471.03±0.5760.37±0.9493.24±0.0022.94±0.29

[1] Wang Z, Li X, Sun Y, et al. TPCH: tensor-interacted projection and cooperative hashing for multi-view clustering, AAAI. 2025.
[2] Ji J, Feng S. Anchors crash tensor: efficient and scalable tensorial multi-view subspace clustering, IEEE TPAMI, 2025.

评论

Thanks for authors' response. This paper proposed an interesting method with promising results, and I have no further questions.

评论

Thank you for your positive response. We appreciate your time and valuable feedback, which have greatly helped improve our work.

Best regards,
Authors

审稿意见
4

This paper proposes a scalable sample-alignment-based multi-view clustering method. The proposed approach first employs a cluster-label matching algorithm to select the view, and then constructs representations of non-aligned samples by computing their similarities with aligned samples. Extensive experiments demonstrate the effectiveness of the proposed method.

优缺点分析

Strengths: (1) This paper proposes a novel alignment strategy to address the issue with partially aligned views. Extensive experiments demonstrate the effectiveness of the proposed method.

Weaknesses: (1) The title of this paper is 'Scalable Cross-View Sample Alignment for Multi-View Clustering with View Structure Similarity'. However, it lacks the most important experiments with fully misaligned views. This raises a question: Is the proposed method unable to address the issue of fully misaligned views?

(2) This paper considers the issue of view misalignment. Therefore, performance validation should be compared more with the SOTA methods of view misalignment instead of traditional multi-view clustering methods.

(3) The proposed method uses a feature rotation matrix to align the feature space. Is the rotation matrix effective for the sample, i.e., can the learned rotation matrix be used to output aligned samples?

(4) The writing of this paper can be further improved. For example, the roles of each component in Eqs. (4), (5) and (7) should be specifically explained.

问题

(1) The title of this paper is 'Scalable Cross-View Sample Alignment for Multi-View Clustering with View Structure Similarity'. However, it lacks the most important experiments with fully misaligned views. This raises a question: Is the proposed method unable to address the issue of fully misaligned views?

(2) This paper considers the issue of view misalignment. Therefore, performance validation should be compared more with the SOTA methods of view misalignment instead of traditional multi-view clustering methods.

(3) The proposed method uses a feature rotation matrix to align the feature space. Is the rotation matrix effective for the sample, i.e., can the learned rotation matrix be used to output aligned samples?

(4) The presentation of the paper can be further improved. For example, the roles of each component in Eqs. (4), (5) and (7) should be specifically explained.

局限性

Yes

最终评判理由

After reading the other reviewers' comments and the authors' response, most of my concers have been addressed. So I raise the rating from 3 to 4.

格式问题

No

作者回复

Q1: Is the proposed method unable to address the issue of fully misaligned views?
A1: Thanks for your valuable comment. Although our method is primarily designed for partially aligned multi-view scenarios, it is also theoretically and practically applicable to fully unaligned views. Theoretically, our framework is built on the assumption that all views share a common underlying clustering structure. Even in the absence of explicit sample correspondence, samples belonging to the same cluster across different views exhibit implicit structural consistency. To capture this, we employ an orthogonality-constrained alignment matrix M\mathbf{M} that aligns samples based on their structural behavior in the latent space, rather than their correspondence. In practice, under fully unaligned scenarios, i.e., alignment ratio ρ=0\rho=0, we adapt our strategy by discarding the alignment-based similarity construction. Instead, we compute cross-view similarity based on the affinity between partition matrices Fv\mathbf{F}^v of different views. This allows us to preserve and exploit high-level clustering consistency without relying on direct sample-wise alignment. The learned alignment relationship thus enables indirect alignment by capturing common clustering structures. Furthermore, to demonstrate the effectiveness of our proposed method on fully unaligned scenarios, we compare it with Vsc_mH and OpVuC, which are among the few existing methods capable of handling fully unaligned multi-view data. According to the experimental results, we can find that our method outperforms the competitors in most of the datasets. Thus, these results confirm the adaptability of our method in addressing full sample misalignment by leveraging shared structural information at the partition level.
Clustering results of compared methods under fully unaligned views (alignment ratio ρ=0\rho=0)

MetricMethodsYale3sourcesMSRCV100leavesHWScene
Vsc_mH36.36±0.0040.24±0.0039.05±0.0013.94±0.0013.50±0.0029.45±0.00
ACCOpVuC24.24±0.0024.26±0.0024.29±0.008.75±0.0012.90±0.0010.90±0.00
Ours59.36±2.4055.77±0.4268.31±0.6654.93±1.3719.25±0.9519.88±0.65
Vsc_mH41.79±0.0011.27±0.0021.78±0.0044.50±0.003.42±0.0027.85±0.00
NMIOpVuC28.96±0.004.53±0.005.97±0.0032.91±0.000.93±0.003.52±0.00
Ours63.97±1.2650.21±0.9749.65±0.9373.11±0.865.11±0.369.98±0.63
Vsc_mH14.61±0.0011.07±0.0012.71±0.004.82±0.000.10±0.0014.39±0.00
ARIOpVuC2.57±0.000.27±0.001.59±0.001.08±0.000.02±0.000.56±0.00
Ours40.97±1.9437.35±0.5743.51±1.0239.29±1.282.25±0.234.78±0.34
Vsc_mH20.49±0.0031.99±0.0025.05±0.006.09±0.0015.99±0.0021.10±0.00
F1scoreOpVuC8.74±0.0019.24±0.0017.57±0.002.53±0.0010.97±0.009.06±0.00
Ours44.75±1.8049.59±0.4651.44±0.8839.90±1.2612.06±0.2111.34±0.32

Q2: Should be compared more with the SOTA methods of view misalignment instead of traditional multi-view clustering methods.
A2: Thank you for your valuable suggestion. To better evaluate the effectiveness of our method in handling view misalignment, we have included tow representative state-of-the-art methods specifically designed for unaligned multi-view clustering, i.e., TUMCR[1] and DAGF[2], and the corresponding experimental results are presented in the following, where we report ACC under two different alignment ratios: 0%(fully unaligned) and 50% (partially aligned). As shown, our method consistently achieves superior performance across six benchmark datasets in both settings. For example, under the fully unaligned scenario, our method outperforms TUMCR and DAGF by large margins on datasets such as MSRCV and 100leaves. Similar trends can be observed under the other datasets. Therefore, these results demonstrate the adaptability of our method in dealing with both fully and partially misaligned scenarios.
ACC comparison with view misalignment multi-view clustering methods under different alignment ratios

Alignment RatiosMethodsYale3sourcesMSRCV100leavesHWScene
TUMCR29.09±0.0036.09±0.0026.67±0.0047.44±0.0015.70±0.009.59±0.00
0%DAGF17.33±0.2433.31±0.0026.05±1.458.61±0.0818.02±2.3411.83±0.56
Ours59.36±2.4055.77±0.4268.31±0.6654.93±1.3719.25±0.9519.88±0.65
TUMCR35.76±0.0042.60±0.0037.14±0.0041.75±0.0041.35±0.0014.51±0.00
50%DAGF18.36±0.3633.90±0.0127.90±1.379.46±0.1122.37±2.5313.44±0.57
Ours64.24±3.6264.44±1.2983.52±0.3970.63±1.2996.55±0.0035.91±0.27

[1] Ji J, Feng S, Li Y. Tensorized unaligned multi-view clustering with multi-scale representation learning[C]// KDD. 2024.
[2] Jiang H, Tao H, Jiang Z, et al. Unaligned multi-view clustering via diversified anchor graph fusion[J]. Pattern Recognition, 2025.

Q3: Can the learned rotation matrix be used to output aligned samples?
A3: Thanks for your constructive comments. In our method, the feature rotation matrix R\mathbf{R} is primarily learned in the latent feature space to align the view-specific representations with the unified representation, rather than to directly realign the original samples. Specifically, M\mathbf{M} operates on the extracted features of samples, mapping them into a shared feature space. Therefore, while the learned rotation matrix can produce aligned feature representations, it does not yield aligned samples.

Q4: The roles of each component in Eqs. (4), (5), and (7) should be specifically explained.
A4: Thanks for your advice. The metric in Eq.(4) acts as an enhanced separation-compactness ratio that evaluates the structural quality of a clustering Y\mathbf{Y} with respect to the entire dataset X\mathbf{X}. Specifically, H integrates scale-invariant normalization through σd2\sigma_{d^2}, exponential smoothing to mitigate the effects of data shifts, and a balanced weighting of inter-cluster and intra-cluster distances. As a result, H provides a robust and comparable measure of cluster quality across different views, even when the data distributions and cluster cardinalities vary. On the basis of Eq.(4), Eq.(5) defines the final clustering quality measure, i.e., it averages pairwise cluster scores and applies a logistic transformation to normalize the range to 0-1. This design not only ensures class-cardinality invariance but also provides a directly interpretable and bounded quality score. Therefore, with Eqs. (4)-(5), we can quantify both intra-view compactness and inter-view separability in a scale-free manner and consistently identify the view that fully preserves the intrinsic structure of the data, regardless of sample size, feature scales, or the number of clusters. In this way, we can avoid biases caused by poorly aligned or low-quality views for subsequent cross-view alignment.
After selecting the base view according to the Eqs.(4)-(5), the final clustering objective function can be formulated, as presented in Eq.(7). The first part of Eq.(7) is a sample alignment-based late fusion clustering framework, where the sample alignment matrices Mv\mathbf{M}^v and the feature rotation matrices Rv\mathbf{R}^v are jointly learned to map each view-specific clustering result Fv\mathbf{F}^v into a shared space, enabling the consensus partition F\mathbf{F}^* to integrate multi-view information effectively. The second part introduces a structure-guided alignment term, where Sv\mathbf{S}^v encodes the similarity between the non-aligned samples of each view and the baseline view. This regularization term drives Mv\mathbf{M}^v to capture the correct correspondence by preserving cross-view structural relationships.

评论

I thank the authors for providing a comprehensive response to my comments. The response addresses most of my concerns. After reading the rebuttal and the other reviewers' comments, I would like to raise my rating.

评论

Thank you for your kind response. We are delighted to hear that our rebuttal has addressed your concerns. We sincerely appreciate your thoughtful comments and the time you devoted to reviewing our manuscript. Your constructive feedback has greatly helped us improve the quality of our work. We are also grateful for your positive evaluation and the final recommendation to raise the score.

Best regards,

The Authors

最终决定

This paper proposes a scalable sample-alignment-based multi-view clustering (MVC) method to address the challenge of partially aligned views. The key innovation lies in a cluster-label matching (CLM) strategy to select a benchmark view, followed by constructing cross-view similarity graphs to represent non-aligned samples. The method integrates alignment into a late-fusion framework, enabling clustering without strict sample correspondence. Experiments on eight benchmarks demonstrate its effectiveness, particularly in partial alignment scenarios. Reviewers highlight the novelty of the alignment strategy, its scalability, and competitive performance against traditional MVC methods. Combining their consensus acceptance recommendation after rebuttal, the paper is recommended to be accepted.