Persistent Homology for High-dimensional Data Based on Spectral Methods
Traditional persistent homology and its standard extensions fail on high-dimensional data, but persistent homology with spectral distances works well.
摘要
评审与讨论
The author(s) propose a metric that is based on combining nearest-neighbor graphs and spectral methods to analyze point clouds of high ambient dimension using Vietoris-Rips filtrations and persistent homology. They show experimentally that the proposed distance better detects the significant topological features. For that, they use both toy examples as a proof of concept as well as RNA-sequencing datasets.
优点
The paper deals with a relevant problem as high-dimensional data (with low intrinsic dimension) indeed is a frequent application case in topological data analysis (or rather: there is potential for such applications if the right tools were available).
The experimental section is also well-written, taking a wealth of other methods into account, and applying the methodology on synthetic and real data.
缺点
I miss an explanation of why the loop detection score (sometimes also called hole-detection score) is the right measurement to compare the approaches. I understand it has been used before but I am not quite satisfied with that resoning. Also, there seems to be a score s_m for each integer m, and I fail to see which m is used.
As a minor weakness, the distance measure is defined in Section 5, but there is not too much intuition given of why this should be working better than other methods. When going through the appendix, it seems like there is more explanation given there but I cannot go through the appendix for the lack of time. I wish there would be a brief intuitive explanation of why this works so well.
问题
I do have one question about the claim that the representatives are sometimes not plausible (Fig 9). How is this determined in detail? Homology generators can sometimes be linear combinations of the "natural" representatives that one "sees" in low-dimensional pictures, and as far as I know, there is not much control about what ripser or any other persistence software will give you (except you do some post-processing to compute a "good" representative subsequently)
局限性
As this section is mandatory to fill, I will say that the work is certainly a proof of concept, as only datasets are studied for which the number of loops and voids is clear from the start. That makes sense, however, since how could one verify the results otherwise?
Dear reviewer QsUc,
many thanks for your review. We are happy that you consider the problem we tackle relevant and our experimental section well-written. We will address your concerns in the following.
Hole detection score:
Our hole detection score captures the size of the gap after the -th most persistent feature. A clear gap in a persistence diagram is an indication that the features above the gap are important, while the ones below the gap are likely noise. If we know that a dataset has holes, it is therefore reasonable to require the gap after the -th most persistent feature to be large. This is what our metric measures.
For each dataset in our experiments the value of simply corresponds to the ground-truth number of holes in each dimension. For instance, the interlinked circles have 1D holes (i.e. two loops), while the sphere has 2D hole (i.e. one void). Note that we only use datasets with known ground-truth throughout the paper.
We call our metric hole detection score, when the dimensionality of the hole is not specified. When speaking of loop detection score, we mean that we apply the metric to 1D holes, i.e., loops.
The area chair (AC) suggested an alternative metric based on the widest gap in the persistence diagram, rather than the gap after the -th most persistent feature. We discuss it in the reply to the AC's comment, see also rebuttal figure R1. If you have another suggestion for a metric, we would be happy to try it.
During the rebuttal, we were able to prove continuity properties of our hole detection score, see general comment. Continuity is clearly a desirable property for a metric.
Intuition for the success of spectral methods:
The key idea is that spectral methods aggregate information over all edges in the graph. This makes them focus on dense connections and robust against single stray edges possibly formed due to the high-dimensional noise. They can therefore be viewed as a denoising technique. In contrast, the geodesic distance is brittle as it can change drastically in the presence of a single short-cut edge. See lines 153-159 in the paper for more details.
As an additional illustration, we included rebuttal figure R3. It contrasts the brittle behavior of the geodesic distance in the presence of artificially added noise edges with the robust behavior of effective resistance in a 2D toy setting. We will add this figure to the revised version of the manuscript. For more intuition, we visualized all distances in Figures S6, S7 in the appendix.
Analysis of representatives:
Indeed, the interpretation of representatives is difficult, as we acknowledged in lines 319-322. We only flagged results as dubious when the representatives of the most persistent features did not at all align with additional metadata describing the topology of the single-cell datasets (Figure S10). Here, is the number of ground-truth loops in the dataset. Since for all our scRNAseq datasets apart from the Malaria, there is no possibility of non-trivial linear combinations. The Malaria dataset has a "figure 8" shape, with two loops. We took care of linear combinations and considered both the case where the representatives followed the two small loops and the case where one representative followed the outer loop and the other one inner loop as correct. We did not post-process the representatives.
Limitations:
We have extended the limitations section, following reviewer aHMb's and the AC's suggestion, see the general comment. Inspired by your remark, we also added that our work is limited to validation of topology detection methods, and that topology inference on datasets without ground-truth remains our future work.
Dear Reviewer QsUc, could you please respond to the authors' rebuttal? Thank you
Thank you for the clarifications - I stick to my score
This paper attempts to mitigate the inadequate performance of persistent homology in high dimensional settings, in particular in the presence of noise. The authors investigate a number of distances and propose to use two kinds of spectral distances, such as diffusion distance and effective resistance, instead of Euclidean distance, and demonstrate that the proposed approach performs well in a number of synthetic and single-cell datasets.
优点
(S1) The paper addresses a relevant problem
(S2) Diverse experimental setup (both synthetic and very interesting real world datasets are used)
(S3) The paper is comprehensive and well-written
缺点
(W1) Lack of theoretical results explaining why the suggested distances work so well
(W2) The advantages over the other methods in the literature are not very clear
问题
(Q1) In Figure 5, it seems that t-SNE and UMAP perform better than effective resistance in the presence of a lot of noise – why are they then not recommended? Could you please add an explanation?
(Q2) Your recommended distances (mostly diffusion) fail to detect the 1 void and 2 loops in a Torus, could you please explain this?
(Q3) Can there be a theoretical result demonstrating that your PH with your proposed spectral distances is capable of capturing loops in higher dimensions in the presence of noise?
(Q4) Apart from the single-cell RNA-sequencing datasets, can you please suggest other datasets from other domains where your method could be used?
(Q5) Have you considered using multiparameter persistence in order to avoid the noise issue, and how well do you expect it to work in the higher dimensional setting?
局限性
Yes, all limitations are addressed adequately by the authors in the paper
Dear reviewer C1nu,
we thank you for your review and appreciate that you find the problem we tackle relevant, our experimental setup diverse, and the paper well-written. We will address your concerns in the following.
W1 & Q3: Theoretical results on spectral methods for persistent homology:
The lack of a theoretical guarantee on the performance of spectral methods with persistent homology in high dimensionality is a limitation of our work and likely difficult to address. For this reason, we focused on an extensive empirical validation. We will state this more clearly in a revised Limitations section (see general comment), and plan to address it in future work taking inspiration from the stability results in [10, 36, 66] and, more generally, spectral perturbation theory. The key steps will be that highly persistent loops are a low-frequency feature of a (graph representing a) point cloud and that diffusion distances and effective resistance are denoising / low-pass filters [18, Szlam et al. (2008), Ramakrishna et al. (2020)].
Q1: Performance of UMAP and t-SNE:
Computing the topology of a high-dimensional dataset in a 2D t-SNE or UMAP plot can indeed be very effective (for loops), which we acknowledged, e.g., in lines 300-301 and 325-326. Nevertheless, across the large set of our experiments, diffusion distance and effective resistance outperformed UMAP and t-SNE.
We summarized our reasons for not endorsing UMAP or t-SNE more strongly in the Discussion (lines 326-330). Of particular importance is the choice of the embedding dimension. Clearly, the embedding dimension limits the dimension of topological features that can be found. t-SNE and UMAP are primarily visualization methods, and are mostly used for 2D embeddings, making it impossible to detect any voids. In contrast, using diffusion distances and effective resistance in some sense performs an implicit, data-dependent, soft choice of the embedding dimension (Section 6), and directly arrives at a denoised distance matrix, without intermediate low-dimensional embedding. We devoted Appendix L to exploring the performance of UMAP with higher embedding dimension, where we found it to struggle with the surface of the 3D toy datasets and the void detection of the torus (Fig S14 b, c). Both t-SNE and UMAP sometimes fail for topology detection even in the noiseless setting (Figs. S14 a, S18).
More generally, t-SNE and UMAP have been severely criticized in recent literature, especially in computational biology, e.g., by Chari and Pachter [11] and Wang et al. [74], because they can strongly distort the data (see also the AC's comment about this). In fact, Wang et al. explicitly recommend persistent homology instead of UMAP or t-SNE embeddings for single-cell data analysis. In this work, we wanted to offer alternatives to t-SNE and UMAP, to separate "visualization" from "topological analysis".
Q2: Failure of diffusion distances on the torus:
First, we would like to emphasize that effective resistance performs on par with the Euclidean distance on the torus. Second, prompted by your question, we investigated the failure of diffusion distances and found that with they failed to find the smaller loop of the torus with high persistence. This was because the eigenvectors that encode this loop contribute little to the distance for (Section 6). We ran an additional experiment with , so that these eigenvectors contribute more. Now, diffusion distances performed on par with the effective resistance. This aligns with our recommendation of effective resistance over diffusion distances because it does not have the hyperparameter (lines 315-317).
Third, we described in lines 257-258 that diffusion distances with fail on the torus due to a sampling issue. On a more densely sampled torus both the effective resistance and the diffusion distance clearly outperform the competitors (Fig. S27). The reason is that the eigenvalues of the eigenvectors that encode the small loop are now more similar to those encoding the large loop, so the diffusion distance with does not overly decay them.
Q4: Potential application domains:
High-dimensional data and thus application areas for our improved topology detection pipeline are becoming ubiquitous. Within biology, we see possible applications for our method in other single-cell omics modalities, population genomics, or neural activity data [26, 34]. Beyond biology, we believe that our approach can improve the topological analysis of artificial neural network activations [52], and in general be used to detect topology of any high-dimensional data, e.g. in the climate sciences, in astronomical measurements, or wearable sensor data. We are happy to add this outlook to the discussion section.
Q5: Multiparameter persistence:
Multiparameter persistence is an interesting method and might offer benefits in the high-dimensional setting. It has been employed to spatial transcriptomics data by Benjamin, Katherine, et al. (2022).
However, its output is not as readily interpretable as the persistence diagrams in single-parameter persistent homology, which is why we limited out study to this more established method.
References:
Szlam, Maggioni, & Coifman (2008). Regularization on graphs with function-adapted diffusion processes.
Ramakrishna, Wai & Scaglione. (2020). A user guide to low-pass graph signal processing and its applications: Tools and applications.
Benjamin, Katherine, et al. (2022) Multiscale topology classifies and quantifies cell types in subcellular spatial transcriptomics.
Dear Reviewer C1nu, could you please respond to the authors' rebuttal? Thank you
I thank the authors for their efforts and the response. I have read the response and would like to stick with my score.
This work studies a well-known phenomenon in persistent homology, which is that it performs poorly in the presence of noise in the settings of a high-dimensional ambient space. Spectral and diffusion approaches are proposed as a workaround to this problem and shown in an extensive numerical study to perform well. A new theoretical result is also provided alongside on effective resistance.
优点
Thorough and detailed experimental setup on synthetic as well as real data.
Overall good presentation, with few errors and typos.
缺点
A major weakness is the lack of discussion on the limitations of the work. The main paper only highlights the strengths and superiority over the proposed method to other existing ones. While the work appears to have done a thorough and careful experimental treatment, I did not re-run the experiments myself to validate the findings, but I find it hard to believe that there is no limitation whatsoever of the work compared to existing methods to a well-known problem.
Additionally, it seems like not all approaches to this problem were studied. Certainly some important recent approaches were not mentioned in the section on related work, which is another weakness of the work. Please see the questions below.
I also feel like a comprehensive discussion on spectral methods in persistent homology are lacking, especially when a lot of work has been done in this area, especially recently (see, for example, work by Mémoli and Sanchez-Garcia). In general, spectral and diffusion approaches are not new to the field and so the approach proposed is not very surprising. Additionally, the consideration of other distances for persistent homology is not surprising at all. The theoretical contribution is nice, but it seems to be an accompanying theoretical result to the work rather than really adding any meaningful contribution to the problem that would be very useful to the TDA community, let alone the community of the conference.
问题
The authors refer quite a lot to a very recent paper on the curse of dimensionality in persistent homology [1], which recently appears and which I am quite familiar with. It seems to me that the submission, while providing a thorough experimental analysis, theoretically is only contributing yet another method for quite a well-known problem, which seems to me to have been more comprehensively and convincingly studied and solved in [1], especially in a context that would be a better fit for the audience of the conference. How would the authors justify the importance of their contribution in relation to [1]?
How does the authors' approach compare to those of other approaches for cycle detection [2] that was not mentioned at all?
[1] Curse of dimensionality on persistence diagrams, Hiraoka et al., arXiv April 2024. [2] Cycle registration in persistent homology with applications in topological bootstrap, Reani & Bobrowski, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
局限性
I did not find any discussion of limitations of their work at all in the main body of the paper. All discussions in the paper were devoted to superiority of the method, though presumably after such an extensive numerical experimental study, there must have been more to it, which may have been relegated to the appendix that I did not study in detail but given the requirement by the conference to be forthcoming with the limitations, these should have been stated clearly in the main paper. This, together with my question above on how does this submission fit in with the curse of dimensionality paper, is a major weakness of the submission, as mentioned above.
Dear reviewer aHMb,
thank you for your review! We are pleased that you find our experimental setup "thorough and detailed". As the main weaknesses, you listed our discussion of Hiraoka et al. (2024) and our discussion of limitations. We will address these in detail below as well your other concerns.
Relation with Hiraoka et al. (2024):
As you mention, this relevant paper is "very recent" and it did not influence our work in any way. In fact, it appeared on arXiv about one month before the NeurIPS deadline (on 28/04/2024) so that the NeurIPS rule on contemporaneous work applies:
Contemporaneous Work: For the purpose of the reviewing process, papers that appeared online within two months of a submission will generally be considered "contemporaneous" in the sense that the submission will not be rejected on the basis of the comparison to contemporaneous work. [...]
That said, we are happy to discuss Hiraoka et al. (2024) in more detail. First, their treatment of the curse of dimensionality of persistent homology is mostly theoretical, while ours has an empirical focus. The two works are thus complementary to each other.
Second, our own theoretical treatment in Appendix B aligns with Hiraoka et al. chapter 3 in the key steps: High-dimensional noise leads to distance concentration (our Prop. B.4, their Prop 3.1) which makes persistent homology uninformative (our Cor. B.7, summarized in Thm 3.19 in Hiraoka et al.). Hiraoka et al.'s theoretical treatment is more general, covering non-Gaussian noise, the Čech complex, and describing the effects on persistent homology in more detail.
Third, practically, Hiraoka et al. propose normalized PCA. However, this approach assumes the true dimensionality of the data to be known, which is not realistic in real-world applications. Moreover, the strongest property of normalized PCA that Hiraoka et al. show is that the Hausdorff distance between the persistence diagram of the unperturbed data and that of the normalized PCA is eventually bounded in probability (Thm 4.20). While achieving this formal statement is impressive, it does not guarantee that any information of the original persistence diagram is preserved. In fact, the Hausdorff distance between any two non-empty, deterministic persistence diagrams is eventually bounded in probability by simply using their Hausdorff distance as bound.
We ran experiments with both the normalized and non-normalized PCA on the 1D toy datasets, see rebuttal figure R2. We observed that knowing the dimensionality of the data is crucial, as too few PCs can miss important structure (Fig R2 f,j), while an excessive amount of PCs is also detrimental, and that neither the PCA version outperforms the effective resistance.
Together, this shows that normalized PCA does not "solve" the curse of dimensionality for persistent homology. Hiraoka et al. seem to acknowledge this: "[...] normalized PCA still can not eliminate the curse of dimensionality on persistence diagrams completely" (p. 41).
Our extensive empirical investigation offers crucial complementary insights to Hiraoka et al's theoretical work. First, it provides clear guidance for practitioners. Second, it shows empirically that the curse of dimensionality appears not only for Euclidean distances, but also for many derived distances, not covered by Hiraoka et al.. Third, Hiraoka et al.'s negative results are asymptotic statements. Our experiments provide concrete insight on how early the curse of dimensionality sets in: in tens of dimensions, see Figs. 7, 9.
We will add a condensed version of this discussion and the PCA results in the revised manuscript.
Limitation section:
Thank you for pushing for more transparency regarding limitations. However, we did discuss limitations of our method in the main paper. In particular, we acknowledged that
- our methods need hyperparameter choices (line 214),
- spectral methods only outperform the Euclidean distance on a more densely sampled torus (lines 254-256),
- interpreting cycle representatives is difficult (line 321-322),
- persistent homology can fail to distinguish non-isometric point clouds (line 325),
- and has a high run time (lines 334-336).
Nevertheless, we are happy to create a dedicated Limitations subsection in the revision, also mentioning additional limitations. Please see the general comment for its text.
Comparison with Reani & Bobrowski (2023):
Their method for distinguishing topological signal from noise does not fit well into our benchmark (which already compares 13 different methods!). First, they need to resample data, using either the true data distribution or kernel density estimation; none of this is suitable in a high-dimensional exploratory context (see line 52). Second, their approach uses the coupled alpha complex, while our setup deals with the more common Vietoris-Rips complex. Third, our study compares different distances while Reani & Bobrowski pursue the conceptually very different approach of cycle matching. We will include this reference in the revision.
Spectral methods and persistent homology:
In the related work section, we did acknowledge that spectral methods have been applied in the context of persistent homology (lines 56-59), also citing Mémoli et al. [47]'s work on Persistent Laplacians. We are happy to add a reference to the follow-up work of Davies, Wan, and Sanchez-Garcia (2023). However, none of these works either deals with the correction of effective resistance [71,72], crucial for good performance (Fig. S16), or addresses the high-dimensional setting. Pointing out the curse of dimensionality for persistent homology and combatting it with spectral methods is therefore our novel and original contribution.
References:
Davies, Wan, & Sanchez-Garcia(2023). The persistent Laplacian for data science: Evaluating higher-order persistent spectral representations of data.
Thank you to the authors for their thoughtful and extensive rebuttals. After reading the authors' replies and other reviewers' reports and the authors' replies to them, I may be inclined to raise my score.
Despite the Hiraoka paper being recent, I think it is important to include this reference and the discussion and additional experiments that were run in the revision, as the authors have agreed to. Thank you for making the extra efforts.
Before considering raising my score, though, I have further questions for the authors: Could the authors elaborate further on why they believe that resampling is not appropriate in high dimensional settings? [1] seems to do precisely this, while in response to the authors' argument that Reani & Bobrowski use alpha complexes, it appears that [2] extends the problem to the Vietoris-Rips setting. Moreover, the authors of [2] claim to have proposed a fast method to do this, so it seems that the approach should be implementable and comparable to the authors' work. I acknowledge that both of these papers seem to be quite recent, and [2] falls in the "contemporaneous work" category that the authors mentioned but it appears that a version has been on the arXiv for some time prior.
[1] Clarté, L., Vandenbroucque, A., Dalle, G., Loureiro, B., Krzakala, F., & Zdeborová, L. (2024). Analysis of bootstrap and subsampling in high-dimensional regularized regression. arXiv preprint arXiv:2402.13622 [2] García-Redondo, I., Monod, A., & Song, A. (2024). Fast topological signal identification and persistent cohomological cycle matching. Journal of Applied and Computational Topology, 1-32.
Dear reviewer aHMb,
many thanks for your reply. We are glad that you found our rebuttal "thoughtful and extensive" and that you consider raising your score.
Hiraoka et al:
We will make sure to expand the discussion of Hiraoka et al. in the revised version and will also include the additional experiments. Thank you for pushing for treating this relevant related work in more detail.
Cycle matching:
We would like to reiterate that cycle matching is conceptually very different from our benchmark. We focused on comparing different distances as input to persistent homology, whereas cycle matching is an alternative approach that could be used with any distance (see below). Below we will elaborate how resampling, required for identifying true features with cycle matching, is infeasible in our setting. Moreover, we will discuss how cycle matching performed in additional experiments we conducted.
Resampling:
We may have been overly brief in the initial response due to the character limit. Our point was that Reani & Bobrowski either resample from the original data density, which is not available in any real-world exploratory setting; or they resample from a kernel density estimation (KDE) of the point cloud. It is a classical result that KDE becomes problematic in high-dimensional settings (see e.g. Wasserman (2006), chapter 6.5). The problem is that one requires a prohibitively large number of samples for a high-quality KDE. In addition, a KDE requires an additional hyperparameter, the bandwidth.
Clarté et al. [1] explore other resampling approaches beyond KDE (bootstrap, subsampling, etc) in the context of high-dimensional regularized regression and also find that "resampling methods are fraught with problems in high dimensions" (abstract). Moreover, Roycraft et al. (2023) caution against using naive bootstrapping in general for persistent homology computations.
García-Redondo et al.
Many thanks for pointing us to this interesting work. García-Redond et al. provide an easy-to-use code base and we ran some proof-of-concept experiments indicating that cycle matching with the Euclidean distances does not perform well in the setting of high-dimensional noise. Since we cannot update the page with additional figures, we will summarize our findings in the following.
For this proof-of-concept, we simply resorted to resampling from the original data distribution, even though it is not available in an exploratory context.
Despite speeding up the work of Reani & Bobrowski, García-Redondo et al.'s approach is still very slow. They report a run time of about mins per cycle matching computation for a dataset with points (their Tab 3, last line). We observed an even higher run time on our hardware. This is about times slower than our approach of simply computing the persistent homology with a different distance (Tab S4, lines 1,2).
Due to the high run time, we only computed cycle matchings for three resamples of the circle dataset with points in dimensions. Moreover, we had to limit ourselves to the noise levels .
Cycle matching tries to match each cycle in the persistence diagram of the original data with cycles in the diagrams of the resampled data. Each successfully matched cycle gets a prevalence score in . The other cycles get a prevalence of . When used for identifying real structure from noise, the hope is that only the prevalence for the true feature is high, when averaged across multiple resamples.
The following table shows the maximal prevalence scores for the Euclidean distance (and the diffusion distance, see below).
| Euclidean | |||||
| Diffusion |
Results in brackets indicate that the cycle with maximal prevalence did not correspond to the ground truth feature (by visual inspection of the representative). This was the case in all experiments with the Euclidean distance save for . For , we also checked the prevalences of the ground truth cycle (for higher there are lots of noise cycles with higher persistence than the ground truth cycle, making its identification difficult). For the ground truth cycle was not matched for any of the three resamples, leading to a prevalence of . For the ground truth cycle had an average prevalence of , much lower than the maximal prevalence cycle for .
One might hope that the cycles that get matched for all resamples are true topological features, even if their prevalences are low. Indeed, for only the correct cycle was matched for all three resamples. However this heuristic also does not work because for no cycle was matched for all three resamples, while for there was a cycle matched for all three resamples, but it did not encode the ground truth feature (and its prevalence was very low ()).
The fact that most cycles were not matched for all resamples led to the high standard deviations of the prevalences. The reason for the high mean prevalences for is that some noise cycles get matched with very high prevalence (sometimes ) for one or two of the resamples. For the same reason, the prevalence of the ground truth cycle for was not much higher than the second highest prevalence (). Averaging across more resamples might help with this, but becomes even more computationally expensive.
Overall, we find that cycle matching is not only very slow, but also does not alleviate the curse of dimensionality for persistent homology. In contrast, our spectral methods produced a near-perfect loop detection score until (Fig 5).
Cycle matching as alternative performance score
Cycle matching does not require the Euclidean distance, but works in principle for any two finite metric spaces. While the rebuttal time was too short to amend García-Redondo et al.'s implementation accordingly, we did explore the diffusion distance. This was possible because the diffusion distance is a Euclidean distance between certain point configurations (Section 6), allowing us to use García-Redondo et al.'s code unchanged. We included the results for the diffusion distance in the above table. We found that
- a) the maximal prevalences were larger than for the Euclidean distance for ,
- b) the cycle with maximal prevalence always represented the ground truth loop, and
- c) the prevalence of the correct cycle was a clear outlier among all prevalences for ,
- d) the cycle representing the ground truth feature was always matched for all three resamples and was the only one matched for all three resamples for .
This underlines the superior performance of spectral methods for detecting topology in high-dimensional settings.
Using cycle matching not for identifying topologically significant cycles but to assess the performance of different distances is an interesting complementary performance score. We can use this validation criterion only on the synthetic datasets where we have access to the data distribution for resampling. However, due to the long run time, we cannot apply it to all of our synthetic experiments.
We will add our experiments with cycle matching to the appendix in the revision. Thank you for nudging towards this approach.
References
Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media.
Roycraft, B., Krebs, J., & Polonik, W. (2023). Bootstrapping persistent Betti numbers and other stabilizing statistics. The Annals of Statistics, 51(4), 1484-1509.
Thank you to the authors for carefully considering my suggestions and running additional experiments as well as providing additional clarifications. I think that these observations and additional experiments would be useful to include in the revision for completeness because they are recent contributions in the area of topological data analysis that have gained quite a lot of traction given their recent appearances and they are relevant to the use case that the authors study with their method.
I have also read the other reviews and the authors' responses to them as well, which I think were generally handled well. I have raised my score from 3 to 5.
Thank you for the fruitful discussion! We agree that the paper got more well-rounded through your suggestions, which we will definitely incorporate in the revision. We very much appreciate you raising your score!
Dear Reviewer aHMb, could you please respond to the authors' rebuttal? Thank you
Summary. The paper proposes a new spectral type distance that (through the pipeline of persistent homology) better captures hidden cycles in high-dimensional point clouds than standard distances such as Euclidean.
Strengths. The authors should be highly praised for a rigorous approach to challenging data in high dimensions. The paper and appendices contain several theoretical results with detailed proofs.
Weaknesses. A few statements can be made more exact. Lines 29-30 say that "persistent homology based on the Euclidean distance between noisy points can fail to find the correct loop (Figure 1)". The actual diagram in Figure 1 contains all birth-death pairs, which can be visualized by a Homologically Persistent Skeleton (https://www.sciencedirect.com/science/article/abs/pii/S0031320321000893) with theoretical guarantees. The authors probably meant that it would be great to see only one highly persistent pair that is well-separated from all others.
Any dimensionality reduction (including UMAP, t-SNE, PCA, MDS and all others) is either discontinuous (makes close points distant) or collapses an unbounded region to a single point (loses an infinite amount of data), which has a simple and rigorous proof in the paper at https://www.tandfonline.com/doi/abs/10.4169/amer.math.monthly.123.4.392.
Questions. What is the asymptotic complexity (for a fixed dimension, in the number of points) of the effective resistance distance?
The "hole detection score" measuring the ratio (p_1-p_2)/p_1 can be unstable if p_1 is (close to) 0, which happens for generic families of non-isometric point clouds. The authors probably know about these examples by citing reference [64] whose published version is at https://link.springer.com/article/10.1007/s41468-024-00177-6.
Why not to use the simpler widest gap between dots in the persistence diagram, which can separate several genuine cycles (not only one) from noise and has theoretical guarantees, which were proved in the paper at https://www.sciencedirect.com/science/article/abs/pii/S0031320321000893 and would substantially improve this paper?
Limitations. The key limitation of persistent homology is its weakness as an isometry invariant [64] in addition to a cubic computational time in the size of a complex, which is usually much larger than the cloud size. This limitation has been addressed 20 years ago by much simpler, faster and stronger invariants (https://www.sciencedirect.com/science/article/pii/S0196885803001015) and more recently by complete invariants (https://match.pmf.kg.ac.rs/issues/m91n1/m91n1_79-108.html, https://openaccess.thecvf.com/content/CVPR2023/html/Widdowson_Recognizing_Rigid_Patterns_of_Unlabeled_Point_Clouds_by_Complete_and_CVPR_2023_paper.html).
The authors and reviewers are encouraged to consult the guidance at https://neurips.cc/Conferences/2024/ReviewerGuidelines and focus on justified comments, questions, and answers, without putting weight on numerical scores because scientific decisions should be made not by popular votes but through rigorous arguments.
Dear area chair,
thank you very much for the additional feedback! We are happy you appreciate the rigor of our approach and the theoretical contributions. We respond to your comments below.
Precision:
We will rephrase as
persistent homology [...] can fail to identify the correct loop as a clear outlier in the persistence diagram
Dimensionality reduction:
We fully agree that dimensionality reduction introduces artifacts. We mentioned this for t-SNE and UMAP in line 329 which we will strengthen with the reference to Landweber et al. that you provided. It also applies to the normalized PCA approach from [36] suggested by reviewer aHMb.
Neither effective resistance nor diffusion distance perform dimensionality reduction. Both can be exactly expressed only in dimensions, where is the number of points. Instead, they should rather be considered denoising methods. Therefore, our recommendations align with your criticism of dimensionality reduction.
Hole detection score:
Stability
Indeed, our hole detection score can be unstable for small . While this is mitigated by the numerator also getting small for small , we did observe some instability if there are only few points in the diagram. As described in lines 223-226, we handled this by setting in case all death-to-birth ratios are small (compare also Figs S19, S21). During the rebuttal, we proved a desirable continuity property for our hole detection score, see general comment.
Some normalization of the gap width is necessary to compare different distances, because their values are on different scales. Our hole detection scores always lies in .
Widest gap metric
We use our hole detection score to validate if a method identifies the correct number of ground-truth holes, . It always considers the gap in the persistence diagram that needs to be large so that the correct number of holes appear as outliers.
Building a metric based on the widest gap in the persistence diagram is an interesting suggestion. We now implemented and tested the binary metric which is 1 if the number of features above the widest gap equals the number of ground-truth features and 0 otherwise ("widest gap metric").
Note that our metric can be lower than 1 even if the persistence diagram clearly shows the correct number of outliers (e.g. for in Fig. 3b). The widest gap metric would assign a score of 1 in this case.
On the other hand, the binary nature of the widest gap metric leads to instability in the case of nearly equisized gaps. For instance in Fig. 8d, a small perturbation could make the gap after the most persistent feature the widest, dropping the widest gap metric from 1 to 0. In contrast, our correctly reflects the two outliers.
The binary nature of the widest gap metric leads to high-variance results even though we increased the number of random seeds to 10 (rebuttal Fig R1). Spectral methods still performed better than Euclidean, Fermat, and DTM distances. The widest gap metric can lead to false positives. E.g. for the Fermat distance and in 9 of 10 trials on the circle the widest gap appeared after the first feature. However, all 9 persistence diagrams looked like noise clouds and in 8 cases the most persistent feature's representative was clearly a noise feature. We plot this for one seed in Fig R1 d-e. Conversely, for the effective resistance there was always a clear outlier in the diagram and the representative did follow the ground-truth loop (Fig R1 f-g). Although the Fermat distance failed in this setting and the effective resistance performed great, their widest gap metrics were both very high.
For these reasons, we prefer to stick to our hole detection score. We will add the above discussion including Fig R1 to the appendix. Moreover, we will cite Smith and Kurlin (2021) as additional inspiration for our hole detection score in line 222.
Limitations:
Runtime complexity
Following reviewer aHMb, created a dedicated Limitations section (see general comment). It contains our existing discussion of the high complexity of persistent homology (line 331), adding the cubic complexity in the number of points of effective resistance.
Isometry invariant
We already mentioned that persistent homology can fail to distinguish non-isometric point clouds (line 324). We will add the references on finer invariants you provided (see general comment). That said, we believe that persistent homology being a coarser invariant is not a grave disadvantage for exploratory data analysis. The topology of a point cloud provides an intuitive, high-level description of the data.
References:
Landweber et al (2016). On fiber diameters of continuous maps.
Smith & Kurlin (2021). Skeletonisation algorithms with theoretical guarantees for unorganised point clouds with high levels of noise.
Please use the word "metric" a bit more carefully, because a metric is a distance satisfying the metric axioms, https://en.wikipedia.org/wiki/Metric_space. The word "distance" is not a replacement for a metric but has a wider meaning as in the book at https://link.springer.com/book/10.1007/978-3-662-52844-0. If talking about a "feature" or a "descriptor", these words are more suitable than a "metric". Thank you.
Thank you for your attention to details! We used the word metric in the expression widest gap metric in the sense of a performance metric (like mean squared error, , precision, recall, score) since it measures the performance of different approaches. This is a standard use of the word metric in the machine learning community. Nevertheless, to avoid any confusion with the quantities we use as input to the Vietoris-Rips complex, we are happy to change our wording to widest gap score.
Please note that we mention in Appendix F / Prop C3, which of the investigated distances/dissimilarities satisfy the axioms of a metric (in the sense of metric space).
Dear reviewers and area chair,
we cordially thank you for the effort invested in assessing our manuscript. We are glad that you found the tackled problem "relevant" (C1nu, QsUc), liked our presentation (C1nu, aHMb), and appreciated both the theoretical contributions (AC) and our experimental setup (C1nu, aHMb, QsUc).
Rebuttal figures:
Attached to this general comment, please find a page with three additional figures.
Dedicated Limitations section:
Following the comments of reviewers aHMb, QsUc, and the AC, we will include a dedicated Limitations section, see below. It consists of passages formerly located in the Discussion section plus some additional content, highlighted below in italics. The remaining part of the Discussion section will be retitled Conclusions.
Limitations and future work:
In the real-world applications, it was important to look at representatives of detected holes as some holes were persistent, but arguably incorrect. That said, each hole homology class has many different representative cycles, making interpretation difficult. Given ground-truth cycles, an automatic procedure for evaluating cycle correctness remains an interesting research question.
Persistent homology can only detect topology, which is often a useful global level of abstraction. However, it may therefore fail to distinguish some non-isometric point clouds [64]. There exist dedicated measures for detecting isometry [Boutin et al. 2004, Widdowson et al. 2023, Kurlin 2024].
Using effective resistance or diffusion distances is easy in practice as their computation time ()) is dwarfed by that of the persistent homology (Table S4), which scales as for n points and topological holes of dimension [51]. This high complexity of persistent homology aggravates other problems of high-dimensional datasets as dense sampling in high-dimensional space would require a prohibitively large sample size and since spectral methods seem to need a high sampling density on some datasets like the torus for good performance. Combining persistent homology with non-Euclidean distance measures could mitigate this problem via the approach of Bendich et al. [6], who performed subsampling after computation of the distance matrix. This is a particularly attractive avenue for future research.
Both effective resistance and diffusion distances require the choice of hyperparameters. However, effective resistance only needs a single hyperparameter: the number of NN neighbors. For this reason and due to its greater outlier resistance (Appendix K), we tend to recommend effective resistance over diffusion distances, but a principled criterion when to use which of the two is still missing.
Moreover, we do not have a theoretical proof that spectral distances mitigate the curse of dimensionality. Such a proof may be achieved in the future taking inspiration from the stability results in [10, 36, 66] and, more generally, spectral perturbation theory.
Our empirical results focus on benchmarking which distances identify the correct topology in the presence of high-dimensional noise. Therefore, we only considered datasets with known ground-truth topology. The next step will be to use spectral distances to detect non-trivial topology in real-world exploratory contexts.
Continuity of our hole detection score:
Reviewer QsUc and the area chair had questions about our evaluation metric. During the rebuttal, we were able to prove the following desirable continuity result for our hole detection score , which we will include in the revision. It states that on persistence diagrams with sufficiently many points our hole detection metric is continuous. This makes our metric robust against small perturbations in the persistence diagram, a property expected for a reliable metric.
In the typical case, where the persistence diagram contains a noise cloud of many points close to the diagonal and, optionally, some outliers, the requirement on the number of points is satisfied. Only in the case where we expect a large gap after the -th feature, but the diagram does not even have features, can there be a sudden jump in our hole detection score.
Proposition
Let be the space of all persistence diagrams with finitely many points off the diagonal, endowed with the topology induced by the bottleneck distance . Then the map is not continuous for any . Here are the non-negative real numbers.
Let and define as the subset of with at least points off the diagonal. Then the map is continuous.
Proof sketch
The discontinuity of happens when adding the -th point to a diagram.
For the continuity part, we split into two parts. The map that assigns its -th persistence to a persistence diagram is continuous (details omitted for brevity). The map is continuous. Furthermore, we have . Finally, factors as , showing its continuity. Q.E.D.
References:
Boutin & Kemper (2004). On reconstructing n-point configurations from the distribution of distances or areas.
Widdowson & Kurlin (2023). Recognizing rigid patterns of unlabeled point clouds by complete and continuous isometry invariants with no false negatives and no false positives.
Kurlin (2024). Polynomial-time algorithms for continuous metrics on atomic clouds of unordered points.
This paper rigorously investigated the shortcomings of several past methods (dimensionality reduction and persistent homology), which are important to know for a broad audience in data science.
All reviewers gave positive scores and sticked to them after the author's detailed rebuttal, though no reviewer provided decisive arguments.
Reviewer QsUc, who gave score 7, didn't participate in the discussion after acknowledging the rebuttal by saying "Thank you for the clarifications - I stick to my score".
Reviewer C1nu, who gave score 6, was asked even before the rebuttal to clarify their comments with more details but didn't react and later only wrote that "I thank the authors for their efforts and the response. I have read the response and would like to stick with my score."
Reviewer aHMb, who gave score 5, acknowledged that "I may have been harsh in my initial evaluation and while my overall impression has increased, I still maintain that is a borderline paper that could be accepted".
The area chair champions this paper for the following reasons.
The key strength is the rigorous approach with theoretical results, e.g. Corollaries 6.1-6.2 proved in Appendices C-D.
The main argument for accepting this paper is the practical demonstration of the curse of dimensionality for persistent homology, which wasn't previously known.
The authors understood the major limitation of any dimensionality reduction, which is either discontinuous (makes close points distant) or collapses an unbounded region to a single point (loses an infinite amount of data) as proved in the paper at https://www.tandfonline.com/doi/abs/10.4169/amer.math.monthly.123.4.392.
This crucial limitation applies to all modern methods such as t-SNE and UMAP, whose optimization has no guarantees of a global optimum, as well as to classical methods such as PCA whose coordinates are analytically computable but can discontinuously change under perturbations of given points when some eigenvalues become equal.
Any justified analysis of high-dimensional data should be done in the original space, while low-dimensional projections to explicitly defined coordinates can help only to visualize, not to make rigorous conclusions.