PaperHub
5.4
/10
Poster5 位审稿人
最低5最高6标准差0.5
6
5
5
6
5
3.4
置信度
正确性3.0
贡献度2.4
表达3.2
NeurIPS 2024

Expected Probabilistic Hierarchies

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-16
TL;DR

Probabilistic model learning hierarchies in data by optimizing the expected metrics via gradient descent outperforming several baselines.

摘要

关键词
hierarchical clusteringgraph clusteringclusteringunsupervised learningprobabilistic models

评审与讨论

审稿意见
6

This paper considers the problem of hierarchical clustering. This problem is typically handled by discrete optimization approaches, which define a hierarchical clustering quality score (e.g. Dasgupta and TSD) and optimize it on a discrete search space. More recent approaches consider this problem from a probabilistic perspective. They soften the hierarchical clustering scores and optimize them on a continuous space. In this paper, the authors propose an extension of the prior work on Flexible Probabilistic Hierarchy (FPH) [Zügner et al, 2021]. Specifically, they introduce Expected Probabilistic Hierarchies (EPH), a new probabilistic hierarchical clustering objective for continuous optimization. The consistency and advantages of EPH are supported by theoretical analysis. The experimental results on synthetic and real-world datasets show that EPH can outperform existing approaches.

优点

  • Overall, this paper is well-organized and easy to follow.
  • The contributions are clearly described and well-supported by theoretical analysis and experiments.
  • Extensive experimental results and visualizations are provided.

缺点

  • While I appreciate the theoretical results for supporting the advantages of EPH over FPH, there is a lack of substantial technical improvements. Therefore, the novelty is limited.

问题

  • The motivation behind adopting the Dasgupta and TSD is not clearly stated. While the authors provide definitions and references for them, it is still hard to understand why these two particular scores are used and how they are in contrast to each other. Could the authors elaborate on this point?

局限性

The authors have discussed the limitations of their work.

作者回复

We thank the reviewer for the comprehensive feedback. In the following, we address their comments.

Comment: While I appreciate the theoretical results for supporting the advantages of EPH over FPH, there is a lack of substantial technical improvements. Therefore, the novelty is limited.
Response: We appreciate the reviewer's recognition of the theoretical results supporting the advantages of EPH over FPH. We want to emphasize that EPH introduces several significant advancements that contribute to its novelty and practical impact.

First, EPH demonstrates substantial improvements in performance, with gains of over 20% compared to the original FPH on certain datasets (see Table 10 in the appendix). These improvements are not isolated but consistent across a wide range of datasets, underscoring the effectiveness of our approach.

Moreover, our work introduces a novel unbiased subgraph sampling technique that significantly enhances the scalability of EPH, particularly on dense vector datasets. This scalability is a crucial advancement, as it allows EPH to be applied to larger and dense datasets that FPH cannot handle. EPH opens new possibilities for its application in various domains where data similarities are dense.

In addition to performance and scalability improvements, EPH provides theoretical justification for its framework. Finally, we are hopeful that future work will be able to adapt our algorithm to different use cases, as its only requirement is a differentiable metric.

Comment: The motivation behind adopting the Dasgupta and TSD is not clearly stated. While the authors provide definitions and references for them, it is still hard to understand why these two particular scores are used and how they are in contrast to each other. Could the authors elaborate on this point?
Response: We focused on the Dasgupta cost and TSD because they are well-studied clustering metrics with intuitive meanings. The Dasgupta cost is a widely used metric that measures the quality of a hierarchical clustering by evaluating how close similar leaves are in the hierarchy. Specifically, it quantifies the expected number of leaves for which the lowest common ancestor of a randomly sampled pair of leaves is their shared ancestor. A lower Dasgupta cost indicates that similar items are grouped together in the hierarchy, reflecting a good clustering structure.

The Tree-Sampling Divergence (TSD), on the other hand, quantifies how well a hierarchy can reconstruct its graph. More specifically, it is defined as the KL divergence between the node and the edge distributions. Its advantage over the Dasgupta cost is that it does not favor binary hierarchies.

Both metrics are unsupervised and differentiable, which is necessary for the gradient-based optimization in our approach. The unsupervised nature of these metrics allows us to evaluate the quality of the hierarchies without the need for labeled data, making them applicable to a wide range of applications.

We again thank the reviewer for their feedback and are happy to address any upcoming questions.

评论

Thank you for the detailed response. After reading the rebuttal and other reviewers' comments, I am happy to keep my score unchanged (weak accept).

评论

We thank the reviewer for their response and appreciate any further suggestions.

审稿意见
5

This paper proposes for hierarchical clustering a new method Expected Probabilistic Hierarchies (EPH) that is developed from the Flexible Probabilistic Hierarchy (FPH) method. Unlike FPH using Soft-Das and Soft-TSD, EPH provides an unbiased estimate on two new objectives called Exp-Das and Exp-TSD. EPH has addressed the alignment issue between continuous and discrete targets, which is supposed to be the reason why it outperforms other continuous methods. Soft-Das is proved to be a lower bound on Exp-Das (Proposition 4.2), which is the main theoretical contribution. A learnable approach based on differentiable hierarchical sampling was proposed, in which the tree sampling procedure and the Gumbel-Softmax estimator are the main techniques to approximate the discrete choices of parent nodes. Experimental results verify the effectiveness of the new method.

优点

  1. The authors propose a new hierarchical clustering method based on a sampling-based differentiable metric.

  2. The authors make comprehensive experiments to demonstrate that their method outperforms the baselines on both synthetic (HSBM) and real (including graph and vector) datasets.

缺点

The technical contribution is limited since the framework of the algorithm and main techniques almost totally come from (Zu¨\ddot{u}gner et al., 2022). I do understand that EPH has its own scores rather than soft scores, but I don't think the novelty is strong enough for the NeurIPS criterion.

问题

  1. It seems that the Dasgupta cost is robust against the numbers of hierarchies and internal nodes, whose variation may lead to quite different hierarchical structures. However, a real system is believed to have its own intrinsic structure with relatively fixed such numbers. Does it mean that EPH or even Das and TSD costs are not suitable for finding the intrinsic hierarchical structure of real systems? I think that finding the intrinsic numbers is quite a difficult task, and it seems not a good way to treat them as hyper-parameters.

  2. How do you restrict the number of hierarchies when in the condition that the number of internal nodes is given?

局限性

Yes, the authors have addressed the limitations in their discussions.

作者回复

We thank the reviewer for the valuable feedback and address their comments below.

Comment: The technical contribution is limited since the framework of the algorithm and main techniques come from (Zügner et al., 2022).
Response: While our work builds upon the probabilistic hierarchies introduced by Zügner et al. (2022), it addresses key limitations of their approach and introduces several significant contributions and advancements.

Firstly, we prove that the Soft-Das cost is a lower bound of the discrete Dasgupta cost (Sec. 4.2), and we provide a concrete example where it fails to identify the optimal hierarchy. Building on this insight, we propose the Exp-Das and Exp-TSD metrics (Sec. 4.1) and show that their optimal hierarchies align with the discrete counterparts. Furthermore, we introduce an unbiased subgraph sampling algorithm, allowing us to employ both FPH and EPH to vector datasets.

Our empirical evaluation further demonstrates these advancements, showing substantial performance improvements, with gains of over 20% on certain datasets compared to the original FPH (see Table 10 in the appendix). These improvements are consistently observed across graph and vector datasets, underscoring the practical impact and novelty of our contributions.

Comment: It seems that the Dasgupta cost is robust against the numbers of hierarchies and internal nodes, whose variation may lead to quite different hierarchical structures. However, a real system is believed to have its own intrinsic structure with relatively fixed such numbers. Does it mean that EPH or even Das and TSD costs are not suitable for finding the intrinsic hierarchical structure of real systems? I think that finding the intrinsic numbers is quite a difficult task, and it seems not a good way to treat them as hyper-parameters.
Resonse: The Dasgupta cost and TSD metrics tend to improve with an increasing number of internal nodes, but they eventually converge when the hierarchies become expressive enough to capture the intrinsic structure of the data. In an unsupervised setting, the true number of internal nodes is unknown. To address this, we conducted experiments to determine when most of the information is captured, which we found to occur at n=512n' = 512 (see Figure 14 and Figure 15). While the choice of internal nodes can depend on the specific application, a higher number is generally preferable as it captures more information and can be pruned later. We do not expect significant structural differences in hierarchies of different sizes, as the metrics maintain consistent and intuitive meanings across configurations.

Moreover, we validated our approach on synthetic HSBMs, where the number of internal nodes is predefined (see Figure 4 and Table 4). The results show that the inferred hierarchies closely align with the ground truth, indicating that EPH, along with the Dasgupta and TSD costs, is suitable for identifying the intrinsic hierarchical structure when the exact number of internal nodes is used to condition the hierarchies.

Comment: How do you restrict the number of hierarchies when in the condition that the number of internal nodes is given?
Response: The number of hierarchies is implicitly constrained by the sizes of A\mathbf{A} and B\mathbf{B}. Specifically, selecting nn' effectively limits the number of internal nodes, thereby restricting the possible hierarchies that can be inferred.

We again thank the reviewer and are happy to address any remaining concerns.

评论

Thank the authors' response. My major concern is about the technical contributions. I keep my score.

评论

We thank the reviewer for their response.

审稿意见
5

This paper proposes a method for hierarchical clustering by optimizing expected clustering scores (DAS/TSD) over the distribution of hierarchies. They show theoretically that their proposed optimization objective is consistent with their discrete counterparts: that is, the solution for their proposed optimization problem is equivalent to the solution of the intractable discrete optimization problem (this is in contrast with other continuous relaxations such as soft-DAS that do not have this property). They show that this problem can be solved well in practice in an end-to-end manner despite not being convex. The authors show that their method empirically outperforms other known methods.

优点

The paper is extremely well written and technically correct/solid. The problem background is explained well in a way that is accessible to a broader audience as well.

缺点

The main issue with this paper is the significance of the contribution. Overall, the work seems like a marginal improvement over [1]. In my view, the differentiable sampling techniques used are very standard and the main novel contribution is their theoretical result about consistency of their optimization objective with that of the discrete objective. The experimental results also only show a marginal improvement over the work that it builds on [1]. Furthermore, based on the runtimes from Table 18, it seems that this marginal improvement comes at a great computational overhead compared to FPH [1]. Thus, in the context of the whole paper, while the theoretical result is certainly interesting, this seems like an incremental contribution.

[1] Zügner et al., End-to-End Learning of Probabilistic Hierarchies on Graphs.

问题

  1. In what cases would a practitioner prefer to use your method (EPH) over FPH? It seems that the former is much more computationally expensive. Are there applications that demand such precise clustering? (If I'm understanding correctly, because of the non-convexity, EPH cannot guarantee optimal solution despite consistency anyway).

  2. Alternatively, are there reasons to believe future work can significantly speed up EPH to make it competitive (in the runtime sense) with other baselines?

局限性

Yes, the authors addressed the limitations.

作者回复

We thank the reviewer for their valuable feedback and address their concerns in the following.

Comment: The experimental results also only show a marginal improvement over the work that it builds on [1]. Based on the runtimes from Table 18, it seems that this marginal improvement comes at a great computational overhead compared to FPH [1]. [...]
Response: We appreciate the reviewer's feedback and would like to clarify the significance of our contribution. Our proposed method, EPH, introduces several crucial modifications to FPH, which result in notable improvements of FPH itself across multiple datasets, as shown in Table 10. For instance, on the Brain dataset, EPH achieves an over 20% improvement compared to the original FPH, and even after tuning FPH, EPH still provides up to a 12% improvement on the OpenFlight dataset. Furthermore, it is important to note that FPH can only be applied to large vector datasets due to our proposed subgraph sampling algorithm. EPH demonstrates superior scores to many baselines and a lower runtime than the other continuous method, HypHC. We believe these advancements elevate our contribution beyond incremental improvements and hope the reviewer recognizes the broader applicability and potential impact provided by EPH.

Comment: In what cases would a practitioner prefer to use your method (EPH) over FPH? [...] Are there applications that demand such precise clustering? EPH cannot guarantee optimal solutions despite consistency.
Response: While it is true that EPH cannot guarantee convergence to the global optimum, its consistency provides several advantages that make it preferable in specific scenarios. EPH ensures that any reduction in the objective function results in a probabilistic hierarchy where discrete hierarchies are encoded with consistent costs. Unlike FPH, which can diverge from an optimum, EPH's probabilistic approach ensures that the resulting optima contain discrete hierarchies with equivalent costs, allowing for easier sampling of discrete hierarchies.

EPH also provides a valuable trade-off between quality and runtime through adjustable hyperparameters, allowing practitioners to prioritize high-quality hierarchies when computational resources are secondary. In applications where precise clustering is crucial, EPH can uncover hierarchies that deterministic methods like FPH may miss, irrespective of the computational resources available.

A common example where precise clustering is crucial is to predict the clinical prognosis of cancer genomics [2, 3].

Comment: Alternatively, are there reasons to believe future work can significantly speed up EPH to make it competitive with other baselines?
Response: Our primary focus was on improving the quality of inferred hierarchies rather than optimizing for runtime. However, we see several opportunities for future work to significantly reduce the runtime of EPH. These include:

  1. Parallelizing computations of different Gumbel samples, which in our current implementation are computed sequentially. By parallelizing, we can achieve a substantial reduction in runtime.

  2. Reducing the number of Gumbel samples as a trade-off. EPH allows for flexibility in the number of samples needed to approximate the loss, with fewer samples leading to faster computations.

  3. Skipping validation steps that are necessary for FPH but not for EPH, as EPH's loss is consistent with the discrete Dasgupta cost.

To demonstrate the feasibility of these improvements, we implemented two modified versions of EPH on the graph datasets: EPH (parallelized) and EPH (minimized). EPH (parallelized) maintains the same numerical results while reducing runtime by up to 80%. EPH (minimized), which reduces the number of Gumbel samples and skips validation steps, achieves a runtime reduction of up to 97.5%, even outperforming FPH in speed, as shown in the following table.

MethodPolblogsBrainCiteseerGenesCora-MLOpenFlightWikiPhysics
FPH452547345373644592667
EPH3834340226092848332244196389
EPH (parallelized)14961066749521119616542325
EPH (minimized)111100727985116157

While these modifications offer a significant speed improvement, they still achieve state-of-the-art results, as shown in the following table.

ModelPolblogsBrainCiteseerGenesCora-MLOpenFlightWikiPhysics
FPH238.65425.7076.03182.91257.42355.61482.40
EPH (minimized)236.86404.3072.81173.56238.11309.33463.63

We are optimistic that future research, particularly in sampling approximation methods [4] and GPU-accelerated alias sampling [5], will continue to reduce EPH's runtime, allowing it to scale to even larger datasets. Finally, it is worth noting that EPH's flexibility allows it to be applied to any differentiable metric, enabling future adaptations and applications across a wide range of fields.

We again thank the reviewer for their feedback and hope that we have addressed their concerns satisfactorily.

[1] Zügner et al. "End-to-end learning of probabilistic hierarchies on graphs." In International Conference on Learning Representations. 2021.

[2] Van't Veer et al. "Gene expression profiling predicts clinical outcome of breast cancer." nature 415, no. 6871 (2002): 530-536.

[3] Macaluso et al. "Cluster trellis: Data structures & algorithms for exact inference in hierarchical clustering." In International Conference on Artificial Intelligence and Statistics, pp. 2467-2475. PMLR, 2021.

[4] Paulus et al. "Rao-blackwellizing the straight-through gumbel-softmax gradient estimator." arXiv preprint arXiv:2010.04838 (2020).

[5] Wang et al. "Skywalker: Efficient alias-method-based graph sampling and random walk on gpus." In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 304-317. IEEE, 2021.

评论

Thank you for your time and effort in providing a detailed response! I really appreciate the new experiments that optimize the runtime to EPH to make it competitive with existing method like FPH. It successfully demonstrates the practical applicability of the method.

I have modified my score to recommend acceptance.

评论

We thank the reviewer for their response and for adjusting the score.

审稿意见
6

This paper studies gradient-based methods for hierarchical clustering, presenting an interesting approach EPH which is both scalable and accurate. The approach uses a subgraph sampling approach for scalability and is interesting in its optimization of expected hierarchical clustering costs.

优点

This is a very well written paper which:

  • Interesting methodological contributions
  • Has thorough empirical comparisons
  • Clear writing and presentation of ideas
  • Presentation of complexity
  • Presentation of empirical comparisons on a wide variety of datasets

缺点

Weaknesses of the paper include:

  • Perhaps the methodological ideas are a bit on the simple side, however, this may be an impression given by a well written paper.
  • it is a bit difficult for me to understand when a practitioner would choose this method over some algorithmic alternatives

问题

Can you add / discuss more about the implications of binary vs non-binary hierarchies?

局限性

adequately addressed

作者回复

We thank the reviewer for their valuable concerns. In the following, we address their comments.

Comment: It is a bit difficult for me to understand when a practitioner would choose this method over some algorithmic alternatives.
Response: EPH is particularly advantageous over alternatives in scenarios where the quality of the hierarchy is crucial and outweighs other factors. Unlike algorithmic alternatives, EPH has the unique ability to uncover hierarchies that deterministic methods are not able to reach, regardless of their computational resources. This makes EPH an ideal choice when the primary objective is to capture intricate data structures that are critical for the analysis.

Moreover, EPH offers flexibility through its adjustable hyperparameters, allowing practitioners to balance the trade-off between clustering precision and computational efficiency. This adaptability enables users to tailor the method to their specific needs, whether they require high-quality clustering or need to optimize for runtime.

Comment: Can you add / discuss more about the implications of binary vs non-binary hierarchies?
Response: Binary hierarchies offer fine-grained clustering and are often easier to analyze with certain theoretical models, e.g., to derive lower bounds for costs [3]. However, they can be computationally intensive. Non-binary hierarchies, in contrast, provide more flexibility by allowing nodes to have more than two children, which often aligns better with real-world clustering tasks. This flexibility can simplify the hierarchy and make it easier to interpret at higher levels. However, it is worth noting that binary hierarchies are required in certain tasks, such as jet physics and cancer genomics [2].

It is important to note that while the Dasgupta cost tends to favor binary branches [1], as seen in Figure 4, our approach is not restricted to binary hierarchies. EPH allows hierarchies to have any number of children, including binary hierarchies, if that structure is indeed optimal for the data.

We thank the reviewer again for their feedback and will discuss these points in the updated manuscript. We are happy to address any remaining concerns.

[1] Dasgupta, Sanjoy. "A cost function for similarity-based hierarchical clustering." In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 118-127. 2016.

[2] Macaluso, Sebastian, Craig Greenberg, Nicholas Monath, Ji Ah Lee, Patrick Flaherty, Kyle Cranmer, Andrew McGregor, and Andrew McCallum. "Cluster trellis: Data structures & algorithms for exact inference in hierarchical clustering." In International Conference on Artificial Intelligence and Statistics, pp. 2467-2475. PMLR, 2021.

[3] Chami, Ines, Albert Gu, Vaggos Chatziafratis, and Christopher Ré. "From trees to continuous embeddings and back: Hyperbolic hierarchical clustering." Advances in Neural Information Processing Systems 33 (2020): 15065-15076.

审稿意见
5

The paper builds on a previous work [40] for probabilistic hierarchical clustering. Instead of using the “soft” version of Dasgupta cost and Tree-Sampling Divergence in the objective function, the paper proposes to use the expected value of the two cost functions and develops a sampling method for calculating the expected value of the cost functions. The proposed cost functions have been shown to have some interesting theoretical properties and better empirical clustering performance compared to the previous work.

优点

  • The paper includes theoretical analysis of the proposed Exp-Das and Exp-TSD. It shows the optimal value of those cost functions for the probabilistic hierarchy is equivalent to the optimal value of their counter-parts (i.e. Das and TSD) for the discrete hierarchy. It further shows that the Soft-Exp is only a lower bound of Exp-Das and thus less ideal for use to find the hierarchy with optimal Das.
  • The experiments include reasonable number of baseline methods and datasets. The proposed method EPH is shown to be empirically better than the FPH proposed in the previous work [40].
  • The presentation is satisfactory and the paper is easy to follow.
  • The code of the proposed method is/will be publicly available.

缺点

  • The presentation of the paper and the proposed hierarchical clustering solutions closely resemble a previous work [40] that proposes the generalisation of probabilistic hierarchy and the cost functions Soft-Das and Soft-TSD. The current paper proposes Exp-Das and Exp-TSD to replace the Soft-Das and Soft-TSD. The contribution looks incremental in this sense, even with some interesting and nicer theoretical properties over the previous work.
  • The proposed method EPH looks only slightly better than the previous method FPH in the experiments. The improvement of EPH over FPH is not as significant as that over other baseline methods.
  • The proposed solution may not converge to global optimum of Exp-Das nor Exp-TSD. This makes the theoretical results less relevant as it is not guaranteed to reach global optimum anyway. This may also explain why the empirical performance is not markedly better than the previous work.
  • Only the best score of the hierarchies resulting from five random initialisation is reported in the experiments. It is hard to know how much variance there would be for the performance due to the random initialisation and whether the proposed method EPH is significantly better than the previous method FPH.
  • The qualitative evaluation does not provide very meaningful insights of the performance. I suppose similar results can be observed by other clustering methods provided that suitable features are available to compute the similarity between images. I think it would be more interesting if the hierarchy of clusters can be visualised instead.

问题

  • Can we find the discrete hierarchy that gives the optimal Das or TSD after getting the probabilistic hierarchy with the optimal Exp-Das or Exp-TSD?
  • Section 4.3 says that "we can use a closed-form expression", but then it says "no known solution exists". Do you have a closed-form expression of Exp-Das and Exp-TSD? If so, why don't you use it instead of the sampling method?

局限性

Yes

作者回复

We thank the reviewer for their comprehensive feedback. In the following, we address their questions and remarks.

Comment: The contribution looks incremental in this sense, even with some interesting and nicer theoretical properties over the previous work.
Response: While our work builds upon the probabilistic hierarchies introduced in [1], it has several contributions. First, we present new theoretical insights into the limitations of Soft-Das, providing a concrete example where FPH fails to find the optimal hierarchy. To address this issue, we propose expected scores, along with a framework for optimizing these scores. We further justify our approach by demonstrating that their optimal hierarchies align closely with their discrete counterparts. Additionally, we introduce an unbiased subgraph sampling algorithm, extending the applicability of both FPH and EPH to vector datasets. Finally, through a thorough empirical evaluation, we demonstrate that our method consistently outperforms existing baselines across both graph and vector datasets, highlighting the practical impact of our contributions.

Comment: The improvement of EPH over FPH is not as significant as that over other baseline methods.
Response: We would like to emphasize that we did various modifications to FPH, improving their results as shown in Table 10 in the appendix. For example, EPH has a >20% improvement on the Brain dataset over the original FPH. Even after tuning FPH, EPH reaches improvements of up to >12% (OpenFlight).

Comment: The proposed solution may not converge to the global optimum of Exp-Das nor Exp-TSD.
Response: While EPH, like many complex optimization algorithms, may not always converge to the global optimum, its consistency ensures that any probabilistic hierarchy found encoded corresponding discrete hierarchies. The theoretical results remain relevant because they demonstrate that even probabilistic hierarchies correspond to discrete hierarchies with consistent scores. This consistency is a valuable property not guaranteed by FPH or other baselines.

Comment: Only the best score of the hierarchies resulting from five random initialisation is reported in the experiments.
Response: The reason for reporting the best score is that non-deterministic methods can improve their results across multiple runs. To address the concern about variance, we have reported the standard deviations in Tables 16 and 17, which are less than 2% and are substantially lower than the standard deviations of other methods. Additionally, we provide the following table comparing the mean performance of EPH with FPH, demonstrating that even the average performance of EPH is superior:

Dasgupta costs:

ModelPolblogsBrainCiteseerGenesCora-MLOpenFlightWikiPhysicsDBLPZooIrisGlass
FPH238.65425.7076.03182.91257.42355.61482.4031,68756.1369.13122.00
EPH mean235.89402.3074.11179.64239.21315.10459.5130,63755.8069.10120.97

TSD scores:

ModelPolblogsBrainCiteseerGenesCora-MLOpenFlightWikiPhysicsDBLP
FPH31.3732.7569.3867.7859.5557.5849.8741.62
EPH31.6033.8769.3467.6759.3857.6350.0442.69

These results further demonstrate the effectiveness of EPH and its robustness against randomness.

Comment: It would be interesting to visualise the hierarchy of the clusters.
Response: We thank the reviewer for their suggestion. We would like to highlight that we have already visualized hierarchies for the HSBMs in Figure 4 of the main paper and for the OpenFlight dataset in Figure 12 of the appendix.

In addition to these, we have now also visualized the hierarchies for the vector datasets, as described in the general comment.

Comment: Can we find the discrete hierarchy that gives the optimal Das or TSD after getting the probabilistic hierarchy with the optimal Exp-Das or Exp-TSD?
Response: Yes, the optimal discrete hierarchy can be easily obtained from the optimal probabilistic hierarchy. Proposition 4.1 in our paper proves that the optimal expected scores and discrete scores align. Since the expected cost can be expressed as a convex combination of discrete scores, any sampled discrete hierarchy from the optimal probabilistic hierarchy will be optimal. This means that once we obtain the optimal probabilistic hierarchy, we can sample any discrete hierarchy from it, knowing that it will have the optimal Das or TSD score.

Comment: Section 4.3 says that "we can use a closed-form expression", but then it says "no known solution exists".
Response: We apologize for the confusion caused by the wording. The intention was to emphasize that while closed-form expressions might exist in theory, tractable solutions have yet to be discovered. The phrase "we can use a closed-form expression" should have highlighted the theoretical possibility rather than the practical availability. Currently, there are no known closed-form solutions for Exp-Das and Exp-TSD. Consequently, we use the sampling method.

We again thank the reviewer for their feedback and hope that we have satisfactorily addressed their concerns. If there are any remaining remarks, we are happy to address them.

[1] Zügner et al. "End-to-end learning of probabilistic hierarchies on graphs." In International Conference on Learning Representations. 2021.

评论

Thank you for your response. I concur with some of the other reviewers that the contribution appears incremental, given its similarity to the original method [1]. While the novelty of the method (though incremental) might be sufficient for NeurIPS, I believe the paper needs to better articulate the strengths of its method before it can be accepted. Therefore, I am maintaining my score as a borderline accept.

评论

We thank the reviewer for their response and are happy to address any remaining concerns.

作者回复

We want to thank the reviewers for their valuable feedback and for acknowledging the clear writing (xKdT, AAtX, f9v5, YnBf), our extensive experiments (xKdT, f9v5, KKzQ, YnBf), and our methodological contribution including our theoretical analysis (xKdT, f9v5, YnBf).

We have addressed all comments and concerns in the individual responses provided under each review. In this general response, we want to summarize the new results and experiments we have conducted.

Additional Visualizations
In response to reviewer xKdT’s suggestion, we have included additional visualizations by applying the t-SNE algorithm to the vector datasets. These visualizations display the ground-truth labels, clusters inferred from the hierarchies, and their corresponding dendrograms. To enhance the clarity, we aligned the inferred clusters with the ground-truth labels using the Hungarian algorithm, allowing for an intuitive comparison. These visualizations are available in the supplementary PDF.

The additional qualitative evaluation demonstrates a strong alignment between the ground-truth labels and the clusters inferred by EPH. Furthermore, we observe that similarities in the t-SNE space are well-preserved within the dendrograms, indicating the effectiveness of our method.

Runtime Improvements
Following the recommendation of reviewer AAtX, we improved the runtime of EPH by parallelizing the computations of the different Gumbel samples. This optimization does not alter the results but substantially reduces the runtime. The following table shows the improved runtime (in seconds) on the graph datasets:

MethodPolblogsBrainCiteseerGenesCora-MLOpenFlightWikiPhysics
EPH3834340226092848332244196389
EPH (parallelized)14961066749521119616542325

We again thank all the reviewers and are confident that the feedback will help improve the revised manuscript. We are happy to address any remaining or upcoming concerns.

最终决定

The reviewers generally felt positive about this paper -- the main concern is that the paper is in some sense incremental over prior work on FPH (citation [40] in the paper). The main idea is the same as this prior work, but the approach just looks at the expected cost over the probability of hierarchies rather than the expected cost of the probabilistic hierarchy itself. This leads to some nicer theory in terms of global optima corresponding for the probabilistic and discrete optimization tasks. The reviewers were swayed ultimately to accept due to the potential for practical relevance of the method, including the demonstration of improved runtimes included in the rebuttal. We hope that these additional results will be included in the final version of the paper.