Distortion-free and GPU-compatible Tree Embeddings in Hyperbolic Space
In this paper we propose a method for embedding trees in hyperbolic space by optimizing hyperspherical point separation and using floating point expansion arithmetic for maintaining GPU-compatibility.
摘要
评审与讨论
This paper presents a new hyperbolic embedding algorithm for tree data. The authors propose a Delaunay tree embedding with maximum separation (MS-DTE), where during placement, the child nodes of the node directly result in lower embedding distortion by optimizing the maximum separation. To solve the problem of floating point precision at the edge of Poincare-ball space, a gpu multi-precision algorithm is proposed.
优点
1.The motivation of this article is very natural. The tree structure embedding in hyperbolic space does have two problems mentioned . 2.The paper is well written and is easy to understand. The theory of the article is very solid, and the precision problem is explained very well.
缺点
1.This approach is aimed at hyperbolic embedding of tree structures, but does not seem to be able to handle general data (there is no explicit tree structure, but often there is an underlying tree structure). 2. The author lacks a discussion of algorithm complexity. Especially for the accuracy problem, whether it will cause a greater amount of calculation.
问题
- Can this precision processing method be applied to general hyperbolic neural networks? As we all know, hyperbolic machine learning has two problems, precision error and difficult optimization.
- See other questions in Weaknesses.
伦理问题详情
No
We thank the reviewer for their positive feedback on the motivation, writing, and theory. The reviewer points out three points of discussion: embeddings beyond trees, the algorithm complexity, and application to general hyperbolic networks. We provide answers to all three points below.
Embedding of trees and beyond. Indeed, our method is aimed at embedding trees into hyperbolic space. However, the approach can generalize to general graph data by first embedding the graph into a tree. This method is also used to embed graphs using the construction in (Sala et al., 2018). In Table 12 of the main paper, we provide a comparison between the embeddings of trees generated from graph data. We note that, when combining graph-to-tree and tree-to-hyperbolic algorithms, the distortion is a combination of the distortions incurred by each individual algorithm, making it difficult to assess the contribution of each separate algorithm to the overall error. Therefore, our experiments focus on the embeddings of trees to show how it performs against the baselines without clouding results.
Algorithm complexity. There are two components to discuss regarding the complexity: [1] The first is the added complexity that results from separating hyperspherical points using MHS. This is shortly discussed at the end of Section 3, but we will expand upon the subject further here. Essentially, each time that hyperspherical points have to be generated in step 6 of Algorithm 1, we check if the degree of the node has been encountered before. If not, then we have to run the optimization and cache the output. If it has been encountered before, then we simply use the cached hyperspherical points. This optimization itself is cheap to perform and the worst-case number of times that the optimization has to be performed is as shown by Theorem 1. Moreover, in practice we find that the actual number of optimizations is even lower than this as shown by the short analysis in our response to Reviewer fcXk under "How frequently are caches applied?". As a result, we find that the optimizations take a fraction of the total computation time. [2] The second is the added complexity due to the use of floating point expansion arithmetic. A complete analysis of both the accuracy and complexity of this framework and its operations is given by (Popescu, 2017) and is far too extensive to include here. To summarize their results, the complexity is similar to the complexity of mixed precision arithmetic with respect to the number of bits, while also resulting in similar accuracy. As such, the main difference between the two frameworks is the compatibility of FPEs with GPUs.
Broader applicability. Our approach has the potential for broader integration with hyperbolic networks. This does require the implementation of additional operations for FPEs, similar to the implementation of . We consider this a very interesting direction of future research and we have included it in the conclusions of the paper.
We hope our answers have addressed the reviewer's concerns and, given the positive feedback, that we have convinced the reviewer to raise their rating.
Thanks for your response! All my doubts have been resolved. I'm looking forward to your ability to apply this accuracy algorithm to a wider range of hyperbolic machine learning, which will help push the field forward. Hence, I have raised my score.
Embedding tree structures in hyperbolic space enhances knowledge representation, especially for hierarchies and ontologies. This paper addresses two key challenges: poor point separation and limited hardware compatibility. It proposes maximally separated Delaunay tree embeddings (MS-DTE) and floating-point expansion arithmetic, achieving lower distortion and efficient GPU use, which improves embedding quality in deep learning applications.
优点
- The paper is well-structured.
- The proposed method is well-justified.
- The method demonstrates strong performance improvements.
缺点
- The experimental evaluation lacks comprehensiveness.
问题
- In the abstract, "directly leads to lower embedding distortion" is mentioned; if the method only reduces distortion, what justifies the use of "distortion-free" in the title?
- The relationship between "DISTORTION-FREE," "GPU-COMPATIBLE," and performance in downstream tasks remains unclear.
- The impact of finding maximally separated points on a hypersphere for downstream tasks needs clarification.
- Experimental details are incomplete, as MHS requires training.
- The paper should include an analysis of computational complexity and overhead.
- Parameter analysis for the scaling factor is needed, as different values are used across tasks.
- Why is the MAP metric omitted from Table 3 and Table 4?
We thank the reviewer for their positive feedback regarding the structure, the motivation and the performance of the method. We have performed additional experiments as discussed below, which we hope addresses the concern raised by the reviewer. Moreover, we have answered their questions below.
More comprehensive evaluations. We have performed additional experiments where we study the effect of the embedding dimension. We have added results for varying embedding dimension in our response to reviewer fcXk and find that our method outperforms the baselines for those other embedding dimensions as well. These results have also been added to the appendix. Lastly, we also want to point out the comparisons on graph data in Table 12 of the appendix.
The use of "distortion-free". We agree with the reviewer and have changed the wording to "low-distortion", akin to (Sarkar, 2011). Only in the limit does this construction lead to completely distortion-free embeddings.
The relationship between low-distortion, gpu-compatible and downstream performance. The importance of low-distortion embeddings for downstream tasks depends on how the embeddings are incorporated into the downstream method. Most recent methods (Peng et al., 2021; Mettes et al., 2024; Long et al., 2020; kolyvakis et al., 2020; Tifrea et al., 2018) make use of a prototypical deep learning approach. This is still an active area of research, but what these methods typically have in common is that they require low-distortion embeddings. This is motivated by the fact that when using a tree, all of the information about how labels are related is stored in the edges (and their weights). If the embeddings suffer from high distortion, then much of this information is lost, rendering the embeddings effectively useless. The importance of GPU-compatibility stems from the use of deep learning on which all of these methods rely. When using mixed precision arithmetic for the embeddings, we end up with a representation that is incompatible with GPUs. As a result, we cannot use these directly for downstream tasks if we plan on using a deep learning method. The relation between distortion-free, gpu-compatibility and performance lies within the scaling factor . As increases, the construction leads to smaller distortions, at the cost of placing embeddings further away from the origin. When we move away from the origin, we need more bits to represent points. So, to decrease the distortion of our embeddings, we need to increase , which in turn means that we need to use more bits. Increasing the number of bits while retaining GPU-compatibility is currently only possible by using FPEs.
The impact of maximally separated points on downstream tasks. In the original 2-dimensional construction (Sarkar, 2011), Sarkar has shown the relationship between several important variables such as the scaling factor , the degree of nodes and the minimal separation between points in the hypersphere step. Those results are further generalized to higher dimensions by (Sala et al., 2018). A rigorous exposition of these results is perhaps a bit excessive for this discussion, but the complete results can be found in Theorem 6 and Algorithm 1 of (Sarkar, 2011) and Appendix D of (Sala et al., 2018). In short, these state that the distortion can be decreased by increasing or by increasing the separation between hypersphere points in step 6 of Algorithm 1 as given in our paper. Therefore, if we achieve better hyperspherical separation, then the distortion will go down.
MHS optimization details. The experimental details that we use for optimization with MHS are given towards the end of Section 3. We have added a header to this paragraph to indicate that it describes the implementation details for optimizing MHS.
Computational complexity. There are two components to discuss regarding the complexity: [1] The first is the added complexity that results from separating hyperspherical points using MHS. This is shortly discussed at the end of Section 3, but we will expand upon the subject further here. Essentially, each time that hyperspherical points have to be generated in step 6 of Algorithm 1, we check if the degree of the node has been encountered before. If not, then we have to run the optimization and cache the output. If it has been encountered before, then we simply use the cached hyperspherical points. This optimization itself is cheap to perform and the worst-case number of times that the optimization has to be performed is as shown by Theorem 1. Moreover, in practice we find that the actual number of optimizations is even lower than this as shown by the short analysis in our response to Reviewer fcXk under "How frequently are caches applied?". As a result, we find that the optimizations take a fraction of the total computation time. [2] The second is the added complexity due to the use of floating point expansion arithmetic. A complete analysis of both the accuracy and complexity of this framework and its operations is given by (Popescu, 2017) and is far too extensive to include here. To summarize their results, the complexity is similar to the complexity of mixed precision arithmetic with respect to the number of bits, while also resulting in similar accuracy. As such, the main difference between the two frameworks is the compatibility of FPEs with GPUs.
The scaling factor . Based on the theoretical results by (Sarkar, 2011; Sala et al., 2018), the scaling factor should always be chosen to be as large as possible without having embedded points end up outside the effective radius. Therefore, is essentially determined by the number of bits of precision and the longest path in the tree that is being embedded. For example, in Figure 1b the x-axis shows the number of bits of precision, but could have equivalently shown the effect of the scaling factor , since the number of bits and the maximal admissable are linearly related.
Absence of the MAP metric in Tables 3 and 4. The carnivora and lichen trees are weighted and the MAP is not properly defined for such trees. To make Table 3 a bit more consistent we decided to leave the MAP out for the other 2 trees as well, especially since the MAP for the constructive methods with FPEs is 1 for both trees, making the metric a bit less interesting to show. The trees generated from the graphs in Table 4 (now 12) are also weighted trees, so here the MAP is undefined as well.
We hope that the additional experiments involving the embedding dimension resolve the reviewer's concerns and that our answers have provided more clarity.
The authors introduce maximally separated Delaunay tree embeddings to construct tree embeddings in hyperbolic spaces, particularly in the Poincaré ball model. Empirically, they show that their method improves upon existing methods. Additionally, they present a method for the arithmetic expansion of floating-point numbers in tensors, allowing for increased calculation precision without losing the benefits of hardware acceleration.
优点
The authors present a series of interesting theoretical results with clear and well-written proofs. These results serve as the fundamental backbone of the work, making it well-written and cohesive.
缺点
- Line 855 is not clearly understandable; there is likely a typo.
- Theorems 3 and 5 seem more straightforward than presented. It would be better to state them as propositions and briefly comment on their proofs.
问题
- Could it be possible to elaborate a bit more on the third limitation (line 236)? I may have missed something, but it doesn't seem entirely clear based on the current text.
- Can you use isometries between hyperbolic spaces to study another manifolds? I think maybe some properties will be preserved.
- Can you derive an extension of Theorem 1 changing MHS? The proof may be similar.
We thank the reviewer for their positive comments on the theory and writing of the paper and for spotting some improvements that can be made. Below, we have answered the questions posed by the reviewer and we have made the proposed changes in the manuscript.
Typo in line 855. There was indeed a typo. We thank the reviewer for catching the mistake and we have updated the sentence.
On Theorems 3 and 5. We agree with the reviewer that these statements would be more appropriate in the form of a proposition and we have changed this in the manuscript accordingly.
Elaboration on third limitation of baseline (line 236). When placing a point on the vertex of an -dimensional hypercube, there are options, so we can represent each option by a binary sequence of length . For example, on a hypercube where each vertex has , each vertex is of the form , so we can represent as some binary sequence . The distance between two such vertices can then be expressed in terms of the Hamming distance between the corresponding sequences as which shows that points placed on vertices of a hypercube are maximally separated if this Hamming distance is maximized. This forms an interesting and well studied problem in coding theory where the objective is to find binary sequences of length which have maximal pairwise Hamming distances. There are some specific combinations of and for which optimal solutions are known, such as the Hadamard code. However, for most combinations of and , the solution is still an open problem (MacWilliams and Sloane, 1977). Therefore, properly placing points on the vertices of a hypercube currently relies on the solution to an unsolved problem, making it difficult to do in practice. We have added this discussion to the appendix.
MacWilliams, F. J., and N. J. A. Sloane. The Theory of Error-Correcting Codes. North-Holland Publishing Company, 1977.
Generalizing to other manifolds. Yes, it is certainly possible to use isometries between hyperbolic spaces to study other manifolds. Our method operates in the Poincaré ball model of hyperbolic space and the resulting embeddings can be mapped to any other model of hyperbolic space through the usual isometries. This maintains the pairwise distances of the embeddings, meaning that the quality of the embedding is preserved. Therefore, our approach can be used to generate embeddings for any model of hyperbolic space by simply applying an isometry at the end. However, it should be noted that some isometries may rely on operations for which new FPE routines need to be implemented. Therefore, we focus on the Poincaré ball throughout the paper.
Extension of Theorem 1. We can indeed extend Theorem 1 by changing MHS. In fact, the proof is independent of the optimization objective that is being used. The statement of the theorem would have been more appropriate without the mention of MHS. We have changed this accordingly. Thank you for pointing this out.
We again thank the reviewer for their feedback and we hope to have addressed their questions.
Thank you for the reply! My concerns are addressed.
This paper tackles two key challenges in embedding tree-structured data using hyperbolic geometry, a mathmatical concept known for effectively capturing hierarchical relationships. Traditional combinatorial methods struggle with finding maximally separated points on a hypersphere, leading to poor separation and high distortion in embeddings. The authors introduce maximally separated Delaunay tree embeddings (MS-DTE), which optimize child node placement to reduce distortion. Additionally, they address the precision requirements for low-distortion embeddings, replacing multiple-precision arithmetic with floating-point expansion to ensure compatibility with GPU acceleration. MS-DTE offers a more accurate and hardware-compatible approach for hyperbolic embeddings, facilitating their use in deep learning.
优点
This paper tries to improve the hyperbolic embedding method for tree-like data in two aspects: lower distortion and higher precision. It demonstrates effectiveness through several experiments.
缺点
My primary concern with this paper lies in its limited exploration of embedding dimensions, as all experiments are confined to fixed dimensions of 8 or 10. Hyperbolic spaces are indeed well-suited for embedding tree-like structures in low dimensions with minimal distortion, as shown by prior work such as Sala et al. (2018), which evaluated dimensions from 2 to 200. The restricted range of dimensions examined here leaves questions about the method's robustness across different dimensional settings and its performance at lower dimensions, which could highlight the embedding quality and distortion more clearly.
Moreover, in the specific experiment detailed in Table 1, the authors embed an 8-depth binary tree in a 10-dimensional space. Given the remark in lines 232-233 about point generation limitations in low-dimensional spaces, this experiment does not seem sufficient to validate these claims, as a binary tree should be well-represented in 10 dimensions without encountering major separation limitations. Additionally, as the node degree is 2 in this case, the proposed MHS in Equation 14 appears equivalent to Liu et al.’s (2018) approach in Equation 13, raising concerns about the distinct advantage claimed for this setting.
To strengthen the paper, I recommend conducting experiments across a wider range of dimensions, particularly in low dimensions, which would not only enable visualization but also demonstrate the effectiveness of the proposed GPU-compatible floating-point expansion approach. This expanded experimentation would provide a more comprehensive evaluation of the proposed method’s advantages and limitations.
问题
-
What is in Eq 13
-
In lines 250-254, you discuss the limitation of Eq. 13. Is this a mere observation or a conclusion drawn from empirical analysis? Have you experimented with using Equation 13 as an objective function, and if so, what were the results?
-
In Equation 14, is the objective minimizing the absolute angle value? Note this is equivalent to minimizing the geodesic distance between vectors. Why not instead minimize the cosine value between the angles?
-
You note that the effective number of nodes is lower in practice due to the high frequency of low-degree nodes, allowing cached hyperspherical points to be reused (lines 271-272). Could you provide more context (statistics on dataset) on how frequently these caches are applied and their impact on computational efficiency?
-
While the paper focuses on the embedding method itself, have you evaluated the utility of these embeddings in downstream tasks? For example, Nickel and Kiela (2017) demonstrated the effectiveness of their embeddings on link prediction. Any insights on potential downstream improvements would be helpful.
-
Have you evaluated the proposed model on WordNet?
Additional questions
in Eq. 13. Thank you for pointing this out. The value is a nonnegative integer introduced in (Liu et al., 2018) which parameterizes the hyperspherical energy functions, so each forms a different hyperspherical energy function. Higher values of lead to higher penalties on highly similar points akin to how the mean squared error and the mean absolute error relate to one another. We have added this information below the equation.
Limitations of Eq. 13. The statements on lines 250-254 are based on the results in Figure 1a and Table 1, where we see that the objectives in Equation 13 lead to lower minimum pairwise angles and that these objectives lead to higher distortion when compared to MHS. We have added a mention of the experiment to this discussion in the paper.
Minimizing cosine similarity between points instead of maximizing absolute angles. Based on the suggestion by the reviewer, we have recreated Figure 1b where we have added the proposed objective, where we minimize the maximum value of the cosine similarity instead of maximizing the minimum angle. This leads to mostly similar results, although we find that MHS is slightly more stable and, on average, leads to slightly better separation, compared to the cosine-based approach. Moreover, as can be seen in the table for the binary tree above, the two objectives lead to similar distortions, with MHS leading to slightly lower worst-case distortion in two cases. We have included the ablation study in the appendix of the paper. Thank you for the suggestion.
How frequently are caches applied? We can use a cached result each time we run into a node with a degree that we have encountered before. Therefore, the number of times that we cannot use a cached result and have to perform the optimization is equal to the number of unique degrees occuring in the tree (ignoring the leaf nodes). In the table below you can see this number of optimizations for each of the trees that is used in the experiments compared to the theoretical worst-case number of optimizations given by Theorem 1, alongside some other statistics. As can be seen, the number of optimizations that we perform is generally quite a bit lower than this upper bound. We have included this analysis in the appendix of the paper.
| Tree | Nodes | Unique degrees | Theoretical worst-case | Longest path length | |
|---|---|---|---|---|---|
| -ary trees | Varying | 2 | Varying | Varying | Varying |
| Mosses | 344 | 11 | 38 | 16 | 51 |
| Weevils | 195 | 5 | 29 | 8 | 29 |
| Carnivora | 548 | 3 | 45 | 4 | 192.4 |
| Lichen | 481 | 3 | 48 | 4 | 0.972 |
Utility of the embeddings. The use of tree embeddings for downstream tasks is an active area of research, with most methods taking a deep learning approach using prototype learning (Peng et al., 2021; Mettes et al., 2024; Long et al., 2020; Kolyvakis et al., 2020; Tifrea et al., 2018). In order to apply our method, with FPEs, in such a setting, we need additional routines, similar to the routine, and an altered neural network architecture that allows for predicting FPEs. We consider this a very interesting research direction, but due to the additional theory required, we deem it outside the scope of the current paper. Experiments, such as the link prediction experiment by (Nickel and Kiela, 2017), are aimed at graphs and are inappropriate for trees, since predicting an additional edge in a tree would break the tree structure.
WordNet. WordNet is not a tree, so in order to embed it using our approach, it must first be embedded into a tree, similar to (Sala et al., 2018). We attempted this first step using the approach by (Abraham et al., 2007) which led to a tree which was already heavily distorted from the original WordNet graph. In other words, WordNet does not let itself be described by a tree very well. Moreover, the method of graph-to-tree embedding and the choice of root node for the tree heavily influence the distortion of the embedding. As such, it is difficult to compare our approach to for example (Sala et al., 2018) without having access to the exact graph-to-tree embedding configuration, since any difference in distortion is a combination of both embedding steps. We, therefore, focus on tree-like data in our experiments.
We hope to have addressed the reviewer's primary concern and to have answered their questions. If there are any additional experiments of interest or if there are any further questions or concerns, please let us know.
Thank you for your responses.
There are two minor questions I would like the authors to address:
-
It appears that the proposed method achieves almost the same results across different dimensions. Could you clarify why this is the case?
-
The proposed method is applicable only to trees, not to other tree-like or hierarchical data structures (e.g., WordNet). This limitation should be explicitly stated.
Most of my other concerns have been addressed. I will raise my score later.
We thank the reviewer for their response and we are glad that the reviewer is willing to increase their score. We will address the two questions below:
Same results across different dimensions. This can be explained by the fact that, theoretically, past a certain dimension, a further increase in the dimension does not lead to a significant reduction in distortion. This is shown as part of the proof of Proposition 3.1 in (Sala et al., 2018). More specifically, they find that dimensions greater than do not help reduce distortion. This can be observed in our experiments as well. For instance, in the worst-case distortions for the -ary trees (Table 4 of the previous response or Table 6 in the manuscript). The 3-tree, 5-tree and 7-tree theoretically benefit from increasing the dimension up to 4, 6 and 8, respectively. Indeed, the worst-case distortion achieved by MS-DTE improves as expected for each of these trees and remains constant after. The reason that the distortion remains constant is due to the fact that we found well separated hypersphere points using MHS, even for higher dimensions. Other methods, such as (Sala et al. (2018)), struggle with separating points in higher dimensions, leading to an increase in distortion.
The applicability of the method to tree-like data. We agree with the reviewer that our approach is designed for tree data. We note that, similar to (Sala et al., 2018), our method can be used in conjunction with a graph-to-tree embedding method, such as (Abraham et al., 2007), to embed graphs into hyperbolic space. We show results on the trees that originate from embedding graphs into trees in Table 12 in the pdf, where our approach outperforms the baselines. In such cases, the distortions of the two embedding methods will be compounded, with the initial embedding determining a lower bound on the distortion that can be achieved in the second step. Therefore, the results will only be useful if the original graph is highly tree-like, allowing for a low-distortion embedding of the graph into a tree. We have updated the manuscript to state our focus and the option to generalize.
Thank you for your responses, I have raised my score accordingly.
The fifth and sixth tables contain the results for the phylogenetic trees for the methods using FPEs, akin to the last 3 rows of Table 3 in the paper. Here, we again observe that HypFPE + MS-DTE outperforms each method in nearly every setting. Moreover, we observe that for the carnivora and for the lichen, the results with the precomputed points (HypFPE + Sala et al. (2018)*) become worse as the dimension increases. This can be explained by the fact that these trees have a low (shown in the table corresponding to the answer to question 4), meaning that an increase in dimension is not as beneficial for embedding these trees, while this increase in dimension does make the task of separating points on a hypersphere more difficult. It seems that the precomputed points are not as well separated for increasing dimension. On the other hand, MHS still leads to proper separation, even for higher dimension, as evidenced by the results.
| Mosses | Weevils | Carnivora | Lichen | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dimension | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 |
| HypFPE + Sala et al. (2018) | - | - | - | 0.09 | - | - | 0.07 | 0.07 | 0.04 | 0.04 | 0.04 | 0.04 | 0.12 | 0.12 | 0.12 | 0.12 |
| HypFPE + Sala et al. (2018)* | 0.06 | 0.10 | 0.08 | 0.10 | 0.03 | 0.05 | 0.05 | 0.10 | 0.01 | 0.03 | 0.03 | 0.06 | 0.05 | 0.10 | 0.11 | 0.19 |
| HypFPE + MS-DTE | 0.04 | 0.04 | 0.04 | 0.04 | 0.03 | 0.03 | 0.03 | 0.03 | 0.02 | 0.02 | 0.03 | 0.02 | 0.06 | 0.06 | 0.05 | 0.05 |
| Mosses | Weevils | Carnivora | Lichen | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dimension | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 |
| HypFPE + Sala et al. (2018) | - | - | - | 1.10 | - | - | 1.09 | 1.09 | 6.76 | 6.76 | 6.76 | 6.76 | 43.4 | 43.4 | 43.4 | 43.4 |
| HypFPE + Sala et al. (2018)* | 1.36 | 1.21 | 1.14 | 1.16 | 1.25 | 1.12 | 1.11 | 1.13 | 3.50 | 4.06 | 4.87 | 13.0 | 4.73 | 5.44 | 6.42 | 36.0 |
| HypFPE + MS-DTE | 1.09 | 1.07 | 1.06 | 1.07 | 1.05 | 1.05 | 1.04 | 1.04 | 2.46 | 2.45 | 2.03 | 2.35 | 4.07 | 4.63 | 3.30 | 7.17 |
We conclude that our approach provides the best hyperbolic embeddings across all dimensionalities.
We thank the reviewer for their feedback. Below we show how our approach handles varying dimensionalities and answer the individual questions. We agree with the reviewer on the need for additional ablations on the embedding dimensionality and we believe this analysis further strengthens our contributions.
Additional ablations on dimensionality. In our submission, we chose 10 dimensions since the Hadamard baseline of (Sala et al., 2018) is often too restricted on smaller dimensions. Our approach can handle any choice of dimensionality. As pointed out by the reviewer, hyperbolic embeddings are well-suited in low-dimensional settings. To showcase our robustness, we replicated our experiments in Tables 1, 2 and 3 with 4, 7 and 20 embedding dimensions. The results are shown in the tables below, which have also been added to the appendix of the paper.
The first two tables show the average distortion and worst-case distortion, respectively, of the embedding of a binary tree of depth 8, akin to Table 1 of the paper. The cosine similarity method that has been added to the tables is in response to question 3 regarding the objective function, which will be addressed below. We note that MHS has the best performance on both metrics for all configurations except for the with embedding dimension 4, where achieves slightly better performance, and for with embedding dimension 20, where the cosine similarity achieves a slightly better result. For lower embedding dimensions, the hyperspherical separation is slightly easier to perform, meaning that MHS's benefits compared to the hyperspherical energy objectives () are slightly less important. However, overall we find MHS to lead to the best results. The MAP metric has not been reported because , , and MHS already lead to a perfect MAP for each dimension.
| 4 | 7 | 10 | 20 | |
|---|---|---|---|---|
| Sala et al. (2018) | 0.734 | 0.734 | 0.734 | 0.734 |
| Sala et al. (2018)* | 0.235 | 0.502 | 0.361 | 0.726 |
| 0.192 | 0.188 | 0.219 | 0.189 | |
| 0.190 | 0.196 | 0.204 | 0.190 | |
| 0.194 | 0.198 | 0.190 | 0.198 | |
| Cosine similarity | 0.189 | 0.189 | 0.188 | 0.188 |
| MHS | 0.188 | 0.188 | 0.188 | 0.189 |
| 4 | 7 | 10 | 20 | |
|---|---|---|---|---|
| Sala et al. (2018) | 1143 | 1143 | 1143 | 1143 |
| Sala et al. (2018)* | 10.51 | 132 | 18.42 | 280.5 |
| 1.655 | 1.625 | 1.670 | 1.640 | |
| 1.619 | 1.664 | 1.686 | 1.698 | |
| 1.666 | 1.687 | 1.642 | 1.680 | |
| Cosine similarity | 1.636 | 1.637 | 1.635 | 1.633 |
| MHS | 1.632 | 1.623 | 1.635 | 1.631 |
In the third and fourth tables, we present the average distortion and the worst-case distortion, respectively, of the embedding of a 3-tree, 5-tree and 7-tree, akin to Table 2 of the paper. Here, we see that MS-DTE outperforms all methods on . The only exception is h-MDS. This approach is however not useful, as it leads to infinite worst-case distortion due to leaf node collapse and, therefore, also to low MAP scores, as highlighted in Figure 2 of the main paper. The Hadamard code (Sala et al. (2018)\ddagger) can only generate 4 points for dimensions 4 and 7, due to which it cannot be used to embed the 5-tree and 7-tree in those dimensions. The MAP metric is again not reported because all methods except h-MDS achieve a perfect MAP in all settings.
| 3 | 5 | 7 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dimension | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 |
| Sala et al. (2018) | 0.09 | 0.07 | 0.03 | 0.01 | 0.18 | 0.05 | 0.04 | 0.03 | 0.16 | 0.13 | 0.03 | 0.02 |
| Sala et al. (2018) | 0.11 | 0.11 | 0.11 | 0.11 | - | - | 0.12 | 0.12 | - | - | 0.12 | 0.12 |
| Sala et al. (2018)* | 0.08 | 0.08 | 0.09 | 0.14 | 0.10 | 0.12 | 0.13 | 0.18 | 0.12 | 0.12 | 0.13 | 0.17 |
| MS-DTE | 0.06 | 0.06 | 0.06 | 0.06 | 0.09 | 0.09 | 0.09 | 0.09 | 0.10 | 0.10 | 0.10 | 0.10 |
| 3 | 5 | 7 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dimension | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 | 4 | 7 | 10 | 20 |
| Sala et al. (2018) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sala et al. (2018) | 1.14 | 1.14 | 1.14 | 1.14 | - | - | 1.14 | 1.14 | - | - | 1.14 | 1.14 |
| Sala et al. (2018)* | 1.32 | 1.22 | 1.18 | 1.23 | 1.28 | 1.30 | 1.30 | 1.34 | 1.53 | 1.25 | 1.31 | 1.26 |
| MS-DTE | 1.07 | 1.07 | 1.07 | 1.07 | 1.14 | 1.10 | 1.10 | 1.10 | 1.14 | 1.13 | 1.12 | 1.12 |
This paper refines an algorithm for tree embeddings via projected stochastic descent and improved floating point arithmetic. The authors identify an issue with an algorithm for uniformly sampling points on hyperplanes in the Poincare disk, which is a crucial part of a classical tree embedding algorithm. They then compare their approach to other well-known methods and obtain consistently improved performance.
优点
- Fast implementation using projected gradient descent.
- Empirical efficacy
- The solution to the floating point arithmetic degeneracy ailment is crucial due to calculations of the division of small numbers. I find it to be very interesting.
缺点
-
There are no claims of empirical results for downstream tasks, despite the introduction which claims the importance of tree embeddings for downstream tasks.
-
Code isn't available.
问题
- Can you explain Figure 1 (b)?
We thank the reviewer for their positive comments on the method, its implementation and its efficacy. Below we have addressed the questions and concerns posed in the review.
Down-stream tasks. Indeed, our empirical evaluation focuses on the tree embeddings themselves, not on down-stream tasks. Works that focus on down-stream tasks build hyperbolic embeddings on top of deep networks and have shown that better tree embeddings lead to better (hierarchical) classification (Peng et al., 2021; Mettes et al., 2024; Long et al., 2020; Kolyvakis et al., 2020; Tifrea et al., 2018). Our approach leads to lower distortion, while our floating point expansion (FPE) ensures that we can still use standard floating point operators required by modern deep learning libraries. Making the step towards down-stream tasks does require further study however, as current solutions do not use FPEs. We consider this step a fruitful direction for future research. To avoid confusion about our claims, we have made changes to the introduction and conclusions accordingly.
Code will be made public. All code will be made available. We will release two software libraries, one for arbitrarily-dimensional hyperbolic tree embeddings (where all current embedding algorithms are implemented under one umbrella for direct comparisons) and one for GPU-compatible floating point expansions. The code will be released on GitHub with the links shared in the final manuscript.
Figure 1(b). As we increase the number of bits that we use for our representation, we increase the effective radius of the Poincaré ball, i.e. the distance from the origin up to which we can represent points in hyperbolic space without numerical problems. In fact, this radius in terms of the hyperbolic distance of points to the origin grows linearly in the number of bits. The scaling factor that we use stretches the edges, which leads to lower distortion of the embeddings as discussed by (Sarkar, 2011; Sala et al., 2018), while also causing the embeddings to end up further and further from the origin as grows. This means that we need to use more bits if we want to use larger values of and this relationship is linear. In Figure 1(b) we simply set to a value for which the embeddings end up close to this effective radius, so we could have also used on the x-axis here. As expected, as the number of bits, and therefore , increase, the distortion decreases. The elbow type graph that we see is due to the fact that subtree embeddings can overlap for smaller values of , which leads to very large distortion. Once the scaling factor becomes large enough, these subtrees get disentangled, after which the distortion decreases more slowly.
We hope to have addessed the questions raised by reviewer and we thank you for the feedback.
Thanks for the detailed response. I have maintained my supportive score.
We are glad to see the positive feedback regarding the method, theory, results, and writing of the paper. We want to thank the reviewers for their constructive feedback that will help strengthen the paper. The points raised by each reviewer are addressed below in the individual responses. Particularly, the ablations on dimensionality requested by reviewer fcXk highlight that our approach is robust across embedding dimensions. We have uploaded the improved manuscript with changes in the main paper shown in blue along with the rebuttal.
This paper introduces Maximally Separated Delaunay Tree Embeddings (MS-DTE) and a floating-point expansion arithmetic (FPE) technique for hyperbolic tree embeddings. MS-DTE optimizes child node placement to be maximally separated which leads to a lower embedding distortion. FPE ensures compatibility with GPUs, overcoming the precision challenges inherent in hyperbolic embeddings while maintaining computational efficiency.
The paper receives overall positive ratings and I agree with the reviewers that the FPE makes a nice contribution with great practical relevance, robust theoretical underpinnings, and strong performance improvement.
However, I found limited novelty in the other main contribution, namely the maximal hyperspherical separation in Eq. (14). Here, the goal is to “generate uniformly distributed points on a hypersphere” to “maximizing this minimal angle”, which corresponds to the well-established Tammes problem [1] that has been extensively studied for decades [2] with numerous applications in machine learning [3, 4]. The absence of any discussion on this connection not only reduced the novelty of the submission, but also compromised its quality.
For example,
- Eq. (13) is a generalization of the Tammes problem (as discussed in [5]) so that the author’s critique of (13) in favor of (14) becomes unjustified.
- The validity of the method comes with a requirement in Eq. (11), which could have been substantiated using known theoretical results on Tammes problem [6].
- A set of vectors drawn at random from the unit sphere may already be pairwise separable and satisfy Eq. (11) with a high probability (e.g. by using well-known bounds on the area of spherical cap). This may serve as a naive baseline for the proposed method.
Given the entire absence of discussion about the Tammes problem, the paper’s novelty and quality are significantly compromised. While the FPE component is still of great importance, I would place this paper in the borderline rejection category and strongly encourage the authors to revise their submission by addressing these limitations.
[1] https://en.wikipedia.org/wiki/Tammes_problem
[2] https://cohn.mit.edu/spherical-codes/
[3] Pascal Mettes, et al., Hyperspherical Prototype Networks
[4] Liu, Weiyang, et al. "Learning with hyperspherical uniformity." International Conference On Artificial Intelligence and Statistics. PMLR, 2021.
[5] Learning towards minimum hyperspherical energy
[6] Jiang, Jiachen, et al. "Generalized neural collapse for a large number of classes."
审稿人讨论附加意见
Overall, the author rebuttal satisfactorily addressed major concerns from the reviewers. A few points are summarized below.
Concern: The method is designed specifically for tree structures, limiting its applicability to non-tree or hierarchical data like WordNet.
Response: The authors explained that MS-DTE could generalize to graph-like data using graph-to-tree embedding techniques, though experiments were not included.
Concern: Computational complexity of MS-DTE and FPE.
Response: The authors provided a partial analysis, stating that FPE's complexity is comparable to mixed-precision arithmetic. However, reviewers noted the need for more detailed analysis.
Concern: Lack of downstream task evaluation.
Response: The authors acknowledged this gap, arguing that low-distortion embeddings are beneficial for downstream tasks but left experimental validation for future work.
Concern: Limited exploration of embedding dimensions.
Response: The authors added ablations demonstrating robust performance across different dimensional settings.
Concern: Misleading title regarding "distortion-free."
Response: The authors revised the title to "low-distortion," addressing concerns about overstatements.
Reject