PaperHub
5.0
/10
Poster4 位审稿人
最低4最高6标准差0.7
6
4
5
5
3.0
置信度
正确性3.0
贡献度3.0
表达2.5
NeurIPS 2024

Hyperbolic Embeddings of Supervised Models

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

A full-fledged solution to embed supervised *models* in hyperbolic geometry, and more

摘要

关键词
Hyperbolic geometrysupervised model embeddingdecision treesboosting

评审与讨论

审稿意见
6

Summary This paper proposes a novel approach to embed supervised models, specifically decision trees and their ensembles, in hyperbolic space using the Poincaré disk model. The paper contributes three main advancements: linking loss functions for class probability estimation to hyperbolic embeddings, providing a method for unambiguous embedding of decision trees by extracting their monotonic subtrees, and introducing a tempered integral to improve the visualization and numerical accuracy of embeddings near the border of the Poincaré disk.

Strengths

  1. The approach to embedding supervised models, particularly decision trees, in hyperbolic space is novel. The focus on supervised models rather than just the data is a significant contribution.
  2. This paper provides a robust theoretical framework linking class probability estimation losses with hyperbolic embeddings and introducing the concept of monotonic decision trees (MDTs) for clean embeddings.
  3. The introduction of the tempered integral to handle numerical issues near the border of the Poincaré disk demonstrates a deep understanding of the challenges in hyperbolic geometry and offers a principled solution.
  4. This paper is well-structured, with clear definitions, thorough explanations of the proposed methods, and detailed experimental results. The figures and tables effectively support the textual descriptions.

Weaknesses and comments

  1. The theoretical aspects, particularly the tempered integral and its application, might be challenging for practitioners to fully grasp and implement without further guidance or examples.
  2. While the paper demonstrates the applicability to decision trees and their ensembles, it would benefit from a discussion on how the methods could be adapted or extended to other types of supervised models.
  3. While the experiments are thorough, including more diverse use cases, such as different types of datasets or real-world applications, would strengthen the demonstration of the method's versatility and robustness.
  4. The related work section is comprehensive, covering both hyperbolic embeddings and supervised learning, which helps to contextualize the contributions.
  5. The methodology sections are detailed and well-explained. The separation into three main contributions helps in understanding the different components of the proposed approach.

优点

Strengths

  1. The approach to embedding supervised models, particularly decision trees, in hyperbolic space is novel. The focus on supervised models rather than just the data is a significant contribution.
  2. This paper provides a robust theoretical framework linking class probability estimation losses with hyperbolic embeddings and introducing the concept of monotonic decision trees (MDTs) for clean embeddings.
  3. The introduction of the tempered integral to handle numerical issues near the border of the Poincaré disk demonstrates a deep understanding of the challenges in hyperbolic geometry and offers a principled solution.
  4. This paper is well-structured, with clear definitions, thorough explanations of the proposed methods, and detailed experimental results. The figures and tables effectively support the textual descriptions.

缺点

Weaknesses and comments

  1. The theoretical aspects, particularly the tempered integral and its application, might be challenging for practitioners to fully grasp and implement without further guidance or examples.
  2. While the paper demonstrates the applicability to decision trees and their ensembles, it would benefit from a discussion on how the methods could be adapted or extended to other types of supervised models.
  3. While the experiments are thorough, including more diverse use cases, such as different types of datasets or real-world applications, would strengthen the demonstration of the method's versatility and robustness.
  4. The related work section is comprehensive, covering both hyperbolic embeddings and supervised learning, which helps to contextualize the contributions.
  5. The methodology sections are detailed and well-explained. The separation into three main contributions helps in understanding the different components of the proposed approach.

问题

N/A

局限性

N/A

作者回复

We thank the reviewer for a review whose strengths (1-3) perfectly capture our key contributions, in particular the fact that we embed models, not data.

We would like to highlight the common point to all three weaknesses mentioned (1-3): they all come down to putting substantial material in a space that was extremely constrained at submission time. Fortunately, there would be a +1 page in the camera ready. We believe we could at least partially cover (3) on more diverse use cases, assuming we can label our proposals [t8YB-B][t8YB-C] as such (see above).

Weakness (2) (extension to other supervised models) could be partially addressed in conclusion: for example, one can use our technique to embed branching programs without any modification (apart from a few tweaks in the Sarkar part, because we replace binary trees by directed acyclic graphs).

Weakness (1) would require a lot more space to deliver the reviewer’s suggestions, we are afraid. We would propose to use part of the +1 space to slightly expand Section 5, making a better connection with Figure 3 (right).

评论

"I agree with the other reviewer's comments about the evaluation of contributions. The datasets used are indeed not very large. I am maintaining a positive score

评论

We thank the reviewer for those comments and the decision to maintain their positive score.

We would just like to emphasize that we believe one of our key contributions is (as the reviewer rightfully pointed in point 3 of Strengths) is in past our application and lies in the material introduced in Section 5.

We just paraphrase the last comments we just lodged to reviewer Ry4X, on the fact that it is important to judge our contribution past the scope of its application, as what we develop in Section 5 can work universally for any application in the Poincare disk. But there is more since our solution relies on a new core tool, extending Leibniz-Newton's fundamental theorem of calculus, which can also be applied to other problems. In hyperbolic geometry first, our appendix grounds a proof that it can also be used in the Lorentz model (C.V). The claim that our tool can applied more broadly comes from observations that it leads to natural extensions of many properties known for Riemann integral (C.III).

Regarding dataset size, we would like to emphasize that changing the size of data (and e.g. use much bigger domains) would not bring something substantial, because (i) our algorithm getMDT depends only on the model learned and not the data used to learn this model (getting a DT with X nodes on datasets with 1K or 1M examples will lead to the same embedding complexity) and (ii) our representation also does not depend on the training size (a specific DT obtained from 1K or 1M sample will be embedded in the same way).

审稿意见
4

This paper introduces a method to embed decision trees into hyperbolic space, linking class probability estimation with hyperbolic distances. It proposes extracting monotonic subtrees and using a "tempered" integral for better visualization. The approach aims to enhance model interpretability and analysis without changing model performance.

优点

The paper presents an approach to embedding decision trees in hyperbolic space, offering a novel perspective on model visualization and interpretation. By leveraging the properties of hyperbolic geometry, particularly the Poincaré disk model, the authors provide a mathematically grounded method for representing complex tree structures and confidence levels in a unified, visually intuitive manner, potentially enhancing our ability to analyze and understand sophisticated machine learning models.

缺点

Limited empirical validation: The authors present their method primarily on small-scale UCI datasets, which may not fully demonstrate the scalability and effectiveness of their approach on large, real-world machine learning problems. The lack of comparison with existing tree visualization techniques, such as those in popular packages like scikit-learn or XGBoost, makes it difficult to gauge the practical advantages of this method.

Potential computational overhead: The paper does not adequately address the computational complexity of transforming decision trees into their hyperbolic representations. For large ensembles or deep trees, this transformation could potentially introduce significant overhead, which might limit its applicability in time-sensitive or resource-constrained environments.

Ambiguity in interpretability gains: While the authors claim improved interpretability, they do not provide a quantitative measure or user study to substantiate this claim. It remains unclear whether domain experts or data scientists would find these hyperbolic representations more intuitive or actionable than traditional tree visualizations.

问题

How does the method perform on regression trees as opposed to classification trees? Are there any significant differences in the representation or interpretation?

How does this hyperbolic embedding method perform on highly imbalanced datasets, particularly in visualizing minority class decision boundaries?

局限性

NA

作者回复

We thank the reviewer for providing a uniform “good” rating on all three paper metrics, highlighting that our approach is novel, our method is mathematically grounded and visualization is intuitive.

weaknesses

The authors present their method primarily on small-scale UCI datasets [...] lack of comparison with existing tree visualization techniques, [...] scikit-learn or XGBoost,

[Ry4X-A] We believe there is a misunderstanding here, see [ALL-2] above. The software suggested work in the same way as the one we quote in our footnote ”*” (bottom page 2). It is not a criticism: all softwares proceeds in the same way – embed a graph (binary rooted tree), superimpose the classification information. The only differences between softwares are “cosmetic” details to represent the full models: use charts or pies or boxes to represent local class probabilities, add text / color, etc. . We write cosmetic because it all end up to the same conclusion when one realizes what we write in [ALL-1]: it is largely impossible to represent a tree with 200+ nodes (that we use in our experiments) and get any meaningful information of how classification changes as we descend the tree, how nodes / leaves in different parts of the tree compare in term of confidence, etc . Monotonic Decision Trees (that we introduce) do not solve the problem because the issue is not on the “cosmetic” details (that is why we use the term “cosmetic”): the issue is that the backbone comes from the graph theory representation and is suboptimal to grasp models that big when applied to ML. We believe the hyperbolic embedding is very helpful to address the issue. We refer the reviewer to the attached file above to see for themselves.

[...] Potential computational overhead: The paper does not adequately address the computational complexity of transforming decision trees into their hyperbolic representations. [...]

[Ry4X-B] We would be happy to add in the camera-ready (using the +1 page) the fact that the complexity of computing the MDT is linear in the size of the tree (DT: see algorithm getMDT in our paper; class method PROCESS_TREE_GRAPHS in file MonotonicTreeGraph.java) and our variant of Sarkar’s algorithm is a cheap as Sarkar’s, so there is no computational overhead compared to a standard visualization of a DT.

[...] Ambiguity in interpretability gains: While the authors claim improved interpretability, they do not provide a quantitative measure or user study [...]

[Ry4X-C] While we hope that our argument in [ALL-2] makes the case for “not small” models at least, we emphasize that the whole trend of hyperbolic geometry in ML has been built from the mathematical properties and benefits of tree/hierarchical representations in the model. All the papers that we cite (either published in ICML or NeurIPS) make this case. We quote, in order of our reference numbering, the first ones:

[6]: “[...] hyperbolic embeddings [...] are more aligned with the geometry of trees than standard Euclidean embeddings [...]”;

[7]: “[...] Learning representations of data in hyperbolic spaces has recently attracted important interest in Machine Learning (ML) due to their ability to represent hierarchical data with high fidelity in low dimensions [...]”

[8]: “[...] The adoption of hyperbolic geometry for graph embeddings has sparked a vibrant and rapidly growing body of machine learning research [...] driven by the compelling advantages offered by hyperbolic spaces, particularly in capturing hierarchical and tree-like structures [...]”

(and so on) The motivation for our approach is no different from these.

questions

[...] How does the method perform on regression trees as opposed to classification trees? [...]

One would have to change the first step (the node embedding) of our approach because the losses we use are fundamentally for supervised classification. Then, the rest could be just about the same because of the similar tree structure (getMDT, modified Sarkar).

[...] How does this hyperbolic embedding method perform on highly imbalanced datasets, particularly in visualizing minority class decision boundaries? [...]

We suspect there is a misunderstanding here: we do not represent data but a model; therefore there is no such boundaries because there is no “training point cloud” mapped. However there is an effect of class imbalance, which is to (predictably) move the root of the tree far from the center of the disk. Thus class imbalance can be spotted “at lightning speed”, see L273-L274.

评论

Thank you for your response. I find the linearity of your algorithm unclear, and the experiments seem limited. Although you claim improved interpretability, there is no quantitative measure or case study.

评论

We thank the reviewer for these last comment and are happy to reply to them:

  • linearity of our algorithm: comes from the fact that each node is processed exactly once (steps 1, 8) and the processing takes O(1)O(1) time (since it amounts, in the worst case, to tag 1 node, add 1 node, add 1 arc, update 1 interval (II)).

  • re: interpretability. We maintain that our models are easily interpretable because the position of nodes in the disk immediately gives an indication as to whether they are good for classification. We also have a measure of how the overall tree display is accurate (vs having a look at all confidences displayed as reals in a classical graph based display): the error ρ\rho shows that in all our experiments, this is very faithful to the "hidden" numbers.

We would also like the reviewer to factor in their judgement the fact that our Theory in Section 5 approaches a longstanding problem which, alone, has motivated papers [19]. Our solution can work universally for any application in the Poincare disk (we discuss this in our reply to reviewer j9dG). To support the claim that this theory can indeed be broadly useful, our appendix grounds the proof that it can also be used in the Lorentz model (C.V) and Sections C.III and C.IV drill even deeper to show that our core tool, extending Leibniz-Newton's fundamental theorem of calculus, leads to natural extensions of many properties known for Riemann integral.

审稿意见
5

This work proposes a hyperbolic implementation of decision tree models, providing a link providing a link between class probability estimation and hyperbolic distances. In addition the authors identify key objectives/criteria for enabling “clean” hyperbolic embeddings, which is achieved by maintaining monotonic part of the tree thus avoiding shared root nodes at the origin of the poincare ball. The authors also address the inherent collapse of embeddings to the boundary of the ball by the introduction of a t-self method which also improved the qualitative visualisations of embeddings on the poincare ball.

优点

  • This work addresses many fundamental issues in hyperbolic learning presenting some original concepts linking class probability estimation and hyperbolic distances and addressing boundary collapse, thus presenting an interesting direction and findings for the community.
  • The proposed t-self method addresses a key problem across hyperbolic distance metrics that seemingly helps avoid collapse to the boundary of the ball.
  • Significant theoretical proofs are provided to support many of the claims and assumptions of the work.

缺点

  • It is not clear that the statement made that supervised models are not embedded in hyperbolic space. Some examples of fully hyperbolic classification models include [1,2]. I may be misunderstanding, therefore, correct me if so. In addition, to clarify this point, the authors could provide some evidence to back up this claim as currently it is weak. In addition, supervised hyperbolic geometry is well established which the authors claim otherwise, and hyperbolic decision trees have previously been proposed in [2]
  • The presentation and writing overall is generally poor and hard to follow, notably the bold headings in section 4. The use of mathematical notation is poorly integrated in some settings and largely impedes the readability of the work.
  • The experimental results are purely demonstrated as visual comparisons, and no quantitive measures are presented.
  • Arguably the t-self method you propose to improve visualisation is a distance metric that helps avoid collapse to the boundary of the ball. While helping visualisation should not be positioned in the contributions as a method for visualisation.
  • Experimental/implementation settings are omitted from the work limiting the ability of replication.

Minor:

  • The presentation is poor in many places notably the experimental analysis, while the style changes significantly throughout the work.
  • The related work is missing key works and is generally vague in relation to the problem statement. General grammar and writing:
  • Line 61
  • Line 106 is this incomplete?
  • dd^* and dd_* are used interchangeably.

[1] Doorenbos, L., Márquez-Neila, P., Sznitman, R. and Mettes, P., 2023. Hyperbolic Random Forests. arXiv preprint arXiv:2308.13279.

[2] Chlenski, P., Turok, E., Moretti, A. and Pe'er, I., 2023. Fast hyperboloid decision tree algorithms. arXiv preprint arXiv:2310.13841.

问题

  • On Line 60 you state insist that these models do not represent models in hyperbolic space, why? This needs elaborating.
  • How does the proposed method compare to Euclidean models? What is the practical benefit of employing your method?
  • You suggest the t-self may be useful as a standard encoding in hyperbolic space, have you made any empirical findings to support this?
  • The limitations of only embedding the monotonic DT are not discussed? It may assist for clean embeddings but what are the drawbacks?

局限性

Some limitations are questioned above, while broader impacts of the work are addressed.

作者回复

We really appreciate the comment that our method “[...] addresses many fundamental issues in hyperbolic learning presenting [...]”. This is a strong selling point for our method, which unfortunately comes with a challenging presentation problem. We hope our rebuttal helps in clarifying views and questions.

We also really appreciate the comment that one part of our contribution (the t-self) indeed addresses a general, “[...] key problem across hyperbolic distance metrics that seemingly helps avoid collapse to the boundary of the ball [...]” (indeed, it avoids collapse).

weaknesses

“[...] It is not clear [...] Some examples of fully hyperbolic classification models include [1,2]. I may be misunderstanding [...]”

[j9dG-A] Notations and naming can be misleading. Indeed, there is a misunderstanding: please have a look at Figure 4 in [1] and Figure 2 in [2]. The legend makes it clear that it is data which is indeed embedded. Note that some details of the model can be embedded as well, like decision boundaries (Figure 2 in [2]), but at the end of the day, the hyperbolic space is the data’s in those papers, while it is the model’s in ours. We also refer to [t8YB-A] for a limitation of [1][2] in the data space.

“[...] The presentation and writing overall [,,,] impedes the readability of the work. [...]”

We apologize for the inconvenience. We believe that we can make use of the +1 page in the camera ready to expand some of the technical material’s explanation in the main file.

“[...] The experimental results are purely demonstrated as visual comparisons, and no quantitive measures are presented. [...]”

There are actually qualitative measures of the quality of the embeddings displayed on each, the error ρ\rho, but we do understand the point made, to perhaps instead (or in addition) make a separate quantitative analysis. We have made such a proposal, that would be straightforward to implement using our code, in [t8YB-B][t8YB-C].

“[...] Arguably the t-self method you propose to improve visualisation [...] should not be positioned in the contributions as a method for visualisation.[...]”

It is right, it is a transformation that can be used in any use of the Poincaré disk model. We were hoping that our remarks in L305-L308 would help grasp this. To improve this part of the readability, we would be happy to include part of it in the introduction as well, using part of the +1 page camera ready.

“[...]Experimental/implementation settings are omitted from the work limiting the ability of replication[...]”

We strongly disagree on this point: the code we provide contains a UCI dataset and the resource file to run experiments exactly as we did for this dataset (all is explained in the README). Using another dataset requires changing just one line in the resource file (“@PREFIX,abalone”).

On minor comments, we appreciate the suggestions and we are sorry that the small space allocated probably caused potential misunderstandings (ex: L106 goes with L113 for the bold face statement), we are confident that the +1 page can allow us to improve readability.

questions

“[...]On Line 60 you state insist that these models do not represent models in hyperbolic space[...]”

This is related to [j9dG-A] above and we refer to it. In short, in all those papers, the hyperbolic space is the data’s in those papers, while it is the model’s in ours.

“[...]How does the proposed method compare to Euclidean models? [...]”

The substantial benefits come from the properties of the hyperbolic space, and is common to all papers using such geometric models. See for example [t8YB-C] for a concrete example, that was also part of a paper [29], albeit in a different context. There is, however, an additional benefit in our case, which comes from the singular property that there is a relationship between the link function of the log loss and the Poincaré distance which makes a tree node embedding not just natural: it provides a direct visualization of classification confidences in the disk ! (though simple, we have not seen this connection elsewhere).

“[...]You suggest the t-self may be useful as a standard encoding in hyperbolic space, have you made any empirical findings to support this? [...]”

Excellent question: at the moment, with our paper cramming so much content, the sole observation that we have is experimental in our context (the remark appears in L292-L293), and its usefulness hopefully becomes clear in Table 1. We use the conditional (might) instead of may because we indeed suspect a much more thorough analysis can be carried out, which would require not just digging in experiments but also in various encodings and against additional geometric properties of the Poincaré disk model, to establish the potential for a broader standard. This is out of the scope (and size !) of our paper.

“[...]The limitations of only embedding the monotonic DT are not discussed? It may assist for clean embeddings but what are the drawbacks?[...]”

As we understand the question, we see three interpretations for “limitations” that specialize in three different questions:

  • Are there limitations in terms of accuracy for models (MDT vs DT) ? We kindly refer the reviewer to Appendix Section D.II Line 689 and Table II. Here the effects of reducing a DT to its monotonic form is explored. Here we find that in most cases the accuracy / behaviour of the MDTs are similar in performance to the original DTs. Note that we suspect there is a formal rationale to this observation, but it is out of the scope of this paper.

  • Are there limitations in terms of computational cost for creating the MDT ? The answer is clearly no, see [Ry4X-B].

  • Are there limitations for the user, any pricey “entry ticket” for using MDTs (vs DTs) ? This question is about the easiness of using MDTs vs DTs. Knowing DTs, the entry ticket to get to MDTs would arguably be minimal compared e.g. to get to kernels.

评论

I thank the authors for their time and effort in their rebuttal. I appreciate the clarity on some points and the answering of questions.

Regarding the understanding of prior works not embedding the model in hyperbolic space, can you explain further, as I may be misunderstanding your point. From my understanding a fully hyperbolic network operates with fully hyperbolic operators, parameters lie in hyperbolic space, and thus the learned transformations and decision boundaries are hyperbolic. How would such a model not be defined as hyperbolic? If I have misunderstood something please do correct me.

Thank you for the clarity on quantitative measures, I would emphasise that these be made more clear in any revised manuscript.

My apologies for not seeing hyperparameter details presented in the code, I accept this weakness is no longer valid as part of the review. I would however, like to see some further implementation details presented in the appendix in revised manuscript for cohesion.

Thank you for answering questions raised, the responses appropriately address my points.

For now I am maintaining my score until further clarity can be provided for my above question. I am happy to engage further with the authors and other reviewers, and if satisfied will be happy to alter my score at a later time.

评论

We thank the reviewer for this positive feedback official comment. We quote the question raised therein:

Regarding the understanding of prior works not embedding the model in hyperbolic space, can you explain further, as I may be misunderstanding your point. From my understanding a fully hyperbolic network operates with fully hyperbolic operators, parameters lie in hyperbolic space, and thus the learned transformations and decision boundaries are hyperbolic. How would such a model not be defined as hyperbolic?

The reviewer's initial questioning was related to tree-based classification [j9dG-A]. Here, the question is specific on hyperbolic networks and from its formulation it seems to revolve around the ideas of the paper "Hyperbolic Neural Networks" by Ganea et al. in NeurIPS 2018 (our ref. [15]). Indeed, this work changes the Euclidean operators "primitives" by Hyperbolic ones inside the model. The new model is indeed hyperbolic and thus "natively" performs an hyperbolic embedding of data. There is however no such thing as an hyperbolic embedding of the neural network (model/architecture) in the paper (for example, their Figure 2 does not represent a neural network: it represents a separation in a data space in the same way as the reviewer's ref [2] in their Figure 2 does for trees). What we do is embed models. An "equivalent" for neural networks would embed the network's architecture in the Poincaré disk (though we cannot tell how this would be achieved).

We hope this answers the question and contributes to showing the originality / novelty of our work. We are, of course, open for further discussions.

评论

Thank you for the clarification.

Given some of my weaknesses have been addressed and questions answered, I have raised my score. However, in line with the other reviewers, there are considerable empirical evaluation and presentations concerns that are left to be addressed by the authors in any future revision.

审稿意见
5

This paper presents an approach to embed the models (not data) into hyperbolic spaces. Hyperbolic spaces are especially suited for embeddings of hierarchies, but recent works have so far focused on embeddings of data. The paper presents a framework consisting of measures to estimate the confidence of model's predictions and the scheme to embed the decision tree model into hyperbolic space together with the postprocessing procedure. The authors evaluate the proposed approach and notice that more confident and accurate predictions tend to be closer to the border of the Poincare ball.

优点

  • The paper is well written.
  • The approach appears very original to me, I am not familiar with the methods similar to the presented work. I also find the angle of embedding the models (and not the data) interesting and non-trivial.
  • The approach is well motivated and backed up by theoretical explanations provided by authors. The mathematical justifications are rigorously presented in Appendix.
  • In my opinion, the methods for embeddings of discrete data into hyperbolic space is less covered in the literature, so I believe that adding more methods covering this topic would benefit the community. The authors show how exactly their approach differs from a baseline by Sarkar.

缺点

  • Some visualizations are hard to interpret. Please also see a remark in 'Questions'.
  • In my opinion, the paper would have benefited from a more clear presentation of quantitative results. E.g., there is a discussion of the embedding error in Section 6, but the numbers listed in the text seem not to cover all the datasets considered in the analysis. Another table that seem to provide quantitative results is Table II in Appendix showing difference between DT and MDT (its reduced hyperbolic counterpart). Again, I find it challenging to interpret.

问题

  • A small remark: the most part of plots use red/green colours which can be challenging to distinguish for colour blind people. If it is possible, I would encourage authors to change the colour scheme or at least keep that in mind for future submissions.
  • The sentence in line 697 is unclear.

My main concern regarding the paper is related to the difficulties while evaluating the performance of the approach and it mainly justifies my rating. However, I am open for discussions with the authors.

局限性

The authors have adequately addressed the limitations. I see no potential negative societal impact of their work.

作者回复

We thank the reviewer for noting that our approach is “[...] very original [...]” – it is indeed the first of its kind and we believe the problem we address is of broad importance. We also appreciate the comments on the formal part of our approach.

strengths

In my opinion, the methods for embeddings of discrete data into hyperbolic space is less covered in the literature, [...] adding more methods covering this topic [...]

[t8YB-A] This is an excellent observation that we did not think about and that would surely make our contribution stronger: the approaches [8][11] rely on real data embeddings, making it tricky to embed discrete or categorical data. Because we embed models that are naturally fit to handling any kind of numerical / nominal data, our technique encompasses any kind of such data.

weaknesses

“[...] In my opinion, the paper would have benefited from a more clear presentation of quantitative results [...]”

[t8YB-B] A fair comment, due to the fact that we mainly focuses on the “visualization part”; fortunately, this can easily be addressed as each visualization also includes an error term: we propose to summarize (mean ±\pm std dev) the errors (ρ\rho) made on each dataset, in a separate table, that would hopefully fit in the +1 page allocated for camera-ready.

[t8YB-C] Reading this comment from the reviewer, we came up with another idea to test one of the benefits of hyperbolic embeddings to our setting, which would be trivial to code and run: it is well known that, as points move closer to the border, the hyperbolic distance dB(z,z)/2d_{B}(**z**, **z**’)/2 approaches the average (dB(z,0)+dB(z,0))/2(d_{B}(**z**, **0**)+d_{B}(**z**’, **0**))/2 (see [29] Figure 1). For our application, the benefit is substantial: the expectation is indeed the expectation of absolute confidences of the two nodes mapped on z**z** and z**z**’ (the nodes’ quality) and what the property says is that it can be approximated by just looking locally at (half) the distance between the two points (in terms of visualization, think zooming on a part of the disk or having a hovering tool providing the information of this distance “on the fly”). We would be happy to compute a discrepancy between the two terms above, as a function of the localization of the nodes on the disk.

questions

“[...] A small remark: the most part of plots use red/green colours which can be challenging to distinguish for colour blind people. [...]”

The reviewer is absolutely right: we propose, in addition to the color, to differentiate using shapes (e.g. circle for red, square for green, for all nodes). This would be a one-liner change in the code.

“[...] The sentence in line 697 is unclear. [...]”

Our experiments involve a 10-folds stratified cross validation (L672) to compute average errors ρ\rho. In our visualizations, we can only show one of the ten models. In the spirit of fairness, we display the model learned on the first fold (instead of e.g. picking the best model out of the ten). This is what is summarized in L697 and we would be happy to be more specific.

“[...] My main concern regarding the paper is related to the difficulties while evaluating the performance of the approach and it mainly justifies my rating. However, I am open for discussions with the authors.[...]”

We are glad the reviewer signals their openness for discussion. Regarding evaluation, we hope that the suggestion we make in [t8YB-B][t8YB-C] would lead to a purely quantitative evaluation of our embeddings that would satisfy the reviewer.

评论

I would like to thank authors for addressing my comments and questions. After reading other reviews and answers, I have decided to maintain my rating.

作者回复

To all reviewers

[ALL-1] We would like to thank all reviewers for granting unanimous approval on our paper from the “soundness” (“good” or “excellent”) and “contribution” (all “good”) standpoints. We understand the presentation is the bottleneck of our paper, which we attribute to the fact that we are the first to tackle model embedding (new problem) and that a full-fledged solution requires several technical steps (mathematical background). We hope the reviewers concur that the +1 page for the camera ready can be used to address those concerns (details in each rebuttal).

[ALL-2] We have had requests to compare graphical representations with other software, such as scikit-learn [Ry4X-A]. There is one crucial thing to remark: all those packages have the same blueprint as the one we mention in our footnote "*" (bottom of page 2): they adopt a graph-based representation. For our purpose, this would just map nodes’ confidences on top of the vertices of a (rooted binary tree) graph. The issue that is relevant to our paper is not whether the DT learned is accurate, but whether the representation easily brings meaning. We attach to this general rebuttal a pdf file representing, on domain online_shoppers_intentions, three representations of one tree learned during one CV fold (not necessarily the same for each representation):

  • We first represent (left) a decision tree using an ASCII-”art” representation of the rooted binary tree. We chose this because the software we submitted can actually output these at no cost and we believe the picture is telling and highlight the same issue as would be observed from all public packages: it is useless to represent a tree (graph) with 200+ nodes to grasp how classification is achieved / changes throughout the tree, compare node predictions in different part of the tree, etc. (with our ASCII representation, the full graph would have 200+ lines; in scikit-learn, it would have 200+ boxes for nodes, etc.).
  • We then represent (middle) a monotonic DT using the same ASCII-art formalism (this one representation is not included in our software but is easy to implement). The lesson ? The graph is much smaller than for the DT but it is still hard to grasp when it can have dozens of nodes and as many prediction confidences = real numbers to mix in the interpretation.
  • We finally present (right) our hyperbolic embedding of the MDT. We believe the reviewers will concur that the “global picture” of the MDT is a “snapshot” clean and interpretable in a way that would not have been possible for both other representations. Should our paper be accepted, we would be happy to make this point very clear in the camera-ready by using a slick version of our attached file, using part of the +1 page.

Tags are spread throughout rebuttals for easy search / referencing (e.g. [ALL-1]).

最终决定

This paper introduces a method to embed decision tree models into hyperbolic spaces, particularly using the Poincaré ball model. The approach links class probability estimation with hyperbolic distances, focusing on enhancing interpretability without altering model performance. Key techniques include extracting monotonic subtrees and introducing a "tempered" integral to improve visualization and prevent embeddings from collapsing to the ball's boundary. The authors demonstrate that more confident predictions tend to be closer to the Poincaré ball's border, providing a novel way to analyze and visualize model decisions within hyperbolic geometry.

The reviewers agree that the approach is both interesting and useful. From an algorithmic perspective, this work is novel and would make a strong contribution to NeurIPS. However, one of the main concerns is the experimental evaluation. While the paper includes numerous experiments on benchmark datasets, it lacks experiments with more realistic data. Incorporating real-world datasets would greatly enhance the paper’s impact, and I strongly encourage the authors to add such experiments. Despite this, I find the algorithm to be innovative and compelling, and I recommend the paper for acceptance.