Sparse hyperbolic representation learning
We propose the regularization and optimization methods for sparse learning in hyperbolic space.
摘要
评审与讨论
This paper researches the sparse representation learning in hyperbolic space. It defines the sparsity of one point geometrically in the Cartan-Hadamard manifold and avoids the difficulty of defining a unique coordinate system in a general manifold. In order to ensure sparsity, this paper introduces a novel sparse regularization term hyperbolic 1-norm for continuous optimization. Further, since the defined optimization problem becomes non-smooth because of the norm, the existing Riemannian gradient descent method would oscillate around the sparse solutions on a non-smooth function, while the proposed hyperbolic iterative shrinkage-thresholding algorithm (HISTA) successfully avoids the oscillation issue. Finally, the numerical experiments prove that the proposed HISTA is effective and outperforms the existing hyperbolic-space-based representation learning methods with respect to space complexity and sparse representation quality.
优点
(1) This paper defines the sparsity of a point geometrically in Cartan-Hadamard manifold, which is non-trivial in hyperbolic space.
(2) This paper introduces a novel sparse regularization term hyperbolic 1-norm on a Cartan-Hadamard with an origin and orthonormal bases, which is non-trivial in a coordinate system of hyperbolic space.
(3) This paper proposes the HISTA that avoids the oscillation issue that occurs in the existing Riemannian gradient descent method.
(4) Experimental results indicate that the proposed algorithm outperforms the existing method in terms of oscillation issue, space complexity, and sparse representation quality.
缺点
(1) The proposed method only works for Cartan-Hadamard manifolds.
(2) There is no convergence analysis for the proposed HISTA.
(3) Currently the applications of the proposed sparse learning method are kind of limited.
Some typos:
an CHMOO -> a CHMOO;
Page 9, -> ;
Page 9, delete "sufficient" in Section 8 Limitation.
问题
See Weaknesses.
Thank you for your comments acknowledging our paper's non-triviality and that the proposed algorithm outperforms the existing method in terms of oscillation issue, space complexity, and sparse representation quality.
(1): The proposed method only works for Cartan-Hadamard manifolds.
Response: First, we emphasize that considering Cartan-Hadamard manifolds suffices for our paper's motivation: sparse learning in hyperbolic space. This is because hyperbolic space is a Cartan-Hadamard manifold. In fact, as we have mentioned in Section 8, we can extend our framework to other manifolds, such as spheres and tori. For example, suppose that we consider a two-dimensional sphere, like our planet, and we put the origin at the point with 0 longitude and 0 latitude without loss of generation. Then, sparse hyperplanes are 0-degree longitude lines and 0-degree latitude lines. The signed distance from those hyperplanes can easily be calculated on the sphere. This discussion applies to higher-dimensional spheres and tori. However, we omitted those cases since our motivation is not there but in hyperbolic space, and it involves unnecessary technical preliminaries if we include those types of space in the paper's discussion.
Just in case, we remark that we did not limit the discussion to hyperbolic space but included a general Cartan-Hadamard manifold because it helps readers to compare the hyperbolic space case to the Euclidean space case. Here, this purpose does not require us to consider other types of manifolds, either.
(2): There is no convergence analysis for the proposed HISTA.
Response: This point is what we have mentioned in Section 8 as a limitation. As we have mentioned there, we cannot use the same technique as we use in Euclidean space. However, we claim that it should not be regarded as an incompleteness of our paper. One reason is that our paper has provided a framework of sparse learning in hyperbolic space, so we have no competitor to which we can compare the convergence rate of our method. Another reason is that the convergence analysis of optimizing an objective function on a general Riemannian manifold with a non-smooth regularization has not been pioneered at the moment. In fact, to the best of our knowledge, there has been no convergence analysis of closed-form algorithms with non-smooth regularizations in a general Riemannian manifold. Obviously, if we regard the regularization term as a part of the loss function, we can get a convergence rate of Riemannian stochastic gradient descent variants, but in this case the convergence rate is affected by the non-smoothness of the regularization term and bad. Also, it does not solve the oscillation issue. This is why the ISTA is proposed in Euclidean space and we are inspired by ISTA. The ISTA achieves a good convergence rate in Euclidean space by not regarding the regularization term as a part of the loss function. However, to the best of our knowledge, no research has succeeded in extending the convergence rate of ISTA to that of a closed-form algorithm in a general Riemannian manifold. The paper (Huang & Wei, 2022) is the only related one, but the algorithm is not given in a closed form in hyperbolic space, unfortunately. To wrap up, if we succeeded in providing a convergence analysis for our algorithm, it could be another independent breakthrough paper. Considering the above background, we believe that the lack of convergence analysis is not an obstacle for our paper to be published at the moment.
(3) Currently the applications of the proposed sparse learning method are kind of limited.
Response: What we have proposed is a sparse learning framework in hyperbolic space based on regularization. Hence, it applies to any method with a loss function of parameters in hyperbolic space. In this sense, we claim that the application of our sparse learning method is not limited. The reason why we chose the graph embedding as an example application in the numerical experiments in the Appendix of our paper is that the simplicity of the task helps us to discuss the effect of our regularization. Indeed, we successfully discuss the relation between our method's performance and the graph structure there. This does not mean it cannot apply to other tasks. However, if we had applied it to more complicated tasks, such as regression using a multi-layer hyperbolic neural network, the evaluation of our sparse learning framework could not be as insightful as our experiments.
Thank you for your feedback. I will keep my score.
Thank you for your reply and for taking the time to read my comment.
This paper innovatively proposes a sparse learning scheme for hyperbolic representation learning. By defining sparseness and 1-norm via the Cartan-Hadamard manifold, this paper derives a novel hyperbolic iterative shrinkage-thresholding algorithm (HISTA) to obtain sparse representations. With theoretical analysis and experimental verification, HISTA successfully achieves geometrically uniform sparsity and avoids oscillation issues.
优点
- Ideas presented in this paper have novelty. Existing hyperbolic methods lack the complete statement of sparse learning.
- Theoretical analysis is organized and clear. Adequate definitions and remarks enhance the credibility of this paper.
- Limitations and future studies are given from different perspectives, showing the fundamental contributions of this paper in sparse hyperbolic learning.
缺点
- There lacks a graphical presentation of experimental results in the main body. Please optimize the article architecture for better readability.
- Symbols denoted in this paper are ambiguous sometimes. For example, the italic upper letter T represents the tangent space in the preliminary section but represents the final iteration number in Algorithm 1. Please check the uniqueness of notations.
- Several mistakes. a) Sections 2, 3, 7 and 8 are missed at the end of the introduction. b) Why Definition 2 and Example 2 start in new lines? They should be consistent with others. c) SHP mentioned in Definition 3 should be italic-bolded. d) There is an incorrect spelling of Poincar'e at the end of section 6.
问题
Please refer to the weaknesses part.
Thank you for your insightful comments pointing out that our paper is novel and theoretical analysis is organized and clear.
1.: There lacks a graphical presentation of experimental results in the main body. Please optimize the article architecture for better readability.
Response: Thank you for your constructive suggestion. Our priority was to provide a framework for sparse learning in hyperbolic space, and we focused on its theoretical foundation in the main body. However, we agree that including the graphical presentation of experimental results would help readers' understanding if we had more space. Although the additional page is not granted in this rebuttal period, we will move the contents of Section F in the current version to the main body in the camera-ready version using the additional page.
2.: Symbols denoted in this paper are ambiguous sometimes. For example, the italic upper letter T represents the tangent space in the preliminary section but represents the final iteration number in Algorithm 1. Please check the uniqueness of notations.
Response: Thank you for pointing out the notation issue. We have updated the draft. In the revised version, , instead of , is used to denote the tangent space and is now used only to represent the final iteration number.
3.: Several mistakes. a) Sections 2, 3, 7 and 8 are missed at the end of the introduction. b) Why Definition 2 and Example 2 start in new lines? They should be consistent with others. c) SHP mentioned in Definition 3 should be italic-bolded. d) There is an incorrect spelling of Poincar'e at the end of section 6.
Response: Thank you for pointing out the format issues. We have updated the draft. In the revised version,
- a) Sections 2, 3, 7, and 8 are now in the explanation of the remaining sections at the end of the Introduction section.
- b) Definition 2 and Example 2 do not start in new lines now.
- c) SHP mentioned in Definition 3 is now italic-bolded.
- d) The spelling of Poincaré at the end of Section 6 has now been corrected.
Thank you for your feedback. Compared to Section F, some experiments in Section G could better demonstrate the superiority of this method in real applications. I will maintain my original score.
Thank you for your reply and for taking time to read our comments.
Here, we address your additional concern.
- Compared to Section F, some experiments in Section G could better demonstrate the superiority of this method in real applications.
Response: We stress again that the experiments in Section G exist to discuss both when our method works well and when not, and confirm both in real applications. This discussion is essential in a machine learning paper since no method can consistently outperform others, according to the no free lunch theorem. In our case, as explained in the dataset description in Section G, if the data are a mixture of a tree-like structure and a non-hierarchical structure, our method is expected to work well. Conversely, if the data are with a purely tree structure only, no regularization for hyperbolic space is needed and our method is not expected to work as well as the no-regularization method. Our numerical experiments clearly show they are correct expectations. Specifically, our work outperformed others in ENRON-EMAIL, which is with a tree-like structure and a non-hierarchical structure, and our work did not outperform others in CORA, which is almost tree-like. To wrap up, our experiments clearly demonstrated both the superiority and inferiority of our method in real applications with a general lesson about when our method has the superiority. We remark that showing the inferiority of our method, as well as its superiority, should be understood as a full disclosure of limitation, which is included in the ICLR Code of Ethics (Researchers should provide full disclosure of all pertinent system capabilities, limitations, and potential problems to the appropriate parties, including any party that may deploy the system.). Hence, it is not the weakness of our paper.
The paper extends the sparsity regularization framework of Euclidean representations to more general manifolds of nonpositive curvature called Cartan-Hadamard manifold (CHM). CHMs include the Euclidean geometry and hyperbolic geometries. The main idea is to study the orthonormal basis of the tangent space at some point of the manifold and apply l0- or l1-norm regularization via differential geometry tools such as the logarithmic map. Since l1-norm regularization might not converge to sparse solutions in practice, the paper proposes to extend an iterative shrinkage-threshold algorithm to CHMs to solve this practical issue.
优点
Sparsity is an important tool in machine learning to avoid overfitting but has mainly been studied in the Euclidean context. Hyperbolic representations have been shown to outperform Euclidean representations in low-dimensional space. This paper presents an elegant regularization framework to promote sparsity CHMs in general.
The paper is well-written and its motivation is clear. Since the Euclidean space is a CHM, the connection and extension to CHMs are well explained.
缺点
Although the paper proposes a way to promote sparsity in CHMs, the family of considered Riemannian manifolds is still limited (i.e. complete Riemannian of nonpositive curvature). The paper illustrates a nice example where l1-norm regularization can be extended to some Riemannian manifolds, but it assumes the existence of exponential and logarithmic maps whose closed-form is often unknown in practice.
The practical use of the proposed framework in real world applications is also unclear. In particular, Figure 6 in the Appendix shows that no regularization at all is competitive with the proposed method in terms of accuracy (sometimes even better). This limits the practicality of the proposed approach.
Another limitation is that the paper considers nonparametric embeddings, not neural networks. Although sparsity has been studied for embeddings in the signal processing community, the sparsity criterion in machine learning often focuses on the output representation of linear or nonlinear models such as Support Vector Machines and Multiple Kernel Learning.
问题
Could you explain the practical relevance of the approach?
- Your comment: Another limitation is that the paper considers nonparametric embeddings, not neural networks. Although sparsity has been studied for embeddings in the signal processing community, the sparsity criterion in machine learning often focuses on the output representation of linear or nonlinear models such as Support Vector Machines and Multiple Kernel Learning.
Response: Thank you for your comment. We reiterate that our motivation is in representation learning in hyperbolic space. Hence, we claim that not including Support Vector Machines and Multiple Kernel Learning, which consider points in reproducing kernel Hilbert space, not hyperbolic space, is not understood as a weakness of the paper.
You pointed out that our paper considers nonparametric embedding. It is correct that graph embedding, which is experimented in Appendix G, is a variant of nonparametric embedding. The reason of selecting graph embedding is that the simplicity of the task makes it easy to discuss the effect of our regularization. Another reason is that graph embedding is a most typical area where machine learning using hyperbolic space succeeded, represented in (Nickel and Kiela, 2017).
Your comment also mentions neural networks. Actually, we can apply our sparse learning method to any machine learning method with a loss function of parameters in hyperbolic space, including hyperbolic neural networks. This is because our method is regularization-based. Nevertheless, if we evaluated our sparse learning scheme on hyperbolic neural networks, the evaluation would be by far less intuitive than our experiments in Appendix G since the meaning of the parameters there is not trivial.
- Your question: Could you explain the practical relevance of the approach?
Please see our response to your comment The practical use...
- Your comment: The practical use of the proposed framework in real world applications is also unclear. In particular, Figure 6 in the Appendix shows that no regularization at all is competitive with the proposed method in terms of accuracy (sometimes even better). This limits the practicality of the proposed approach.
Response: We appreciate your paying attention to the results in the Appendix. However, we suspect that you might misunderstand the objective of the experiments. What the experiment, as a machine learning paper, tried to see was when the method outperforms others and when it does NOT as well. To confirm the latter is also essential in machine learning since no method can always outperform others according to the no free lunch theorem. To see the former case, we selected ENRON-EMAIL, and to see the latter case, we selected CORA.
Specifically, as explained in the dataset description in Appendix G,
- Since it is a mixture of a tree-like structure and a non-tree-like structure, we expect that our sparse learning scheme works better than the no regularization method in the ENRON-EMAIL.
- Since CORA is highly tree-like, we expect it to be more effective to apply a (non-sparse) hyperbolic embedding method in low-dimensional space than to apply our sparse learning scheme in high-dimensional space. Hence, we do NOT expect that our sparse learning scheme works so effectively in CORA as in ENRON-EMAIL.
Hence, it is not strange that our method did not outperform others in CORA, and this result does not imply the practicality of the proposed method is limited. It even shows the soundness of the experiments in the sense that it does not violate the no free lunch theorem.
The results were as we expected above, as we can see in Figure 6. Just in case, we remark again how to interpret Figure 6. In Figure 6, the closer to the upper left corner are the graphs, the better since we need to see the trade-off between the accuracy and the space complexity. Just comparing the accuracy is not meaningful since we can easily increase it by decreasing the regularization hyperparameter or increasing . To say that method A is better than method B, we need to find a pair of a point on method A's graph and a point on method B's graph, such that method A's point is located at the upper left of method B's point. For a pair with other relative positions, we conclude they are tied. Therefore, we can say that our proposed method outperforms no regularization methods in ENRON-EMAIL as we see the left half of the figure, while they are tied in the right half of the figure. Conversely, the no-regularization method outperforms our regularization method in CORA.
To wrap up, what we can conclude from numerical experiments is that
- our sparse learning method works well on the data with a mixture of a tree-like structure and a non-tree-like structure, and
- our sparse learning method does not work well as no regularization method on the data with a purely tree-like structure.
This is a natural conclusion since we can embed any tree in the two-dimensional hyperbolic space with arbitrarily small distortion (Sarker, 2011). Data with a purely tree-like structure is a domain where the proposed method cannot outperform others, but every machine learning method has such a domain, according to the no free lunch theorem. Showing the results in such a domain, as well as results where the proposed method outperforms others, should be understood as a full disclosure of limitation, which is included in ICLR Code of Ethics (Researchers should provide full disclosure of all pertinent system capabilities, limitations, and potential problems to the appropriate parties, including any party that may deploy the system.). Hence, this point is not a weakness of the paper, as long as the proposed method has a clear domain where it can outperform others, which is data with a mixture of a tree-like structure and a non-tree-like structure in this case. Since it is clear when the proposed method is effective, the practicality of the proposed method is guaranteed.
Thank you for your comments acknowledging that our paper is well-written and its motivation is clear.
- Your comment: Although the paper proposes a way to promote sparsity in CHMs, the family of considered Riemannian manifolds is still limited (i.e. complete Riemannian of nonpositive curvature).
Response: Thank you for raising an important point related to our discussion in Section 8. First, we emphasize that the CHM is a sufficiently large class for our motivation, sparse learning in hyperbolic space since hyperbolic space is a CHM. Also, as you pointed out, considering a CHM helps readers understand the paper since the well-known Euclidean space is also a CHM. However, it does not mean the discussion has to be limited to a CHM. In fact, we have some important cases where we can introduce sparse learning the same way, such as spheres and tori, as mentioned in Section 8. However, since they are outside of our motivation, we omitted the details from the paper for the sake of a simple discussion.
- Your comment: The paper illustrates a nice example where l1-norm regularization can be extended to some Riemannian manifolds, but it assumes the existence of exponential and logarithmic maps whose closed-form is often unknown in practice.
Response: First, we stress that it is not problematic at all for our motivation, sparse learning in hyperbolic space. This is because the closed forms of the exponential and logarithmic maps are explicitly available and computationally efficient in hyperbolic space. Hence, it is not a weakness of the paper. Nevertheless, in the following, we discuss a general Riemannian manifold case for the reviewer's interest.
It is correct that our framework needs a closed-form formula, whether exact or approximated, of the exponential and logarithmic maps. However, it is rarely a practical issue in the machine learning community, where we believe ICLR belongs. This is because in most cases where we use a Riemannian manifold in machine learning, in the end, we need an exponential map formula to optimize the loss function. Hence, if no exponential map formula is available in a Riemannian manifold, the manifold is not very welcomed in machine learning in the first place. We are aware that we have no exact formula for the exponential map in some manifolds and need to use an approximation, often called a retraction. Even in this case, if the retraction works well as an approximation, then our sparse learning scheme would work well with the approximation. Conversely, if not, the machine learning model (without our sparse learning scheme) would not work well in the first place since the optimization is difficult. This is why we do not assume requiring the closed form of the exponential map and its inverse, logarithmic map, is a practical issue.
This paper proposes to extend sparsity regularization from the Euclidean setting to the setting of Riemannian manifolds of nonpositive curvature (Cartan-Hadamard manifold - CHM), along with an iterative shrinkage-threshold algorithm (CHISTA), in particular in hyperbolic space (HISTA).
Reviewers generally agree that sparsity on manifolds is a topic of interest for machine learning. However, there are issues with the current formulation, both theoretically and empirically:
-
It is not clear if the CH-1 norm is independent of the coordinate system in general, as claimed in Remark 5. This needs to be proved explicitly. The SHP (sparse hyperplane) is defined based on a fixed origin and orthonormal basis of the corresponding tangent space and so is the signed SHP distance map (SSDM).
-
The convergence analysis of CHISTA and HISTA is lacking, as acknowledged by the authors themselves as a limitation of the current framework.
-
The experimental validation is weak (Reviewers kjo2 and mRWy) and only provided in the Supplementary Material, showing either no or very limited improvement over methods, including no regularization.
为何不给更高分
There are issues with the theoretical formulations and the experimental validation is weak.
为何不给更低分
N/A
Reject