RMLR: Extending Multinomial Logistic Regression into General Geometries
We propose a general framework of building intrinsic Riemannian classifiers for general geometries , and showcase our framework on the SPD manifold and special orthogonal group.
摘要
评审与讨论
This paper extends the multiclass logistic regression into general Riemannian spaces, contributing to the field of Riemannian deep learning. Starting from the concept of Riemannian hyperplanes, the present work constructs the distance from Riemannian points to Riemannian hyperplanes and derives Riemannian Multinomial Logistic Regression (RMLR) on Riemannian manifolds. The RMLR framework is then showcased under 5 geometries on the SPD manifold, and SO(n). Extensive experiments on different Riemannian backbone networks, including Riemannian feedforward, Riemannian residual, and Riemannian graph neural networks, validate the effectiveness of the proposed RMLR. Especially, the results in Tab. 9 on direct classification (LogEig v.s. SPD MLR) show a clear advantage of the proposed classifiers (up to 18.34 improvement).
优点
- The proposed RMLR framework (Thm. 3.3) can be easily implemented in different geometries. For a specific geometry, one only needs to put the involved operators into Eq. 11.
- The proposed RMLR not only generalizes the Euclidean MLR, but also incorporates several previous MLRs, such as gyro SPD MLR, gyro SPSD MLR, and flat SPD MLR (Tab. 1). Besides, it further can deal with the geometry which is non-flat or agnostic to gyro structure.
- A complete study of 5 families of deformed SPD metrics is presented in Tab. 2 and Fig. 1.
- 5 SPD MLRs and one Lie MLR are specifically implemented. The experiments on different network backbones including Riemannian feedforward, Riemannian residual, and Riemannian graph neural networks, validate the effectiveness of the proposed RMLR.
- The presentation is clear, such as SPD and Lie MLRs in Thms. 4.2 and 5.2.
缺点
- More details on the optimization for learning the parameters of the MLR should be presented.
- For the SPD manifold, there are at most three hyperparameters: . Although these indicate the generality of the proposed framework, how to select the parameter should also be discussed from a practical view.
问题
- What are the complexities (memory and time) of different metrics, theoretically and experimentally?
- For the SPD metrics, how can the researcher select the involved hyper-parameters in practice?
局限性
N/A
We thank Reviewer for encouraging feedback and valuable comments. Below, we address the comments in detail. 😄
1. Details on the parameters learning of the MLR.
Due to page limits, the optimization details are discussed in Apps. G.1.3 and G.2.3. Generally, our RMLR (Eq. (11)) requires Riemannian optimization for each manifold-valued parameter . For the SPD MLR, we use geoopt [a] to optimize the SPD parameter . For the Lie MLR, is a rotation matrix, whose Riemannian computation is not supported by geoopt. We, therefore, further extend the geometry in geoopt to include rotation matrices to update our Lie MLR.
2. Hyperparameters in SPD MLRs: > .
The hyperparameter selection has been discussed in App. G.1.4. For a specific SPD MLR, there are at most three kinds of hyperparameters: deformation factor , , and . The general order of importance should be .
-
The most significant parameter is the deformation factor . As we discussed in Sec. 4.1, interpolates between different types of metrics ( and ). Therefore, one can select its value around the deformation boundary, which has been systematically presented in Fig. 1. Generally speaking, the recommended candidate values for in AIM, PEM, and LCM are , while the ones for in BWM are .
-
The less important parameters are . Recalling Tab. 12, only affects the Riemannian metric tensor, i.e., the inner products over tangent spaces. For our SPD MLRs in Thm 4.2, they only affect inner products, which should have fewer effects. Our experiments indicate that is saturated for most cases.
3. Model complexities.
As the logics of RMLR are similar under different geometries, we focus on the SPD MLR in the following for simplicity.
-
Memory complexities: Recalling Thm. 4.2, each class requires an SPD parameter and a Euclidean parameter in the SPD MLR. The memory costs depend on the number of class and the dimension of or . Supposing there is class and the dimension of is . SPD MLR needs to store 2C matrix parameters of , i.e, . On the other hand, the classifier (LogEig MLR) on the tangent space at the identity requires , where is the dimension of tangent space. Although our SPD MLR requires more parameters than the vanilla LogEig MLR, we achieve much better performance across different network architectures.
-
Computational complexities: The efficiency of our SPD MLRs against the vanilla nonintrinsic LogEig MLR has been discussed in App. G.1.5. The key factor of the computational complexity of SPD MLRs under different metrics lies in the number of matrix functions, such as matrix power, logarithm, Lyapunov operator, and Cholesky decomposition. These matrix functions are divided into two categories: one is based on eigendecomposition, and the other is the Cholesky decomposition. Note that Cholesky decomposition is more efficient than eigendecomposition. Tab. A summarizes the number of matrix functions in each SPD MLR. With deformation, the general efficiency of SPD MLR should be LCM>EM>LEM>AIM>BWM, while without deformation, the order should be EM>LCM>LEM>AIM>BWM. The following Tab. B comes from Tab. 16 in App. G.1.5, which reports the average training time (in seconds) per epoch of each classifier. Please refer to App. G.1.5 for more details. Tab. B shows that EM-, LEM-, and LCM-based SPD MLR are more efficient than AIM- and BWM-based MLR. Notably, we also observe that our LCM achieves even comparable efficiency with the vanilla LogEig MLR on the radar and Hinss2021 datasets.
Table A: Number of matrix functions for each class in different SPD MLRs. a (b) means the number of matrix functions in the SPD MLR under the deformed (standard) metric.
| Metric | Eig-based Matrix Functions | Cholesky Decomposition | In Total |
|---|---|---|---|
| LEM | 1 (1) | 0 (0) | 1 (1) |
| AIM | 3 (2) | 0 (0) | 3 (2) |
| EM | 1 (0) | 0 (0) | 1 (0) |
| LCM | 1 (0) | 1 (1) | 2 (1) |
| BWM | 5 (4) | 1 (1) | 6 (5) |
Table B: Training efficiency (s/epoch) of different classifiers.
| Methods | Radar | HDM05 | Hinss2021 Inter-session | Hinss2021 Inter-subject |
|---|---|---|---|---|
| LogEig MLR | 1.36 | 1.95 | 0.18 | 8.31 |
| AIM-MLR | 1.75 | 31.64 | 0.38 | 13.3 |
| EM-MLR | 1.34 | 3.91 | 0.19 | 8.23 |
| LEM-MLR | 1.5 | 4.7 | 0.24 | 10.13 |
| BWM-MLR | 1.75 | 33.14 | 0.38 | 13.84 |
| LCM-MLR | 1.35 | 3.29 | 0.18 | 8.35 |
Reference
[a] Riemannian adaptive optimization methods. ICLR, 2018.
Thanks for the reply.
1 Interesting. I hope the SO(n) computation package will be released. This will facilitate building networks in the Lie group.
2-3 Interesting and worth reading, thanks!
Generally, Riemannian deep learning can benefit from the proposed RMLR framework. Apart from the current geometries discussed in this paper, the proposed RMLR has the potential to be implemented into other geometries, facilitating Riemannian neural networks.
I have no further concerns and have raised my score to 8. Good luck.
Thanks for the encouraging feedback! We will release the code including the one about SO(n) computation. 😄
- Instead of adopting complex approaches for extending MLR to Riemannian manifolds via general geometry extensions such as gyro structures and generalized SINE rules, this study generalizes to Riemannian manifolds using a simple approach based on the logarithm map.
- The authors show the several experimental results with various types of datasets.
优点
- This study generalizes to Riemannian manifolds using a simple approach based on the logarithm map, avoiding complex approaches like gyro structures and generalized row of sines.
- The geometric aspects of this paper are well-founded. Conducting geometrically valid computations using the logarithm map and parallel transport is a standard method for tangent space analysis.
- Evaluation is conducted on various types of datasets.
缺点
- I find this paper somewhat confusing to read. What does the claim "our framework only requires the explicit expression of the Riemannian logarithm" in the introduction mean? For example, parallel transport requires explicit expressions for each of types of Riemannian manifold or metrics (Table 12). Is there a contradiction with the authors' claim?
问题
- Does Eq. 8 satisfy the axioms of distance?
- The authors adopt parallel transport to determine A~_k in Eq. 11 but projecting a point in Euclidean space to the tangent space might be simpler. Parallel transport can be computationally intensive and needs to be defined according to the type of manifold. Why did the authors choose parallel transport? Also, regarding Weakness 1, if parallel transport needs to be determined individually, is there a contradiction with the authors' claim that only the logarithm map is needed?
局限性
- Discussed in Appendix A.
We thank Reviewer for the constructive suggestions and insightful comments! In the following, we respond to the concerns in detail. 😄
1. Parallel transport is not the only option for determining in the RMLR (Eq. (11)).
-
The ways to determine are flexible. As discussed in lines 142-150, as varies, we can not directly optimize by the Euclidean optimization. The key is to use a map to generate from a fixed tangent space. The choice of is pretty flexible. In lines 142-150, we discussed four instances of , including parallel transport, vector transport [a] (the approximation of parallel transport), differential of Lie group translation, and differential of gyro group translation. Apart from these, we believe there will be alternative proper . For a specific manifold, the suitable choice of depends on factors such as the geometry of the manifold, simplicity, and numerical stability.
-
We mainly focus on parallel transport due to its advantageous properties and theoretic convenience. Parallel transport has many nice properties, such as preserving inner product across tangent spaces [b, Ch. 3.3]. Furthermore, in our SPD and Lie MLRs involving parallel transport, parallel transport can be canceled out, and the MLR expression can be further simplified. For instance, Although the parallel transport under AIM could be complex (as shown in Tab. 12), it is canceled out and further simplifies the final expression of AIM-based SPD MLR. Please check the SPD MLRs in Thm. 4.2 (except the one under BWM) and Lie MLR in Thm. 5.2, as well as their proofs, for more details.
-
Furthermore, we also use the differential of Lie group translation for better numerical stability. As detailed in App. F.2.1, the parallel transport under BWM is backpropagation-unfriendly. Therefore, we use the differential of Lie group translation to determine for the BWM-based SPD MLR as a more stable alternative.
-
Finally, the projection map might omit crucial aspects of the latent geometry. We assume the projection map you mentioned is the orthogonal projection for the Riemannian submanifold or quotient manifold [a, Ch. 3.6]. It is widely used in Riemannian optimization, as it is equivalent to transforming the Euclidean gradient to Riemannian counterparts. However, the projection map only captures the horizontal part and discards the vertical part, leading to information loss or redundancy. For instance, it fails to preserve the inner product. Besides, it is not bijective, potentially introducing many-to-one redundancy in . Therefore, although it can be numerically used to determine , we did not use this map. Nevertheless, we will explore other maps, which properly map the Euclidean vector into the tangent space for determining .
In summary, parallel transport is not a necessary ingredient for determining , and there are different alternative choices. The specific choice depends on theoretical rationality and computational simplicity & stability.
2. Eq. (8) is not from point to point, but from point to hyperplane; therefore, the axioms of metric space cannot be applied to our distance.
-
The Riemannian margin distance in Eq. (8) differs from the distance in a metric space. The distance in a metric space is defined as . However, the distance in Eq. (8) is $\mathcal{M} \times \{ \tilde{H} _{\tilde{A},P}\} \rightarrow \mathbb{R}$, where $\{ \tilde{H} _{\tilde{A},P}\}$ denotes the set of all Riemannian hyperplanes. Therefore, the axioms of metric space can not be applied to our Eq. (8). Nevertheless, Eq. (10) indicates the positiveness of our Riemannian margin distance.
-
Besides, our Riemannian margin distance is a direct generalization of Euclidean margin distance. When the latent manifold , Eq. (8) is reduced to the familiar Euclidean margin distance (Eq. 7).
Reference
[a] Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds. Princeton University Press, 2008.
[b] Do Carmo M P, Flaherty Francis J. Riemannian geometry[M]. Boston: Birkhäuser, 1992.
I thank the authors for their detailed responses. All my concerns were solved. Therefore, I decided to raise the score.
Thanks for the encouraging reply and the time you took during the review and discussion! We will add these clarifications to the main paper for better readability. 😄
The authors extend multinomial logistic regression into spaces where they only require a logarithmic map. They do so in order to accomplish tasks such as classification
优点
The paper is well organized and written. There is a good balance of theoretical results and practical applications. It is nice to see a thorough exposition of the SPD manifold with all the commonly used metrics.
缺点
I don't understand why lines 29-31 seem to look down on the use of things such as tangent spaces and coordinate systems because the proposed method relies on the log map which itself effectively requires tangent spaces and coordinate systems.
The experiment section could use more thorough explanations which are found in the appendix.
问题
Can the authors elaborate on how their work is a distinct contribution compared to the SOTA methods which are mentioned? I understand this is more general, as is shown in Table 1 but I fail to see the entire picture.
局限性
NA
We thank Reviewer for the careful review and the suggestive comments. Below, we address the comments in detail. 😄
1. Linearization by a single fixed tangent space or coordinate system fails to capture global geometry; Our RMLR adopts distinct dynamic tangent spaces for each class respecting Riemannian geometry.
Lines 29-31 summarize several previous classifiers in Riemannian neural networks, which linearize the whole manifold as a fixed Euclidean space, such as a fixed tangent space (at the identity matrix) or fixed coordinate system. In contrast, our Riemannian Multinomial Logistic Regression (RMLR) does not identify the manifold with a flat Euclidean space. Instead, it directly builds the classifier based on Riemannian geometry.
-
Theoretically, our RMLR extends the Euclidean MLR into manifolds by Riemannian geometry. Def. 3.1 extends the Euclidean distance to the hyperplane in Eq. (7) into manifolds by Riemannian trigonometry and geodesic distance. Thm. 3.2 offers a general solution for this hyperplane distance. Putting this solution into Eq. (4), RMLR is introduced in Eq. (11). Although Eq. (11) uses Riemannian logarithm, it is derived from distances to Riemannian hyperplanes, respecting the underlying Riemannian geometry.
-
Numerically, our RMLR in Eq. (11) adopts distinct dynamic tangent spaces for each class:
where and . Each class involves a distinct calculation of in the tangent space at , i.e., . More importantly, each tangent space is dynamically learned, as each is a learnable parameter. Besides, respects the Riemannian hyperplane (Eq. (5)), which generalizes the Euclidean decision plane (Eq. (3)). As illustrated in Figs. 2-3, the Riemannian hyperplane for each class is a curved surface.
2. Core technical contribution: We derive the Riemannian marginal distance by the Riemannian trigonometry and geodesic distance, which only requires matrix logarithm.
As discussed in lines 104-118, the core challenge in building Riemannian MLRs is reformulating the Eucliedan marginal distance in Eq. (7) into Riemannian counterparts. Previous Riemannian MLRs mainly resort to gyro structures and pullback metrics from the Euclidean space. These approaches fail to handle general geometries due to the strong requirements of the metrics.
-
As the core technical contribution, we re-interpret the Riemannian marginal distance using Riemannian trigonometry and geodesic distance, which only requires matrix logarithm. As discussed in lines 120-121, the matrix logarithm is the minimal requirement for building Riemannian MLRs, and it exists in the most popular geometries in the machine learning community. Based on this, our Thm. 3.3 provides a solution for RMLR under general geometries. We thus see our RMLR is easy to use and can handle general geometries. Further, as shown in Tab. 1, many previous MLRs are special instantiations of our RMLR under different metrics. We will emphasize this core contribution in the revised introduction.
-
Further remark: As we claimed, given a Riemannian metric, our RMLR can be implemented in a plug-in manner, by simply putting the required Riemannian logarithm into Eq. (11). Therefore, our RMLR can also be implemented on other manifolds, such as correlation matrices [a] and special Euclidean groups [b], as the required Riemannian operators have explicit expressions. We will explore these in the future.
3. Experimental explanations in the appendix will be briefly summarized in the main paper.
Due to page limits, we left implementation details and RMLR efficiency in the appendix. In the final version, we will briefly summarize our findings in the main paper and provide proper references to the appendix.
References
[a] Thanwerdas Y. Permutation-invariant log-Euclidean geometries on full-rank correlation matrices. SIAM Journal on Matrix Analysis and Applications, 2024.
[b] Murray R M, Li Z, Sastry S S. A mathematical introduction to robotic manipulation[M]. CRC press, 2017.
i thank the authors for their thoughtful rebuttal. i have increased my score by one point to reflect this.
Thanks for the reply! We appreciate the time that you have taken during the review and discussion. 😄
The paper generalizes multiclass logistic regression to Riemannian manifolds. All reviewers agree that the contribution to the community is valuable and that the approach is novel. The discussion phase largely clarified minor points and worked on an improved presentation.