6.0

/10

Poster3 位审稿人

最低6最高6标准差0.0

3.7

置信度

正确性3.0

贡献度2.3

表达3.0

NeurIPS 2024

OT4P: Unlocking Effective Orthogonal Group Path for Permutation Relaxation

Yaming Guo,Chen Zhu,Hengshu Zhu,Tieru Wu

OpenReview PDF

提交: 2024-05-08更新: 2024-11-06

TL;DR

We present a novel differentiable transformation for relaxing permutation matrices onto the orthogonal group, which enables gradient-based (stochastic) optimization of problems involving permutation matrices.

摘要

关键词

permutation matrixorthogonal groupdifferentiable transformationstochastic optimization

评审与讨论

审稿意见

评分: 6置信度: 42024-07-01

This paper proposes a new method for relaxing permutation matrices onto the group of orthogonal matrices. A temperature controlled differentiable transform maps the permutations onto O(n) and this allows for adjusting the strength of regularity vs the problem difficulty. With this relaxation, the paper employs continuous optimization methods to solve various problems including permutation synchronization, a notoriously challenging non-convex problem. The experiments demonstrate a consistent advantage over (Gumbel-) Sinkhorn.

优点

First of all, optimization over permutations is a very important problem, remaining yet to be solved. To this end, this paper introduces an awesome idea as well as a sensible, well-thought approach to implement it. I especially like the fact that, similar to entropic-OT, the temperature parameter allows for an interpretable control over being a generic orthogonal matrix vs. being a permutation. I also checked the code and it seems reasonable.

缺点

There are several clarifications and evidence needed to make the paper fully convincing:

Permutations are the only strictly positive orthogonal matrices. Therefore, all relaxed matrices in this paper will inherently include some negativity. While acceptable for optimization, this poses a challenge when rounding results back to permutations. This raises two questions:
1. How do the authors implement rounding, precisely? In the experimental results, it would be beneficial to see the 'permutationness' of the final matrix, possibly in comparison to accuracy or loss. This insight would inform practitioners about important design choices.
2. Why constrain ourselves to the orthogonal group and not introduce temperature directly in $\mathbb{R}_{+}(n)$ ? What makes orthogonal matrices special? This point is only partially explained.
I'm uncertain about the transformation of odd permutations. $O(n)$ has a disconnected topology, unless it's $SO(n)$ (considering reflections), and transforming to the interior of $\mathcal{U}$ can distort and change geodesic distances, crucial for optimization. The paper appears to use $O(n)$ and $SO(n)$ interchangeably, which might not be good practice. The consideration of reflections is essential due to potential distortions caused by mapping some matrices into the interior of $\mathcal{U}$ . Are we assuming a connected topology? These points confuse me.
Permutation synchronization, a new experiment mentioned only in the supplementary materials, deserves mention in the main text. Also note that, Birdal and Simsekli [5] use the Riemannian geometry of the doubly stochastic matrices to solve permutation synchronization. They also control the strength by analytically deriving the prior induced by the manifold-objective. This work should be discussed, and hopefully compared.
The paper lacks another simple baseline: relax permutations onto orthogonal matrices (classical) and use a regularization term to enforce permutation-ness, controlled by coefficient $c$ . As $c \to 0$ , optimization on $O(n)$ is naive, while as $c \to \infty$ , strict permutation-ness makes the problem challenging. This approach could be included in comparisons.
In the proposed approach, $T \to 0$ can cause numerical issues. At what point does this become problematic and unfeasible? Intuitively, why is this approach preferable to entropy-regularized optimal transport, which seems similar in spirit? The paper should discuss or at least cite: Cuturi, Marco. "Sinkhorn distances: Lightspeed computation of optimal transport." Advances in neural information processing systems 26 (2013).
Can we see results from different optimizers? Depending on their behavior, advantages might be amplified or diminished. For instance, [5] used Riemannian LBFGS. Note that Adam might not be the most suitable optimizer for such problems.

Minor issues:

Ln. 223: "is stay" -> "stays"
The link for CMU house seems to be broken. Where did authors obtain the dataset?

问题

Please see the weaknesses section for most of my questions. In particular, I would be happy if the authors can address:

Distinguishing SO(n) from O(n) and stating the impact of the choice, especially in context of the proposed transformation.
Some evidence/evaluation/justification that the transformation does not pose a problem for geodesic distance computation.
Comparisons and discussions to [5] and the simple regularization baseline (see weaknesses).

局限性

The paper discusses the limitations in Appendix A. I would appreciate if some of those would be moved to the main paper. Especially the boundary issues can be discussed within the main text. An appropriate broader impact statement is added to the paper (appendix). I see no concerns with that.

作者回复

2024-08-07

We sincerely thank Reviewer k6w2 for your insightful and constructive comments. We strongly recommend reviewing our global rebuttal first to clarify common concerns. Following are our responses to each individual comment.

Q1: how do the authors implement rounding

To implement rounding from the orthogonal matrix $O$ , we first eliminate negative elements by subtracting the minimum element found within $O$ , i.e., $O−\min(O)$ . This approach is justified by the fact that $\arg\max_P ⟨P,O⟩=\arg\max_P⟨P,O−\min(O)⟩$ . Subsequent to this adjustment, we employ the Hungarian algorithm, available in existing libraries, to round $O−\min(O)$ to the closest permutation matrix. We will provide additional details on this implementation in the paper.

Q2: it would be beneficial to see the 'permutationness'

Based on your suggestion, in the newly added experiments, we have employed $\ell_1$ Distance to measure the "permutationness" of the final matrix. For more details, please refer to the Additional experiments in the global rebuttal.

Q3: why constrain ourselves to the orthogonal group

As discussed in the paper, the orthogonal group possesses the advantages of offering lower-dimensional search space and preserving inner products. If Step II is replaced with linear interpolation, we would lose these potential benefits. Our experiments have also verified that remaining within the orthogonal group can lead to performance improvements in certain scenarios.

Q4: I'm uncertain about the transformation of odd permutations

We would like to clarify that the computation of geodesics is exclusively conducted within the simply connected $\mathrm{SO}(n)$ , without alternating between $\mathrm{SO}(n)$ and $\mathrm{O}(n)$ . For a more comprehensive explanation, please refer to Q4 in the global rebuttal. Importantly, we do not directly transfer orthogonal matrices near the odd permutation into the interior of $\mathcal{U}$ . Instead, we first map these matrices to $\mathrm{SO}(n)$ using an isometry, and subsequently to $\mathcal{U}$ through a differentiable homeomorphism, thus avoiding any potential distortions.

Q5: permutation synchronization

The global rebuttal presents additional experiments conducted on the WILLOW-ObjectClass dataset [1], incorporating the two baselines you mentioned. Furthermore, we provide results using various optimizers. Please note that the code in [5] is not publicly available, so we utilize Riemannian gradient descent at Birkhoff Polytope as an alternative [2].

Q5: $\tau\to 0$ can cause numerical issues

Please refer to Q3 in the global rebuttal.

Q6: why is this approach preferable to entropy-regularized optimal transport

Our approach is similar in spirit to methods based on Sinkhorn's algorithm, as both aim to relax and subsequently anneal solutions toward permutation matrices. However, unlike these works, we opt for relaxation over the orthogonal group, which offers potential advantages such as a lower-dimensional search space and the preservation of inner products. Practically, OT4P employs a single well-defined hyperparameter, whereas the Sinkhorn-based approaches typically involve two interdependent hyperparameters: the number of iterations and the regularization strength. These distinctions can make OT4P more straightforward to configure and potentially more robust in practice.

We will correct the typographical errors and cite the paper you mentioned. The CMU dataset can be found in the open-source project of [3]. Due to page constraints, discussions on permutation synchronization and boundary issues are currently placed in the appendix. We will consider relocating these sections to the main paper to improve clarity and structural coherence. Thank you once again for your insightful feedback, which provides valuable insights for further refinement of our work.

References

[1] Minsu Cho, Karteek Alahari, and Jean Ponce. Learning graphs to match. ICCV, pages 25–32, 2013.

[2] Gary Becigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods. ICLR, 2019.

[3] Florian Bernard, Daniel Cremers, and Johan Thunberg. Sparse quadratic optimisation over the stiefel manifold with application to permutation synchronisation. NeurIPS, 34:25256–25266, 2021. Project at https://github.com/fbernardpi/SparseStiefelOpt

2024-08-08

I thank the authors for the responses and will maintain my recommendation of acceptance. It would be nice to see the new comparisons in the main paper.

评论- Thank you for your positive comments

2024-08-08

We appreciate your efforts in reviewing our paper and are pleased to know that our responses have addressed your concerns. Permutation synchronization is an important issue, as you have noted, so these comparisons will be included as a subsection in the experiments section of the main paper.

审稿意见

评分: 6置信度: 32024-07-15

The paper proposes a parameterization of $n\times n$ permutation matrices by $n(n-1)/2$ unconstrained numbers. This parameterization eases the training of neural networks and is applicable to tasks involving optimization over permutations.

优点

The paper is well-written and easy to follow.
The initial idea is very simple: Replace the doubly stochastic relaxation with the orthogonal relaxation. Following this idea, the paper develops good intuition and solid theorems in terms of injective/surjectivity of the parametrization.
The experimental results look great, presenting the advantages of the proposed method in some aspects.

缺点

Some design choices and technical development do not seem to be very motivated;
- What's the role of the temperature parameter $\tau\in(0,1]$ ? What's wrong if we just set it to $0$ and we just want the closest permutations? In fact, the experiments show that smaller $\tau$ seem to be better (see Table 5). If $\tau$ is not needed (and if I understand this correctly), then there would not be the problem of odd permutations and there would be no need to introduce $D$ and the remedy.
  The reader may be unclear about why special orthogonal groups are used for parameterization. Why not just use orthogonal groups? It would be nice to have this explained in the paper.
Can authors provide results for the case "Known 0%" in Table 2? This is important as it gives the reader a better understanding of the importance of having prior information.
The running time of the method is unclear and the scalability claim does not seem to be true.
- First, the two costly steps of the proposed method are eigenvalue decomposition and solving the linear assignment problem, and, in my experience, the latter is often much slower for various reasons, e.g., eigenvalue decomposition has high-quality implementations and is more amenable to parallel execution. Also, the method of Jonker & Volgenant (1987) [1] is much faster than the Hungarian algorithm. Both steps should be fast for matrices of size 1000x1000. Could the authors comment on why their implementation is slow for $n>1000$ (as mentioned in Appendix A.2)? Furthermore, matrices of such sizes are not "very large"; hence, I think the claim that the proposed method is scalable is an overstatement.
- Could the authors compare the proposed method's running time (time complexity) with prior works?

[1] A shortest augmenting path algorithm for dense and sparse linear assignment problems

Some typos or confusion:
- In the definition of the domain $U$ of Theorem 1, what is Im? Does it mean the imaginary part of the eigenvalues? It would be nice to define them.
- In Eq. 14, the optimization problem is not defined clearly. What are the optimization variables, and is $\theta$ the parameter of $f$ as well?

问题

Besides the above, one extra question:

Eq. 7 is a linear assignment problem. But since the data matrix is orthogonal, is it possible to have an algorithm faster than cubic time?

局限性

See the above.

作者回复

2024-08-07

We extend our heartfelt thanks to Reviewer FuEB for your thorough and thoughtful review of our manuscript. We strongly recommend reviewing our global rebuttal first to clarify common concerns. We have carefully considered your feedback and responded to it below.

Q1: what's the role of the temperature parameter $\tau$ ?

Please refer to Q2 in the global rebuttal.

Q2: what's wrong if we just set it to 0 and we just want the closest permutations?

During the evaluation phase, we can set $\tau=0$ to obtain the permutation matrix closest to the original orthogonal matrix. However, during the training phase, setting $\tau=0$ impedes gradient-based optimization. For a detailed explanation, please see Q3 in the global rebuttal.

Q3: the experiments show that smaller seem to be better

We would like to clarify that a smaller $\tau$ indeed brings the relaxed problem closer to the original problem, potentially yielding more accurate solutions. However, this also tends to increase the difficulty of the optimization process, particularly when considering the extreme case of $\tau=0$ .

Q4: there would not be the problem of odd permutationsd

The issue with odd permutations arises because the special orthogonal group does not include permutation matrices $P$ with $\det(P) =-1$ . We address this problem using Lie group theory; for more details, please refer to Q4 in the global response.

Q5: the reader may be unclear about why special orthogonal group

The full group of orthogonal matrices, $\mathrm{O}(n)$ , is not connected and consists of two connected components. When an initial point is set, Riemannian optimization typically cannot reach the other component. Consequently, most work prefers to utilize the special orthogonal group, $\mathrm{SO}(n)$ , which includes the identity matrix, for parameterization purposes.

Q6: can authors provide results for the case "Known 0%"

We conducte experiments under the Known $0\%$ setting, and the results are as follows.

Known $0\%$	Naive	Gumbel-Sinkhorn	OT4P ( $\tau=0.3$ )	OT4P ( $\tau=0.5$ )	OT4P ( $\tau=0.7$ )
$\log p(Y\mid P)$	<-3000	<-3000	<-3000	<-3000	<-3000
Precision ( $\\%$ )	2.08	1.76	1.36	1.60	1.60

We found that all methods failed. A possible reason is that prior information (constraint matrix) provides an initial point, which is crucial for gradient-based optimization. This experiment highlights the importance of prior information in accurately inferring neuron identities.

Q7: the running time of the method is unclear

As you correctly pointed out, the primary computational costs of OT4P lie in solving the linear assignment problem and performing eigendecomposition, both typically having a time complexity of $\mathcal{O}(n^3)$ . Numerous efforts have been made to accelerate these computations through parallel implementations on GPUs. We have employed existing implementations, specifically torch-linear-assignment and torch.linalg.eig, and found that eigendecomposition tends to be slower. These findings are replicated in the table below.

size	50	100	500	1000	5000
linear assignment problem (s)	0.001	0.002	0.011	0.051	2.384
eigendecomposition (s)	0.028	0.053	0.246	0.798	16.729

While exploring more efficient methods for solving the linear assignment problem and performing eigendecomposition would be valuable, such investigations are beyond the scope of this paper. Nevertheless, we will include a time complexity analysis of OT4P in the manuscript and compare it with previous work.

Q8: the scalability claim does not seem to be true

Although OT4P lacks scalability for matrices larger than 1000, we argue that this can be further improved by implementing a more efficient parallel version of eigendecomposition. Additionally, OT4P has demonstrated potential for stochastic optimization over latent permutations, as discussed in Section 3.2, representing an important extension for probabilistic tasks.

Q9: the optimization problem is not defined clearly

In Equation 14, the function $f(P)$ is defined with respect to the permutation matrix $P$ , which is treated as a random variable drawn from the distribution $P\sim q(P;\theta)$ parameterized by $\theta$ . Here, $\theta$ serves as the optimization variable. Evaluating and differentiating Equation 14 presents challenges because calculating the expectation involves a sum of $n!$ terms. We address this problem efficiently using OT4P combined with the re-parameterization technique.

Q10: is it possible to have an algorithm faster than cubic time

This is an insightful question. In the early stages of developing our idea, we experimented with randomized rounding, which derives permutations based on the action of the orthogonal matrix on random vectors [1]. We also tested a greedy strategy that sequentially sets the largest available element to $1$ in each column. However, preliminary experiments demonstrated the unreliability of these methods, prompting us to abandon them. In conclusion, OT4P would certainly benefit from a faster algorithm specifically designed to solve the linear assignment problem with an orthogonal matrix.

Thank you once again for your time and effort. These valuable discussions will be incorporated into our paper to enhance its quality.

References

[1] Alexander Barvinok. Approximating orthogonal matrices by permutation matrices. arXiv preprint math/0510612, 2005.

评论- Reply

2024-08-12

Dear authors, thank you for your reply to my comments and questions. It has nicely addressed my concerns and has improved my understanding of this paper and linear assignment problems. I have thus increased my score by 1.

评论- Thank you for your reply

2024-08-12

We are pleased that our response has resolved your concerns. We will strengthen the explanations in our paper as per your suggestions to enhance its clarity and quality. Thank you once again for the efforts you have put into reviewing our manuscript and for raising your score.

审稿意见

评分: 6置信度: 42024-07-29

This paper proposes OT4P, a differentiable transformation that relaxes permutations to the orthogonal group. Based on OT4P, the authors propose novel frameworks for deterministic and stochastic optimization on permutation matrices. Numerical experiments demonstrate its efficiency and scalability in permutation matrix optimization tasks.

优点

The paper is generally well written despite a few confusing sentences. The authors highlight the contributions and make a clear comparison with previously known results, and the main idea is interesting. Mathematical ideas are hard to clarify, but I find the paper easy to follow. The numerical experiment supports the claim that OT4P enjoys flexibility, simplicity and scalability on selected tasks.

缺点

Experiments are insufficient and do not fully convince me of the superiority of OT4P. It's hard to convince me that OT4P is valuable in real settings, therefore the impact of this work in practice is questionable.

The experiments in the main text are both synthetic. Why are they important?
Permutation synchronization is only tested on CMU house dataset, and it is only tested to demonstrate the runtime and memory efficiency. You can try more challenging multi-object matching datasets such as Willow[1]. It would be interesting to see if OT4P avoids unreliable local minima and achieves a better matching result, compared to Birkhoff polytope-based methods.
For permutation synchronization, convex relaxation methods are known to be slow, but you are using parallel computation to accelerate. A detailed explanation on this is necessary.

There are a few confusing sentences and some room of improvement in writing.

Line 223: 'is stay' should be 'stays'.
Not sure how to read figure 4. It would be nice to have some explanation.
The manifold \mathcal{M} is an important mathematical object in the paper that relates directly to parameterization in optimization tasks. It is helpful to have more explanation on how it changes with the temperature parameter.

[1] M. Cho, K. Alahari, and J. Ponce. Learning graphs to match. In Proceedings of the IEEE Interational Conference on Computer Vision, 2013.

问题

Do you have any guidance on how to choose the hyperparameter \tau? It seems that \tau=0.5 is optimal for all tasks in the experiment section. Should we just fix \tau=0.5 for practical use?

局限性

Despite several limitations discussed in the appendix, other limitations of this work include:

In real world computer vision problems, people usually have partial permutations instead of permutations, since a keypoint may not be observed in some images. It would be more interesting if this work can be extended to optimization on partial permutations.

作者回复

2024-08-07

We are deeply grateful to Reviewer 3CCZ for the detailed and constructive feedback on our work. We strongly recommend reviewing our global rebuttal first to clarify common concerns. We address your specific questions below.

Q1: the experiments in the main text are both synthetic

We recognize the importance of demonstrating the practical impact of the proposed OT4P method. Indeed, the experiments presented in the main text address new, practical problems involving permutation matrices in the field of machine learning, ranging from deterministic to stochastic optimization. Our OT4P showcases an effective way of tackling these real-world challenges.

The first reason for using synthetic data is the absence of a publicly available and widely used real-world dataset, as most are privately created in specific scenarios. The second reason is that the primary purpose of experiments is to test the theoretical properties of OT4P, including its ability to approximate permutation matrices and its effectiveness in gradient-based optimization. A controlled synthetic setting allows us to isolate the impact of our method from other potential confounding factors found in real-world environments. In conclusion, using synthetic data provides the most rigorous and efficient means of validating the proposed theory.

As you have noted, to further validate the practical applicability of our theory, we have included experiments with real-world datasets in the permutation synchronization task, which are detailed in the Additional experiments of the global rebuttal.

Q2: permutation synchronization is only tested on CMU house dataset

Following your suggestion, we have conducted additional experiments on the WILLOW-ObjectClass dataset [1], incorporating four new baselines. Please see the Additional experiments in the global rebuttal for details.

Q3: convex relaxation methods are known to be slow

The primary computational costs of the proposed OT4P arise from solving the linear assignment problem and performing eigendecomposition, both of which typically scale with $\mathcal{O}(n^3)$ . Given the widespread application of these tools in machine learning, numerous studies have developed their parallel versions, especially optimized for GPUs that are well suited for dense matrix computations [2]. Benefiting from these advancements, OT4P can efficiently handle relaxation matrices at significantly accelerated speeds. We will provide further explanations in the paper.

Q4: not sure how to read figure 4

Figure 4 illustrates the orthogonal matrices obtained in Step II of OT4P. The leftmost image ( $\tau=1$ ) represents the original orthogonal matrices $O$ from Step I, and the rightmost image ( $\tau=0$ ) is the permutation matrix $P$ that is closest to $O$ . The temperature parameter $\tau$ controls how closely the resulting orthogonal matrices $\widetilde{O}$ , obtained in Step II, approach $P$ . As $\tau\to 0$ , the resulting orthogonal matrices $\widetilde{O}$ increasingly converge to $P$ .

Q5: the manifold $\mathcal{M}$ is an important mathematical object

Given the temperature parameter $\tau$ , the manifold $\mathcal{M}:=\\{\mathcal{S}\_{P}^{\prime}\subset \mathrm{O}(n)\mid P\in\mathcal{P}\_n \\}$ consists of submanifolds $\mathcal{S}_{P}^{\prime}$ , where each $\mathcal{S}\_{P}^{\prime}$ encompasses a permutation matrix $P$ . As $\tau$ approaches $0$ , the $\mathcal{S}\_{P}^{\prime}$ contract towards the permutation matrix $P$ , culminating in the degeneration of $\mathcal{S}\_{P}^{\prime}$ to a singleton set $\{P\}$ when $\tau=0$ .

Q6: do you have any guidance on how to choose the hyperparameter $\tau$

The selection of the hyperparameter $\tau$ involves balancing the relaxation of the solution and the difficulty of optimization. A large $\tau$ may lead to overly relaxed solutions, while a small $\tau$ might cause optimization challenges. Setting $\tau=0.5$ is a practical and balanced choice, as this setting positions the resulting orthogonal matrix $\widetilde{O}$ midway between the original orthogonal matrix $O$ and its closest permutation matrix $P$ , in a certain sense.

Q7: people usually have partial permutations instead of permutations

This is indeed a valuable and interesting extension. One possible approach is to consider $n\times k\ (k<n)$ partial permutations as a projection of $n\times n$ total permutations. In this way, we could use OT4P to relax the $n\times n$ permutation matrices and then take the first $k$ columns of the resulting orthogonal matrix as the final output. Essentially, this method involves relaxing partial permutations onto the Stiefel manifold with dimension $n\times k$ . We plan to explore this idea more thoroughly in future work.

We will conduct a thorough review and correction of any typographical errors to avoid potential confusion. We once again thank the reviewers for their constructive feedback, which will significantly enhance the quality and clarity of our paper.

References

[1] Minsu Cho, Karteek Alahari, and Jean Ponce. Learning graphs to match. ICCV, pages 25–32, 2013.

[2] Ketan Date and Rakesh Nagi. Gpu-accelerated hungarian algorithms for the linear assignment problem. Parallel Computing, 57:52–72, 2016.

2024-08-11

I would like to thank the authors for addressing my questions and concerns. It would be good to see them included in the final paper. I am especially satisfied with the results on Willow. Given the outstanding results I raise my score to 6.

评论- Thank you for raising your score

2024-08-11

We are pleased to know that we have addressed your concerns. We will include these valuable discussions and experiments in the final version of the paper to enhance its quality. Thank you once again for the efforts you have put into reviewing our manuscript and for raising your score.

2024-08-13

Could please you update your score? Thanks.

Your AC.

2024-08-13

Just updated the score. Thanks for letting me know.

作者回复

2024-08-07

We sincerely appreciate the reviewers for dedicating their valuable time and effort to thoroughly reviewing our manuscript. Here, we address the common concerns raised and introduce the additional experiments conducted.

Common concerns

Q1: What is the intuition behind Step II of the proposed OT4P?

Given an orthogonal matrix $O$ obtained from Step I, Step II aims to move $O$ toward its closest permutation matrix $P$ to approach $P$ . Analogous to transitioning a point $A$ towards another point $B$ along a straight line $AB$ in Euclidean space, OT4P moves $O$ towards $P$ along a geodesic $OP$ —the generalization of straight lines to the manifold context.

Q2: What's the role of the temperature parameter $\tau$ ?

The temperature parameter $\tau$ may be interpreted as the parameter of the geodesic (inversely), controlling how closely the resulting orthogonal matrix $\widetilde{O}$ , obtained in Step II, approaches the permutation matrix $P$ , with $\widetilde{O}\to P$ as $\tau\to0$ . Specifically, when $\tau=1$ , $\widetilde{O}$ remains equal to $O$ , and when $\tau=0$ , $\widetilde{O}$ becomes $P$ .

Q3: What's wrong if we set $\tau = 0?$

Setting $\tau=0$ is an impractical choice, as it would cause all orthogonal matrices near the permutation matrix $P$ to be mapped directly to $P$ . In such a setting, gradient-based optimization becomes unfeasible because the mapping $\psi_{\tau}(\cdot)$ (corresponding to Step II) turns into a piecewise constant function, whose derivatives are almost everywhere zero.

Q4: How to deal with odd permutations?

To deal with odd permutations, we identify an agent $\widehat{P}$ of the odd permutation $P$ , and utilize Lie group theory to establish an isometry between the neighborhoods of $P$ and $\widehat{P}$ . This approach allows us to move orthogonal matrices along geodesics within $\mathrm{SO}(n)$ towards $\widehat{P}$ , and then restore the results equivalently into the neighborhood of $P$ to approximate $P$ . Importantly, all geodesics calculated in this process remain within $\mathrm{SO}(n)$ , ensuring there is no distortion in the transformation.

Additional experiments

Based on reviewers' feedback, we conducted permutation synchronization experiments on the more challenging WILLOW-ObjectClass dataset, incorporating four new baselines.

The WILLOW-ObjectClass dataset [1] comprises images of five object classes, each containing $10$ equal keypoints of at least $40$ images. For each image, we extract interpolated features from the relu4_2 and relu5_1 layers through a pre-trained VGG16 model on ImageNet. The initial pairwise correspondences are established by applying the Hungarian algorithm to the distance matrices of features.

We select the following algorithms as baselines:

Reg: Optimizes in Euclidean space with a regularization term $\sum_j (\sum_j P_{i,j} - 1)^2$ that encourages each column to sum to $1$ .
OrthReg [2]: Optimizes over the (special) orthogonal group, using a regularization term $\frac{2}{3}\mathrm{trace}(P^{T}(P-P\circ P))$ ( $\circ$ is element-wise product) to force the orthogonal matrix to converge to permutation matrices.
RiemanBirk [3]: Optimizes on Birkhoff polytope utilizing Riemannian gradient descent.
Sinkhorn [4]: Optimizes on the Birkhoff polytope, using the Sinkhorn operator to adjust positive matrices into doubly stochastic matrices.

Unless otherwise stated, all algorithms employ the Adam optimizer for $100$ iterations, with RiemanBirk utilizing Riemannian Adam. The initial learning rates are tuned within the set $\\{0.1, 0.01, 0.001, 0.0001\\}$ .

We generate problem instances of varying sizes and report the average results from five runs in Figure 1 (see PDF). RiemanBirk and Sinkhorn demonstrate poorer performance. A primary reason is that both methods relax permutations in the Birkhoff polytope, leading to unreliable local minima and preventing optimal solutions. Benefiting from the potential advantages offered by the orthogonal group, OrthReg generally produces competitive results. However, due to the instability of its regularization term, OrthReg sometimes underperforms, which may necessitate careful adjustment of the regularization coefficient for each class. In contrast, our proposed OT4P consistently outperforms other methods and demonstrates robustness to variations in hyperparameters $\tau$ .

We take the $\ell_1$ Distance to assess the "permutationness" of the final matrix. Specifically, we round the matrix $O$ , returned by the algorithms, to its closest permutation matrix $P$ , and then calculate the $\ell_1$ Distance between $O$ and $P$ . Table 1 (see PDF) lists the results for the problem instances corresponding to the largest size (multiples of $5$ ) in each object class. We observe that the relaxation extent of Sinkhorn is unstable. Unlike them, ОТ4Р consistently maintains smaller distances in almost all cases and exhibits a positive correlation with changes in the hyperparameter $\tau$ .

We compare the results of different optimizers in Table 2 (see PDF), selecting the largest problem instances (multiples of $5$ ) for each object class. Methods based on the Birkhoff polytope show notable performance improvements on most datasets when using (Riemannian) SGD. For our proposed OT4P, the choice of optimizer appears to be less critical, as it consistently outperforms other methods regardless.

References

[1] Minsu Cho, Karteek Alahari, and Jean Ponce. Learning graphs to match. ICCV, pages 25–32, 2013.

[2] Michael M Zavlanos and George J Pappas. A dynamical systems approach to weighted graph matching. Automatica, 44(11):2817–2824, 2008.

[3] Gary Becigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods. ICLR, 2019.

[4] Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. Learning latent permutations with gumbel-sinkhorn networks. ICLR, 2018.

2024-08-08

I thank the authors for the clarifications and additional experiments including comparisons with the baselines. I like to suggest one correction: I could find nothing in [3] regarding either the Birkhoff polytope or the permutation synchronization problem. Moreover, the Birkhoff polytope implementation in geoopt is due to Birdal & Simsekli CVPR'19, which addresses the permutation synchronization problem by optimizing over the Birkhoff polytope using its Riemannian characterization: https://github.com/geoopt/geoopt/blob/master/geoopt/manifolds/birkhoff_polytope.py.

Hence, I think the correct citation for RiemannBirk is: Birdal, Tolga, and Umut Simsekli. "Probabilistic permutation synchronization using the riemannian structure of the birkhoff polytope." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

I will keep my initial recommendation of acceptance.

评论- Thank you for the correction

2024-08-08

I apologize for any confusion caused. The RiemanBirk we used is indeed from https://github.com/geoopt/geoopt/blob/master/geoopt/manifolds/birkhoff_polytope.py, which is the method proposed and implemented by Birdal & Simsekli in CVPR'19. We will make the correct citation in our paper.

2024-08-11

Dear reviewers,

Could you please respond to the rebuttal, discuss with authors and finalize your score?

最终决定Accept (poster)

2024-09-25

The paper introduces OT4P, an innovative approach that relaxes permutation matrices onto the orthogonal group, enabling continuous optimization for permutation-related tasks. Reviewers found the concept both novel and well-executed, offering a promising new direction in this challenging area. While some concerns were raised about the clarity of certain design choices and the need for more extensive experimental validation, the method's potential impact and the soundness of its theoretical foundation make it a valuable contribution. Therefore, I recommend accepting this paper.

OT4P: Unlocking Effective Orthogonal Group Path for Permutation Relaxation

摘要

评审与讨论

优点

缺点

问题

局限性

Q1: how do the authors implement rounding

Q2: it would be beneficial to see the 'permutationness'

Q3: why constrain ourselves to the orthogonal group

Q4: I'm uncertain about the transformation of odd permutations

Q5: permutation synchronization

Q5: τ→0\tau\to 0τ→0 can cause numerical issues

Q6: why is this approach preferable to entropy-regularized optimal transport

References

优点

缺点

问题

局限性

Q1: what's the role of the temperature parameter τ\tauτ?

Q2: what's wrong if we just set it to 0 and we just want the closest permutations?

Q3: the experiments show that smaller seem to be better

Q4: there would not be the problem of odd permutationsd

Q5: the reader may be unclear about why special orthogonal group

Q6: can authors provide results for the case "Known 0%"

Q7: the running time of the method is unclear

Q8: the scalability claim does not seem to be true

Q9: the optimization problem is not defined clearly

Q10: is it possible to have an algorithm faster than cubic time

References

优点

缺点

问题

局限性

Q1: the experiments in the main text are both synthetic

Q2: permutation synchronization is only tested on CMU house dataset

Q3: convex relaxation methods are known to be slow

Q4: not sure how to read figure 4

Q5: the manifold M\mathcal{M}M is an important mathematical object

Q6: do you have any guidance on how to choose the hyperparameter τ\tauτ

Q7: people usually have partial permutations instead of permutations

References

Common concerns

Q1: What is the intuition behind Step II of the proposed OT4P?

Q2: What's the role of the temperature parameter τ\tauτ?

Q3: What's wrong if we set τ=0?\tau = 0?τ=0?

Q4: How to deal with odd permutations?

Additional experiments

References

Q5: $\tau\to 0$ can cause numerical issues

Q1: what's the role of the temperature parameter $\tau$ ?

Q5: the manifold $\mathcal{M}$ is an important mathematical object

Q6: do you have any guidance on how to choose the hyperparameter $\tau$

Q2: What's the role of the temperature parameter $\tau$ ?

Q3: What's wrong if we set $\tau = 0?$