$SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation
We propose an $SE(3)$ equivariant model with spherical harmonics ray embeddings and demonstrate its effectiveness in the task of generalized stereo depth estimation.
摘要
评审与讨论
This paper introduces an SE(3)-equivariant multi-view depth estimation model based on the Perceiver IO framework. Specifically, each feature ray is treated as a token, and the feature vector of each ray is concatenated with an equivariant positional embedding. To achieve equivariance, the authors propose using spherical harmonics to encode the ray poses. Ray features are treated as type-0 (rotation-invariant) irreps. These equivariant ray encodings are processed through several equivariant self-attention layers and aggregated into global features and a canonical reference frame. The camera pose encoding is first inverse-transformed into this inferred canonical frame, resulting in an SE(3)-invariant query. A series of cross-attention layers between the encoded global features and the query features is then used to predict pixel colors. The authors demonstrate the effectiveness of the proposed approach on the ScanNet and DeMoN datasets.
优点
- To the best of the reviewers' knowledge, this is the first paper to address SE(3)-equivariant positional embedding for the transformer/PerceiverIO framework for multiview applications. While Fuchs et al [1]. and Liao et al [2,3]. have addressed SE(3)-equivariant attention for GNNs, their methods are more complex and computationally inefficient compared to the proposed approach.
- The proposed method shows competitive benchmark results compared to state-of-the-art methods across multiple datasets. The ablation study convincingly demonstrates the significance of the equivariant embedding.
- In the appendix, the authors put significant effort to make the concepts accessible to beginners, including detailed visualizations for how the computations are done. This contrasts with typical papers on SE(3)-equivariance, which often include difficult equations that can be a barrier to entry for newcomers.
[1] Fuchs et al., “SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks,” NeurIPS`20
[2] Liao et al., “Equiformer: Equivariant graph attention transformer for 3d atomistic graphs,” ICLR’23
[3] Liao et al., “Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations,” ICLR’24
缺点
-
The authors introduced a new equivariant nonlinearity inspired by [4], but the motivation and benefits are not clearly demonstrated. What is the distinctive advantage of this new nonlinearity, compared to existing SE(3)-equivariant nonlinearities?
-
The number of parameters was not fixed during the ablation experiments regarding the maximum spherical harmonics degrees. A recent study [5] claimed that the reported increase in performance due to incorporating higher-type irreps in various works could actually be due to the increased number of parameters. It is essential to control the number of parameters to be similar between the ablated models and the proposed model.
[4] Deng et al., "Vector Neurons: A General Framework for SO(3)-Equivariant Networks,” ICCV’21
[5] Wang et al., “Rethinking the Benefits of Steerable Features in 3D Equivariant Graph Neural Networks,” ICLR’24
问题
-
According to the equations in Appendix F, different types of irreps do not mix in the self-attention layer. They also do not mix in the proposed nonlinearities in Appendix F. It seems like in the proposed method, each of the irreps (except for type-0) can only indirectly modulate other irreps of different types via attention. Am I correct?
-
Subtracting the mean of the center is not stable under the addition or removal of camera points. Is it possible to use relative positional encoding, similar to rotary embedding, to achieve translational equivariance without relying on centroid subtraction?
局限性
-
Spherical harmonics can only encode the orientation of and not the length . Therefore, typical SE(3)-equivariant networks address this by incorporating additional length encoding. However, in this paper, the distance information is discarded.
-
Using irreps features inevitably introduces a band-limit. Increasing this band-limit is difficult because the feature dimension increases quadratically, which is also discussed by the authors.
-
The authors also mentioned that higher-degree spherical harmonics caused instability in training. However, this might be due to the choice of equivariant nonlinearities. Liao et al. [3] reported that certain nonlinearities cause instability in training.
W1. The authors introduced a new equivariant nonlinearity inspired by [4], but the motivation and benefits are not clearly demonstrated. What is the distinctive advantage of this new nonlinearity, compared to existing SE(3)-equivariant nonlinearities?
Our nonlinearity, unlike norm and Gate nonlinearities, can change the direction of the feature, by reducing/keeping the projection of the tensor to an equivariance space. In the meanwhile, compared with the nonlinearity method that can change the direction of the feature by applying Fourier and Inverse Fourier transforms [1, 2], ours is more computationally efficient.
[1] Maurice, et al. "General E(2)-equivariant Steerable CNNs." NeurIPS 2019
[2] Adrien, et al. "A Functional Approach to Rotation Equivariant Non-linearities for Tensor Field Networks." CVPR 2021
W2. The number of parameters was not fixed during the ablation experiments regarding the maximum spherical harmonics degrees. A recent study [5] claimed that the reported increase in performance due to incorporating higher-type irreps in various works could actually be due to the increased number of parameters. It is essential to control the number of parameters to be similar between the ablated models and the proposed model.
Thank you for the great suggestion. We agree that this is an essential variable to be controlled in these experiments. Thus, we include here additional ablation experiments on the ScanNet benchmark with increasing the numbers of parameters, by using the highest orders of spherical harmonic embeddings to match the parameter count of our model. These were conducted using a short training schedule of 100K training steps, and the results are shown in Table 1 in the rebuttal PDF. We observe a similar trend as reported in our submission (Table 3), showing that these improvements do not come from the increased model complexity. We will include these additional results in the camera-ready version of the paper.
Q1. According to the equations in Appendix F, different types of irreps do not mix in the self-attention layer. They also do not mix in the proposed nonlinearities in Appendix F. It seems like, in the proposed method, each of the irreps (except for type-0) can only indirectly modulate other irreps of different types via attention. Am I correct?
You are correct. When generating key, query, and value features, we do not mix different feature types. While calculating the attention matrix, as shown in Appendix F, we use all types of features, which makes them indirectly modulate other irreps of different types.
Q2. Subtracting the mean of the center is not stable under the addition or removal of camera points. Is it possible to use relative positional encoding, similar to rotary embedding, to achieve translational equivariance without relying on centroid subtraction?
We thank the reviewer for pointing that out. When there are two cameras (which is the primary setting explored in our experiments), the relative position is exactly twice our current translation (the translation after subtracting the centroid). We use spherical harmonics for our current translational positional embedding, which is equivariant to while rotary embedding is not equivariant to . For multiple cameras, introducing a relative positional encoding would be a good idea, however, this approach introduces the problem of permutation equivariance of the camera order in the Perceiver IO architecture (the index of the camera impacts the index of relative translation). We believe there is a more elegant solution to the translation equivariance problem, and hope to look into it in future work.
L1. Spherical harmonics can only encode the orientation of and not the length. Therefore, typical SE(3)-equivariant networks address this by incorporating additional length encoding. However, in this paper, the distance information is discarded.
As stated in Section 3.3.1, we incorporate an additional radial component into the original “order-l” spherical harmonics of the corresponding degree. Despite this adaptation, these functions retain their fundamental characteristics and are still referred to as spherical harmonics. We provide an introduction and discussion of spherical harmonics in Appendix A.4, as well as the design of incorporating the invariant length.
L2. Using irreps features inevitably introduces a band-limit. Increasing this band-limit is difficult because the feature dimension increases quadratically, which is also discussed by the authors.
Thanks for the deep comment. Yes, the dimension of the spherical harmonics is a limitation in our approach. That's why we address it in the decoder by predicting an equivariant frame followed by a conventional decoder, which enables higher frequency embedding for the query.
L3. The authors also mentioned that higher-degree spherical harmonics caused instability in training. However, this might be due to the choice of equivariant nonlinearities. Liao et al. [3] reported that certain nonlinearities cause instability in training.
Thank you for the insightful remark. We have indeed observed the same behavior stated in [3] about the instability of activation, which treats the features as the Fourier coefficients for , processing scalars, and high-order features together. They addressed this behavior by separating the scalars and higher-order features. On the other hand, our approach separates different orders of features and processes them independently (as mentioned in Q1). Replacing our nonlinearity with norm linearity did not stabilize training, suggesting that this instability is not solely due to the choice of nonlinearity.
Thank you for the clarification. I have raised my score as the major concerns have been addressed.
[Summary of Concerns Resolved by the Rebuttal]
-
Lack of motivation for the new nonlinearity => Addressed with explanations of computational efficiency and training stability.
-
Number of parameters not controlled in the ablation study => New experiments in the rebuttal paper have controlled the number of parameters.
This paper introduces a ray embedding representation with rotational and translational equivariance, integrating the existing Perceiver IO architecture to achieve robust multi-view implicit depth estimation. The paper first utilizes the mean shift and spherical harmonics to achieve translational equivariance, and then builds upon this to use spherical harmonics to achieve a rotationally equivariant representation, ultimately combining to obtain a three-dimensional transformation embedding with equivariance. By further designing equivariant encoders and decoders, the paper realizes robust estimation of depth from new perspectives. Experiments on the ScanNet and DeMoN datasets demonstrate the effectiveness of the proposed method.
优点
-The motivation is clear, the algorithm design makes sense, and the experimental results are complete.
缺点
-Ablation study: Since the equivariance consists of two parts, namely translation and rotation, what would be the qualitative and quantitative impact of removing these two parts respectively?
问题
-The task setting of implicit depth estimation seems to be very compatible with the existing sparse view NeRF/GS methods. Although the focus of the two is different, with NeRF/GS focusing more on rendering images, while DeFiNe and EPIO mainly focus on geometry, there is a possibility of mutual exchange between the two. Can you report the comparative results with such methods? For example, ENeRF.
Lin, et al. "Efficient Neural Radiance Fields for Interactive Free-viewpoint Video", SIGGRAPH-ASIA 2022.
-DeFiNe can synthesize novel view images, can EPIO do the same? What are the results like?
局限性
see questions
W1. Since the equivariance consists of two parts, namely translation and rotation, what would be the quantitative impact of removing these two parts respectively?
Thank you for the valuable suggestion. We conducted an ablation study where we individually integrated only rotation equivariance and translation equivariance into the model. These were conducted using a short training schedule of 100K training steps due to the time limit. The results, displayed in Table 1 of the rebuttal PDF, show that models without translation or rotation equivariance perform worse than our complete model.
Q1/2. The task setting of implicit depth estimation seems to be very compatible with the existing sparse view NeRF/GS methods. DeFiNe can synthesize novel view images, can EPIO do the same?
Thank you for pointing that out, this is a great direction that we would like to explore in follow-up works. EPIO and DeFiNe were designed for “single-query” predictions, without relying on volumetric rendering (like NeRF) or explicit 3D structures (like 3DGS). Single-query novel view synthesis is challenging, especially in the generalizable setting, and DeFiNe itself does not report results in this task, but rather shows that by jointly learning novel view synthesis and depth estimation it can marginally improve depth estimation. Our proposed EPIO architecture could be extended to novel view synthesis, especially because it accepts traditional non-equivariant decoders (i.e., from DeFiNe), and we indeed show results on a toy novel depth synthesis experiment in Appendix O.2. Unfortunately, extending EPIO for an additional task and retraining the model is impractical in the rebuttal period, but we will aim to provide initial results (similar to the novel depth synthesis experiments) in the camera-ready version.
Having said that, there is a follow-up work to DeFiNe (DeLiRa) [1] which explores the use of Perceiver IO for volumetric rendering, focusing on novel view synthesis with depth estimation guidance. We believe EPIO could be used in this setting as well, and that would be a very interesting extension.
[1] Guizilini et al. “DeLiRa: Self-Supervised Depth, Light, and Radiance Fields”, ICCV 2023.
Thanks for the response
This paper presents a SE(3) rotational and translational equivariant variation of Perceive IO for multi-view depth estimation with known camera poses. The authors first encode both the pixel-wise ray direction and the camera translation using spherical harmonics as the position encoding, and then to maintain equivariance under global transformations through the network's forward pass, the authors modify several components, including the linear projection, the latent array construction, and the output decoding. To demonstrate the effectiveness of the proposed method, the authors conducted experiments on several RGBD datasets, including ScanNet, SUN3D, TUM-RGBD, and Scene11, and achieved better performance than existing implicit multi-view depth estimations, such as DeFiNe, and multi-view stereo (MVS) models, such as DPSNet.
优点
-
The authors introduce the problem well, explaining the importance of equivariance to the task of multi-view depth estimation effectively. They also provide a brief yet sufficient review of existing works, clearly positioning this work within the field.
-
The authors have carefully designed several novel equivariant components:
- A SE(3) equivariant positional encoding, where besides rotation, the authors smartly encode camera translation also using spherical harmonics.
- An equivariant linear projection layer where the linear projection is applied to each group of features that corresponds to position embedding derived from the spherical harmonics of a specific order.
- Equivariant latent array construction and the reversal of the rotation from the latent array before being cross-attended to the output queries.
These designs, along with the adoption of existing equivariant components through the Perceive IO pipeline, ensure good performance and can be inspiring for other tasks that require equivariance.
-
The experiments are sufficient and demonstrate the equivariance of the output and the overall accuracy.
缺点
The major weakness of this paper lies in its presentation and organization, which makes the paper difficult to read:
-
Many important details from Sections 3.4 to 3.6 are placed in the appendix, making the main paper not self-contained. For instance, details in Appendices A.3 and E would be better suited in the main paper.
-
Sections 3.4 to 3.6 are organized into fragmented components, where the holistic process of the Perceiver IO is missing. Specifically, the authors should introduce each modification in the order of the Perceiver IO pipeline.
-
The description of individual components are also confusing:
- It is better to only briefly discuss components that are equivariant itself, such as attention, and discuss only how they made the input to the attention equivariant, such as the latent array in Section 3.5.1. Otherwise, it might be misleading to suggest that there are new equivariant attention modules themselves.
- Why is only rotation sampled and encoded when constructing the latent array in Section 3.5.2 and Figure 4, while the inputs have the encoded camera translation?
- Similarly, in Section 3.6, only reverse rotation is applied to the latents after several self-attention transformation blocks, while the translation is omitted.
- Line 261-262: "which allows us to leverage higher frequency information beyond the dimensional constraints of SPH." The authors indicate that the Fourier encoding is not equivariant but use it for the output query, therefore, the authors should elaborate more on the insight behind this choice and provide sufficient proof to support this design.
- Many illustrations (Figures 8-11) in the appendix are confusing and do not help to clarify the equations.
问题
Please refer to the weakness.
局限性
The limitations have been sufficiently discussed in the Appendix. Q.
W1. Many important details from Sections 3.4 to 3.6 are placed in the appendix, making the main paper not self-contained.
We appreciate and thank the reviewer for the valuable feedback. We decided to organize the paper this way not only due to limited space but also because we want readers to have a clear understanding of our main proposed components, especially those without a prior equivariance background. Thus, we leave standard components (e.g., equivariant linear and normalization layers in appendix A.3) in the appendix, where readers unfamiliar with the field can get a better understanding through visualization and more detailed analysis and proofs.
Having said that, we agree that a more balanced structure might be better, and will move more technical details and visualization from the appendix E to the main paper for the camera-ready version.
W2. Sections 3.4 to 3.6 are organized into fragmented components, where the holistic process of the Perceiver IO is missing.
Thank you for the suggestion. We decided to introduce fundamental attention layers in Section 3.4 before discussing the encoder in Section 3.5 and the decoder in Section 3.6 because attention mechanisms are essential components of both modules (encoder and decoder). By first explaining these fundamental operations, we aim to help readers build an understanding going from the basics to the overall more complex structures.
If this approach is causing confusion, we propose to instead first briefly give a whole overview of our holistic method with the illustration of Figure 2, then present the input design of the encoder first, followed by an introduction of the encoder, the “canonicalization” of the encoder output to serve as input to the decoder, the design of the decoder, and finally the prediction process. We believe this new order would create a better flow that approximates more closely the holistic nature of the Perceiver IO architecture, from input to output.
W3. It is better to only briefly discuss components that are equivariant itself, such as attention, and discuss only how they made the input to the attention equivariant.
Thank you for the suggestion. The reason we provide an overview of the attention mechanism and compare our module structure with previous work is to lay emphasis on the equivariant operation of proposed module and help readers without a relevant background to have a better understanding of why the conventional operation is not equivariant and ours is.
However, to avoid any potential confusion, we will move more details from Section 3.4 (especially the descriptions of previous works) to the appendix for reference, so the focus is on our contributions. We will also move Appendix I to Section 3.5.1 to provide more information about the input construction for the encoder.
W4. Why is only rotation sampled and encoded when constructing the latent array in Section 3.5.2 and Figure 4, while the inputs have the encoded camera translation?
The translations of the two cameras are opposites because we subtract the mean of the translation to achieve translational invariance. As discussed in Section 3.5.2, we need to average the PE (positional encodings) of the two cameras to obtain the latent array. Since averaging the PE of two opposite translations might cause some zero values, we discard the PE for the translation of the latent array construction. Note that we keep the spherical harmonics PE for translation in inputs, as a way to preserve translational information.
W5. In Section 3.6, only reverse rotation is applied to the latents after several self-attention transformation blocks, while the translation is omitted.
As stated in Section 3.3. 2, after we subtract the cameras’ central position, our model becomes translationally invariant. Because the hidden features are now rotationally equivariant and translationally invariant, we apply the reverse rotation to the latent array to make it rotationally invariant. Note that, when we provide query cameras for the decoder, their position is also extracted from the center, which makes the whole model translationally invariant.
W6. The authors indicate that the Fourier encoding is not equivariant but use it for the output query.
As stated in L334-339, a limitation in the use of spherical harmonics (SH) is that dimensionality grows linearly (2x) with increasing orders, which constrains the highest frequency that can be used in the positional encodings. Therefore, we learn instead an equivariant frame of reference designed to make the input to the decoder invariant, which enables us to use traditional decoders without SH, which can then reach higher frequencies. It is important to note that, even though we use traditional Fourier positional encodings for the decoder, the model is still equivariant, since the input to the decoder is invariant by design. The theoretical proof that guarantees an invariant input for the decoder is provided in Appendix J. There is an ablation study in Table 3, where “EquiDecoder” indicates the use of an equivariant decoder with equivariant positional encoding for the query, our design is better than the equivariant decoder, which supports the effectiveness of our design.
W7. Many illustrations (Figures 8-11) in the appendix are confusing.
Thank you for pointing that out. We have enhanced these figures and included the updated versions in the rebuttal PDF (Figure 1,2,3,and 4). Any additional feedback would be highly appreciated, so we can further improve the quality of our submission.
Dear authors, thank you for the rebuttal, which clearly addressed my concerns.
Thank you for the reply, and for taking the time to consider and analyze our rebuttal. If it has clearly addressed your concerns, do you mind raising your score accordingly? We are also happy to answer any other questions you might have in the meantime.
Dear Reviewer KQ34
Thanks for reviewing this work. Would you mind to check authors' feedback and see if it resolves your concerns or you may have further comments?
Best wishes
AC
We sincerely thank the reviewers for the valuable comments and positive feedback regarding our submission.
As mentioned by reviewer xRYu, we are the “first to address SE(3)-equivariance in the transformer-based Perceiver IO architecture for multi-view applications.” They also praise the significant effort put into our appendix, in an attempt to make the complex topic of neural network equivariance accessible to beginners. Reviewer KQ34 states that our proposed method “can be inspiring for other tasks that require equivariance”, and that we “introduce the problem well, effectively explaining the importance of equivariance to the task of multi-view depth estimation”. Reviewer QCgu mentions that our general multi-view architecture can be extended beyond depth estimation to also improve novel view synthesis, which is an exciting direction for future work that we will explore. All reviewers equally praise the motivation behind our work, our algorithmic and design contributions, which include several novel equivariance components, and our state-of-the-art results in multi-view depth estimation, evaluated in multiple benchmarks, as well as convincing ablation studies.
We address each point raised by the reviewers in their respective replies and will include all proposed modifications and additional experiments in the revised version of our manuscript. In particular, reviewer KQ34 mentioned that our appendix contains important information that should be included in the main paper to improve reader experience. We are committed to balancing the main paper and appendix in order to give readers a fluent understanding of our motivation and contributions, as well as improving figures to help those not familiar with fundamental equivariance concepts. We also would like to emphasize that reviewer QCgu awarded us with an Excellent score for presentation.
This paper introduces a ray embedding representation with rotational and translational equivariance, integrating the existing Perceiver IO architecture to achieve robust multi-view implicit depth estimation. Concerns are about motivation and the ablation studies. The authors well resolve the problems. AC recommends acceptance for this paper.