PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
4
5
4.0
置信度
创新性2.3
质量2.8
清晰度2.8
重要性3.0
NeurIPS 2025

Bridging Equivariant GNNs and Spherical CNNs for Structured Physical Domains

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

This paper introduces and evaluates G2Sphere, a general method for mapping object geometries to spherical signals using equivariant neural networks and the Fourier Transform.

摘要

关键词
EquivarianceGeometricFourierSpherical SignalsSO(3)RadarPolicy Learning

评审与讨论

审稿意见
5

The paper proposes G2Sphere, a method to build efficient equivariant architectures able to output very high-resolution spherical signals. This is achieved by 1) gradually increasing the spherical frequency of the features in the decoder and 2) densely sampling the resulting features on the sphere and applying an MLP pointwise on each location. The paper demonstrates the benefit and the efficiency of the method on a variety of tasks.

优缺点分析

The manuscript is well written. While the main idea is in some sense incremental, it is well motivated and supported by many experiments. Overall, the proposed method seems solid.

I only have two small doubt. First, it is not entirely clear to me why the output signal is defined over a spherical surface in the Drag Prediction dataset. I feel like the zero-shot super-resolution argument is not so different from interpolation. See my questions below for more details. I would appreciate if the authors could comment on these points.

问题

It's not very clear to me how the 3D shapes are mapped to the radar signal in the Radar dataset. Similarly, why is the output signal defined over a sphere rather than over the mesh surface in the Drag Prediction dataset?

Line 249: why using global averge pooling? while I understand why this works, this choice still feels like it might discard a lot of important spatial information. Isn't there some better choice? Maybe something attention based or maybe by preserving a low-resolution spatial dimension?

Line 285-290: by training on lower resolution data, isn't the model simply going to learn only the low frequencies component of the signals? In this sense, the zero-shot super-resolution is not much more interesting than interpolation, in principle, right?

Fig 4 is a bit hard to read. You could consider inceasing the quality of the figure (e.g. the font sizes and the resolution) and include a more informative caption.

Tab 2: could you report the standard deviation over the 50 initializations?

Tab. 3: why is the NE-G2S much faster than G2S?

TSNL: the output of the pointwise MLP applied on the spherical signal can have unbounded maximum frequency as stated in the paper. The authors, then, compute the Fourier Transform of the resulting signal, which was densely sampled on the sphere. However, because the resulting signal has unbounded frequency, a Fourier Transform with finite samples can not be exact and will always introduce some artifacts, thereby breaking the exact equivariance of this layer. This is in contrast with what the authors claim in lines 727 and 730. Could the author comment on this?

局限性

Yes

最终评判理由

The authors addressed all my concerns so I increased my score accordingly.

格式问题

N/A

作者回复

Thank you for your review. We are working on a revision to incorporate your feedback, but we address your comments and questions here as well.


“It's not very clear to me how the 3D shapes are mapped to the radar signal in the Radar dataset. Similarly, why is the output signal defined over a sphere rather than over the mesh surface in the Drag Prediction dataset?”

For the radar dataset, the 3D shape is processed as a point cloud and encoded using a GNN. This is then decoded to a spherical radar signal.

To clarify, for all three applications, we are not projecting the mesh onto the sphere. Rather, we are modeling the directional dependency of each output (radar signal, drag prediction, control policy) as a spherical (angle-dependent) function. These are inherently global functions defined over the sphere of directions, not over the local mesh surface. In the radar setting, the radar signature is dependent on the sensor viewing angle on the object, in the drag prediction task, the drag value depends on the direction of the airflow impacting the object. In the control policy learning task, the policy value depends on the agent’s motion angle. If our explanation is still unclear, we would be happy to explain further.


“Why using global averge pooling? … this choice still feels like it might discard a lot of important spatial information. Isn't there some better choice?”

We use global pooling to aggregate spatial features across the mesh, enabling the model to produce outputs that depend on global geometry and direction. This is related to our answer for the previous question, the spherical output signal is actually a global signal, and thus pooling across the mesh is appropriate. While global average pooling is effective in this context, we also experimented with other pooling strategies, such as top-k pooling, and found similar results. That said, we agree that more expressive alternatives like attention-based pooling could better preserve spatial structure, and we are interested in exploring such methods in future work.


“By training on lower resolution data, isn't the model simply going to learn only the low frequencies component of the signals? In this sense, the zero-shot super-resolution is not much more interesting than interpolation …”

While training on lower-resolution data does limit the signal to lower frequency components, our model's ability to perform zero-shot super-resolution stems from more than just interpolation. Because the spatial sample locations are randomized across the different meshes (due to the different mesh geometries and orientations), achieving good performance requires the model to learn the correlation between low and high frequency components across the dataset. This is only possible when there is consistent structure in the signals that the model can exploit, such as those arising from similar geometry or alignment across samples.

We additionally note that this has been seen in prior works like Fourier Neural Operators[1], where models trained on coarse data, PDEs in their case, can generalize to finer resolutions by learning global patterns. We also agree that this could have been made clearer and will revise the text to emphasize these details in the paper.

[1] Z. Li et al., Fourier Neural Operator for Parametric PDEs, ICLR 2021.


“Fig 4 is a bit hard to read.”

Thank you for the suggestion. We will improve the readability of Fig. 4 by increasing the size in the final version. We will also revise the caption to provide more context and better highlight the key takeaways from the figure.


“Tab 2: could you report the standard deviation over the 50 initializations?”

Instead of reporting the standard deviation across the 50 initializations, we follow the common practice in policy learning literature of reporting the mean performance of the 50 initializations over the final 10 evaluation rollouts. This reflects the stability of the final policy and aligns with established conventions in the field. We acknowledge that additional variance metrics (e.g.,standard deviation) could provide further insight and will consider including them in future work.


“Tab. 3: why is the NE-G2S much faster than G2S?”

The difference in runtime between G2S and NE-G2S in Table 3 is primarily due to the increased computational overhead of equivariant architectures. Specifically, SO(3)-equivariant models like G2S rely on spectral convolutions and spherical harmonics, which are inherently more expensive than the standard operations used in non-equivariant models such as NE-G2S. That being said we find that the increased performance of equivariant G2S to out-way the minute (6ms) increase in inference time.


“TSNL: … because the resulting signal has unbounded frequency, a Fourier Transform with finite samples can not be exact and will always introduce some artifacts, thereby breaking the exact equivariance of this layer. This is in contrast with what the authors claim in lines 727 and 730.”

Thank you for pointing this out. You are correct in that applying a Fourier Transform with finite samples can introduce minor artifacts, which in turn lead to a small equivariance error in the TSNL layer. While this is typically negligible in practice, we agree that the claim in lines 727 and 730 should be updated accordingly. We will revise the text to clarify this point and will additionally quantify the equivariance error introduced by TSNL in our revised version.

Additionally, we note that this approach is common in many equivariant architectures such as the Spherical Fourier Neural Operator [1], Spherical CNN [2], and Gauge Equivariant Mesh CNNs [3]. Specifically in [2] they measure this equivariance error and find that it is <2e6< 2e-6 (see Sec 5.1 of [2]).

[1] Bonev, Boris, et al. "Spherical fourier neural operators: Learning stable dynamics on the sphere." International conference on machine learning. PMLR, 2023.

[2] Cohen, Taco S., et al. "Spherical CNNs." International Conference on Learning Representations. 2018.

[3] De Haan, Pim, et al. "Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs." International Conference on Learning Representations, 2021.

评论

Thanks for the detailed answer. I believe the authors have addressed all my concerns, so I will raise my score accordingly.

审稿意见
4

This paper addresses the problem of mapping local 3D geometric information to global spherical signals. The proposed method, G2Sphere, encodes the geometric structures into latent Fourier features using equivariant networks.

The encoder processes 3D input data via graph convolutions based on Equiformer v2 [39] (Sec. 4.1), while the decoder leverages spherical convolutions [15] to project features onto the sphere (Sec. 4.2).

The main contribution is the frequency upsampling module (L203–212), inspired by [7, 26], which integrates a Trainable Spherical Non-Linearity (TSNL) to increase the SH frequency from 5 to 40. This allows modeling of higher-frequency signals than previous equivariant GNN baselines [20, 26].

G2Sphere demonstrates improved performance across several structured physical domains, including radar/drag modeling, and policy learning. The authors argue that equivariance and Fourier features reduce prediction errors in these three physics-based CV benchmarks.

优缺点分析

Strengths

  • S1. Technical Soundness:The combination of an equivariant GNN encoder [39] and spherical CNN decoder [15] is well-founded for SO(3)-equivariant modeling. The proposed frequency upsampling via a point-wise MLP allows increasing SH frequency levels within an iFT, but see the concerns of W2 in below.
  • S2. Interesting Applications:The chosen evaluation domains (radar/drag modeling and policy learning) effectively demonstrate the utility of SO(3)-equivariant framework in structured physics-based scenarios.
  • S3. Performance Gains:On synthetic datasets (Radar/Drag) and a policy learning dataset, G2Sphere outperforms baseline methods, including both equivariant and non-equivariant models.

Weaknesses

  • W1. Missing Prior Work:A closely related architectural baseline is missing from the discussion: [A] uses a spherical encoder and SO(3)-equivariant CNNs [15], with a similar motivation of operating in the frequency domain rather than the spatial domain. Please cite [A] and compare both architectural and conceptual similarities and differences.

[A] 3D Equivariant Pose Regression via Direct Wigner-D Harmonics Prediction (Lee & Cho, NeurIPS 2024)

  • W2. Maintaining Equivariance of TSNL:The use of point-wise MLPs in TSNL raises questions about whether end-to-end SO(3)-equivariance is preserved. Please clarify this point (L46–47), potentially by quantifying equivariance error module-wise or via ablation.

  • W3. Motivation for Spherical Representation & Frequency Domain:While the use of SO(3)-equivariant models is well-motivated, the rationale for spherical representation and frequency-domain approch remains vague. Please provide justification or empirical comparison with spatial-domain approaches.

  • W4. Frequency-level Choice:Although SH frequency is increased to 40, the paper does not isolate how this impacts precision. A direct analysis of the trade-off between SH frequency and performance would strengthen the claim. Is there an accuracy-efficiency trade-off analysis for different frequency levels?

  • W5. Sample Efficiency Claims:The paper mentions sample efficiency in the introduction, but does not provide supporting experiments. Clarify whether zero-shot (L285–290) or generalization tests (L291–306) demonstrate this, and if so, explain how.

  • W6. Redundant Writing:Sections 5 and Appendix G contain overlapping content and overly verbose explanations, which at times read like LLM-generated text. A more concise and well-structured narrative would greatly enhance clarity. In addition, the manuscript would benefit from better organization around the following:

  1. Clear separation between reused modules and newly proposed components,
  2. Stronger logical linkage between the stated motivation and the experimental design,
  3. More cohesive flow from methodology to results and analysis. A refined and tightly written presentation would make the technical contributions more accessible and compelling to the reader.
  • W7. Lack of HEALPix Comparison:The paper critiques HEALPix spatial grids (L29–30) for overfitting to sample locations. Is there an empirical comparison with HEALPix to substantiate this claim?

问题

Please see the Weaknesses section above.

局限性

yes

最终评判理由

NeurIPS’25 Post-Rebuttal Review

I apologize for not being able to actively participate in the reviewer discussion period. However, to make up for this and ensure thorough feedback, I have put more effort into preparing this post-rebuttal review.

Please see my comments below:

W1: Thank you for explaining the comparison with [A]. I now clearly understand the differences from G2S. Please include this comparison in the final manuscript.

W2: I understand that SO(3)-equivariance is preserved; however, I am concerned about potential precision loss due to the truncation of the Fourier transform. It would be helpful if you could quantitatively show whether the equivariance error reported in [2] also applies to this work.

W3: Demonstrating that spherical representations and frequency-domain approaches are effective in improving performance is reasonable. However, an ablation study on these two factors would further strengthen the paper’s evidence.

W4: I have confirmed the effect of the maximum frequency level L_{\text{max}} in Table 4. Thank you.

W5: The results shown in Figures 2 and 4 provide some evidence of sample efficiency, but it would be better to show results across a wider range of sample sizes. As mentioned in the rebuttal, the authors should add experiments with various settings.

W6: I expect the authors to follow through on their promise to carry out the revisions clearly.

W7: I expect the authors to follow through on their promise to perform the comparison with HEALPix.

In conclusion, I am releasing my concerns for W1, W4, and W6, and partially releasing my concerns for W2, W3, W5, and W7.

I appreciate the authors’ rebuttal and will raise my score by one level (4: Borderline accept).

格式问题

No paper formatting concerns

作者回复

Thank you for your review. We are working on a revision to incorporate your feedback, but we address your comments and questions here as well.


“A closely related architectural baseline is missing: [A] uses a spherical encoder and SO(3)-equivariant CNNs [15], with a similar motivation of operating in the frequency domain rather than the spatial domain.”

Thank you for bringing this work to our attention, we will add it to our related work section and explain its relationship to our method. Indeed, [A] shares key architectural similarities with G2S, including the use of spherical representations and SO(3)-equivariant convolutions in the frequency domain, which support continuity, equivariance, and data efficiency. However, there are two important distinctions. First, [A] begins from 2D image inputs, using a ResNet backbone followed by projection onto the sphere, whereas G2S is designed for 3D geometric inputs and operates directly in the spherical frequency domain from the outset. Second, [A] targets 3D pose estimation and thus operates effectively at a fixed low bandwidth (Lmax=5L_{max} = 5), while G2S must adapt LmaxL_{max} to the demands of each domain, ranging from 5 in our policy learning tasks to 40 in radar prediction. Therefore, we employ various techniques such as a frequency upsampling strategy to scale output resolution efficiently.


“The use of point-wise MLPs in TSNL raises questions about whether end-to-end SO(3)-equivariance is preserved.”

The TSNL is equivariant up to the sampling error introduced by the inverse Fourier transform which we use to convert from frequency domain features to real-space signals on a fixed spherical grid. Because we apply this MLP pointwise to each of these points it is SO(3)-equivariant by construction. While it’s true that performing a Fourier transform with finite samples can introduce minor artifacts, these result in negligible equivariance error in practice. We note that this approach is common in many equivariant architectures such as the Spherical Fourier Neural Operator [1], Spherical CNN [2], and Gauge Equivariant Mesh CNNs [3]. Specifically in [2] they measure this equivariance error and find that it is <2e6< 2e-6 (see Sec 5.1 of [2]).

[1] Bonev, Boris, et al. "Spherical fourier neural operators: Learning stable dynamics on the sphere." International conference on machine learning. PMLR, 2023.

[2] Cohen, Taco S., et al. "Spherical CNNs." International Conference on Learning Representations. 2018.

[3] De Haan, Pim, et al. "Gauge Equivariant Mesh CNNs: Anisotropic convolutions on geometric graphs." International Conference on Learning Representations, 2021.


“While the use of SO(3)-equivariant models is well-motivated, the rationale for spherical representation and frequency-domain approach remains vague.”

We appreciate this comment and agree that the rationale for using spherical representations and the frequency domain deserves further clarification. Many of the signals we consider (e.g., radar and drag) are global in nature but direction-dependent. Representing them as functions on the sphere allows us to model this directional dependence explicitly, without tying the representation to the surface geometry of the 3D object. Figure 5c illustrates how G2Sphere models these directional signals by learning Fourier coefficients enabling continuous signal reconstruction via spherical harmonics. This approach is more efficient than spatial methods that either predict dense grids (5a) or require repeated coordinate queries (5b), as the cost of prediction is decoupled from the number of output points. Modeling in the frequency domain also provides a strong structural prior that helps avoid overfitting to noisy or irregular signals. We observe this in the drag prediction experiments (Fig. 4), where spatial baselines tend to overfit, while G2S is biased toward smoother solutions due to its harmonic representation. Additionally, it provides a clean and scalable way to maintain SO(3) equivariance, resulting in an efficient end-to-end equivariant architecture. To demonstrate this, we do include competitive spatial-domain baselines such as transformers in the mesh-to-sphere experiments, and IBC and diffusion models in the policy learning tasks.


“Although SH frequency is increased to 40, the paper does not isolate how this impacts precision.”

We do include a study in Appendix H on the effect of frequency on precision. As shown in Table 4, increasing LmaxL_{max} leads to a clear reduction in error (0.26 MSE reduction from L=5L=5 to L=40L=40), while Figure 6 provides qualitative evidence of improved predictions, e.g. critical specular radar effects are completely missing at lower LL but become observable as LL increases. However, as noted in Figure 7, higher frequencies can lead to overfitting in low-data regimes. Additionally, we observe that compute and memory usage increase substantially with larger LmaxL_{max}. We agree that analyzing the trade-off between SH frequency and performance is critical to understanding the role of LmaxL_{max}​ and we will move this study to the main body.


“The paper mentions sample efficiency in the introduction, but does not provide supporting experiments.”

The mention of sample efficiency in the introduction is supported by the policy learning experiments (Table 2) and the generalization tests on the drag domain (Fig 4). In all of our policy learning experiments we are in relatively low data regime settings with only 100 demos for each dataset. Therefore, when we see both the equivariant and non-equivariant versions of G2S outperforming the baselines, the primary reason for this is their increased sample efficiency. For example, in the push-T task, the non-equivariant G2S achieves a success rate of 0.97, the equivariant version reaches 1.0, while the diffusion model scores 0.95. The differences are even more pronounced in the random push-T variant, where non-equivariant G2S achieves 0.74, equivariant G2S reaches 0.92, and diffusion drops to 0.71. These gains in sample efficiency are primarily due to the combination of equivariance, which allows the model to better generalize under rotations, and the frequency modeling, which provides regularization, stability, and avoids overfitting. This overfitting in particular can be seen in the generalization on the drag domain (Fig 4). That being said, we acknowledge that additional ablations on dataset sizes would further strengthen this claim, and we plan to include this to provide a more comprehensive analysis of sample efficiency.


“Sections 5 and Appendix G contain overlapping content and overly verbose explanations, which at times read like LLM-generated text. A more concise and well-structured narrative would greatly enhance clarity.“

We thank the reviewer for the valuable feedback. While Section 5 contains some overlap with Appendix G, this redundancy was intentional to ensure the reproducibility of our experiments. Section 5 provides a high-level overview of key model parameters and important details, while Appendix G includes a more comprehensive set of specifications and additional information necessary for fully replicating the experiments. This design ensures that readers have both a concise summary in the main text and an exhaustive reference in the appendix for reproducibility. However, we acknowledge that there might have been too much overlap and will take care to streamline the content to reduce this. Additionally we will do the following to address the other points brought up:

  • Separation of reused and new modules: We will revise Section 4 to more clearly differentiate between components adopted from prior work and those introduced as novel contributions in this paper.
  • Linking motivation and experiments: We will strengthen the connection between the motivation and the experiments by revising both the introduction and experimental sections to clearly articulate the purpose of each experiment and how it addresses our core research questions.
  • Cohesive flow: We will revise the transition from the methodology section to the experiments to ensure a more cohesive flow. In particular, we will add forward references from Section 4 to the relevant experiments in Section 5. For example, highlighting how the frequency upsampling technique introduced in Section 4.2 directly contributes to the improved high-resolution outputs shown in Figure 3. This will help clarify how specific design choices lead to observed performance gains.

“The paper critiques HEALPix spatial grids (L29–30) for overfitting to sample locations.“

While HEALPix is cited as an example of a possible grid for spherical output signals, we do not explicitly compare to it in this work. However, we evaluate several representative gridded methods that similarly predict or rely on regularly sampled points over the sphere. For example, our transformer baseline in the radar domain predicts a fixed grid of uniformly spaced outputs. In the drag prediction and policy learning tasks, implicit models such as IBC and the implicit transformer are conditioned on coordinate queries sampled from a fixed grid, which the model must learn to reconstruct.

As we stated in the introduction using fixed spatial grids, whether for direct prediction or coordinate conditioning, commonly leads to overfitting to the sample locations. This behavior is clearly visible in Figure 4 (drag prediction), where the implicit transformer baseline overfits and in Figure 8 (policy learning), where IBC shows severe overfitting. We will revise the paper to clarify that this overfitting tendency is characteristic of the broader class of gridded methods, of which HEALPix, our transformer baseline, and coordinate-conditioned implicit models like IBC are all representative.

评论

Hi, just a quick note to say we’ve done our best to respond to your comments in the rebuttal, and we’d really appreciate any further thoughts or discussion. We’re happy to clarify anything or continue the conversation if there are remaining concerns. Thanks again for your time and feedback!

审稿意见
4

This paper focus on using spherical fourier transformation and equivariant networks to handle the geometric data like Mesh data. Specifically, the encoder takes SO(3)-equivariant networks fed into the equivariant networks to model the 3d mesh data points into the fourier frequencies denoted by the irreducible representations. Then, the decoder using this irreps as the coefficients for predefined spherical harmonic basis function to reconstruct the mesh input.

优缺点分析

Strengths:

  1. This work extend the usage of SO(3)-equivariant networks into mesh data and consider the mesh problem which is applied in molecular or material modeling at the beginning.

  2. The Architecture provides a framework bringing new insights about how to build autoencoder for mesh data.

Drawbacks:

  1. It might be valuable to provide a discussion on the efficiency of the method. In particular, it would be helpful to clarify the maximum frequency (L_{max}) applied in this work. As is well-known in the literature, when L_{max} is high, the SO(3) equivariant networks can become computationally demanding, and it would be insightful to understand how this is handled in your setting.

  2. It could further strengthen the paper to include a discussion or comparison with widely recognized baselines for mesh or point cloud problems, such as PointNet++[1] and RepSurf[2]. Including these would help better position the contributions of your work in the broader context of mesh-based (point cloud based) learning methods.

  3. As a suggestion, the paper could emphasize the generated dataset which is also part of the contribution nof this paper. And it may be helpful to emphasize the importance of these generated datasets and how they can contribute to real-world applications.

[1] PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.

[2] Surface Representation for Point Clouds.

问题

Question:

  1. How did you handle the distance in the 3D meshes when performing the Fourier transformation? I understand that applying the Fourier transformation on the S2 space is natural. However, when working with mesh data, defining the origin of the system requires transforming the mesh points by considering their distances to the origin. In this case, I believe applying spherical harmonics alone is not sufficient. How did you address the handling of these distances? In Figure 2, it is said it will use "grid of pre-computed harmonic basis functions". Does this mean that each channel will be corresponding to a certain fixed distance?

局限性

N/A.

最终评判理由

This architecture offers a framework that provides new insights into designing autoencoders for mesh data. Experiments against point-cloud networks (e.g., PointNet++) and equivariant models (e.g., Equiformer) demonstrate its effectiveness. Overall, the paper presents a good step toward autoencoding 3D mesh data.

格式问题

N/A.

作者回复

Thank you for your review. We are working on a revision to incorporate your feedback, but we address your comments and questions here as well.


“It might be valuable to provide a discussion on the efficiency of the method. In particular, it would be helpful to clarify the maximum frequency (L_{max}) applied in this work. As is well-known in the literature, when L_{max} is high, the SO(3) equivariant networks can become computationally demanding, and it would be insightful to understand how this is handled in your setting.”

We agree that the computational costs of SO(3)-equivariant networks with high LmaxL_{max} is an important consideration. This is especially true of equivariant GNNs where maintaining high LmaxL_{max} across a large number of nodes quickly becomes computationally prohibitive. In contrast, this is less of an issue with spherical models as we only have a single node, e.g. the sphere itself. Therefore, in our GNN encoder we use smaller LmaxL_{max}, for example in our mesh-to-sphere experiments we use Lmax=5L_{max}=5 for all layers, before raising the LmaxL_{max} in the spherical decoder layers. By combining this lower LmaxL_{max} encoder architecture and the frequency upsampling technique (Sec 4.2) in our decoder we avoid incurring the large computational costs that make high LL GNNs intractable.

As a rough rule of thumb, we found frequency upsampling to be necessary when Lmax>10L_{\max} > 10. This typically corresponds to domains requiring high-fidelity predictions, such as mesh-to-radar, where we push LmaxL_{\max} to the upper limits of what our compute allows. We include ablations showing the relationship between LmaxL_{max} and output fidelity in Appendix H. In lower-complexity domains such as policy learning, we instead use small fixed values (Lmax5L_{max} \leq 5) to encourage generalization and avoid overfitting.


“How does your approach compare to standard, strong point cloud or mesh baselines, e.g. PointNet++ and RepSurf, that are not SO(3)-equivariant?”

Thanks for bringing these stronger mesh and point cloud baseline encoders to our attention. While we were unable to run these baselines within the current rebuttal timeline, our paper does include competitive non-equivariant baselines (e.g. transformers in the supervised learning settings and diffusion models in the policy learning domain) that serve as points of comparison.These models serve as reasonable baselines for our domains and help highlight the benefits of incorporating equivariance and frequency-based modeling. If additional results become available before the end of the discussion period, we will report and include them.


“...the paper could emphasize the generated dataset which is also part of the contribution of this paper. And it may be helpful to emphasize the importance of these generated datasets and how they can contribute to real-world applications.”

We agree that the generated datasets represent a valuable contribution of this work and are planning on releasing them alongside our code. The domains we consider in this work are generally lacking in publicly available datasets and we hope that our data can lay the foundations of a standardized and reusable testbed for future methods targeting spherical signal learning from 3D data. Additionally, the scale of these datasets enables the broader application of ML models in this space which has historically lagged behind other more mainstream applications such as text, image, and audio.


“How did you handle the distance in the 3D meshes when performing the Fourier transformation? I understand that applying the Fourier transformation on the S2 space is natural. However, when working with mesh data, defining the origin of the system requires transforming the mesh points by considering their distances to the origin. In this case, I believe applying spherical harmonics alone is not sufficient. How did you address the handling of these distances?”

You’re absolutely correct that the SH alone are not enough to encode this type of 3D geometric data as they only capture the angular (S2S^2) information. To incorporate the radial (distance) information, we use a radial network which transforms pairwise distances into feature embeddings. This design parallels approaches like Equiformer [1], which similarly uses Gaussian-based radial basis functions transformed via learnable MLPs to encode edge distance information in their equivariant attention layers.

[1] Liao, Yi-Lun, and Tess Smidt. "Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs." The Eleventh International Conference on Learning Representations.


“In Figure 2, it is said it will use "grid of pre-computed harmonic basis functions". Does this mean that each channel will be corresponding to a certain fixed distance?”

This question relates to the decoder stage of our pipeline, where we transform features from Fourier space back to real space. The phrase “grid of precomputed harmonic basis functions” refers to our use of a fixed sampling grid on the sphere, over which we evaluate the inverse spherical Fourier transform to recover real-space signals. These basis functions are precomputed purely for efficiency, e.g. to avoid recomputing SH values during every forward pass, and are not related to encoding different radial distances. To clarify: no, the channels do not correspond to fixed distances, they represent frequency components on the sphere, not spatial or radial positions.

For example, suppose Lmax=10L_{\max} = 10. Then there are 45 total frequency components (since we have l=010(2l+1)=121\sum_{l=0}^{10} (2l + 1) = 121 modes, but due to conjugate symmetry for real-valued signals, we only need to store the 45 complex coefficients). If we are modeling a function f:S2R50f: S^2 \rightarrow \mathbb{R}^{50}, the frequency-domain representation would be a tensor of shape [B, 45, 50], where B is the batch size. After applying the inverse Fourier transform on a set of, say, 10,000 sampled points on the sphere, the real-space signal is recovered as a tensor of shape [B, 10,000, 50], representing the function values at each of those 10K spherical coordinates.

评论

Following up on the earlier discussion regarding strong non-equivariant baselines, we’ve now run PointNet++ on our radar domain. The results are summarized in the table below. Interestingly, PointNet++ performs the worst across all baselines in both mesh types. This further highlights the advantages of incorporating symmetry and frequency-based structure, as seen in the stronger performance of our G2S models.

DomainMesh TypePointNet++TransformerEquiformerSpherical CNNG2SG2S+TSNL
RadarFrusta0.56500.2010.2710.4380.2210.195
RadarAsym0.59170.1790.257.4960.1230.128
评论

Thank you for your effort in providing the rebuttal and the new experiments regarding PointNet++. These address my main concerns.

审稿意见
5

This paper proposes G2Sphere that maps object geometries to spherical signals by operateing entirely in Fourier space, encoding geometric structure into latent Fourier features using equivariant neural networks and outputting the Fourier coefficients of the continuous target signal at any resolution. The ideas are reasonable and natural. The algorithm is feasible and well demonsrated by the experiments.

优缺点分析

Strengths: It is quite significant to bridge the equivariant GNNs and spherical CNNs for the modeling tasks in structured physic domians in this work. Moreover, the theoretical foundation is investigated and the complete framework G2Sphere is proposed. The experiments are carefully designed for the tasks of radar response modeling, aerodynamic drag prediction, and policy learning for manipulation and navigation and demonstrate the effectiveness of G2Sphere.

Weaknesses: The proposed method consists of certain existing algorithms and is short of a systematic theory. There are some English presentation and grammar errors in the paper.

问题

There are several hyperparameters for implementing the proposed method, what are the best values for a task given the dataset?

局限性

Yes.

最终评判理由

The authors addressed all my concerns and problems so I have increased my score accordingly.

格式问题

No.

作者回复

Thank you for your review. We are working on a revision to incorporate your feedback, but we address your comments and questions here as well.


“The proposed method consists of certain existing algorithms and is short of a systematic theory.”

Our primary goal in this work is not to create a systematic theory but to address a lack of methods for mapping 3D geometric inputs (e.g. point clouds and meshes) to spherical signals, despite the fact that this type of mapping is found across many different domains. Our proposed method is general enough to demonstrate strong performance across a range of diverse domains.


“There are some English presentation and grammar errors in the paper.”

Thank you for pointing this out. We will carefully revise the paper to improve clarity and grammatical correctness.


“There are several hyperparameters for implementing the proposed method, what are the best values for a task given the dataset?”

G2Sphere was trained using the Adam optimizer with a learning rate of 1e-4 and a decay rate of 0.95. In general, we use the largest batch size available to us given the compute required for the domain. For example, when using large meshes, such as in the radar domain, we use a smaller batch size of 4. In the policy learning domains, where the geometric input is smaller, we can use a larger batch size of 64. Additional hyperparameters, including the number of layers in both the encoder and decoder, the layer-wise feature dimensions, and the choice of maximum SH frequency , LmaxL_{max} are detailed in Appendix G. For completeness, the appendix also provides the hyperparameter settings used for all baseline methods.

评论

Hi, just a quick note to say we’ve done our best to respond to your comments in the rebuttal, and we’d really appreciate any further thoughts or discussion. We’re happy to clarify anything or continue the conversation if there are remaining concerns. Thanks again for your time and feedback!

评论

Hi Reviewer gYBD, Reviewer pmav, and Reviewer anCc,

The authors have made considerable efforts to address your comments in their rebuttal.

As the discussion period is ending soon, could you please take a moment to review their response and indicate whether your concerns have been fully addressed? Note based on the PCs' instructions, submitting “Mandatory Acknowledgement” without posting a single sentence to the authors in discussions is not permitted this year.

Your prompt response would be greatly appreciated.

Thanks,

AC

最终决定

This paper introduces G2Sphere, a method that connects equivariant GNNs and spherical CNNs for physical domains. In the initial comments, reviewers raised concerns about missing baselines, clarity of the theory, and limited experiments. In their rebuttal, the authors added stronger baselines (such as PointNet++), provided clearer explanations of the theory, and included more experiments and ablation studies. Reviewers agreed that these additions resolved their main concerns, and some raised their scores. Overall, the idea is well-motivated, the paper is now clear and thorough, and the results are convincing. I recommend acceptance.