Point Cloud Synthesis Using Inner Product Transforms
We develop a differentiable, invertible geometrical-topological encoding of point clouds based on inner products.
摘要
评审与讨论
The work proposes Inner Product Transforms (IPT) to represent point-clouds from images. Basically, by combining a sufficient number of cleverly positioned views of point-cloud data, one may reconstruct the full point-cloud using the inner products of coordinates. The work proves that building n+1 independent directions (planes), one may reconstruct points clouds in R^n. The results indicate that the IPT is comparable to current state-of-the-art methods regarding reconstruction in quantitative and qualitative aspects. Additionally, the authors claim that the method is orders of magnitude faster for training and inference. Finally, IPT may be used as in conjuntion with generative methods to create novel point clouds from a pretrained latent-space encoder-decoder model.
优缺点分析
Strengths
The work is well grounded and well motivated. The essential concepts necessary to understand IPT are presented in a simple and effective manner, so that even unfamiliar readers will understand the main contributions and how they were achieved using only the main text. Additional details on the appendix help readers looking for a deeper dive into the subject.
Additionally, I enjoyed the limitations discussion, where the authors were frank regarding the disconnect between theory and practice regarding the number of planes needed to reconstruct a point cloud. Such good discussions lead to more openings for future works and I appreciate this.
Finally, I liked that the authors submitted source code in addition to the manuscript.
Weaknesses
I found no description of hardware configuration used for training the models, limiting the assessment regarding performane gains compared to other models. The authors mention that the model is trained on "commodity hardware" (Pg. 1, L. 32), although the actual hardware used is not mentioned anywhere in the text.
Furthermore, the authors mention the compression capabilities of IPT without arguably performing compression experiments. Upsampling after downsampling is not a compression experiment. I would suggest removing the term "compression" in this case, since no metrics, such as Rate-Distortion (RD) curves, were presented. Additionally, this experiment was performed using a single configuration downsizing the point clouds to 256 points and upsampling it back to the original size. More extreme configurations (16, 32, 64, ...) points would make the authors' argument stronger.
Minor comments
- Define the 1-NNA acronym in the text.
- L. 270 "As a comparison partner, we use...". the term "partner" is odd here.
References
[1] Park, J. J., Florence, P., Straub, J., Newcombe, R., & Lovegrove, S. (2019). DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 165–174. https://doi.org/10.1109/CVPR.2019.00025
问题
Is the method capable of performing point-cloud completion from partial data? An example of this can be seen in DeepSDF [1] (see the references above).
Why is Table 3 is mentioned after Table 4 in the text? Additionally, in Table 3, the authors only insert the device for ShapeGF and IPT-VAE. What is the device used for the other models?
局限性
Yes, the authors have adequately addressed the limitations.
最终评判理由
The authors answered my concerns, and, from my point of view, the questions posed by my fellow reviewers. Given that, I don't see a reason to reject the work, so I'm increasing my score.
格式问题
The "Impact Statement" section is outside of the 9 page limit. However, it is not a numbered section, as in the text body, nor an appendix. Therefore, I'm not sure if it should count towards the 9 page limit or not.
We thank the reviewer for their positive view concerning our work! The fact that we perform comparable to state-of-the-art methods is indeed striking due to the simplicity of our architecture. Representing point clouds as images has allowed us to simplify the pipeline for generation considerably. For future work, we believe that this allows state-of-the-art image generative models to be used for the generation or segmentation of point clouds.
We also appreciate that the reviewer likes our discussion on the practical considerations. The theoretical results typically provide no insight into the practical feasibility. Hence, to show that our method works in practice, we perform the backpropagation experiment in Section 4.4 and Table C.1 in the appendix. The table evaluates how well a point cloud can be reconstructed if the resolution of the IPT is progressively reduced. We will emphasize this perspective more.
Summary
Here's a brief summary of our changes:
New experiments (with preliminary results):
- Varying point cloud sizes: Already retrained for 32–512 points; quantitative results included, qualitative visuals will be added. Reconstruction quality remains high even in the low-cardinality regime.
- Point cloud completion: Preliminary results for cars and chairs show strong performance; will extend to more categories and compare to DeepSDF.
- Compression: We will work on providing a rate–distortion analysis to Section 4.4.
Clarifications and revisions:
- Add full hardware configuration (RTX 4070, i7-13700K, 32GB RAM) and training times across tables.
- Highlight the practical perspective of the backpropagation experiments.
- Define 1-NNA explicitly; replace “comparison partner.”
- Fix table order (Table 3/4) and unify device reporting.
- Adjust or remove the Impact Statement as per formatting rules.
Details
Please find detailed responses to your additional questions and remarks below.
the authors claim that the method is orders of magnitude faster for training and inference … found no description of hardware configuration used for training the models, limiting the assessment regarding performance gains compared to other models.
This was indeed an oversight of our part, which we will rectify in our revision. Our hardware consists of an NVIDIA GeForce RTX 4070 with 12GB VRAM and an 13th Gen Intel(R) Core(TM) i7-13700K with 32GB RAM. Due to our model being small, training our Encoder and VAE takes only approximately 45 minutes combined. In contrast, generating the samples with PVD alone takes at least an hour and training takes much longer leading to a major advantage of our method.
Furthermore, the authors mention the compression capabilities of the IPT without arguably performing compression experiments.
The authors were not familiar with the metric mentioned by the reviewer and are very thankful for having pointed that out. Next to rephrasing our manuscript, we believe that it may be possible to add an RD curve to our backpropagation experiments (Section 4.4), which assesses the quality versus resolution. This is a great suggestion and will provide an additional quantification of our experiment!
Additionally, this experiment was performed using a single configuration downsizing the point clouds to 256 points and upsampling it back to the original size.
This is a great point and we have already retrained our model to include various point cloud sizes, ranging from 32 to 512 (please see the table below). Visually, the smaller point clouds look good and capture the key characteristics from the shape. Unfortunately, rebuttal restrictions prevent us from showing these images, but we observe that the low-cardinality point clouds still capture the core features and we will add such visualizations in our revised manuscript. Notice that the table demonstrates that our model can upsample the shape to a high-quality reconstruction, even when using a small number of key points.
| Number of points | Airplane, MMD-CD | Airplane, MMD-EMD | Car, MMD-CD | Car, MMD-EMD | Chair, MMD-CD | Chair, MMD-EMD |
|---|---|---|---|---|---|---|
| 32 | 1.398 | 3.376 | 9.993 | 6.047 | 18.251 | 11.615 |
| 64 | 1.197 | 2.683 | 7.572 | 4.969 | 14.336 | 9.788 |
| 128 | 1.146 | 2.312 | 6.644 | 4.510 | 12.070 | 8.878 |
| 256 | 1.129 | 2.169 | 6.289 | 4.233 | 11.591 | 8.530 |
| 512 | 1.124 | 2.091 | 6.316 | 4.361 | 11.548 | 8.336 |
Is the method capable of performing point-cloud completion from partial data? An example of this can be seen in DeepSDF [1] (see the references above).
Thanks for this suggestion! We attempted this experiment for the car and chair category and our preliminary experiments point towards its feasibility. Specifically, we used point cloud samples from depth maps and applied our Encoder to complete the point cloud given the IPT of the partial point cloud. We did not train our model on the partial point clouds but only on the “full” point clouds, making the task arguably more challenging. We are currently working on additional experiments and aim to extend the manuscript accordingly.
As the preliminary results show (other methods/numbers are quoted from "Point Voxel Diffusion, arxiv:2104.03670"), our method appears to perform very well in this setting! We will extend these experiments in more challenging settings and compare to DeepSDF as well. We appreciate this suggestion by the reviewer!
| Category | Model | MMD-CD | MMD-EMD |
|---|---|---|---|
| Chair | SoftFlow | 2.786 | 3.295 |
| PointFlow | 2.707 | 3.649 | |
| DPF-Net | 2.763 | 3.320 | |
| PVD | 3.211 | 2.939 | |
| IPT (ours) | 1.103 | 1.238 | |
| Car | SoftFlow | 1.850 | 2.789 |
| PointFlow | 1.803 | 2.851 | |
| DPF-Net | 1.396 | 2.318 | |
| PVD | 1.774 | 2.146 | |
| IPT (ours) | 0.707 | 0.912 |
Define the 1-NNA acronym in the text and Line 270 "As a comparison partner, we use...". the term "partner" is odd here.
Thanks for these suggestions! The 1-NNA metric refers to the 1-Nearest-Neighbour-Accuracy, first defined by Lopez-Paz and Oquab [1]. We will clarify this terminology and update our use of “partner”.
[1] Revisiting Classifier Two-Sample Tests, Lopez-Paz and Oquab, ICLR 2017
Why is Table 3 is mentioned after Table 4 in the text? Additionally, in Table 3, the authors only insert the device for ShapeGF and IPT-VAE. What is the device used for the other models?
Thanks for spotting this! This seems to be due to LaTeX placing tables differently; we will fix this and apologize for the confusion. For Table 3, we aimed to use as little repetition as possible and used the GPU for all models except when explicitly mentioned (second to last row). In response to your comments and those by other reviewers, we will add a more extensive description of the hardware used. For the table you mentioned, our hardware consists of an NVIDIA GeForce RTX 4070 with 12GB VRAM and an 13th Gen Intel(R) Core(TM) i7-13700K with 32GB RAM.
The "Impact Statement" section is outside of the 9 page limit. However, it is not a numbered section, as in the text body, nor an appendix. Therefore, I'm not sure if it should count towards the 9 page limit or not.
Thanks for your diligence! While the paper FAQ states that such a statement is not required, we decided to add it to point out some potential benefits of our method. To our knowledge, it does not count towards the limit, but we are happy to remove it for a revision.
We hope to have addressed your concerns and questions and look forward to updating the manuscript accordingly. Please let us know if you have any additional issues, concerns, or remarks!
Thank you very much for the clarifications! I ask the authors to add these clarifications to the text (either main text or supplementary material). And good luck!
Thanks for your support! We will of course include these revisions. Please let us know if you have any further questions!
This paper introduces a novel framework for point cloud synthesis based on a two-stage process: (1) generating a compact and injective inner-product-based representation (IPT) of the point cloud, and (2) reconstructing the full 3D point cloud from the IPT using a learned encoder. The IPT encodes multi-directional height filtrations as 2D images, inspired by the Euler Characteristic Transform but simplified and adapted to point clouds. The authors prove injectivity of the representation in theory, and show that its approximation retains strong empirical properties. This framework enables extremely fast and lightweight generation and reconstruction pipelines, using simple convolutional architectures and supporting general tasks such as reconstruction, generation, interpolation, and upsampling.
优缺点分析
Strengths The core idea of using inner product filtrations to encode point clouds as images is both elegant and theoretically grounded, leading to a highly compressed and expressive representation. The two-step pipeline—separating image generation and reconstruction—reduces architectural complexity and training cost, while maintaining high generation quality. The paper further demonstrates strong generalization in out-of-distribution settings, thanks to the stability and structure of the IPT latent space.
Weaknesses
-
The paper proves that the Inner Product Transform (IPT) is injective in the continuous case, but the practical implementation relies on a finite number of directions. Could the authors provide a brief empirical analysis or visualization showing how reconstruction quality varies with the number of sampled directions? This would help clarify how well the theoretical guarantee holds under practical constraints.
-
The method claims to support tasks like interpolation and upsampling, but these parts are not described in much detail. It would be helpful to include a few more examples or clarifications explaining how interpolation is implemented in the latent space, and what mechanisms ensure geometric plausibility in the interpolated outputs.
-
There are minor writing issues that slightly affect readability, such as “exhibits performance on a par with” or “we found the IPT space to be smooth”. These could be rephrased more formally or precisely to improve presentation.
问题
see weaknesses
局限性
yes
格式问题
I did not notice any major formatting issues. The paper appears to follow the NeurIPS 2025 formatting guidelines.
We really appreciate your thoughtful and positive review, which summarizes our contributions most admirably in the way we intended! Our paradigm of separating the generative pipeline into generation and reconstruction is similar to latent diffusion, where we do point-cloud-to-image autoencoding allowing an image generative model to generate point cloud representations. The crucial difference is that our pipeline is cross-domain; we hope that this idea will be applied to different data modalities as well, the main premise being that by transforming a “difficult” data modality into an image, we simplify the generation pipeline.
Summary
Here's a brief summary of our changes.
New experiments and additions:
- Number of directions (see below for preliminary results): Provide empirical analysis and visualizations showing reconstruction quality vs. number of directions; clarify in the main text (Table 1, Appendix C.1).
- Interpolation: Add extended examples and varied results; clarify pixel-wise linear interpolation of IPTs; expand intuitive discussion with theoretical justification (Theorem 2).
Clarifications and revisions:
- Clarify how backpropagation experiments confirm expressivity and reconstruction quality.
- Improve writing precision (phrasing such as “on a par with”, “smooth IPT space”).
- Strengthen discussion of geometric plausibility in interpolated outputs.
Details
Please find our detailed responses below.
The paper proves that the Inner Product Transform (IPT) is injective in the continuous case, but the practical implementation relies on a finite number of directions. Could the authors provide a brief empirical analysis or visualization showing how reconstruction quality varies with the number of sampled directions?
Often a mathematical proof of invertibility indeed does not provide any practical guidance as to how one would have to do it in practice or if it can perform well, as correctly pointed out by the reviewer. Thus, in order to assess the ability to invert the IPT we performed backpropagation experiments to check that the gradients can, in fact, be correctly backpropagated. This shows that the IPT is expressive and can reconstruct the point cloud with high quality, given sufficient directions. The current version of the paper already shows some of these results in Section 4.4 and Table C.1 of the Appendix. Since the table is admittedly somewhat dense we will (a) further clarify this in the main text in our revision, while (b) also adding additional visualizations and tables that show the behavior of our method as the number of directions is varied. We appreciate this suggestion!
As a preliminary experiment and preview, please refer to this table, showing the reconstruction quality for the "Airplane" class as a function of the number of directions (cf. Table 1 in the main text). Our method is very stable and maintains high quality even for a small number of directions. In the future, we aim to further analyze and elucidate on this point. Thanks for this great suggestion!
| Number of directions | MMD-CD | MMD-EMD |
|---|---|---|
| 4 | 1.287 | 2.047 |
| 8 | 1.177 | 1.882 |
| 16 | 1.090 | 1.789 |
| 32 | 1.070 | 1.685 |
| 64 | 1.033 | 1.559 |
The method claims to support tasks like interpolation and upsampling, but these parts are not described in much detail. It would be helpful to include a few more examples or clarifications explaining how interpolation is implemented in the latent space, and what mechanisms ensure geometric plausibility in the interpolated outputs.
Thanks for this suggestion! We will extend the examples and their discussion in our revision, while also adding a more varied set of interpolation results that clarify how the interpolation is performed. The gist of interpolation is that we do a pixel-wise linear interpolation of the IPT, viewed strictly as an image. For two IPTs the interpolant will be for . To approximate the interpolated point cloud at a particular level of , we use our encoder model.
The theoretical justification why such a construction works is that the IPT is linear with respect to adding or removing points (we prove this in the paper in Theorem 2). That is to say that computing the pixelwise sum of two IPTs is the same as computing the IPT of the two point clouds simultaneously. Heuristically speaking, the interpolant is thus a kind of "weighted virtual point cloud" the model aims to reconstruct. Theorem 2 makes this precise; and to further elucidate this construction, we will extend the manuscript with a more intuitive discussion.
There are minor writing issues that slightly affect readability, such as “exhibits performance on a par with” or “we found the IPT space to be smooth”.
Thanks for your diligent read! We will rectify these formulations in our revision and go over the paper again to ensure precise formulations.
We would like to thank the reviewer again for their engagement and positive review! If you believe we have adequately addressed your considerations we would like to kindly ask you to consider increasing your score and we would more than happy to respond to any further questions or remarks.
We would like to thank the reviewer for their extensive rebuttal and appreciate their feedback for our manuscript. With the rebuttal period coming to a close we would like to understand if there are anymore questions and if our response so far has addressed the concerns of the reviewer.
After reading the authors’ response and considering other reviewers’ comments, I tend to maintain my original score. The authors’ clarifications have adequately addressed my concerns, and I recommend acceptance of the paper.
Thank you very much for the support of our work and the very valuable feedback from the reviewer!
The paper presents the Inner Product Transform (IPT), a novel method for point cloud representation using inner products, converting point clouds into 2D images that capture their geometrical and topological features. The core contributions include a new image-to-point-cloud pipeline that reduces training and inference times while preserving generation quality, and the demonstration of IPT's injective property allowing point cloud reconstruction from its descriptor. The authors also highlight the stable latent space provided by IPT, which supports high-quality interpolation and out-of-distribution tasks without additional retraining. This work advances point cloud processing by offering a more efficient and accessible approach, potentially broadening the use of machine learning algorithms for point cloud tasks. The theoretical foundation and versatile applications of IPT, including reconstruction, generation, compression, and upsampling of point clouds in arbitrary dimensions, further enhance its value.
优缺点分析
Strength: Its main advantage is the introduction of a novel and efficient point cloud representation method. The IPT shows promise in improving computational efficiency for point cloud tasks. The authors provide a solid theoretical foundation for their method, including proofs of key properties like injectivity, which adds credibility to their approach.
Weaknesses:
- The experimental results, while respectable, do not demonstrate a significant improvement over existing methods in terms of reconstruction and generation metrics.
- It is not very clear how the authors organize the IPT as an image. How do they order the direction vectors?
- The current formulation of IPT is not invariant to rotations.
- The injectivity results are only valid in the case of an infinite number of directions
问题
- How do you order the direction vectors? Does different ordering affect recosntruction and generation results?
- Why do you use VAE as the generative model? Have you tried more advanced diffusion models?
局限性
Yes
最终评判理由
The authors addressed most of my concerns in the rebuttal. This work is significant to the academic community as it advances point cloud processing by offering a more efficient and accessible approach, potentially broadening the use of machine learning algorithms for point cloud tasks. I raise my rating to Borderline accept.
格式问题
I do not notice any major formatting issues.
We thank the reviewer for their thoughtful review, which recognizes our core contributions. The faithfulness of the IPT has a significant premise, namely that the generation of point clouds can be viewed as a pure image generation task. Moreover, we found there to be a significant amount of future avenues of research. We indeed believe that the computational efficiency and stability are strong a advantage that make our methods very suitable in cases where hardware is restricted. Moreover, we consider similar approaches to be an excellent starting point for graph generation or higher-order data generation.
Summary
Here's a summary of our changes:
New experiments and additions:
- Rotation invariance: Add existing results with random rotation (data augmentation); clarify IPT equivariance properties; discuss possible extensions via spherical harmonics.
- Injectivity: Clarify that injectivity also holds for a finite number of directions (Appendix F, Theorem 1); highlight backpropagation experiments (Sec. 4.4, Table C.1) confirming practical invertibility.
- Generative model choice: Expand discussion on limitations of VAEs; motivate future directions with diffusion models, transformers, or general token-based formulations.
Clarifications and revisions:
- Ordering of directions: Clarify use of randomly sampled but fixed directions; explain treatment as unordered multi-channel 1D signal; mention positional encodings as possible improvement.
- Main contribution framing: Strengthen emphasis that IPT is a natural and expressive representation for point clouds, even with lightweight architectures.
- Future work: Explicitly outline potential extensions (diffusion, transformer-based IPT, invariance mechanisms).
Details
Please find our detailed responses below.
The experimental results, while respectable, do not demonstrate a significant improvement over existing methods in terms of reconstruction and generation metrics.
While we agree that if we had trained a large diffusion model with superior results our claims would be somewhat “easier” to sell, in doing so we would have overlooked the subtle but critical claim we aim to make in this paper: That is, we find the IPT to be a natural and suitable representation for point clouds! Had we trained a large model, the additional capacity could compensate for the lack of expressivity. With our tiny models that is simply not an option, therefore our inductive biases have to be task-adequate, supporting our claim, making it possible to translate tasks in the point cloud domain to task in the image domain.
We would also like to point out that our architecture was deliberately chosen to be lightweight; for image classification tasks, for instance, SOTA architectures would nowadays also have to be larger. To further improve reconstruction quality, we are considering to exchange the VAE with a (conditional) diffusion model; we believe that, in general, inspirations from latent diffusion can lead to substantial improvements to our generative and the IPT-Encoder.
The primary focus of this work is to demonstrate the general suitability of the IPT framework for such tasks; we believe that we have accomplished this goal. For our revised version, we are scaling up the framework to include diffusion models, but within the restricted time of the rebuttal, we do not yet have any finished experiments to show for.
It is not very clear how the authors organize the IPT as an image. How do they order the direction vectors? How do you order the direction vectors? Does different ordering affect recosntruction and generation results?
Thanks for raising this great point! For two dimensions, the IPT admits a natural parametrisation of the directions with the angle . However, in 3D (and higher dimensions) this is no longer possible, since we would need at least two parameters ( and ) for a parametrisation of the unit sphere. Instead we use a set of randomly sampled directions and stack them in an unordered but fixed fashion. Therefore, the only spatial structure in the IPT image is along the columns and forms the motivation to view the signal as a multi-channel 1D signal. Our approach implicitly encodes the directions and an explicit encoding of the directions through positional encoding could further improve the model. We will clarify these aspects in our revision.
The current formulation of IPT is not invariant to rotations.
This is a great observation! Equivariance and invariance are indeed natural properties for us to consider. We already performed extensive experiments in 2D and 3D to see how well the model can handle such cases and found that simple data augmentation in the form of random rotations enables the model to learn the equivariance despite not being intrinsically equivariant. We will add this experiment to the appendix. In addition, we will investigate improved formulations of the IPT, drawing, for instance, upon spherical harmonics to obtain "builtin" invariance. Notice that the IPT is equivariant with respect to the orthogonal group (or covariant in the strictest sense); we will clarify this in our revision.
The injectivity results are only valid in the case of an infinite number of directions
We appreciate the reviewer for raising this point and hope to clarify this. From a theoretical perspective, the IPT is injective for an infinite number of directions, but this result also holds for a finite number of directions as we prove in Appendix F, Theorem 1. While this shows injectivity in theory, it does not show if an IPT can actually be inverted in practice. To show that one can actually invert it in a practical and computationally efficient manner, we perform the Backpropagation experiment in Section 4.4 and Table C1 in the Appendix. This shows that both in theory and in practice the IPT can be inverted, and we are the first paper to do so for general data.
Why do you use VAE as the generative model? Have you tried more advanced diffusion models?
Considering other types of models for generative modeling is something that is certainly possible and something we wish to explore in future work. Expanding on the previous point, one could also consider the IPT as a bag of tokens with an additional direction coordinate. This could potentially lead to transformer architectures for the reconstruction (and generation) of IPTs and point clouds and is (most likely) a very fruitful future line of work. For added context, we also wish to mention that our work is the very first to provide a practical method to invert a topological transform and show its efficacy as an expressive representation for point clouds. We are excited about the abundance of future directions and are already working on followup work to explore some of these perspectives. Including all of the directions (some pun intended!) into our work may be beyond the scope of a single paper, making it harder to understand the primary claims. We will, however, investigate which additional models can be shown as part of this work without losing the focus on representational efficiency.
We thank the reviewer again for their time and engagement with our work. We hope that our clarifications and proposed revisions address your questions and would kindly ask you to consider increasing your score if they do.
Thanks to the authors for the detailed response. The authors addressed my concerns on the order the direction vectors, invariance to rotations, injectivity of IPT. I will raise my rating. I still hope to see results of more advanced generative models such as diffusion models or auto-regressive generative models in the revised manuscript.
Thank you very much for the valuable feedback and raising the score! We will incorporate all proposed experiments in the revised manuscript, as it is very valuable! If you have any additional questions, please let us know.
This paper proposes a novel representation for point clouds, called the Inner Product Transform (IPT), which encodes a point cloud as a multi-directional height histogram (essentially a 2D “image”). The key insight is that this representation is provably injective (under certain conditions), and simple CNNs can effectively learn to invert it. The authors use a lightweight CNN-based encoder and VAE to generate point clouds by first generating IPTs and then inverting them. The method achieves competitive reconstruction and generation results on ShapeNet while being significantly faster than flow- and diffusion-based baselines.
优缺点分析
The paper presents a simple and elegant idea, backed by solid theory and clean implementation. The injectivity of IPT is well-motivated, and the authors convincingly demonstrate that their learned inverse can effectively reconstruct or upsample shapes from low-resolution IPTs back to 2048-point clouds. I also appreciate the clarity of presentation and the attention to efficiency; this is one of the few papers in the space that seriously benchmarks speed.
That said, there are some essential experimental gaps. First, everything is trained and evaluated on 2048-point clouds — and there’s no indication the method scales to higher-resolution shapes (e.g., 4K or 128K points). This is particularly limiting because the value of efficient generation becomes most critical in real-world applications that involve extensive or dense point clouds, such as LiDAR scans or detailed shape modeling. Second, although they highlight the stability of the latent space, there’s no real out-of-distribution or cross-category generalization shown. The one case they call “out-of-sample” is stemming from resolution downsampling, which is not the same thing. Lastly, while the method is efficient, the timing comparisons (Table 3) lack consistent hardware reporting, which makes it hard to judge the absolute magnitude of the speedup. They also don’t compare against at least one relevant recent method focused on efficient generation (e.g., Point Straight Flows, SPVD).
Overall, I like the direction and believe the IPT idea could have a broader impact, but I’d like to see how the authors respond to the experimental limitations.
问题
- Can your method scale to higher-resolution point clouds? A reconstruction or generation experiment at 4096 or more points would really help here. 2. You claim IPT encodes a stable latent space — but did you try testing it on unseen shape categories (e.g., train on chairs/planes, test on cars)? 3. For your speed claims (Table 3), could you confirm what GPU each baseline was measured on? Or alternatively, report FLOPs to standardize the comparison? 4. Why use 1D convs for the encoder but 2D convs for the VAE? This asymmetry isn’t clearly explained, and it seems like IPTs could support both. 5. Finally, have you observed how IPT behaves when two very similar shapes are encoded at low resolution? Some study of approximate collisions would support your injectivity claims in practice.
局限性
yes.
格式问题
no.
We thank the reviewer for their thorough positive review and appreciate their extensive deliberations, which show a great attention to detail! Indeed we propose an effective representation of point clouds that allows a simplification of the generative pipeline. In some sense our method is comparable to latent diffusion. Where latent diffusion aims to represent a large image with a smaller one, we propose to represent a point cloud with an image. Our paper aims to show that this representation is expressive, flexible and allows for a much simplified architecture. This perspective opens up a plethora of opportunities for other data modalities as well. Since we are the very first to attempt such an approach, we had to carefully scope our experiment section to provide strong evidence that our method is indeed an expressive and suitable representation. We agree that there are many opportunities for future research and extensions to the capabilities of our method, which we find rather exciting!
Summary
Here's a brief summary of our changes:
New experiments (with preliminary results):
- Efficiency: Reported FLOPS for IPT and baselines; will also add timing comparisons with Point Straight Flows, SPVD, and hardware details.
- Scalability: Added 4K-point generation experiments showing favorable scaling; extend discussion on hierarchical extensions (e.g., Point-E).
- Generalization: Conduct ShapeNet-13 and 3D MNIST multi-category experiments; preliminary results suggest competitive performance.
- Cross-domain: Preliminary reconstruction experiments across categories (Airplane→Car/Chair); add discussion of limitations; challenging tasks for all models.
- Approximate collisions: Add more interpolation/collision examples to support injectivity claims.
Clarifications and revisions:
- Clarify motivation for minimal convolutional architecture and note potential transformer-based improvements.
- Correct typo regarding 1D vs. 2D convolutions.
- Emphasize simplicity + flexibility of IPT and discuss broader opportunities.
- Extend discussion of limitations (scaling, cross-category generation) and future directions.
Details
Please find our detailed responses below.
The key insight is that this representation is provably injective (under certain conditions), and simple CNNs can effectively learn to invert it
Very much so! Our motivation to use a convolutional architecture is to show the IPT to be an expressive representation. That said, there are opportunities to improve our model. For instance, our encoder model only implicitly encodes the directions, and a transformer based architecture viewing the IPT as a bag of tokens could further improve model performance. The current model is therefore not the best model per se, but the most minimal model that can invert the IPT effectively for distributions of shapes.
I also appreciate the clarity of presentation and the attention to efficiency … Lastly, while the method is efficient, the timing comparisons (Table 3) lack consistent hardware reporting …recent method focused on efficient generation (e.g., Point Straight Flows, SPVD)
Thanks for your positive feedback! We were not only excited by the simplicity of our method, but are also very pleased by its computational performance. We thus very much appreciate the suggestion to report FLOPS and find that the FLOPS required for the forward pass of the IPT-Encoder model are , compared to FLOPS for a standard ResNet-18 image classifier, which is approximately 6 times more. For our revised manuscript we will also include the FLOPS for PointFlow and other methods, if available, as a comparison.
In terms of timing and hardware, Point Straight Flows reports a sampling time of seconds (on an Nvidia RTX 3090 GPU) which would be on par with SetVAE ( seconds) (on an GTX 1080ti GPU). We used an NVIDIA GeForce RTX 4070 with 12GB VRAM and a 13th Gen Intel(R) Core(TM) i7-13700K with 32GB RAM. Where possible we will also report the hardware used by other methods.
As for the comparison to SPVD, we find that the generation of ~600 “car” point clouds takes between 15–45 minutes, whereas our model only takes a couple of seconds on the same hardware (see above). Our method can additionally handle larger batch sizes, leading to further performance gains. We will add this comparison to the revised manuscript.
there’s no indication the method scales to higher-resolution shapes (e.g., 4K or 128K points)
Thanks for this suggestion! To show that our method also scales to larger point clouds, we have retrained our Encoder to generate 4K points, observing that computational performance is not significantly impacted, whereas generation quality was slightly improved (cf. Table 1 in our manuscript; this is not surprising given the fact that we are using more points, but it points towards favorable overall scalability).
Please find the encoding quality numbers for 4K points below:
| Dataset | MMD-CD | MMD-EMD |
|---|---|---|
| Airplane | 0.867208 | 1.60767 |
| Chair | 8.16931 | 6.32711 |
| Car | 5.13266 | 3.27385 |
Scaling this to ultra-large point clouds is going to require additional changes in the architecture (which we kept simple on purpose for this paper), and we aim to address this in future work. We will extend the discussions and limitations accordingly. The standard approach in the literature is to use a hierarchical approach with an upsampler model as is done with Point-E (Point-E: A System for Generating 3D Point Clouds from Complex Prompts, arXiv:2212.08751), for instance; we believe this to be a feasible strategy for the IPT as well, which we are excited to pursue in the future.
(Please also refer to the response for reviewer Ff9T; effectively, we ran additional experiments that demonstrate that our method also preserves relevant features as we further decrease the cardinality of the respective point clouds.)
there’s no real out-of-distribution or cross-category generalization shown / unseen shape categories
We agree that true cross-category generation is a very relevant (but challenging) task; we are preparing out-of-distribution experiments and will add them to the revised manuscript. In the meantime, to show that our model can handle more challenging shape distributions we have conducted a new experiment and trained our generation pipeline on 13 classes of ShapeNet (following the setup of LION). Preliminary results indicate that our method performs on a par on a subset of the test set (see table below). Since the evaluation over the full test set takes over a week to complete, we have included the evaluation results on a subset of the data for now; we will add the final results in our revision.
| Model | 1-NNA CD | 1-NNA EMD |
|---|---|---|
| PVD | 58.65 | 57.85 |
| PointFlow | 63.25 | 66.05 |
| LION | 51.85 | 48.95 |
| IPT (ours, 176 samples) | 54.95 | 48.96 |
| IPT (ours, 352 samples) | 63.92 | 50.28 |
We will also add an additional experiment for the generation of 3D MNIST point clouds to show that our method works well in a multi-category setting. For a comparison to other methods in a multi-category setting, we will evaluate our performance on the full ShapeNet-13 dataset, as proposed by LION (LION: Latent Point Diffusion Models for 3D Shape Generation, arXiv:2210.06978).
Finally, as an experiment of the performance with respect to unseen shapes (that is, cross-domain reconstruction), we ran preliminary experiments using an IPT-Encoder model on the "Airplanes" dataset and applying it to reconstruct the IPT of "Cars" and "Chairs" shapes, respectively. Since we are considering different datasets, this task is extremely challenging for any model and unsurprisingly impacts performance considerably. On the other hand, training the IPT-Encoder on the joint dataset, performance very much improves again. The latent space for a trained encoder is still stable with respect to modifications of the model (i.e. modifications within the shape class), but the shape changes from an Airplane (with prominent features like wings/tail/...) to a car/chair are too complicated to be captured without having access to additional training samples. We will expand on this behaviour in the revised manuscript.
| Airplane to | MMD-CD | MMD-EMD |
|---|---|---|
| Car | 22.676 | 9.363 |
| Chair | 203.688 | 52.287 |
Why use 1D convs for the encoder but 2D convs for the VAE? This asymmetry isn’t clearly explained, and it seems like IPTs could support both
We experimented with both variants and found the 1D variant to work only slightly better. This was a typo from our side and it will be rectified in the updated manuscript. Thanks for the diligent read!
Some study of approximate collisions would support your injectivity claims in practice
This is a great point! In the interpolation experiment, we used two rather similar airplanes, where only the engines and tail were different. However, to our knowledge, our method handles such cases well, and we will add more examples of this kind to the manuscript.
We thank the reviewer again for their time and engagement with our work. We hope that our clarifications and proposed revisions address your questions and would kindly ask you to consider increasing your score if they do.
We would like to extend our gratitude to the reviewer and appreciate their feedback. Only some time is left before the rebuttal period is closes and we would like to understand the concerns from reviewer are addressed in our response. If there are any more questions we would be happy to answer them!
We thoroughly appreciate the very productive discussion so far! We have noticed that some reviewers have yet to engage in the rebuttal and we would highly appreciate this, due to the value it brings in strengthening our work. If there are any additional comments and / or suggestions by the reviewers, we would love to hear them!
Please find a summary of our proposed additional experiments, clarifications and proposed revisions for the updated manuscript.
Additional experiments
Based on the reviews we aim to add the following set experiments, for which most results have already been shown to the reviewers.
- Improved generation with a diffusion model.
- Show Rotation equivariance for the encoder and VAE.
- Multi-class generation for the MNIST and ShapeNet-13 datasets.
- More examples for interpolation.
- Ablation with respect to the number of directions.
- Show approximate collisions.
- Cross domain generalization for the IPT-Encoder
- Rate distortion analysis for our backpropagation experiment.
- Point cloud completion with the GenRe dataset.
- Extended upsampling results with a range from 32 to 512 points.
Additional updates for the manuscript
We understand from the reviews that the following revisions will significantly strengthen our work.
- Add full hardware configuration (RTX 4070, i7-13700K, 32GB RAM) and training times across tables.
- Clarify motivation for the architecture.
- Clarify how backpropagation experiments confirms expressivity and reconstruction quality.
- Emphasize simplicity + flexibility of IPT and discuss broader opportunities.
- Ordering of directions: Clarify use of randomly sampled but fixed directions; explain treatment as unordered multi-channel 1D signal; mention positional encodings as possible improvement.
- Main contribution framing: Strengthen emphasis that IPT is a natural and expressive representation for point clouds, even with lightweight architectures.
- Future work: Explicitly outline potential extensions (diffusion, transformer-based IPT, invariance mechanisms).
Thank you very much for the summary of experiments and further updates to the manuscript. This will be very helpful in the discussion period.
The final ratings are 3 borderline accepts, 1 accept. The AC have read the reviews and rebuttal, and discussed the submission with the reviewers. The reviewers raised a number of points during the review phase including limited experiments and analysis, clarity of methodology, substantiating claims of compression. The authors were able to address these points during the rebuttal and discussion phases, and the reviewers reached a positive consensus. The AC recommends the authors to incorporate the feedback and suggestions provided by the reviewers, and the materials presented in the rebuttal, which would improve the next revision of the manuscript.