5.4

/10

Poster5 位审稿人

最低3最高7标准差1.4

4.0

置信度

正确性3.0

贡献度3.0

表达2.6

NeurIPS 2024

Assembly Fuzzy Representation on Hypergraph for Open-Set 3D Object Retrieval

Yang Xu,Yifan Feng,Jun Zhang,Jun-Hai Yong,Yue Gao

OpenReview PDF

提交: 2024-04-28更新: 2024-11-06

摘要

关键词

Hypergraph3D Object RetrievalOpen-Set Learning3D Part AssemblyFuzzy Representation

评审与讨论

审稿意见

评分: 3置信度: 32024-06-24

This paper presents a novel 3D object retrieval method. First, to facilitate this task, the authors build 3 datasets for training and evaluation, which may significantly benefit the community. Then the paper propose the Isomorphic Assembly Embedding (IAE) and the Structured Fuzzy Reconstruction (SFR) modules, which are designed to generate assembly embeddings with geometric-semantic consistency and overcome the distribution skew of unseen categories. Besides, HIConv is proposed to capture high-order correlations within and among objects. Extensive experiments show that the method achieves sota performance.

优点

This paper builds 3 datasets for the task, which may facilitate future research.
The paper proposes several novel modules to capture the part-level and inter-object features for object retrieval.
The task itself is important in shape understanding.

缺点

No visualization results.
The presentation is hard to understand. There are quite some complex equations, like Eq 2 and Eq 4. Please briefly explain what they mean and how they work.
In Fig. 1, it shows that intra-object features are extracted before inter-category features. But in Fig. 2, I only see Inter-object features? It's hard for me to match them up.
I still don't understand the input. So you need dense point cloud with ground truth 3D part segmentation as input, right? If the segmenation is not perfect, will the method collapse? if the point cloud undergoes SE(3)-transformation, will the method collapse? Can this method handle partial point cloud input, like the point cloud back-projected from depth map?

问题

see weakness

局限性

The authors didn't discuss the limitations

作者回复

2024-08-07

Visualization (Weakness 1) We apologize for the lack of sufficient visualization results. We provide some visualized examples of the retrieval results in Fig. R3 of the rebuttal PDF and we will provide more.
Equations (Weakness 2) We have revised the expression and explanation of these two equations. Eq 2. is the goal of the open-set retrieval task (minimize the expected risk), which is a widely accepted type for task definition [1-3] in the retrieval field. Eq 4. is the definition of HIConv layer, which followed the commonly used type for graph-based convolution [4-6].

[1] View-based 3-D object retrieval[M]. Morgan Kaufmann, 2014.
[2] Hypergraph-based multi-modal representation for open-set 3D object retrieval. TPAMI, 2023.
[3] SHREC’22 track: Open-set 3D object retrieval. 2022.
[4] How powerful are graph neural networks? ICLR, 2019.
[5] Hypergraph neural networks. AAAI, 2019.
[6] Hgnn+: General hypergraph neural networks. IEEE TPAMI, 2022.
Framework (Weakness 3): Thanks for your valuable suggestion, We apologize for the typos in Figure 2, the "inter-object" should be corrected as "intra-object". We have restructured the presentation of the proposed framework, which consists of two sequentially connected modules: IAE and SFR. a) The IAE module takes basic part features (as explained in Answer 4) as input. This module employs a structure-aware convolution layer and a set of auto-encoders to achieve assembly fusion of different parts within an object. In this module, the structure-aware convolution layer is implemented by constructing an isomorphism hypergraph and a hypergraph isomorphism convolution function (as explained for Eq. 4 in Answer 2).
b) The SFR module takes assembly embedding for each object as input, utilizing structure-aware feature smoothing and distillation through hypergraph convolution and memory bank reconstruction, respectively. Finally, this module generates the final features (fuzzy embeddings) for similar object matching based on feature distance, thereby enabling retrieval.
Input (Weakness 4) Thanks for your valuable suggestion. The inputs for our framework are the part features rather than dense point clouds, and we do not need ground truth 3D part segmentation as input (as described in lines 259-263 and lines 476-478 of the submission). Instead of extracting features from the segmented parts of the point cloud, we use a segmentation network to obtain point-wise features for each point and then average these point-wise features to obtain part features. As shown in Fig. R2 of the rebuttal PDF, the steps are as follows:

a) Input the point clouds of an object.
b) For each point, obtain its point-wise feature and part labels through a pre-trained point cloud part segmentation network.
c) Select the points belonging to the top- $n$ most frequent part categories and then average the point-wise features with the same top- $n$ part label. Then we calculate the average feature of other points. In this way, we obtain $n+1$ part features for each object as the input for our HAFR framework.

We do not need dense point clouds nor the segmentation ground truth for input. As shown in Tab. R1 of rebuttal, compared with the SOTA point cloud segmentation methods [7][8], the features obtained using PointNet did not significantly affect the results. Therefore, we believe our method will not collapse if the segmentation is not perfect. Besides, our framework is a feature-driven method (as described in lines 157-158) and does not directly process raw data (dense point clouds). Thus, its robustness to SE(3) transformations and partial back-projection of point clouds is equivalent to that of the point feature extraction network. These backbone networks have rotation equivariant and adaptability to partial data, as proven in famous works such as [7-11]. Therefore, we believe the method will not collapse when confronted with SE(3) transformations and partial data.
[7] Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NIPS, 2022.
[8] Segment any point cloud sequences by distilling vision foundation models. NIPS, 2024.
[9] Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR, 2017.
[10] Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NIPS, 2017.
[11] Uni3D: Exploring Unified 3D Representation at Scale. ICLR, 2024.
Limitations (Limitations) We have provided a brief discussion in line 330-332.

a) Experiments of various parts. We only conduct limited experiments on the influence of varying parts in the manuscript. We conducted more experiments with the number of input part features set to 3 ( $n=2$ ) and 5 ( $n=4$ ). As shown in Tab. R1 of the rebuttal PDF, both the 4-part (ours) and 5-part settings show performance improvements over 3-part, indicating that more detailed segmentation provides richer information for assembly-based retrieval. However, our 4-part setting currently exhibits better than 5-part. We can infer that once the number of parts reaches a certain level, the utilization of part information also becomes saturated.

b) Discussions of societal impacts. As shown in line 158-159 and Appendix C Algorithm 1, the proposed HARF framework is a feature-driven framework and exclusively relies on the input of basic features, rather than utilizing raw data through the end-to-end approach. This feature-driven representation approach preserves extensibility to other common multimedia data and such as e.g. text, audio, video, and their pieces. We believe this paper can provide a general theoretical foundation and methodological reference for the application of multimedia retrieval in practical real-world scenarios. We will release the datasets and code immediately after the anonymous review period.

2024-08-11

Dear Reviewer,

We would greatly appreciate any updates or feedback you might have regarding our responses to your initial comments. Your insights are valuable to us as we work to improve our paper.

If you need any additional information or clarification from our side, please don't hesitate to let us know.

Thank you for your time and consideration.

2024-08-13

We sincerely appreciate your positive feedback and professional comments on our work. Your valuable suggestions have been crucial in improving the quality of our paper. We also appreciate the rating improvement, and will carefully revise the manuscript according to your review comments and ensure the rigor of the experimental results and references.

Looking forward to academic discussions with you after the anonymous period of NeurIPS 24, if possible! We are willing to share all our experiences, datasets, and codes of this work.

评论- Final Rating

2024-08-13

Thanks for your rebuttal. I will increase my rating to borderline accept

审稿意见

评分: 7置信度: 42024-07-11

This paper proposes to utilize the part-assembly representation method to mitigate the distribution skew of unseen categories, enhancing the generalization performance for open-set 3D object retrieval. Compared to previous methods, this paper benefits from part-level representation learning rather than object-level representation, obtaining in a good generalization on unseen categories. To utilize the part-level representation, this paper introduces Isomorphic Assembly Embedding (IAE) and the Structured Fuzzy Reconstruction (SFR) modules. The former can generate the assembly embedding isomorphically for each object, and the latter is used for generating the fuzzy representation thus overcoming the distribution skew of unseen categories.

优点

The problem is well-motivated and the solution seems working well. The results are good. The paper also contributes three 3D point cloud datasets with multiple part annotations for benchmarking. Extensive experiments on the three benchmarks demonstrate the superiority of the proposed method over current state-of-the-art 3D object retrieval methods.

缺点

The datasets OP-INTRA and OP-COSEG mentioned in the paper may have limitations in category diversity, number of parts, and dataset size, which may affect the generalization ability of the model.
The framework comprises many sub-architectures, such as the HIConv layer, multiple auto-encoders, fuzzy embeddings, and memory bank, it seems to be relatively complex. However, this paper does not explicitly discuss the computational efficiency of the model, including training and inference time, and computational cost.
Though the paper proposes a solution to the open set problem, the datasets are all virtual. Its generalization ability to unseen categories in real-world applications still needs further verification.
The ablation studies show the effect of the HIConv layer. However, only comparisons with MLP and GIN are performed, but no comparisons with other neural layers such as KAN, nor is the number of HIConv layers ablated.
The experiments are only conducted on the proposed datasets. The generalization ability of the model on a wider data distribution requires more verification. It would be better to add some experiments on previous public datasets or datasets without open-set settings to demonstrate generalization capabilities.

问题

The quantitative performance comparisons in Table 2 show the superiority of the proposed method. However, this paper only surpassed the second place by a little bit in some metrics, and there is no sufficient statistical information to prove the significance of the results, such as p-values.

局限性

The paper should add discussions on limitations and possibly show some failure cases.

作者回复

2024-08-07

Response for Reviewer ZtnM

We sincerely thank you for the valuable comments and advice, which provided important guidance for us to enhance the rigor and coherence of our paper and directed the focus of our future work.

About the generalization ability of the model (Answer for Weakness 1 and Weakness 3):
Thanks for your valuable suggestion. As an early exploration of open-set learning in a fine-grained way, we believe this paper should focus on exploring the new assembly-based retrieval task and designing a novel open-set learning paradigm based on both inter-object and intra-object correlations. Therefore, we selected the three 3D object datasets with typical geometrical structures for experiments, which is representative of a fine-grained way for an open-set environment. Experimental results demonstrate the necessity and effectiveness of the assembly-based paradigm and framework. However, further improving the generalization ability of our framework and extending it to encompass more intertwined factors in complex open-set environments is one of the key directions for future work. Specifically, we have preliminarily experimented with an adaptive Scale isomorphism Computation method for generalization across datasets and domains, inspired by [1] and [2]. Additionally, we have developed a hypergraph-based dynamic system approach to manage the increasing number of parts and labels inspired by [3].
[1] Feng Y, et al. Hypergraph isomorphism computation[J]. IEEE TPAMI, 2024.
[2] Zhou J, et al. Uni3D: Exploring Unified 3D Representation at Scale. ICLR, 2024.
[3] Yan J, et al. Hypergraph dynamic system[C]. ICLR, 2024.
About module-wise computational requirements (Answer for Weakness 2):
Our experiments are conducted on a computing server with one Tesla V100-32G GPU and one Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz. We provide a detailed comparison of model parameters, training time, and inference time for the two stages in Tab. R2 of the rebuttal PDF.
About more ablation studies (Answer for Weakness 4):
Thanks for your valuable suggestion. We conducted more ablation studies, especially on HIConv and other neural layers. Our method employs only a single layer of HIConv, therefore, we explored the impact of more layers of HIConv. Besides, we replaced the HGNN with KAN network [4] in the SFR module. As shown in Tab. R1 of the rebuttal PDF. The full version of our framework yields the best performance, these results indicate the effectiveness of our design in assembly-based open-set 3D object retrieval.
About the public datasets (Answer for Weakness 5):
Thanks for your valuable suggestion. The proposed datasets are constructed based on public datasets without open-set settings, making them suitable for open-set retrieval experiments. As mentioned in Answer 1, we will continue to explore assembly-based retrieval research in our extension version, by constructing more datasets and conducting more experiments.
About the significance of the proposed method (Answer for Questions):
Thanks for your valuable suggestion. We provide statistics of our framework (HAFR) and the second-best method (HGM $^2$ R). As shown in Tab. R3 of the rebuttal PDF, statistical information proves the significance of the results.
About failures cases and limitations (Answer for Limitations):
Thanks for your valuable suggestion. We provide some failure cases in Fig. R3 of the rebuttal PDF. In these failure cases, the query objects (rocket, pistol) and the wrong-matched target objects (car, motorbike) share a certain similarity in their part segmentation. Although the significant performance improvement of the HARF framework demonstrates the necessity of assembly-based research in open-set learning, these cornet cases also indicate the necessity of the equilibrium between different levels of labels, which needs the balance of global semantic and local geometry information. This issue is the same as the generalization ability mentioned in Weakness 1. However, this paper focuses more on the fundamental challenge brought by part assembly for open-set retrieval. Therefore, we only consider typical segmentations in this study. As mentioned in Answer 2 above, we are currently conducting research to address these more complex environments. Thank you for your keen observations and academic insights.

Thank you again for your valuable suggestions, especially your professional advice on future work in assembly-based open-set learning.

审稿意见

评分: 6置信度: 42024-07-11

The manuscript introduces a framework (HAFR) for addressing the challenge of open-set 3D object retrieval. The authors propose a bottom-up approach focusing on part assembly, leveraging both geometric and semantic information of object parts to enhance retrieval performance across categories, including those unseen during training.

The HAFR framework consists of two main modules: Isomorphic Assembly Embedding (IAE) and Structured Fuzzy Reconstruction (SFR). The IAE module utilizes Hypergraph Isomorphism Convolution (HIConv) and assembly auto-encoders to generate embeddings with geometric-semantic consistency. The SFR module tackles distribution skew in open-set retrieval by constructing a leveraged hypergraph based on local and global correlations and employs a memory bank for fuzzy-aware reconstruction.

The authors have created three datasets, OP-SHNP, OP-INTRA, and OP-COSEG, to benchmark their approach. Extensive experiments demonstrate the superiority of HAFR over current state-of-the-art methods in open-set 3D object retrieval tasks.

优点

The paper presents a method for open-set 3D object retrieval that cleverly integrates part-level information using hypergraphs, which is a unique and promising direction in the field. The HAFR framework is well-thought-out, with clearly defined modules (IAE and SFR) that address different aspects of the retrieval task, from assembly isomorphism to distribution skew mitigation.
The construction of three new datasets with part-level annotations provides a valuable resource for the research community and supports the validation of the proposed method.
The methodology is clearly described, and the algorithms are well-structured, making it relatively easy for readers to follow the technical contributions.
The paper is well-written and easy to follow.

缺点

The paper does not address scenarios with varying numbers of parts per object. Expanding the framework to handle flexibility in the number of parts could improve its applicability.
The manuscript could benefit from a discussion on the computational complexity and efficiency of the proposed methods, especially when scaling to larger datasets or higher-dimensional part features.
Why not evaluate on the PartNet(https://partnet.cs.stanford.edu/)?
Although the paper claims state-of-the-art performance, they do not achieve the best (SDML is the best on OP-COSEG for NDCG metric), what is the reason?
Some implementation details, such as network architecture specifics and hyperparameter settings, could be better elaborated to ensure reproducibility.
The paper mentions that data and code will be made available upon acceptance, which is good practice. - However, providing this information upfront or during the review process could enhance transparency and reproducibility. For the three datasets, the detailed construction is missing and encourages the authors to publicize the data, facilitating the community.
The limitations and failure cases should be discussed comprehensively.

问题

The manuscript presents a contribution to the field of 3D object retrieval, particularly in the open-set scenario. The proposed HAFR framework is innovative and has been demonstrated to be effective through rigorous experimentation. However, there are areas where the manuscript could be improved, particularly in terms of computational efficiency, limitations on various parts, and other minor issues. Addressing these points would likely enhance the manuscript's impact and applicability in the field.

局限性

作者回复

2024-08-07

Varying numbers of parts (Weakness 1)
HAFR takes 4 part features as input for each object in this manuscript for now. As shown in Fig. R2 of the rebuttal PDF, the steps for part features generation are as follows:
a) Input the point clouds of an object.
b) For each point, obtain its point-wise feature and part labels through a pre-trained point cloud part segmentation network.
c) Select the points belonging to the top- $n$ most frequent part categories and then average the point-wise features with the same top- $n$ part label. Then we calculate the average feature of other points. In this way, we obtain $n+1$ part features for each object as the input for our HAFR framework ( $n=3$ for this paper).
We conducted more experiments with the number of input part features set to 3 ( $n=2$ ) and 5 ( $n=4$ ). As shown in Tab. R1 of the rebuttal PDF, both the 4-part (ours) and 5-part settings show certain performance improvements over the 3-part, indicating that more detailed segmentation provides richer information for assembly-based retrieval. However, our 4-part setting currently exhibits the best performance (better than the 5-part). We can infer that once the number of parts reaches a certain level, the utilization of part information also becomes saturated.
Computational complexity and efficiency (Weakness 2)
Our experiments are conducted on a computing server with one Tesla V100-32G GPU and one Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz. We provide a detailed comparison of model parameters, training time, and inference time for the two stages in Tab. R2 of the rebuttal PDF. The SFR module occupies a very small parameter space (less than 3%) and is directly affected by the size of datasets. The IAE module only consumes up to 34%, which determines the effectiveness with high-dimensional features. We believe that our method can remain effective when scaling to larger datasets or higher-dimensional part features.
PartNet (Weakness 3)
The ShapeNet Part [1] dataset we used and the PartNet [2] dataset are both fine-grained annotated subsets of ShapeNet, with PartNet having a greater variance in the number of parts per object. As an early exploration of open-set learning in a fine-grained way, we believe this paper should focus on exploring the new assembly-based retrieval task and designing a novel open-set learning paradigm based on both inter-object and intra-object correlations. Therefore, we selected the ShapeNet Part with more evenly distributed parts rather than PartNet, which can reflect the core challenges of a fine-grained approach in an open-set environment. We have preliminarily experimented with an adaptive Scale isomorphism computation method for better generalization, inspired by [3] and [4]. Additionally, we have developed a hypergraph-based dynamic system approach to manage the increasing number of parts and labels inspired by [5] for the future work.
[1] A scalable active framework for region annotation in 3d shape collections[J]. ACM ToG, 2016.
[2] Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding[C]. CVPR, 2019.
[3] Hypergraph isomorphism computation[J]. IEEE TPAMI, 2024. [4] Uni3D: Exploring Unified 3D Representation at Scale. ICLR, 2024.
[5] Hypergraph dynamic system[C]. ICLR, 2024.
SDML on OP-COSEG (Weakness 4)
SDML proposes a scalable multimodal learning paradigm for retrieval by predefining a common subspace, aiming to minimize the intra-class difference. In the context of assembly-based retrieval, the SDML method projects all parts into a class-related common space to achieve unification. However, this approach results in an overall feature shift, leading to the loss of unique geometrical information of parts and the correlations among them. As shown in Fig. R3 of the rebuttal PDF, we can find that compared to the OP-SHNP and OP-INTRA datasets, the objects in the OP-COSEG dataset exhibit stronger symmetry, leading to less influence of this biased unification. Furthermore, NDCG is a metric that considers the global perspective, thus SDML achieves a slight advantage (0.11%) of NDCG on OP-COSEG through this globally biased unification. However, the significantly worse results on both other datasets and other commonly use metric demonstrate the limitation of this method.
Reproducibility and open access (Weakness 5 and 6)
We have provided a brief description of implementation details and dataset generation in the Appendix. Besides, we are well prepared and will release the datasets, code, configurations, and pre-trained models immediately after the anonymous review period of NeurIPS 24. We are willing to share our experiences on this (OpenReview) or other open-source platforms.
Failure cases (Weakness 7)
We provide some failure cases in Fig. R4 of the rebuttal PDF. In these failure cases, the query objects (rocket, pistol) and the wrong-matched target objects (car, motorbike) share a certain similarity in their part segmentation. Although the significant performance improvement of the HARF framework demonstrates the necessity of assembly-based research in open-set learning, these cornet cases also indicate the necessity of the equilibrium between different levels of labels, which needs the balance of global semantic and local geometry information. This issue is the same as the generalization ability mentioned in Weakness 1. However, this paper focuses more on the fundamental challenge brought by part assembly for open-set retrieval. Therefore, we only consider typical segmentations in this study. As mentioned in Answers 1 and 3 above, we are currently conducting research to address these more complex environments.
Writing (Limitations):
We apologize for the typos and writing issues in this manuscript. We will conduct a thorough review and revision of the entire paper to ensure the clarity and rigor.

评论- Reply

2024-08-13

Thanks for your great efforts! After reading the response, some major issues have been addressed well, so I still lean towards positive for the submission. I encourage the author to add these clarifications to the main paper. Thanks!

2024-08-13

We sincerely appreciate your positive feedback and professional comments on our work. Your valuable suggestions have been crucial in improving the quality of our paper. We will carefully revise the manuscript according to your review comments and ensure the rigor of the experimental results and references.

审稿意见

评分: 5置信度: 52024-07-12

This paper presents a method for finding similar samples from a set of 3D objects given query objects in an open setting, where objects can belong to both already seen and new categories. This method is based on considering 3D objects as hypergraphs consisting of individual geometric and semantic parts of objects. The hypergraph is used to form Isomorphic Assembly Embedding. The second part of the proposed HAFR framework is the Structured Fuzzy Representation module that constructs a hypergraph based on local certainty and global uncertainty correlation to enable transfer from seen to unseen categories. The authors propose a new layer, HIConv, which improves the quality of the generated representation. The authors demonstrate the effectiveness of their approach on three datasets that they constructed for this task.

优点

The idea that one can understand the whole object shape from its parts sounds interesting and reasonable.
The description of Isomorphic Assembly Embedding and Structured Fuzzy Reconstruction is formal and rather clear.
The authors conduct extensive ablation studies of their method.

缺点

Based on the provided experiments, it is unclear if HAFR can generalize well to an unseen domain. Are the results in Table 2 provided for the same suite of model weights?
The literature review does not include existing methods for open-set 3d object retrieval and recent methods for closed-set 3d object retrieval.
When comparing with other methods, the authors use their own modification of existing multimodal methods. A comparison with modern methods for open-set 3d object retrieval, such as [1], is necessary to demonstrate the effectiveness of this particular method of object representation.
The method's description lacks an explanation of how the resulting fuzzy embeddings are used to find similar objects. Additionally, the description contains undefined concepts like isomorphism loss and integration function. If these concepts are not introduced by the authors, please include references to articles where they are defined.

[1] Zhou, J., Wang, J., Ma, B., Liu, Y. S., Huang, T., & Wang, X. (2023). Uni3d: Exploring unified 3d representation at scale. arXiv preprint arXiv:2310.06773.

问题

How are fuzzy embeddings used to find similar objects in the target set?
What is the size of the memory anchors bank? Will the method remain effective if the dataset contains more than 16 categories?
The method is described as open-set, but it requires GT segmentation of the object into parts. How do you see its applicability in real-world scenarios where GT segmentation might not be available for any object? How much would the quality metrics decrease if we used a neural network model for part segmentation?

局限性

The authors discuss limitations in the conclusion regarding the use of the assembly fuzzy representation for a varying number of object parts. In my opinion, another limitation is the need to segment the point cloud into parts to use this method.

作者回复

2024-08-07

Generalization ability and comparison (Weakness 1)
All categories of the testing set are unseen during training (widely accepted of open-set retrieval [1-2]), the retrieval results in this paper are experimented on the unseen categories. The compared results between different methods are conducted under the same settings, and training weights.
We have explained this open-set setting in Section 5.1 (lines 254-255) and Appendix A (Table 4). We will provide a more detailed explanation in the revised version of this paper. Quantitative and qualitative results show that HARF achieves significant improvements over existing methods. Under this open-set setting, this improvement sufficiently demonstrates the superiority of HARF in terms of generalization capability.
[1] Open-environment machine learning. National Science Review, 2022.
[2] Hypergraph-based multi-modal representation for open-set 3d object retrieval. TPAMI, 2023.
Related works (Weakness 2)
In our submission, We have provided a brief review of closed-set 3D object retrieval methods in lines 78-88, and a summary of recent open-set 3D object retrieval methods in lines 95-97.
More comparisons (Weakness 3)
We provide more comparison between modern methods[3][4] for open-set 3d object retrieval. As shown in Tab. R1 of the rebuttal PDF, the results indicate that these two methods have similar performance to the existing multimodal method (HGM $^2$ R). Our HAFR framework shows significant improvements over all these SOTA open-set retrieval methods. This improvement highlights the limitations of existing retrieval paradigms in open-set environments and demonstrates the superiority of our assembly-based open-set retrieval paradigm.
[3] Uni3D: Exploring Unified 3D Representation at Scale. ICLR, 2024.
[4] Openshape: Scaling up 3d shape representation towards open-world understanding. NIPS, 2023.
The approach for finding similar objects (Weakness 4 and Question 1)
After obtaining the fuzzy embeddings (feature vectors) for all objects, we follow the common distance-based approach in the multimedia retrieval field to find similar objects, which is a widely accepted practice in recent decades [5-7]. Given a feature vector (fuzzy embedding) of query object: a) Calculate the Euclidean distance between the query feature vector and all target object feature vectors.
b) Sort these distances in ascending order to get the top- $n$ nearest objects. c) Determine whether the class labels of the top- $n$ nearest objects are the same as the query object label and calculate metrics such as mAP, NDCG, ANMRR, and PR-Curve, where $n$ denote the hyper-parameter of the evaluation metrics and can be chosen based on the specific scenario.
[5] A survey of content-based image retrieval with high-level semantics. PR, 2007.
[6] 3-D object retrieval and recognition with hypergraph analysis. TIP, 2012.
[7] Triplet-center loss for multi-view 3d object retrieval. CVPR, 2018.
More ablation on the Memory Bank (For Question 2)
The memory bank we used has 128 anchors and 512 dimensions. The memory bank is a commonly used knowledge distillation method in deep learning. Specifically, it constructs several anchors (feature vectors) and then learns the activation scores of the target embeddings relative to all anchors. The memory bank is independent of the classification layer, and its size usually does not affect the performance of the network when the number of categories changes. This has been validated in methods of multiple fields[8]. We have conducted more ablation studies on the memory bank. As shown in Tab. R1 of the rebuttal PDF, changes in the memory size have almost no impact. We believe HARF remains effective even if the dataset contains more than 16 categories.
[8] Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank. CVPR, 2021.
Methods for part feature extraction (For Question 3)
We have adopted a neural network model for part segmentation to obtain part features (as described in lines 259-263 and lines 476-478 of the submission). However, instead of extracting features from the segmented parts of the point cloud, we use a segmentation network to obtain point-wise features for each point and then average these point-wise features to obtain part features. As shown in Fig. R2 of the rebuttal PDF, the steps are as follows:
a) Input the point clouds of an object. b) For each point, obtain its point-wise feature and part labels through a pre-trained point cloud part segmentation network. c) Select the points belonging to the top- $n$ most frequent part categories and then average the point-wise features with the same top- $n$ part label. Then we calculate the average feature of other points. In this way, we obtain $n+1$ part features for each object as the input for our HAFR framework. Where $n$ is the hyper-parameters in this paper.
Therefore, in our framework, we do not need to know how many parts each object should be segmented into. We have conducted more comparisons on the segmentation method. Compared with the SOTA point cloud segmentation methods [9][10], the features obtained using PointNet did not significantly affect the results. Therefore, we believe the method remains applicable in real-world scenarios and is not influenced by the choice of the segmentation network.
[9] Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NIPS, 2022. [10] Segment any point cloud sequences by distilling vision foundation models. NIPS, 2024.
Undefined concepts (Weakness 4)
We apologize for the lack of sufficient visualization results. The isomorphism loss mentioned in line 189 is a typo, and it should mean the loss function for the IAE module (section 4.2.3). The integration function of line 189 is an averaged function for multiple features followed [2].

2024-08-11

The authors have generally responded to all comments, thank you! In this regard, I have changed the rating to 'borderline accept'

2024-08-13

审稿意见

评分: 6置信度: 42024-07-17

This paper proposes a framework for open-set 3D object retrieval, called the Hypergraph-Based Assembly Fuzzy Representation (HAFR) framework. This model leverages an Isomorphic Assembly Embedding (IAE) to integrate geometric and semantic consistency. Furthermore, a Structured Fuzzy Reconstruction (SFR) is used to overcome the distribution skew of unseen categories. On three point cloud datasets constructed by the authors, this model outperforms the state-of-the-art.

优点

The motivation for this work is well-established.
The idea of using hypergraph structures to achieve high-order correlations both within and between objects is novel.
Sufficient quantitative and qualitative comparisons verify the effectiveness of the proposed model.

缺点

In structured fuzzy reconstruction, the value of k in the k-nearest neighbors seems to determine the global uncertainty hyperedge. However, the paper lacks explanation or experiments to clarify the selection of k value.
While HGM2R [1] employs a multimodal approach, the IAE component appears to be similar to the Multi-Modal 3D Object Embedding in HGM2R. What are the differences and unique contributions of IAE compared to the embedding technique used in HGM2R?

-In Table 2, although HGM2R also utilizes hypergraphs, it shows only slight improvements over previous methods in most metrics. For example, the mAP scores on three datasets are only about 0.1 higher. However, the method proposed in this paper demonstrates a significant improvement over HGM2R on the OP-COSEG dataset, with an increase of nearly 0.6. How can this result be explained? [1] Hypergraph-Based Multi-Modal Representation for Open-Set 3D Object Retrieval. TPAMI 2023.

问题

Please refer to paper weaknesses.

局限性

The authors discusseds the limitations in the conclusion section. But I did not find the societal impacts mentioned in the conclusion section.

作者回复

2024-08-07

Response for Reviewer 3mhr

We sincerely thank you for the valuable comments and advice, which provided important guidance for us to enhance the rigor and coherence of our paper and directed the focus of our future work.

About the ablation study on $K$ -value (Answer for Weakness 1): We further conduct ablation studies on hyper-parameters $K$ to validate the influence of uncertainty hyperedge in the leverage hypergraph. As shown in Fig. R1 of the rebuttal PDF, as the $K$ varies, the performance of the proposed method remains stable and outperforms the compared method in most ranges. However, the performance ceases to improve and may even slightly decrease after reaching a certain value, indicating that the extraction and utilization of high-order correlations have reached saturation. Besides, we can find that different datasets have different peak values of $K$ . A larger K-value indicates the requirements for more extensive and deeper capturing of high-order correlations. Your suggestions have inspired our future work, and we will focus on balancing the performance and complexity of the structure-aware network to achieve optimal relationship modeling. We will also include this part of the experiment and analysis in the paper.
About the unique contributions of IAE (Answer for Weakness 2): The IAE module is designed to obtain assembly embeddings from multiple part features. Compared to the multimodal embedding method in HGM2R, which generates averaged embeddings of different modalities from a global perspective, the IAE module aims at assembled embeddings with geometric-semantic consistency for all parts from a global-local collaborative perspective. Specifically:

a) The IAE module constructs a structure (the Isomorphism Hypergraph) to capture geometric correlations, such as the order and quantity among the input parts, and then utilizes them for structure-aware fusion. However, HGM2R only uses an averaged-guided fusion method with simple auto-encoders.

b) Guided by the isomorphism hypergraph, the IAE module designs the Hypergraph Isomorphism Convolution (HIConv) layer that combines geometric and semantic information to generate embeddings collaboratively. However, HGM2R uses a naive MLP for feature mapping and fusion, which loses the high-order information within the input.

Although the HGM2R designs a satisfactory approach for multimodal embedding, its semantic-only averaged embedding paradigm is not suitable for part assembly, which requires the collaborative use of information from different domains. Based on HGM2R, the IAE module of our method introduces the Isomorphism Hypergraph and Hypergraph Isomorphism Convolution to achieve part assembly with geometric-semantic consistency. Experimental results in both the paper and the rebuttal PDF demonstrate the necessity and superiority of the IAE module.
About the improvement on OP-COSEG (Answer for the last Weakness): Although HGM2R uses a hypergraph structure, they only construct an inter-object hypergraph at the feature smoothing stage (stage 2), without considering the intra-object correlations between different parts of an object, which directly determine the accuracy of the embeddings. Our method constructs hypergraphs for capturing both intra-object and inter-object correlations, as mentioned in the Answer 2 above. We provide examples of the three datasets in Fig. R4, we can find that compared to the OP-SHNP and OP-INTRA datasets, the objects in the OP-COSEG dataset exhibit stronger symmetry, meaning that the isomorphism between parts is more significant. Ignoring geometric information during assembly embedding may lead to inconsistencies within the same category. These challenges result in only slight improvements over non-hypergraph methods. However, our method tackles the challenge of assembly isomorphism and unification, achieving part assembly with geometric-semantic consistency, leading to significant performance improvements on the OP-COSEG dataset.
About societal impacts (Answer for Limitations) As shown in line 158-159 and Appendix C Algorithm 1, the proposed HARF framework is a feature-driven framework and exclusively relies on the input of basic features, rather than utilizing raw data through the end-to-end approach. This feature-driven representation approach preserves extensibility to other common multimedia data and such as e.g. text, audio, video, and their pieces. We believe this paper can provide a general theoretical foundation and methodological reference for the application of multimedia retrieval in practical real-world scenarios. We will release the datasets, code, configs, and pre-trained models immediately after the anonymous review period of NeurIPS 24. We also look forward to engaging and collaborating with more researchers on both theoretical and applied studies of semi-open learning across different fields. Additionally, we are willing to share our experiences on this (OpenReview) or other open-source platforms.

Thank you again for your valuable suggestions, especially your professional advice on future work in assembly-based open-set learning.

2024-08-14

Thanks for the reply. I keep my initial rating.

作者回复

2024-08-07

We thank all reviewers for your insightful feedback and for your valuable time and effort. We try to answer all the questions and weaknesses of each reviewer in the rebuttal section below. The attached PDF contains our additional experimental results and figures.

评论- Summary of the Discussion

2024-08-14

Dear Chairs and Reviewers,

Hope this message finds you well.
With the closing of the discussion period, we present a brief summary of our discussion with the reviewers as an overview for reference. First of all, we thank all the reviewers for their insightful comments and suggestions. We are encouraged that the review found our paper is:

Reviewer 3mhr: the motivation is well-established, the idea is novel, sufficient quantitative and qualitative comparisons verify the effectiveness
Reviewer Kfog: the idea sounds interesting and reasonable, the description is formal and rather clear, extensive ablation studies
Reviewer ZtnM: the problem is well-motivated and the solution seems working well, the results are good
Reviewer zjgH: a unique and promising direction in the field, the HAFR framework is well-thought-out, with clearly defined modules, provides a valuable resource for the research community, methodology is clearly described, and the algorithms are well-structured, well-written and easy to follow
Reviewer 7vQy: facilitate future research, proposes several novel modules, important in shape understanding

We have carefully read all the comments and responded to them in detail. All of those will be addressed in the final version.

We summarize the main concerns of the reviewers with the corresponding response as follows:

About more ablation studies: We further conduct more ablation studies on the k-value, memory bank, layers of HIConv, and part number. Experimental results indicate that the selected values and structures in our implementation yield the best result within the current framework. However, these comments have also inspired our future work to explore the setting of hyper-parameters and their theoretical foundations, aiming to balance the performance and complexity of this assembly-based framework.
About input and part feature extraction: We provide a more detailed description of the framework input and part feature extraction. Instead of extracting features from the segmented parts of the point cloud, we use a segmentation network to obtain point-wise features for each point and then average these point-wise features to obtain part features. Therefore, in our framework, we do not need dense point clouds nor how many parts each object should be segmented into. We conduct more comparisons on the segmentation method. Experimental results show that different point cloud segmentation methods did not significantly affect retrieval performance, further demonstrating the generalization. We believe the method remains applicable in real-world scenarios and is not influenced by the choice of the segmentation network.
About the computational requirements. We further conduct experiments and comparisons on the computational requirements. Since the proposed HARF is a feature-driven method, it requires significantly less training (less than 100s) and inference (less than 20ms) time compared to existing methods. We believe that the proposed framework has the potential to be a foundation framework for open-set learning in real-world scenarios.

Based on the discussion with reviews, we also present a brief summary of our paper as follows:

Observation: The lack of object-level labels presents a significant challenge for 3D object retrieval in the open-set environment. However, part-level shapes of objects often share commonalities across categories but remain underexploited in existing retrieval methods.
Solution: We explore a method to navigate the intricacies of open-set 3D object retrieval (3DOR) through a bottom-up lens of Part Assembly, and we propose the HARF framework for assembly-based open-set 3DOR.
Results: We construct three new datasets for this task and experimental results demonstrate that our method can outperform state-of-the-art retrieval methods for retrieval.
Highlights: Building on 3D object retrieval task, our work has the following highlights:
- Assembly-based Retrieval: a new paradigm for open-set representation and learning through a bottom-up lens of Part Assembly.
- HARF: An early explored framework for fine-grained open-set learning
- Hypergraph Isomorphism Convolution and Leveraged Hypergraph: A flexible high-order structure and the corresponding structure-aware convolution approach for assembly-based representation
Social Impacts: The proposed HARF framework is a feature-driven framework and exclusively relies on the input of basic features. This feature-driven representation approach preserves extensibility to other common multimedia data such as e.g. text, audio, video, and their pieces/parts. We believe this paper can serve as a foundation framework for the application of multimedia retrieval in practical environments.

Thanks again for your efforts in the reviewing and discussion. We appreciate all the valuable feedback that helped us to improve our submission.

Sincerely
Authors of Submission 991

最终决定Accept (poster)

2024-09-25

Congratulations! After reading the rebuttal, the reviewers' concerns were alleviated. Note that the reviewer "7vQy" changed the rating from reject to borderline accept during the discussion (the rating has not been updated in the form). Thus, given support for acceptance from all the reviewers, the AC recommends acceptance. The AC strongly encourages the authors to include the ablation study, clarifications and the rest of the rebuttal/discussion material in the final version of the paper.