Semi-Open 3D Object Retrieval via Hierarchical Equilibrium on Hypergraph
摘要
评审与讨论
This paper extends the open-set 3D object retrieval problem to a semi-open situation where hierarchical semantic labels are considered. The authors leverage the multi-level category information with a proposed hypergraph-based Hierarchical Equilibrium Representation (HERT) framework. This framework consists of a Hierarchical Retrace Embedding module for encapsulating hierarchical semantic information and a Structured Equilibrium Tuning module for learning generalizable features according to the constructed superposed hypergraph from both local coherent and global entangled correlations.
优点
The proposed semi-open 3D object retrieval setting is reasonable and more practical for real-world applications. The idea of introducing a hierarchical semantic graph is suitable for this task and the proposed framework is technically sound. Experiment results could demonstrate the effectiveness of the proposed framework, the ablation study comprehensively shows the functionality of the coarse label.
缺点
-
The mathematical symbols within this article are chaotic, giving the following examples:
1.1. The notations of basic features in Line 152 () and Line 155 ()are inconsistent. Does the aggregation function in Line 156 have the same meaning as in Line 155?
1.2. The description in Lines 157-163 does not match the illustrated pipeline in Figure 2, which shows that takes as a side input (with a concatenate operation? This is also confusing. I think is concat with before inputting into ). How exactly can the Retrace Encoding be obtained? Where are the reconstruction features mentioned in Line 163 be used? Does the actually the in Figure 2? However, has been utilized to represent the basic feature in Line 155. I suggest adding more essential symbols and legends in Figure 2 to align with the descriptions. Similar things happen in the Appendix C Algorithm 1.
1.3. Besides, it is not recommended to over-defined some ''spaces'' that are not utilized in the rest of the article, such as ''retrace space '' (Line 159 retracte -> retrace? Line 160 -> ?) and ''mixed space ''. These may bring potential typos and are unhelpful in understanding the paper.
问题
- In Section 3 of the problem setup, the traditional 3D Object Retrieval setting contains no multi-level labels, right? Bringing this concept with the proposed multi-level labels without explanation is unsuitable, which may lead to confusion.
- The authors construct four datasets for evaluating their framework in their proposed task. However, as listed in Appendix B, each dataset is equipped with only 3 coarse categories according to the shape of the objects, which increases my concerns about the efficiency of the proposed hierarchical semantic graph in more complex situations. Are there any possible coarse categories to be included in the real scenes?
- As the proposed framework is trained in a two-stage scheme, It would be better to include a comparison of model complexity and training/inference times,
- Will the proposed datasets be released?
局限性
The authors have mentioned the potential limitations.
Response for Reviewer vaz9
We sincerely thank you for the valuable comments and advice, which provided important guidance for the presentation of this paper and clarified the direction for future work.
-
About the writing and symbols (Answer for Weakness 1):
We apologize for these typos. We will conduct a thorough review and revision of the entire paper to ensure the clarity and rigor of the writing.
1.1 (For Weakness 1.1) The typos in line 152 and in line 155 should be correctly written as and , which denote the basic features of -th modality and aggregation function, respectively.
1.2 (For Weakness 1.2) We have restructured the data flow in the HRE module for each object:
where is an encoding of the coarse label, which is a learnable vector of the same dimension as . Specifically, retrace encoding is implemented by nn.Parameter for each coarse category, objects within the same coarse label share the same retracing encoding. During computation, it is element-wise added to and then input to . is used solely for loss calculation in the HRE module, using coarse labels for supervision to guide the accuracy of the retrace embedding representation. It does not participate in the computations of the SET module.
Based on your suggestions, we have revised the pipeline diagram and added more symbols, as shown in Fig. R2 of the rebuttal PDF.
1.3 (For Weakness 1.3) We removed the presentation of space in lines 157-163 to enhance the readability of the paper. Specifically, we revise this paragraph as follows:
...then compresses the unified embedding aligned with into retrace embedding and does the reverse reconstruction for supervision... -
About the traditional 3D object retrieval method (Answer for Question 1):
Traditional 3D object retrieval methods, both closed-set [1-3] or open-set[4][5] methods, consider only single-layer labels of objects. Besides, traditional open-set methods strictly assume no overlap between the training and testing sets[5][6]. However, in practical real-world scenarios, objects are typically described by multiple hierarchical labels, and the training set and testing set often share a partial space of coarse labels. As shown in Tab. R1 of the rebuttal PDF, we expand the number of label levels in the semi-open learning task, where testing categories are unseen at one level but seen at other levels. The label spaces are disjoint at only one level and have some overlap at other levels.
[1] Gao Y, et al. 3-D object retrieval and recognition with hypergraph analysis[J]. IEEE TIP, 2012.
[2] He X, et al. Triplet-center loss for multi-view 3d object retrieval[C]. IEEE CVPR, 2018.
[3] Collins J, et al. Abo: Dataset and benchmarks for real-world 3d object understanding[C]. IEEE CVPR, 2022.
[4] Zhou Z. Open-environment machine learning[J]. National Science Review, 2022.
[5] Feng Y, et al. Hypergraph-based multi-modal representation for open-set 3d object retrieval[J]. IEEE TPAMI, 2023.
[6] Parmar J, et al. Open-world machine learning: applications, challenges, and opportunities[J]. ACM Computing Surveys, 2023. -
About the efficiency of graphs (Answer for Question 2):
This paper is an early exploration of semi-open learning. Therefore, we selected the three typical categories for the geometry-based coarse label, which is significantly different from the semantic-based fine category. These three coarse categories and two levels of hierarchical labels are representative of a semi-open environment, helping us focus on exploring the new semi-open learning task and designing a novel collaborative learning paradigm based on hierarchical correlations. Specifically, when the category of coarse labels increases, a natural implementation is to construct a hypergraph structure with more hyperedges to capture more complex correlations. This involves addressing challenges such as complex network representation associated with multiple labels while balancing complexity, efficiency, and performance. Tackling these challenges, we have preliminarily experimented with a hypergraph-based isomorphism computation method for structure compression and complexity reduction, inspired by [7]. Besides, we also have developed a hypergraph-based dynamic system approach to manage the dynamic increase of categories and levels of the label inspired by [8].
[7] Feng Y, et al. Hypergraph isomorphism computation[J]. IEEE TPAMI, 2024.
[8] Yan J, et al. Hypergraph dynamic system[C]. ICLR, 2024. -
About the computational requirements comparison (Answer for Question 3):
Our experiments are conducted on a computing server with one Tesla V100-32G GPU and one Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz. We provide a detailed comparison of model parameters, training time, and inference time for the two stages in Tab. R2 of the rebuttal PDF. -
About the open access to datasets (Answer for Question 4):
Thanks for your interest in our work. We are well prepared and will release the datasets, code, configurations, and pre-trained models immediately after the anonymous review period of NeurIPS 24. We also look forward to engaging and collaborating with more researchers on both theoretical and applied studies of semi-open learning across different fields. Additionally, we are willing to share our experiences on this (OpenReview) or other open-source platforms.
Thank you again for your valuable suggestions, especially your professional advice on future work in semi-open learning.
I'm glad to see the authors' efforts in their rebuttal. Considering most of my concerns have been addressed, I'd like to increase my score to 8. Gook luck.
We sincerely appreciate your positive feedback and professional comments on our work. Your valuable suggestions have been crucial in improving the quality of our paper. We will carefully revise the manuscript according to your review comments and ensure the rigor of the experimental results and references.
The paper introduces a novel framework called the Hypergraph-Based Hierarchical Equilibrium Representation (HERT) for semi-open 3D object retrieval. The proposed framework addresses the practical scenario of semi-open environments where the training and testing sets share a partial label space for coarse categories but are completely disjoint for fine categories. The HERT framework comprises two main modules: Hierarchical Retrace Embedding (HRE) and Structured Equilibrium Tuning (SET). The authors also generate four semi-open datasets to benchmark their approach and demonstrate its effectiveness through extensive experiments.
优点
- Novel Framework: The introduction of the HERT framework for semi-open 3D object retrieval is a novel contribution that fills a gap in the current literature.
- Hierarchical Approach: The use of hierarchical labels to better capture the multi-level semantics of 3D objects is innovative and aligns well with real-world scenarios.
- Comprehensive Experiments: The authors conducted extensive experiments on four newly generated datasets, providing strong empirical evidence of the effectiveness of their approach.
- Clear Problem Definition: The paper clearly defines the semi-open environment and distinguishes it from traditional open-set and closed-set scenarios.
缺点
- Complexity: The proposed framework is quite complex, which might make it difficult for practitioners to implement and extend.
- Lack of Baseline Comparisons: While the paper compares HERT against state-of-the-art methods, more diverse baseline comparisons, including simpler methods, could provide a clearer picture of the improvements.
- Citation: Please add citations for the methods you compared against in the tables (Tab. 1, 2, 3).
问题
- How does the HERT framework perform in scenarios with more than three levels of hierarchical labels?
- What are the computational requirements for training and deploying the HERT framework, especially in terms of time and resources?
- Can the proposed method be extended to other domains beyond 3D object retrieval, such as text or image retrieval?
局限性
The authors have provided discussions about the limitations.
Response for Reviewer oLzP
We sincerely thank you for the valuable comments and advice, which provided important guidance for the presentation of this paper and clarified the direction for future work.
-
About the framework (Answer for Weakness 1):
Based on your suggestions, we have restructured the presentation of the proposed framework. Specifically, the proposed framework consists of two sequentially connected modules: HRE and SET.
a) The HRE module takes basic features of different modalities as input. This module employs two sets of auto-encoders sequentially to achieve modality fusion and category space retrace, and generates unified embeddings and retrace embeddings for each object.
a) The SET module takes two types of embeddings from the last module as input, utilizing structure-aware feature smoothing and distillation through hypergraph convolution and memory bank reconstruction, respectively. Finally, this module generates the final features for similar object matching based on feature distance, thereby enabling retrieval. -
About more comparisons (Answer for Weakness 2):
Inspired by your suggestions, we added three simpler methods as compared baselines to make our experiments more comprehensive. We provide the additional experimental results in Tab. R3 of the rebuttal PDF. From the results, we can observe that the low performance of simpler methods like MLP and GCN demonstrates the complexity of the semi-open environment and the necessity of research in semi-open learning. The significant improvement achieved by our method also proves its effectiveness. We will include these results and analyses in the revised version of the paper. -
About citation and writing (Answer for Weakness 3):
We have added citations for the compared methods as shown in Tab. R3 of the rebuttal PDF, and we have made similar revisions for the tables in the paper. -
About the level of labels (Answer for Question 1):
This paper is an early exploration of semi-open learning. Therefore, we selected the two typical levels of labels based on different criteria: geometry-based coarse shape category and semantic-based fine category, which is representative of a semi-open environment. This two-layer framework is also a typical implementation of the collaborative learning paradigm based on hierarchical correlations. When the levels of hierarchical labels increase, a natural implementation is to construct more Retrace Auto-Encoders, and each auto-encoder is designed to retrace one level of the category. This involves addressing challenges such as domain adaptation associated with multiple levels while balancing complexity, efficiency, and performance. Specifically, we have preliminarily experimented with a hypergraph-based isomorphism computation method to address the increase in parameters brought by higher levels, inspired by [1]. Additionally, we have developed a hypergraph-based dynamic system approach to manage the increasing number of labels at each layer inspired by [2].
[1] Feng Y, et al. Hypergraph isomorphism computation[J]. IEEE TPAMI, 2023.
[2] Yan J, et al. Hypergraph dynamic system[C]. ICLR, 2024. -
About the computational requirements (Answer for Question 2):
Our experiments are conducted on a computing server with one Tesla V100-32G GPU and one Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz. We provide a detailed comparison of model parameters, training time, and inference time for the two stages in Tab. R2 of the rebuttal PDF. -
About the data extension (Answer for Question 3):
As shown in #line 152-153 and Appendix C Algorithm 1, the proposed HERT framework is a feature-driven framework and exclusively relies on the input of basic features, rather than utilizing raw data through the end-to-end approach. This feature-driven representation approach preserves extensibility to other common multimedia data such as e.g. text, audio, video, 3d, etc. We believe this paper can provide a general theoretical foundation and methodological reference for the application of multimedia retrieval in practical real-world scenarios. We will release the datasets, code, configs, and pre-trained models immediately after the anonymous review period of NeurIPS 24. We also look forward to engaging and collaborating with more researchers on both theoretical and applied studies of semi-open learning across different fields. Additionally, we are willing to share our experiences on this (OpenReview) or other open-source platforms.
Thank you again for your valuable suggestions, especially your professional advice on future work in semi-open learning.
Thanks to the authors for answering my questions and for their efforts. Most of my concerns are addressed well. I raise my score to 6. Good luck :)
We sincerely appreciate your positive feedback and professional comments on our work. Your valuable suggestions have been crucial in improving the quality of our paper. We will carefully revise the manuscript according to your review comments and ensure the rigor of the experimental results and references.
This paper introduces a more practical Semi-Open Environment setting for open-set 3D object retrieval with hierarchical labels, in which the training and testing set share a partial label space for coarse categories but are completely disjoint from fine categories. A novel framework, HERT, is proposed for this task. The HRE module is designed to overcome the global disequilibrium of unseen categories. Besides, the SET module is designed to utilize more equilibrial correlations among objects and generalize to unseen categories. Furthermore, four semi-open 3DOR datasets are generated with multi-level labels for benchmarking. The proposed method achieves good performance.
优点
This paper is easy to read. This paper targets an interesting problem, open-set 3D object retrieval. The proposed method is also interesting.
缺点
The proposed method is a little simple, making the technical contribution unclear. Maybe the writing should be improved to highlight the technical insights.
This paper introduces a new task. It would be better to discuss the difference between this new task and existing ones in the technical aspect. It also helps if some promising research directions for this new task could be provided.
问题
Please clarify the technical insights.
局限性
The writing can be improved.
Response for Reviewer YDem
We sincerely thank you for the valuable comments and advice, which provided important guidance for the presentation of this paper and clarified the direction for future work.
-
About the technical contribution (Answer for Weakness 1 and the Questions):
a) A newly general task for real-world practical machine learning. We demonstrated the limitations of naive open-set learning tasks and methods through experiments and proposed a more practical semi-open learning task. Additionally, we constructed four semi-open datasets for benchmarking.
b) An early explored paradigm and framework for semi-open learning. Specifically, we proposed the Hypergraph-Based Hierarchical Equilibrium Representation (HERT) framework, including the Hierarchical Retrace Embedding (HRE) and the Structured Equilibrium Tuning (SET) modules, which are designed to overcome the distribution disequilibrium and confusion of unseen categories in semi-open 3D object retrieval.
c) A flexible high-order structure for semi-open learning. We propose a superposed Hypergraph structure to capture high-order correlations among objects, under the guidance of local coherent correlations and global entangled correlations from hierarchical category information.
d) Extensive experiments and analysis. Experimental results on the four datasets demonstrate that our method can outperform state-of-the-art retrieval methods in the semi-open environment. -
About the task comparison (Answer for Weakness 2):
We discuss the difference between this new task (semi-open learning) and existing ones (open-set learning) in Tab. R1 of the rebuttal PDF. Existing open-set learning methods consider only the single-layer labels of objects and strictly assume no overlap between the training and testing sets[1][2]. However, in practical real-world scenarios, objects are typically described by multiple hierarchical labels, and the training set and testing set often share a partial space of coarse labels. We expand the number of label levels in the semi-open learning task, where testing categories are unseen at one level but seen at other levels. The label spaces are disjoint at only one level and have some overlap at other levels.[1] Zhou Z. Open-environment machine learning[J]. National Science Review, 2022.
[2] Parmar J, et al. Open-world machine learning: applications, challenges, and opportunities[J]. ACM Computing Surveys, 2023.
Thank you again for your valuable suggestions, especially your professional advice on presentation and future work in semi-open learning.
This paper introduces a Semi-Open Environment setting for open-set 3D object retrieval, addressing the limitation of existing methods that only consider single-layer labels and assume no overlap between training and testing sets. The authors propose the Hypergraph-Based Hierarchical Equilibrium Representation (HERT) framework, which includes the Hierarchical Retrace Embedding (HRE) module to balance representations across multi-level categories and the Structured Equilibrium Tuning (SET) module to handle feature overlap and class confusion through high-order correlations in a superposed hypergraph. They also create four semi-open 3D object retrieval datasets with hierarchical labels to benchmark their approach. Experimental results show that their method effectively generates and generalizes hierarchical embeddings of 3D objects, outperforming current state-of-the-art retrieval methods in semi-open environments.
优点
I generally appreciate the study angle around the fine-grained structure across different 3D object categories, which could bring more insights for related 3D research. Overall, the proposed architecture is composed of reasonable components along with reasonable loss functions.
Motivating by potential contradict optimization for open-set learning, they proposed semi-open 3D object retrieval task. To examine the performance, they also design 4 datasets based on existing 3D object datasets. The proposed framework outperforms all the aselines.
缺点
I have several major questions and concerns:
- What does the exact coarse labels mean by author in Figure 1? I failed to see the relationship among solid of revolution, rectangular-cubic, and helicopters. I doubt the meaningful of the proposed multi-levels of 3D objects: it is too coarse to be a intermediate level. Can the author show several qualitative visualization around several 3D objects which share different fine-grained levels but with the same coarse level? I'd like to check the results with randomly sampled from SO-ESB, SO-NTU, SO-MN40, and SO-ABO, respectively. Generally, for each dataset, randomly sampled 5 sets of objects with the different fine-grained-level labels while the same coarse-level label would be very helpful.
- The proposed HRET framework is a bit ad-hoc: consider if we have multiple levels (more than 2 levels) shared across a large amount of 3D objects, we will need more levels of auto-encoders given the current design logic.
- As mentioned in line 448-449, some of the 3D objects will be throwed away due to the improper design of coarse labels. I failed to this the use of basic geometric shape to fit nowadays 3D vision research. For example, Objaverse(-xl) have a lot of comprehensive objects consisting of multiple different basic geometric shape. Those "complex" object may lie in the interest for us to perform retrieve tasks.
Here are some minor points:
- How do you determine the corase label if one object actually contains multiple separate simple shape?
- 3DOR is first mentioned in #line15 without full form. This is unfriendly to readers who are not familiar with this direction.
- In Figure 6 (b), there are some overlap fine-grained objects. Could you please show some? For example, in the left brown ones, they are mixed with yellow and light-purple dots; in the top pink group, there is a green dot.
问题
please address the concerns raised above.
To be specific, I'd like to see a total 25 randomly sampled hierarchy examples from the proposed SO-ESB, SO-NTU, SO-MN40, and SO-ABO.
局限性
The current version discussed limitations in #line 311~314.
Response for Reviewer nF63
We sincerely thank you for the valuable comments and advice, which provided important guidance for the presentation of this paper and clarified the direction for future work.
-
About coarse labels and datasets (Answer for Weakness Major 1, 3, Minor Q1, and the Questions):
The coarse labels in Figure 1 mean the basic shape of the object as a whole. As shown in Fig. R1 of the rebuttal PDF, we provide examples of four datasets, each with three coarse classes and five fine classes. We annotate the coarse labels according to the geometry-based shape of each object as a whole (#line 242-243, #line 447-449), while ignoring the part assembly relationships within an object in this paper. This paper is an early exploration of semi-open learning. Therefore, we selected two typical levels of labels based on different criteria: geometry-based coarse shape category and semantic-based fine category, which is representative of a semi-open environment. As shown in the right side of Fig. R1, the objects that were removed during dataset construction are those with multiple separated parts and cannot be considered as a whole. We believe that the use of these corner-case samples would intertwine other issues such as graphics and foundation models, and would not reflect the key problem of contradicted hierarchical labels in semi-open learning. Therefore, we decided to exclude them from this early exploration and will address these more complex cases with multiple levels of labels in future work. Additionally, we provide the split of some datasets in Answer 5-7. -
About multiple levels (Answer for Weakness Major Q2):
As an early exploration of semi-open learning, we believe this paper should focus on exploring the new semi-open learning task and designing a novel collaborative learning paradigm based on hierarchical correlations. Based on this paradigm, the HERT framework is implemented with two layers temporarily in this paper, aiming to use two of the most representative layers to validate the necessity and performance of machine learning research in typical semi-open environments. However, one of our future work directions is to extend the HERT framework to encompass more intertwined factors in complex semi-open environments. This involves addressing challenges such as domain adaptation associated with multiple levels while balancing complexity, efficiency, and performance. Specifically, we have preliminarily experimented with a hypergraph-based isomorphism computation method to address the increase in parameters brought by higher levels, inspired by [1]. Additionally, we have developed a hypergraph-based dynamic system approach to manage the increasing number of labels at each layer inspired by [2].
[1] Feng Y, et al. Hypergraph isomorphism computation[J]. IEEE TPAMI, 2023.
[2] Yan J, et al. Hypergraph dynamic system[C]. ICLR, 2024. -
About failure cases (Answer for Weakness Minor Q3):
We provide the visualization of failure cases in Fig. R3 of the rebuttal PDF. In these failure cases, the query objects (bench, TV stand, bookshelf) and the wrong-matched target objects (mantel, laptop) share a certain similarity in their shapes and belong to the same coarse category (Rectangular-Cubic Prism). Although the significant performance improvement of the HERT framework demonstrates the necessity of research in semi-open learning and the effectiveness of our method, these corner cases also indicate the necessity of utilizing finer-level information such as the part-assembly of objects. This issue is the same as the multiple levels issue mentioned in Weakness Major Q2. However, this paper focuses more on the fundamental differences brought by semi-open hierarchical labels. Therefore, we only consider two layers of labels in this study. As mentioned in the Answer 2 above, we are currently conducting research to address these more complex environments. Thank you for your keen observations and academic insights. -
About the writing (For Weakness Minor Q2):
Thanks for your thorough review and suggestions. We will conduct a comprehensive review of the entire paper, especially focusing on the use of abbreviations. -
Splits of SO-NTU:
5.1 Train
Rectangular-Cubic Prism: headstone,table square
Solids of Revolution:ball,ballon,cannon,watch
Miscellaneous: book,plant with pot,cold weapon stick,frame,gun pistol,plane delta wing,plant leaf
5.2 Test
Rectangular-Cubic Prism: Bed,truck,chair common,sofa,man,chair,computer,container,tank
Solids of Revolution: fish,bottle,knife,cup,helmet,hydrant,insect fly,insect polypod,missle,orchestral,pen,plane backswept wing,plane forwardswept wing,tree,weed,ring,table round,hammer,screwdriver,wheel,zeppelin
Miscellaneous: dinosaur,dog,duck,tetrapods,bird,car common,chair swivel,chess,chip,clock,cold weapon long,sword,cycle bike,cycle moto,door,gun musket,gun submachine,human stand,floorlamp,table lamp,giant,helicopter,straight wing,flower,galleon,ship modern -
Splits of SO-MN40:
6.1 Train
Rectangular-Cubic Prism: table,night stand,sink, monitor
Solids of Revolution: Glass box,flower pot
Miscellaneous: Keyboard, airplane
6.2 Test
Rectangular-Cubic Prism: Mantel,tv stand,desk,sofa,bed,bookshelf,chair,bathtub,wardrobe,dresser, radio,piano,bench,xbox,range hood
Solids of Revolution: Bottle,bowl,cup,stool,vase,cone,tent
Miscellaneous: Toilet,curtain,car,guitar,stairs,door,person,laptop,plant,lamp -
Splits of SO-ABO:
7.1 Train
Rectangular-Cubic Prism: table Solids of Revolution: tent Miscellaneous: Mirror,Plant or flower pot
7.2 Test
Rectangular-Cubic Prism: chair,cart,shelf,cabinet,dresser,bed,bench,ladder,sofa Solids of Revolution: exercise weight,container or basket,vase,ottoman,pillow Miscellaneous: picture frame or painting,lamp,fan
Thank you again for your valuable suggestions, especially your professional advice on future work in semi-open learning.
Dear Reviewer,
We would greatly appreciate any updates or feedback you might have regarding our responses to your initial comments. Your insights are valuable to us as we work to improve our paper.
If you need any additional information or clarification from our side, please don't hesitate to let us know.
Thank you for your time and consideration.
I greatly appreciate the efforts made by the authors. The additional qualitative results and failure cases will help readers better understand the manuscript.
I have carefully reviewed all of your responses and the rebuttal. Since the authors emphasized the contribution of the new semi-open learning task, I would respectfully ask the AC to evaluate this contribution, while my comments will primarily focus on the technical contributions and any factual errors:
-
I still believe the proposed coarse-label partition is too coarse to be practical.
- For example, in SO-ABO, an ottoman is categorized under Solids of Revolution; however, many ottomans are cube-shaped.
- Plants with pots are currently categorized under Miscellaneous, but many pots have a Solids of Revolution-like shape.
- Several basic object categories, such as airplanes, chairs, and toilets, are partitioned under Miscellaneous. As the author did not respond to Objaverse cases, I guess a lot of objects in Objavese would be categorized into Misc which make no sense.
-
The proposed HERT framework appears to be ad-hoc in its current design of two-level coarse labels. The authors mentioned that they (1) experimented with a hypergraph-based isomorphism computation method to address the increase in parameters associated with higher levels and (2) developed a hypergraph-based dynamic system approach to manage the increasing number of labels at each layer; however, these details are not included in the rebuttal files.
Thanks again for your valuable suggestions. We will respond to your comments separately in the following two text boxes due to space limitations.
- About ottoman in SO-ABO and plants with pots in SO-NTU
-
For the ottoman: Since the shapes of the ottomans are typically composed of curved surfaces and are closer to ellipsoids, we removed the few cube-shaped samples from the ottoman category in the ABO dataset when constructing the SO-ABO dataset, and kept only the ellipsoid-like samples for the experiment.
-
For the plants with pots: The samples in this category are not pots but rather a variety of plants with diverse and unusual shapes (such as Lavender, Snake Plant, Jasmine, Spider Plant, Aloe Vera, etc.), along with their pots, which also have different shapes. Therefore, we classified these objects under Miscellaneous in this paper.
We will provide more examples and explainations, especially for these categories, in the revised version.
-
About the Objaverse and Miscellaneous Category
Thank you for the reminder, and we apologize for omitting the necessary emphasis and analysis for Objaverse in our first round of response. Objaverse is an outstanding work, and we believe that it is one of the most important and practically datasets in the 3D vision field in recent decades. This dataset provides a richer quantity of objects, finer categories, and diverse domains, with a greater diversity of 3D shapes. It is an essential resource for further exploring more complex and practical semi-open learning. We will provide a detailed analysis comparing our datasets with Objaverse, a discussion of Objaverse's irreplaceable role in semi-open learning, and necessary references [1] in the revised version of this paper.
[1] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects[C]. Annual Conference on Neural Information Processing Systems (NeurIPS), 2023.However, our work represents an early exploration of the semi-open environment. To explore this objectively existing environment, we have made a preliminary construction of the four datasets (SO-ESB, SO-NTU, SO-MN40, and SO-ABO). Within the context of these datasets, we used coarse labels such as Rectangular-Cubic Prism and Solids of Revolution to describe two categories of objects with typical coarse shapes. For other objects in the datasets that lack the typical shapes features, we temporarily classified them under the Miscellaneous category. The experimental results demonstrate that even with these simple coarse labels, the semi-open environment exists objectively and our method also achieved significant improvements. We believe that it is necessary in this paper to focus on typically modeling this new environment setting, analyzing the essential key challenges while filtering out the effects of other atypical or random noise. We acknowledge that conducting a comprehensive study of an entirely new field may be difficult within the scope of a conference paper.
Inspired by your comment, we will focus our future work on more hierarchical levels of labels in semi-open learning. We plan to dedicate more effort to exploring Objaverse and other complex datasets, investigating various shape, domain, and semantic labels beyond the shape-based coarse labels proposed in this paper. This will allow us to better adapt to more complex objects and advance the practical applications of semi-open learning. We are well prepared and will release the datasets, code, configurations, and pre-trained models immediately after the anonymous review period of NeurIPS 24. We also look forward to engaging and collaborating with more researchers on both theoretical and applied studies of semi-open learning across different fields. Looking forward to academic discussions with you after the anonymous period of NeurIPS 24, if possible! We are willing to share all our experiences, dataset, and code of this work.
- About the Framework
As mentioned in the answer above, the two-layer HERT is designed based on the existing typical semi-open assumption, serving as an early exploration setting of semi-open learning. The hypergraph-based isomorphism computation and dynamic system are the extended modules of the initial HERT framework for practical applications, we will provide more detailed experimental results and analysis in the revised version of this paper. Here, we provide some results and analysis:
-
For the isomorphism computation
In order to handle more levels of labels, we simplified the hypergraph construction process using isomorphism computation following [2]. Specifically, we detect and merge hyperedges with similar structures or edge embeddings within the proposed HERT framework, thereby reducing the complexity of the hypergraph. To evaluate the efficiency of this approach, we conduct the compared experiments between frameworks with and without isomorphism computation on SO-MN40 datasets. As shown in the table below, retrieval accuracy (mAP) improved as the number of layers increased, but both training and inference times also increased. Isomorphism computation significantly improved the efficiency of training and inference, with more notable gains in efficiency and accuracy observed as the layers increased.Table R4: Ablation studies of isomorphism computation on SO-MN40 dataset
number of layers 2 3 4 mAP w/o IC 0.6336 0.6441 0.6583 mAP with IC 0.6359 0.6483 0.6674 mAP Improvement 0.36% 0.65% 1.38% Training Time (s) w/o IC 91.16 95.79 99.73 Training Time (s) with IC 89.31 92.49 94.97 Training Efficiency Improvement 2.51% 3.44% 4.77% Inference Time (ms) w/o IC 18.42 19.51 20.76 Inference Time (ms) with IC 17.37 17.74 18.31 Inference Efficiency Improvement 5.70% 9.07% 11.80% w/o denote without, IC denotes the isomorphism computation module
-
For the dynamic system
To handle the increasing number of labels, we employed a hypergraph-based dynamic system to incrementally construct the hypergraph following [3]. Specifically, we constructed hyperedges for new vertices of new samples and new labels and updated the hypergraph structure within the proposed HERT framework. To evaluate the efficiency of this approach, we conduct the compared experiments between frameworks with and without the dynamic system on SO-MN40 datasets. As shown in the table below, retrieval accuracy (mAP) improved as the number of coarse categories increased, while an increase in the number of fine categories slightly decreased retrieval accuracy. Both increases in coarse and fine categories led to longer training and inference times. The dynamic system significantly improved the efficiency of training and inference, with more notable gains in efficiency and accuracy observed as the categories increased.Table R5: Ablation studies of dynamic system on SO-MN40 dataset
Number of Categories Original HERT Coarse: 3->4 Coarse: 3->5 Fine: 32->40 Fine: 32->49 mAP w/o DS 0.6336 0.6395 0.6427 0.6295 0.6253 mAP with DS - 0.6431 0.6478 0.6331 0.6327 mAP Improvement - 0.56% 0.79% 0.57% 1.18% Training Time(s) w/o DS 91.60 93.83 95.31 93.98 95.57 Training Time(s) with DS - 91.45 91.61 91.37 91.59 Training Efficiency Improvement - 2.53% 3.88% 2.78% 4.17% Inference Time(ms) w/o DS 18.42 19.74 20.01 19.93 21.36 Inference Time(ms) with DS - 18.45 18.39 18.57 18.43 Inference Efficiency Improvement - 6.50% 8.10% 6.82% 13.72% w/o denotes without, DS denotes the dynamic system module, and Original HERT means the original framework designed for 3 coarse categories and 32 fine categories. '->' denotes the increase in category number.
-
Conclusion
Experimental results above demonstrate that isomorphism computation and the dynamic system can effectively enhance the efficiency of the HERT framework, and they have the potential to advance the practical application of semi-open learning methods. we will provide more detailed results and analysis in the revised version of this paper.
Thank you again for your valuable suggestions, especially your professional advice on practical applications in semi-open learning.
We thank all reviewers for your insightful feedback and for your valuable time and effort. We try to answer all the questions and weaknesses of each reviewer in the rebuttal section below. The attached PDF contains our additional experimental results and figures.
Dear Chairs and Reviewers,
Hope this message finds you well.
With the closing of the discussion period, we present a brief summary of our discussion with the reviewers as an overview for reference. First of all, we thank all the reviewers for their insightful comments and suggestions. We are encouraged that the review found our paper is:
- Reviewer nF63: could bring more insights for related 3D research, composed of reasonable components along with reasonable loss functions
- Reviewer YDem: easy to read, targets an interesting problem, the proposed method is also interesting
- Reviewer oLzP: clearly defines, novel contribution that fills a gap in the current literature, innovative and aligns well with real-world scenarios, providing strong empirical evidence of the effectiveness through four newly generated datasets
- Reviewer vaz9: reasonable and more practical for real-world applications
We have carefully read all the comments and responded to them in detail. All of those will be addressed in the final version.
We summarize the main concerns of the reviewers with the corresponding response as follows:
- The comparison between Semi-Open and traditional Open-Set Learning. We clarify the difference between this new task (semi-open learning) and existing ones (open-set learning). Existing open-set learning methods consider only the single-layer labels of objects and strictly assume no overlap between the training and testing sets. However, in practical real-world scenarios, objects are typically described by multiple hierarchical labels, and the training set and testing set often share a partial space of coarse labels. For the new semi-open learning task, we expand the number of label levels in the semi-open learning task, where testing categories are unseen at one level but seen at other levels. The label spaces are disjoint at only one level and have some overlap at other levels.
- The coarse label generation. We clarify the generation for coarse labels in this paper and provide examples for all coarse categories of all datasets. Specifically, we annotate the coarse labels according to the geometry-based shape of each object as a whole, while ignoring the part assembly relationships within an object in this paper. Since this paper is an early exploration of semi-open learning for task definition and basic framework evaluation, the two levels of labels we selected are based on two typical different criteria: geometry-based coarse shape category and semantic-based fine category, which is representative of a semi-open environment.
- The efficiency with increasing levels and categories. We further conduct experiments on the computational requirements. Since the proposed HERT is a feature-driven method, it requires significantly less training (less than 100s) and inference (less than 20ms) time compared to existing methods. As shown in the resutls, when the levels or number of classes increase, the training and inference times only experience slight growth. Moreover, the increase of the level or number has almost no effect on the efficiency of using a general pre-trained model to extract basic features.
Based on the discussion with reviews, we also present a brief summary of our paper as follows:
- Observation: Existing open-set learning methods consider only the single-layer labels of objects and strictly assume no overlap between the training and testing sets, leading to contradictory optimization for superposed categories.
- Solution: We explore and define a more practical Semi-Open learning task, and we propose a HERT framework for open-set 3D object retrieval with hierarchical labels.
- Results: We construct four new datasets for this task and experimental results demonstrate that our method can outperform state-of-the-art retrieval methods in the semi-open environment.
- Highlights: Building on the 3D object retrieval task, our work has the following highlights,
- Semi-Open Learning: a newly general machine learning task for practical application beyond open-set learning
- HERT: An early explored paradigm and framework for semi-open learning
- Superposed Hypergraph: A flexible high-order structure for semi-open learning
- Social Impacts: The HERT framework is a feature-driven framework and exclusively relies on the input of basic features. This feature-driven representation approach preserves extensibility to other common multimedia data such as e.g. text, audio, video, 3d, etc. We believe this paper has the potential to serve as a foundation framwork for the application of multimedia retrieval in practical real-world scenarios.
Thanks again for your efforts in the reviewing and discussion. We appreciate all the valuable feedback that helped us to improve our submission.
Sincerely
Authors of Submission 17410
This paper receives 1 strong accept, 2 week accept, and 1 borderline reject. Most reviewers agree that the proposed semi-open setting could be interesting to future research in the field when objects are getting more complicated. Some reviewers raised concerns regarding the complexity of the proposed method and lack of strong baseline comparisons. But these concerns are addressed after rebuttal. Reviewer nF63 still holds the concerns regarding the coarse-label partition is too coarse to be practical after thoughtful discussion. After reading the paper and the discussion, AC finds that the proposed new setup and dataset is still quite beneficial to the field. The authors are encouraged to make the above mentioned necessary changes to the best of their ability. We congratulate the authors on the acceptance of their paper!