LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
We propose LangHOPS, the first Multimodal Large Language Model (MLLM)-based framework for open-vocabulary object–part instance segmentation.
摘要
评审与讨论
The authors propose a new method for Open-vocabulary Part Instance Segmentation (OVPIS) called LangHOPS by using hierarchical modeling in language space to encode of object parts. LangHOPS performs better than existing state-of-the-art methods across multiple benchmarks.
优缺点分析
Strengths:
- LangHOPS integrates MLLMs for OVPIS and grounds object-part hierarchies in language space. This is novel and superior to prior methods that struggle with modeling hierarchical relationships or handling part granularity variations.
- The method achieves strong empirical results over baselines.
Weaknesses:
- The object part granularity can be at different levels (as also seen from the supplementary figure 4). How to account for such ambiguities during evaluation? For example, if face is a sub-part for a human, what happens if the module parts with higher granularity like eyes, nose, mouth, etc?
- The paper's math notations use incorrect formatting throughout. For instance, should be and should be and so forth. This inconsistency hampers readability, and I strongly encourage the authors to correct these notation errors.
问题
Please see weaknesses
局限性
Yes
最终评判理由
I thank the authors for their response. The authors have sufficiently addressed my concerns, hence I maintain my positive rating.
格式问题
None that I could find.
We appreciate the reviewer’s positive feedback on our work, especially the highlighting of novelty by integrating MLLMs deeply into the OVPIS task and the handling of varying part granularity based on this. Furthermore, our core contribution grounding object-part hierarchies in the language space and performance over prior baselines is well acknowledged.
We address the reviewer's concerns in the following sections.
1. Granularity Variation
Part granularity is guided by text and can be flexibly specified by the user or by the dataset annotations used for training and evaluation in our approach.
- For example, during evaluation, if the ground truth annotation specifies “face” as the part of interest for a “human” object, our model constructs the corresponding part query by concatenating the text embedding of “face” with that of “human”, resulting in a semantically aligned object-part query (e.g., “human-face”).
- This design allows the model to be explicitly guided to the appropriate level of granularity, depending on the dataset’s supervision or user input. Moreover, our cross-dataset evaluation (Tables 1 and 2 main paper) is specifically intended to test this capability: different datasets may annotate parts at different levels of detail, and our model’s language-grounded hierarchy allows it to adapt accordingly.
- This granularity awareness is a key reason why our method consistently outperforms baseline models in the cross-dataset setting.
We will include more visualization of the model’s predictions on the same image but under different specified part granularities to better illustrate the model’s flexibility in handling varying granularity levels.
2. Notation
We appreciate the reviewer’s feedback and agree that the mathematical notations should be clear and consistent. We have carefully revised all equations and these corrections will be reflected in the updated version.
I thank the authors for their response. The authors have sufficiently addressed my concerns, hence I maintain my positive rating.
Dear Reviewer,
Thank you for your dedicated efforts in reviewing this paper. We are currently in the reviewer-author discussion phase, but we have not yet seen your engagement.
This year's Responsible Reviewing initiative requires all reviewers to communicate with authors during this period, emphasizing that ghosting is not acceptable. We kindly ask that you reply and engage with the authors. Please note that participation in discussions with authors is mandatory before submitting "Mandatory Acknowledgement," as submitting it without any engagement is not permitted in this review cycle.
Best, AC
Dear Reviewer,
We would like to express again our sincere appreciation for your positive review of our work. We hope that we have addressed your concerns and remain committed to further improving the paper as promised. Please feel free to reach out with any additional questions or suggestions.
- This paper presents LangHOPS, a new framework for Open-Vocabulary Part Instance Segmentation (OVPIS).
- LangHOPS explicitly embeds object–part hierarchies in language space, providing a structured representation of object-part relationships.
- The method integrates a Multimodal Large Language Model (MLLM) to enable context-aware and granularity-adaptive object-part parsing.
- Compared to previous works based on heuristics or constrained CLIP embedding space reasoning, LangHOPS introduces language-grounded part queries for accurate and scalable instance-level segmentation.
优缺点分析
Strengths
- The paper introduces the first integration of Multimodal Large Language Models (MLLMs) for the task of object-part instance segmentation, demonstrating originality.
- By explicitly modeling language-grounded object–part hierarchies, LangHOPS improves part query representation, enabling enhanced multi-granularity reasoning and context awareness compared to prior CLIP-based methods.
- The experimental evaluation is covering major benchmarks, including PartImageNet, Pascal-Part-116, and ADE20K, and demonstrates scalability, with further performance gains observed by incorporating additional training data.
Weaknesses
- The proposed approach appears to be an extension of PartGLEE with an integrated MLLM, but the architectural or training-level novelties beyond this integration are not clearly articulated. A more precise explanation of additional contributions, if any, would strengthen the claim of originality.
- Since LangHOPS relies on an MLLM (specifically, PaliGemma2), it would strengthen the work to provide quantitative comparisons regarding computational resources, such as model parameters and inference costs, relative to baseline methods.
- In Section 3.2 Method Overview, the sub-section title "Object Segmentation" is redundantly repeated in Lines 130 and 133, which should be clarified. Furthermore, inconsistencies in terminology between Figure 2, Section 3.2, and subsequent subsections (e.g., Object-Part Parsing, Part Segmentation) could lead to confusion and should be aligned.
- As noted in the NeurIPS Paper Checklist [7. Experiment statistical significance], the experimental results lack information on error bars or measures of variability, limiting the assessment of result consistency and robustness.
- The NeurIPS Paper Checklist indicates "[5. Open access to data and code] Yes," but I could not find a clear link or reference for accessing the released data and code at the time of reviewing.
- Minor typos present in the manuscript:
- Line 50: produing → producing
- Line 131: featrues → features
- Line 243: dateset → dataset
- Table 1: artImageNet → PartImageNet
问题
- In Table 1 and Table 2, experiments are conducted using Pascal-Part-116, which, to my understanding, only provides semantic part segmentation annotations (e.g., "bus's headlight"). Could the authors clarify how part instance segmentation (e.g., "bus 1's headlight 1") is performed using this dataset? Specifically, how was the OVPIS task implemented on PPS-116, which lacks instance-level part annotations?
- In Figure 1, the second example (two people walking) appears to demonstrate semantic segmentation, rather than instance-level segmentation, particularly for the parts. Could the authors clarify whether this example illustrates part instance segmentation or standard semantic segmentation?
- In Line 192, the paper provides an example of the structured prompt input for the MLLM-based object-part parsing. Could the authors elaborate on what specific outputs are produced by the MLLM at this stage?
局限性
yes
最终评判理由
In my original review, I rated the paper as a borderline reject (score 3) due to concerns about clarity, architectural novelty beyond MLLM integration, and presentation inconsistencies.
The rebuttal and subsequent revisions have directly addressed these points.
The authors clarified how their language-grounded object–part hierarchy differs fundamentally from prior approaches, provided instance-specific representation examples, and reorganized the method section for clearer structure and terminology.
They also added statistical robustness analyses, clarified computational costs, and fixed typographical and notation issues.
The method remains the first to integrate an MLLM into OVPIS with consistent SOTA results across benchmarks, and the strengthened presentation makes the contribution clearer and more convincing.
While some improvements could still be made in presentation polish, the technical merit and demonstrated effectiveness now outweigh earlier reservations.
I am therefore raising my rating from weak reject (3) to 4 (weak accept).
格式问题
N/A
We appreciate the reviewer highlighting the originality of our work by introducing language-grounded object-part hierarchies and MLLMs into the object-part instance segmentation task, as well as, in line with other reviewers, our experiments verifying our approach and showing the superior performance.
1. Novelty Clarification
Our work is not an extension of PartGlee with the integration of an MLLM, but rather proposes novel intermediate representations to enable language-grounded reasoning and the corresponding training strategy. In detail, our additional contributions are as follows:
- Our method introduces language-grounded modeling throughout the whole architecture, closely coupling object and part decoding as well as query generation in the same space. Furthermore, the language-grounded object-part hierarchy modeling replaces the brute-force hierarchical representation using a fixed set of parts in the Q-Former of PartGLEE's architecture. Overall, this design enables adaptive part hierarchies in our framework and OVS adaptation to varying part granularity beyond the training data.
- We explore the recipes of both one-stage and two-stage training strategies. We provide an empirical study (Table 7 in the submission) showing that the one-stage training strategy leads to better in-domain performance while the two-stage one leads to better cross-dataset performance. This empirical finding suggests that decoupling coarse-to-fine supervision helps improve generalization in tasks involving hierarchical modeling, offering practical insights for similar structured perception problems.
- Architectural Verification: We conducted experiments to verify and derive our architecture against a direct extension of PartGLEE, PSALM, and PartCatSeg. Most importantly, our experiments (Table 4 main paper) show that our design improves performance far beyond only utilizing the additional capacity and pretraining of the MLLM, but rather builds a framework around it to effectively utilize its generalist capabilities. We add these verification experiments to the final version of our paper.
Together, these contributions show that our method is not merely an MLLM extension of PartGLEE, but introduces important innovations in representation design and training methodology that contribute to its effectiveness.
2. Parameters and Inference Cost
We now report the footprint of GPU hours, carbon cost, inference cost, and model size of PartGLEE, PSAL,M and LangHOPS in Table R-9. The GPU hours and inference time are reported for a NVIDIA H200 GPU. The spec. power (700W) of H200 and world average carbon intensity of electricity (~0.475 kg CO₂e / kWh) are used for calculating the footprint.
| Method | Model Size | Training GPU Hours | Training Footprint (kg CO₂e) | Inference Time (ms) |
|---|---|---|---|---|
| PSALM† | 1.5B | 92 | 30.6 | 628 |
| PartGLEE | 1B | 40 | 13.3 | 240 |
| LangHOPS | 4B | 72 | 23.9 | 396 |
Table R-9: Specs of LangHOPS and the baselines. PPS116 + INS + PART -> PartImageNet.
The Table shows that LangHOPS has the largest model size, mainly due to the usage of MLLM (Paligemma2-3B). PSALM† has the longest training time and carbon footprint since it trains the LLM instead of using LoRA, and needs to process all candidate category names, which leads to long input prompts to the LLM. LangHOPS achieves the best performance with reasonable training and inference cost compared to the baselines.
3. Terminology Inconsistency
We appreciate the reviewer's correction. The second "Object Segmentation" in L133 is indeed redundant, and that paragraph is part of object-part parsing. We will address them in the revised version of the paper. "Language-Space Object-Part Representation" and "MLLM-based Object-Part Parsing" are actually both part of "Object-Part Parsing" in Figure 2. We understand that it can cause confusion, and we will clarify this.
4. Error Bars
We have now included mean ± standard deviation results computed over three random seeds (20, 43, and 4337) for both LangHOPS and PartGlee, the strongest baseline.
These results are reported in Tables R-5 and R-6. As shown, the averaged AP values closely align with the originally reported numbers, slightly increasing in most experiments for our method. The standard deviations have a low magnitude, indicating that our results are stable and show a significant improvement over the PartGlee baseline.
| Method | PPS-116 | - | - | +INS | - | - | +INS+PART | - | - | PartImageNet | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| PartGlee ± | 38.4±0.5 | 8.61±0.46 | 15.2±0.5 | 57.6±1.8 | 11.3±0.2 | 21.5±0.6 | 60.1±0.9 | 10.5±0.7 | 21.5±0.7 | 82.0±0.5 | 42.5±0.8 | 51.3±0.7 |
| LangHOPS± | 48.7±3.3 | 8.89±0.24 | 17.7±0.9 | 60.9±0.7 | 12.1±1.0 | 22.9±1.0 | 63.7±1.4 | 16.6±0.3 | 27.0±0.5 | 84.1±1.0 | 48.2±0.8 | 56.1±0.8 |
Table R-5: PPS116→PartImageNet with standard deviation (±) of the proposed method LangHOPS± and the strongest baseline PartGlee±.
| Method | PartImageNet | - | - | +INS | - | - | +INS+PART | - | - | PPS-116 | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| PartGlee ± | 8.53±0.52 | 2.05±0.09 | 3.04±0.16 | 23.3±0.1 | 3.10±0.16 | 6.17±0.16 | 22.5±0.4 | 3.60±0.24 | 6.46±0.26 | 53.6±1.5 | 13.8±0.6 | 19.9±0.7 |
| LangHOPS ± | 11.0±1.0 | 2.17±0.03 | 3.50±0.18 | 22.9±0.7 | 3.68±0.11 | 6.59±0.19 | 23.2±0.5 | 4.51±0.25 | 7.34±0.28 | 55.0±1.1 | 15.0±0.2 | 21.1±0.3 |
Table R-6: PartImageNet→PPS116 with standard deviation..
5. Code Release
We will make the code public upon the paper’s release. Due to the IP limitations of the work, it was not possible to release the code during review.
6. Typos
We thank the reviewer for reporting the typos. We will address them and go through the paper carefully, correcting them.
7. Instance-level Annotation
The reviewer is correct that the original Pascal-Part-116 [42] dataset only provides semantic part segmentation annotations. However, as detailed in Appendix B of PartGLEE [20] and its official git repository (assets/DATA.md), PartGLEE further processes these annotations to produce instance-level part segmentation labels, which support the OVPIS task. Our experiments in Table 1 and Table 2 use this PartGLEE-provided version of the PPS-116 dataset, which includes part-instance-level annotations. We will clarify this in the revised version of the paper to avoid confusion.
8. Visualization Clarification
The visualizations in Figure 1 are part instance segmentation outputs. For readability, we display each part belonging to the same type in the same color, allowing the reader to reidentify objects and show only a single label for each class of parts to avoid clutter. We thank the reviewer for pointing out this potentially distracting visualization and will provide a more common visualization showing the separation of instances in the final paper.
9. MLLM Output
The MLLM outputs are embeddings of MLLM-processed prompt tokens, and we extract part queries from those embeddings. Specifically, we feed a structured language prompt—containing object-part hierarchical information—into the MLLM. Unlike typical generative usage, the MLLM here returns token-level embeddings for the entire input sequence, with the output length matching the input. We then extract the embeddings at the positions corresponding to the initial part query tokens. These embeddings are used as the refined, language-informed part queries, which are passed to the visual part decoder.
Thank you for your detailed and thoughtful responses to my concerns.
The clarifications regarding novelty, parameter and inference cost (2.), and error bars (4.) were especially helpful.
However, there remain a few points for which I would appreciate further clarification or offer additional comments below.
1. Novelty Clarification
1-1.
In your response, you mention:
“the language-grounded object-part hierarchy modeling replaces the brute-force hierarchical representation using a fixed set of parts in the Q-Former of PartGLEE's architecture.
In this context, could you clarify, with an explicit example, whether bus1’s wheel1, bus1’s wheel2, and bus2’s wheel1 are represented identically or differently in your language space E?
Further, in the open-vocabulary setting described in your problem definition, the object-part categories and candidate part classes for each object are assumed to be predefined.
In such a scenario, how does your approach for part representation (e.g., as in Section 3.4) fundamentally differ from simply concatenating the CLIP text embeddings of the object and part name (i.e., E(C_bus) + E(C_window))?
A concrete example or further clarification on this point would be very helpful.
1-2.
The analysis on training strategies provided in the supplementary material is indeed meaningful.
However, since this aspect is not thoroughly analyzed or discussed in the main text, I believe it does not stand out as a main contribution of the work.
3. Clarity and Consistency of Presentation
As you noted, the additional “Object Segmentation” subsection title is a typographical error.
Beyond that, however, there are broader issues with terminology and consistency throughout the manuscript.
For example, the use of “Object Segmentation” vs. “Object-level Segmentation” in subsection titles and the main figure appears inconsistent and reduces the paper’s overall clarity.
Another example is Section 3.4 “Language-Space Object-Part Representation”, which in the main figure is labeled as “Hierarchical Object-Part Representation”.
The correspondence between subsection titles and figure components is structurally ambiguous, and terminology is not used consistently across the manuscript and figures, which may confuse readers (for example, as previously noted, Sections 3.4 and 3.5 are in fact subcomponents of “Object-Part Parsing,” but this hierarchy is not made explicit in the main text).
For instance, in Figure 2, if “MLLM-based Parsing” inside “Object-Part Parsing” refers to Section 3.5, and “Hierarchical Object-Part Representation” refers to Section 3.4, the hierarchy and terminology between text and figure are not consistent.
As Reviewer Ku9B also mentioned, notations lack consistency throughout the paper, which further undermines the clarity and completeness of the presentation. (for example, C_{} sometimes being italicized and sometimes not)
Taken together, the structural inconsistencies between the main text and figures, the lack of terminological coherence, the frequent typographical errors, and the misused subsection titles all contribute to a sense of incompleteness in the manuscript.
6. Additional Typos and Inconsistencies Noted
- Figure 2: “obj-art” should be “obj-part”
- Figure 2: Capitalization of “C_{obj-part}” is inconsistent with the main text
- Figure 2: “Object Segmentation” vs. “Object-level Segmentation” should be unified
- Figure 2: “Hierarchical Obj-Part Rep” should be inside the bold pink box for consistency with “MLLM-based Parsing”
- Table 1 and 2: “PartGlee” → “PartGLEE”
- Table 3: “PartCatSeg” → “PartCATSeg”
- “PaliGemma2” → “PaliGemma 2”
While I recognize some of the novelty of the proposed approach and its demonstrated effectiveness, I find that the issues in overall flow and consistency, particularly the structural ambiguities and lack of clarity in connecting sections and figures, significantly limit the impact of the contribution at this stage.
Therefore, despite its strengths, I must still regard this as a borderline paper.
Thank you again for your detailed response and engagement.
Dear Reviewer,
Thank you for your dedicated efforts in reviewing this paper. We are currently in the reviewer-author discussion phase, but we have not yet seen your engagement.
This year's Responsible Reviewing initiative requires all reviewers to communicate with authors during this period, emphasizing that ghosting is not acceptable. We kindly ask that you reply and engage with the authors. Please note that participation in discussions with authors is mandatory before submitting "Mandatory Acknowledgement," as submitting it without any engagement is not permitted in this review cycle.
Best, AC
We appreciate the reviewer's further constructive comments and suggestions. We agree that clarifying the raised questions is important for improving the quality and clarity of our work. Below, we provide our detailed responses.
1. Novelty Clarification
Architecture Novelty Clarification
In our model, the embeddings of bus1’s wheel1 and bus2’s wheel1 are distinct in the language space E, while the embeddings of bus1’s wheel1 and bus1’s wheel2 are identical. Specifically, the part embedding is constructed by combining the object query and the CLIP text embedding of the part name “wheel”. Importantly, is not just a category-level embedding; it encodes instance-specific visual features, as it is used to generate the segmentation mask for that specific bus1 instance. Thus,
- Consequently, E(bus1’s wheel1) E(bus2’s wheel1)
This differs fundamentally from the alternative approach of simply concatenating the CLIP embeddings of the object and part names (e.g., CLIP(“bus”) + CLIP(“wheel”)), which would not capture instance-level visual features and variation and would lead to ambiguity, especially in scenes with multiple instances of the same object type.
Following the reviewer’s helpful suggestion, we have clarified this point explicitly in Section 3.4 of the revised paper.
Training Strategy
We agree that the training strategy analysis provides useful insight. In response to the reviewer’s comment, we have moved the relevant analysis to Section 4.3 in the main paper from the supplementary and explicitly discussed its impact on performance. While it may not be the central novelty of our method, we believe it adds valuable context and guidance for future work, and we are grateful for the reviewer’s encouragement to better highlight it.
3.Clarity and Consistency of Presentation
We appreciate the reviewer's suggestion for improving the presentation of the paper. In the revised paper, we have
- corrected the duplicate “Object Segmentation” subsection title,
- ensured consistency and clarity of the method components by
- (a) replacing "object-level segmentation" with "object segmentation" in sections 3 and 4 for consistency;
- (b) unifying the name of section 3.4 in figure 2 to "Language-Space Object-Part Representation";
- (c) clarifying the hierarchy between "Object-Part Parsing" and Section 3.4 and 3.5 in the method overview
- (d) added tags of the subsection numbers to the corresponding modules in Figure 2.
- ensured consistency of the mathematical notations and the terminology
6. Additional Typos and Inconsistencies Noted
We have addressed all the typos and inconsistencies noted by the reviewers. In addition, we have carefully proofread the paper to ensure its clarity and coherence.
We sincerely appreciate the reviewer’s meticulous suggestions and thoughtful feedback, which have significantly improved the quality and presentation of the paper. In the revised version, we have made significant efforts to address these issues with
- clarified model design and analysis on training strategies in main paper
- refined structure of the method section and clear figure–text alignment;
- standardized notation and consistent terminology;
- corrected typos and coherent phrases
We hope these changes effectively address the reviewer’s concerns and make the contributions of the paper more accessible and impactful. We deeply appreciate the reviewer’s acknowledgment of the strengths and novelty of our approach, and we respectfully ask for reconsideration of the overall rating in light of the improvements.
Thank you for the detailed revisions, including clearer architectural novelty, explicit instance-specific representation examples, improved training strategy presentation, resolved terminology and figure alignment issues, and added robustness and resource analyses.
These effectively address my earlier concerns, so I am raising my recommendation from weak reject (3) to weak accept (4).
The paper introduces LangHOPS, the first framework that tackles open-vocabulary object–part instance segmentation (OVPIS) by (i) explicitly encoding object–part hierarchies in CLIP language space and (ii) passing these language-grounded queries through a lightweight MLLM (PaliGemma-2) to refine part queries before decoding. Experiments on three benchmarks (PartImageNet, PascalPart-116, ADE20K) show consistent state-of-the-art gains: +5.5 AP (in-domain) and +4.8 AP (cross-dataset) on PartImageNet, and +2.5 mIoU on unseen ADE20K parts. Ablations confirm that both the language-grounded hierarchy and the MLLM parser are indispensable and that joint training yields positive object–part synergy.
优缺点分析
-
Strengths:
1). Solid empirical evidence: thorough in-domain, cross-dataset, zero-shot, scalability and ablation studies.
2). Clear Implementation details: two-stage training, clear loss formulations and hyper-parameters.
3). Significance and originality: First to integrate an MLLM for OVPIS; shows that language-grounded hierarchies boost both part and object quality; could inspire similar hierarchical designs in other dense tasks. Combines hierarchical language modelling with MLLM-driven query refinement; introduces explicit synergy analysis.
-
Weaknesses
1). Error bars / statistical significance are missing; some ablations (e.g., prompt sensitivity, MLLM size) are limited; compute cost for full training set (INS + PART) not fully quantified.
2). Impact depends on availability of strong MLLMs; benefit over strong vision-only baselines (e.g., clipped Q-Former) is moderate on ADE20K (49.5 hIoU vs 50.0 of PartCatSeg).
3). Repetition/typos (e.g., duplicated “Object Segmentation” heading), table captions compressed, and some notation (Np, H) introduced late.
问题
1). Statistical robustness – Please report confidence intervals (e.g., ± std over three seeds) for main tables. A ≥ 3 pt drop might change my rating. 2). Prompt sensitivity – How stable is performance to wording or ordering of the structured prompt? A study on a held-out set would strengthen claims of robustness. 3). Compute footprint – Provide GPU hours and carbon cost for each training stage and for the large INS + PART run; if substantially larger than baselines, discuss trade-offs. 4). Hierarchy construction – The method assumes a clean object→part taxonomy. How does LangHOPS behave when candidate lists are noisy or overlapping? An experiment with automatically mined candidates would be helpful. 5). Synergy mechanism – The paper shows gains but not why gradients from parts help objects. Visualizing attention maps before/after joint training could clarify this; stronger evidence could raise my score.
局限性
The paper openly notes that synergy is implicit and that the role of language-guided hierarchies on object quality is not fully characterized (Sec. 4.4). However, potential societal impacts (e.g., misuse of fine-grained segmentation for surveillance) are not discussed; consider adding a Broader-Impacts paragraph.
最终评判理由
The rebuttal resolved my main robustness and cost doubts and provided new evidence of hierarchy tolerance and part-object synergy. The method remains the first to integrate an MLLM into OVPIS and delivers consistent SOTA across three datasets with solid ablations. Remaining weaknesses are either incremental (ADE20K gap) or addressable in the final version (impact statement). Overall, reasons to accept now clearly outweigh reasons to reject.
格式问题
1). Minor duplicated subsection titles (Sec. 3.3/3.4). 2). Table 1 caption typo: “→artImageNet”. 3). Occasional spacing errors around “%” and subscripts.
We thank the reviewer for the highly constructive comments and appreciate the recognition of significance and originality, together with highlighting solid empirical evidence in different settings and clear implementation details. We answer to the reviewer's comments as follows.
1. Error Bars
We report mean ± standard deviation results over three random seeds (20, 43, and 4337) for both LangHOPS and PartGlee, the strongest baseline.
As shown in Tables R-5 and R-6, the averaged AP closely align with the originally reported numbers, slightly increasing in most experiments for our method. The standard deviations have a low magnitude, indicating that our results are stable and show a significant improvement over the PartGlee baseline.
| Method | PPS-116 | - | - | +INS | - | - | +INS+PART | - | - | PartImageNet | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| PartGlee ± | 38.4±0.5 | 8.61±0.46 | 15.2±0.5 | 57.6±1.8 | 11.3±0.2 | 21.5±0.6 | 60.1±0.9 | 10.5±0.7 | 21.5±0.7 | 82.0±0.5 | 42.5±0.8 | 51.3±0.7 |
| LangHOPS± | 48.7±3.3 | 8.89±0.24 | 17.7±0.9 | 60.9±0.7 | 12.1±1.0 | 22.9±1.0 | 63.7±1.4 | 16.6±0.3 | 27.0±0.5 | 84.1±1.0 | 48.2±0.8 | 56.1±0.8 |
Table R-5: PPS116→PartImageNet with standard deviation (±) of the proposed method LangHOPS± and the strongest baseline PartGlee±.
| Method | PartImageNet | - | - | +INS | - | - | +INS+PART | - | - | PPS-116 | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| PartGlee ± | 8.53±0.52 | 2.05±0.09 | 3.04±0.16 | 23.3±0.1 | 3.10±0.16 | 6.17±0.16 | 22.5±0.4 | 3.60±0.24 | 6.46±0.26 | 53.6±1.5 | 13.8±0.6 | 19.9±0.7 |
| LangHOPS ± | 11.0±1.0 | 2.17±0.03 | 3.50±0.18 | 22.9±0.7 | 3.68±0.11 | 6.59±0.19 | 23.2±0.5 | 4.51±0.25 | 7.34±0.28 | 55.0±1.1 | 15.0±0.2 | 21.1±0.3 |
Table R-6: PartImageNet→PPS116 with standard deviation.
2. Prompt Sensitivity
We conducted two ablation studies on the ordering and wording of the structured input prompts to assess the robustness of our method to prompt formulation.
Robustness to prompt ordering: We randomly shuffled (i) the order of object queries, and (ii) the order of part queries within each object, multiple times during inference. For instance, object 3 may appear before object 1, or part queries within an object may be permuted (e.g., "part 9, part 4, part 6"). As shown in Table R-7, our method remains highly stable across these permutations, with minimal performance degradation, demonstrating robustness to input ordering.
| Method | PPS-116 | - | - | +INS | - | - | +INS+PART | - | - | PartImageNet | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| Shuffling of Object | 47.8±2.7 | 8.57±0.36 | 17.1±0.9 | 61.1±0.8 | 11.7±0.8 | 22.6±0.8 | 65.1±0.9 | 15.8±0.3 | 26.7±0.4 | 82.1±1.6 | 47.7±1.2 | 55.3±1.3 |
| Shuffling of Part | 46.9±2.4 | 9.08±0.33 | 17.5±0.8 | 58.8±1.1 | 13.6±0.9 | 23.6±0.9 | 64.2±1.6 | 16.9±0.3 | 27.4±0.6 | 81.7±1.0 | 46.2±0.6 | 54.1±0.7 |
| No shuffling | 48.7±3.3 | 8.89±0.24 | 17.7±0.9 | 60.9±0.7 | 12.1±1.0 | 22.9±0.97 | 63.7±1.4 | 16.6±0.3 | 27.0±0.5 | 84.1±1.0 | 48.2±0.8 | 56.1±0.8 |
Table R-7: Ablations on the ordering of the object and part queries - PPS116->PartImageNet.
Robustness to wording: We further test the model's robustness to unseen part names by replacing the subset (from to 100%) of the original part category names with GPT-4o-generated synonyms (e.g., "foot" → "leg"). As shown in Table R-8, LangHOPS significantly outperforms PartGlee under increasing synonym replacement ratios, indicating strong generalization to semantically similar but unseen phrasing. Note that synonym substitutions may introduce granularity mismatches with the dataset’s ground-truth annotations (e.g., "leg" may exclude "paw" in the ground truth for “foot”), which partially explains the observed performance drop.
| Method | 0% | - | 25% | - | 50% | - | 75% | - | 100% | - |
|---|---|---|---|---|---|---|---|---|---|---|
| part | AP | part | AP | part | AP | part | AP | part | AP | |
| PartGLEE | 11.2 | 21.8 | 9.3 | 20.3 | 8.6 | 19.7 | 6.6 | 18.2 | 5.1 | 17.0 |
| LangHOPS | 17.0 | 27.1 | 16.2 | 26.5 | 16.5 | 26.7 | 14.6 | 25.3 | 12.7 | 23.8 |
Table R-8: Ablation on the robustness to input part category names. PPS116 + INS + PART -> PartImageNet. Different percentages of part category names replaced with GPT-4o generated synonyms.
Together, these studies demonstrate that LangHOPS is robust to both the structure and vocabulary of the input prompt.
3. Compute Footprint
We now report the footprint of GPU hours, carbon cost, inference cost and model size of PartGLEE, PSALM and LangHOPS in Table R-9. The gpu hours and inference time are reported with Nvidia H200 GPU(s). The spec. power (700W) of H200 and world average carbon intensity of electricity (0.475 kg CO₂e / kWh) are used for calculating the footprint.
| Method | Model Size | Training GPU Hours | Training Footprint (kg CO₂e) | Inference Time (ms) |
|---|---|---|---|---|
| PSALM† | 1.5B | 92 | 30.6 | 628 |
| PartGLEE | 1B | 40 | 13.3 | 240 |
| LangHOPS | 4B | 72 | 23.9 | 396 |
Table R-9: Specs of LangHOPS and the baselines. PPS116 + INS + PART -> PartImageNet.
The Table shows that LangHOPS has the largest model size, mainly due to the usage of MLLM (Paligemma2-3B). PSALM† has the longest training time and carbon footprint since it trains the LLM instead of using LoRA, and needs to process all candidate category names, which leads to long input prompts to the LLM. LangHOPS achieves the best performance with reasonable training and inference cost compared to the baselines.
4. Robustness to Noisy Hierarchy
We test on the common OVS setting using clean object-part hierarchies, but believe in the value of closing the gap towards a noisy real-world deployment. To evaluate the robustness of LangHOPS to noisy or automatically mined hierarchies, we replace a portion of the clean object-part taxonomy with GPT-4o-generated object-part hierarchies. These auto-mined hierarchies are constructed solely from the object category names and may introduce ambiguity, inconsistency, or irrelevant parts. In Table R-10, we report performance under varying noise levels: for example, at , of the object categories use noisy hierarchies, while the remaining use the clean dataset annotations. We observe that:
- LangHOPS consistently outperforms PartGlee across all noise levels;
- LangHOPS degrades more gracefully as noise increases, maintaining reasonable AP even when of hierarchies are noisy;
- The performance gap widens especially at high noise levels, demonstrating LangHOPS's stronger resilience to imperfect or automatically mined hierarchies.
Please note that the auto-generated hierarchies are often inconsistent with the ground truth annotations in the dataset, leading to lower evaluation metrics. Overall, we agree with the reviewer that developing evaluation protocols for adaptive, task-specific hierarchies remains an open problem and a promising direction for future benchmark design.
| Method | 0% | - | 25% | - | 50% | - | 75% | - | 100% | - |
|---|---|---|---|---|---|---|---|---|---|---|
| part | AP | part | AP | part | AP | part | AP | part | AP | |
| PartGLEE | 11.2 | 21.8 | 10.3 | 21.1 | 9.8 | 20.7 | 8.2 | 19.4 | 3.6 | 15.8 |
| LangHOPS | 17.0 | 27.1 | 13.1 | 24.1 | 12.4 | 23.5 | 8.8 | 20.7 | 6.7 | 19.1 |
Table R-10: Ablations on the noisy hierarchy construction. Different percentages of obj-part hierarchies from the dataset are replaced with GPT-4o generated ones.
5. Synergy Mechanism - Attention Score
We further prove the object-part synergy mechanism by reporting the average attention score, since the attention maps cannot be provided in the rebuttal format. The average attention score is calculated by summing attention scores of true positive predictions inside the ground truth masks, divided by the area of the masks. The attention is the normalized cos similarity between object queries and the dense features of the final layer of the object/part decoder. The score shows the amount of attention correctly assigned by the model to the ground truth area, and is mapped to the range [0, 1].
In the setting of PPS116+INS+PART -> PartImageNet, as shown in Table R-11, compared to the "detached object-part seg.", the synergized object-part segmentation leads to higher attention scores for both object and part segmentation, proving strong evidence of the synergy between both segmentation tasks.
| Setting | Detached Obj-Part Seg | Obj-Part Seg in Synergy |
|---|---|---|
| obj attn. score | 0.76 | 0.82 |
| part attn. score | 0.58 | 0.67 |
Table R-11: Attention score of the object and part segmentation.
We will further show the attention maps and the attention scores in more experimental settings in the revision.
6. Comparison to PartCatSeg
We agree that the improvement over PartCatSeg in OVPS is moderate in a specific dataset (ADE20K). However, it is noticeable that our method is developed and trained with object-part-level instance segmentation. Directly evaluating our method in the semantic segmentation (Table 3 in the main paper) still leads to superior performance (hIOU) in PPS-116 and PartImageNet datasets, showing the great potential of the proposed method.
I thank the authors for the detailed rebuttal, which addresses the issues I raised. Given the demonstrated statistical stability, robustness analyses, and clarified resource footprint, I am raising my overall score from 4 (borderline accept) to 5 (accept). I believe the paper makes a meaningful and well-supported contribution to open-vocabulary part segmentation community.
This paper introduces an object-level instance and object-specific part-level instance segmentation method. The results on several datasets surpass previous methods.
优缺点分析
Strengths:
This paper provides a detailed definition of the problem associated with this "new" task. The experiments conducted on different benchmarks verify the effeteness of the proposed method, particularly in the cross-dataset experiment.
Weaknesses:
However, the motivation for this task is not explained well. The object-specific part-level instance segmentation is more like a combination of part-segmentation and instance-segmentation tasks. It is hard for readers to get the motivation in the beginning.
In addition, the proposed method includes several components while the connection is not strong. The part segmentation module and the object segmentation have overlaps for a specific object category. For instance, both parts can learn the information for "bus". They are no relations with this information when building the structure.
Another critical issue is why the authors refer to it as language-grounded. The processing of the textual part is limited. The prompts proved from neither object segmentation nor part segmentation are passed through a frozen text-encoder, which is followed by a concatenation. The additional information plays limited role as "language-grounded".
The paper requires careful proofreading, particularly in image captions and Section 3.2, 4.3.
问题
- Line 42-47, can you explain the limitations of previous methods? I do not entirely agree with your points.
- Why do the authors train the visual and pixel backbones, which is very rare?
- What is the difference between the proposed method and a set of learnable, universal queries in PartGLEE in Section 3.4?
局限性
Some parts are insufficiently explained in section 3.2. The improvements in Tables 2 and 3 are not significant.
最终评判理由
The authors provide a rebuttal to address some concerns from the first round. However, some key points (e.g., motivation, the distinction from object/part segmentation) requires more sufficient explanations and experiments. Therefore, I keep my initial rating as borderline reject.
格式问题
None.
We appreciate the reviewer's comments and recognition of performance gains achieved by our method on different benchmarks, especially on open vocabulary cross-dataset tasks, which are a central target of our method.
1. Motivation Clarification
We do not perform object-specific part segmentation, but rather segment parts of all requested objects simultaneously, following the broadly accepted open-vocabulary part segmentation (OVPS) framework.
For the relationship between part- and instance-segmentation, in fact, prior work such as PartGLEE follows the approach mentioned by the reviewer — simply combining object and part segmentation modules via a Q-Former.
In contrast, our method models the object-part hierarchy in language space, leading to significant performance gains over PartGLEE. It is significant to address the challenge:
While object-level instance segmentation is well-defined, part-instance segmentation is task-dependent and inherently requires open-vocabulary capabilities.
For example:
- Using a laptop may require segmenting the lid to open it.
- Repairing it may require finer segmentation of screws or hinges.
We solve it by:
- Incorporating open-vocabulary prompts into object segmentation;
- Modeling hierarchical object-part relationships with structured prompts;
- Enforcing a one-to-one mapping between object parts and part segmentation queries;
- Using language-grounded queries in the part decoder that leverage the reasoning capabilities of an MLLM.
We will clarify this in the introduction.
2. Connection Between Modules
We believe this is a misunderstanding.
Our object and part segmentation modules are tightly coupled via the hierarchical object-part representation (mentioned in Line 140, 174, and Figure 2) and aim at utilizing the joint information between parts and objects through the following information flow:
- For each segmented object, a set of part queries is constructed by a concatenation of the object embedding generated in the object decoder and the language-encoded part description.
- These queries are processed by the MLLM, generating queries that are capable of open-vocabulary part segmentation.
- The processed queries are used in the part decoder to segment each requested part.
Our ablation study in l269ff and Table 5 shows a comparison with detached object-part segmentation that demonstrates that the joint training with information flow from the part decoder to the object decoder is efficient and improves performance.
We will add this paragraph to the architecture overview.
3. Language Grounding
We respectfully disagree and refer to our method as being language-grounded since we design the whole object-part representation and processing to remain in a language-aligned space. Our motivation to follow this approach is to utilize the generalist language representations learned by the MLLM.
In detail the language-grounding is enabled in the following way:
- Both object and part queries generated in the segmentation modules are aligned to the language space and compared against text embeddings of category labels for classification.
- More critically, language processing in our method goes beyond the frozen text encoder. We combine the hierarchical part structure in language space with semantic object embeddings produced by the object segmentation, explicitly modeling object-part hierarchies in language space.
- We leverage these text-aligned embeddings in a pretrained MLLM, which enriches the queries with contextual and semantic knowledge from the vision-language domain. Similarly, the output of the MLLM resides in a text-informed feature space, allowing the part decoder to extract visual features in a way that is conditioned on open-vocabulary semantics.
This design enables the open-set generalization capability and allows the model to process novel categories and part-object relationships.
4. Limitation of Previous Methods
We clarify the limitations of the existing methods. It is recognized that CLIP’s embeddings have limited capacity for compositional understanding and modeling object-part hierarchies[1, 38, 48]. As a result, CLIP-based methods such as OV-Part [42] and PartCLIPSeg[7] tend to show suboptimal performance in fine-grained part segmentation. To address this, PartCatSeg augments the model with DINOv2 for structural guidance . However, the approach still lacks an explicit mechanism for modeling object-part hierarchies grounded in language, which is critical for compositional generalization.
For another existing work, PartGLEE [20], it has inherent limitation by design in handling part granularity variation across datasets. It parses objects into parts through a learning-from-scratch Q-Former, which is incapable of handling object-part granularity unseen during training. For example, trained with fine-grained annotations like "eye", "nose", "ear" for the cat on Pascal-Part-116, PartGLEE performs poorly when evaluated on PartImageNet, where the cat is annotated with the only coarser parts ("head", "body", "foot", and "tail").
We will revise the part mentioned by the reviewer to make it clear.
5. Training of Visual Backbones and Pixel Decoder
It is indeed necessary to train the visual backbone and pixel decoder. The visual backbone is from Mask2Former CVPR'22, pretrained only in the object-level segmentation task and thus, is limited in segmenting fine-grained, part-level instances. Therefore, we train the backbone on part-level datasets to grant it part-aware segmentation ability. This practice is actually not uncommon in the literature on part-level segmentation (e.g., [20, 40]).
We further conducted an ablation study on the freezing/training of the visual backbone and the pixel decoder. As shown in Table R-1 and R-2, training the visual backbone and the pixel decoders leads to better performance in part-level segmentation and higher overall AP.
| Method | PPS-116 | - | - | +INS | - | - | +INS+PART | - | - | PartImageNet | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| Frozen Bk+Pd | 48.2 | 6.99 | 16.1 | 64.1 | 8.85 | 21.1 | 66.1 | 9.34 | 22.0 | 80.6 | 30.1 | 41.3 |
| Frozen Bk | 47.6 | 7.36 | 16.3 | 63.0 | 6.98 | 19.4 | 63.4 | 12.4 | 23.8 | 83.2 | 34.7 | 45.4 |
| LangHOPS | 49.1 | 8.62 | 17.6 | 61.8 | 13.6 | 24.3 | 62.7 | 17.0 | 27.1 | 85.5 | 47.9 | 55.8 |
Table R-1: Ablations on freezing backbone (Bk) and pixel decoder (Pd). Cross-dataset and in-domain evaluation, PPS-116 -> PartImageNet (with Random seed 4337).
| Method | PartImageNet | - | - | +INS | - | - | +INS+PART | - | - | PPS-116 | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| Frozen Bk+Pd | 10.5 | 1.71 | 3.05 | 23.4 | 2.57 | 5.73 | 23.2 | 2.76 | 5.86 | 53.3 | 7.48 | 17.7 |
| Frozen Bk | 11.8 | 1.95 | 3.45 | 23.3 | 2.92 | 6.01 | 23.0 | 3.34 | 6.32 | 44.2 | 7.65 | 15.8 |
| LangHOPS | 11.3 | 2.17 | 3.47 | 21.9 | 3.82 | 6.55 | 23.8 | 4.16 | 7.13 | 56.4 | 15.3 | 21.4 |
Table R-2: Ablations on freezing backbone and pixel decoder. PartImageNet -> PPS-116.
6. Differentiation from PartGLEE
Language-Space Object-Part Representation, a core innovation of our method, distinguishes ours from PartGLEE. Unlike PartGLEE, which uses a fixed set of learnable part queries processed by a Q-Former, our method dynamically constructs part queries in language space, conditioned on both the object context and a user-defined list of candidate part categories.
For each object query from the object segmentor, we iterate over candidate part labels and concatenate each text embedding with the object query to form context-aware initial part queries. These are then further refined by a multimodal LLM in Section 3.5. To address the reviewer's concern, we conduct an ablation study by replacing the language-grounded part query initialization with fixed learnable tokens as in PartGLEE and use them as the input to the MLLM, which drops the cross-dataset performance significantly compared to the proposed method:
| Method | PPS-116 | - | - | +INS | - | - | +INS+PART | - | - | PartImageNet | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| w. LQ | 46.8 | 8.10 | 16.7 | 58.2 | 11.8 | 22.1 | 58.9 | 13.9 | 23.9 | 81.3 | 46.7 | 54.4 |
| LangHOPS | 49.1 | 8.62 | 17.6 | 61.8 | 13.6 | 24.3 | 62.7 | 17.0 | 27.1 | 85.5 | 47.9 | 55.8 |
Table R-3: Ablations on Learnable Query. Cross-dataset and in-domain evaluation, PPS-116 -> PartImageNet (with Random seed 4337). LangHOPS w. LQ means that the "Language-Space Object-Part Representation" is replaced by the PartGLEE-like learnable queries concatenated with object queries.
| Method | PartImageNet | - | - | +INS | - | - | +INS+PART | - | - | PPS-116 | - | - |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obj | part | AP | obj | part | AP | obj | part | AP | obj | part | AP | |
| w. LQ | 11.0 | 1.95 | 3.32 | 28.6 | 3.02 | 6.54 | 22.7 | 4.26 | 7.05 | 54.6 | 9.65 | 19.6 |
| LangHOPS | 11.3 | 2.17 | 3.47 | 21.9 | 3.82 | 6.55 | 23.8 | 4.16 | 7.13 | 56.4 | 15.3 | 21.4 |
Table R-4: Ablations on Learnable Query. PartImageNet -> PPS-116.
Thank you for your detailed responses. Some of my previous concerns have been addressed, but I still think some points are not convincing. Regarding the motivation, the authors should carefully state the limitation of current research, particularly for the "new" task. However, a simple combination between A and B does not answer why the authors introduce part and object level segmentation. Regarding the connection, I don't agree that the proposed MLLM processed queries link objects and parts for the part decoder. About the language-grounded, the authors claimed that the text embedding obtained from the prompts are the linguistic alignment with visual part. Then this concept is not perfectly good for this case. Therefore, I keep my original rating.
We appreciate the reviewer's comments, which have helped us better understand and address the remaining concerns.
Motivation Clarification
We agree it is important to clearly articulate the motivation behind the proposed task and method. As stated in the main paper (line 19-25, 71-72) and illustrated in the rebuttal examples (1. Motivation Clarification), the motivation of OVPIS lies not only in the novel hierarchical design for addressing object-part segmentation challenges but also in its potential impact on downstream applications such as image editing and robotic manipulation at the part level. Following the reviewer’s suggestion, we have further emphasized this motivation in the third paragraph of the introduction in the revised paper.
MLLM for the link between multi-granularity concepts
We appreciate the request for clarification. The proposed MLLM receives the object query followed by part queries (as described in line 193: e.g., object_1 with part_1, part_2, ... ) and outputs refined part queries that integrate both object- and part-level information. This design not only enables the MLLM to leverage object-level context to infer part semantics, but also allows bidirectional information flow - from parts back to the object - during training (see Section 4.3, Object-Part Synergy).
We have clarified this mechanism in Section 3.5 of the revised version.
Language-grounded Representation Clarification
We understand the reviewer’s concern and would like to clarify the language grounding process:
- In our method, language grounding is not limited to pure text embeddings. Indeed, both the object query and part query incorporate visual information and are aligned with category text embeddings for classification and segmentation tasks.
- The statement text embeddings are the linguistic alignment with visual parts was imprecise. In fact, the initial part queries are formed by concatenating object queries with text embeddings and then refined via MLLM, resulting in queries enriched with both language semantics and visual information of certain objects.
- A more accurate description would be: language-grounded object-part representations are applied as additional semantic information to form visual-language queries to facilitate open-vocabulary segmentation.
We have incorporated this clarification into the Introduction and Section 3.4 of the revised paper.
We sincerely appreciate the reviewer’s comments and suggestions, which allow us to further improve the motivation of the paper and the clarification of the method design with
- further highlighted motivation with downstream applications in the introduction
- clarified MLLM-based model design
- and a detailed description of the language-grounded object-part representation
We hope these improvements address the remaining concerns and clarify the novelty and contributions of our work. We respectfully ask the reviewer to reconsider the overall rating in light of the revision.
Dear Reviewer,
Thank you for your dedicated efforts in reviewing this paper. We are currently in the reviewer-author discussion phase, but we have not yet seen your engagement.
This year's Responsible Reviewing initiative requires all reviewers to communicate with authors during this period, emphasizing that ghosting is not acceptable. We kindly ask that you reply and engage with the authors. Please note that participation in discussions with authors is mandatory before submitting "Mandatory Acknowledgement," as submitting it without any engagement is not permitted in this review cycle.
Best, AC
Dear Reviewer,
We would like to express again our sincere appreciation for your review of our work. We hope that we have addressed your concerns and remain committed to further improving the paper as promised. Please feel free to reach out with any additional questions or suggestions.
The paper ultimately received three positive reviews (two Borderline Accepts and one Accept) and a single negative review (Borderline Reject).
The positive reviewers found their concerns are fully addressed during the post-rebuttal discussion, leading two of them to raise their scores. While one reviewer remained explicitly negative, the authors provided thorough and convincing responses that the other reviewers found to sufficiently address concerns regarding novelty, motivation, and technical contributions.
Overall, the AC believes the merits of the paper outweigh its flaws and is therefore recommending acceptance.