6.0

/10

Poster4 位审稿人

最低3最高4标准差0.4

4.0

置信度

创新性3.0

质量2.5

清晰度2.3

重要性2.8

NeurIPS 2025

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Yang Miao,Jan-Nico Zaech,Xi Wang,Fabien Despinoy,Danda Pani Paudel,Luc Van Gool

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

We propose LangHOPS, the first Multimodal Large Language Model (MLLM)-based framework for open-vocabulary object–part instance segmentation.

摘要

关键词

Object Part Segmentation; Language Grounding; Multi-modality; Multimodal Large Language Model

评审与讨论

审稿意见

评分: 4置信度: 42025-06-30

The authors propose a new method for Open-vocabulary Part Instance Segmentation (OVPIS) called LangHOPS by using hierarchical modeling in language space to encode of object parts. LangHOPS performs better than existing state-of-the-art methods across multiple benchmarks.

优缺点分析

Strengths:

LangHOPS integrates MLLMs for OVPIS and grounds object-part hierarchies in language space. This is novel and superior to prior methods that struggle with modeling hierarchical relationships or handling part granularity variations.
The method achieves strong empirical results over baselines.

Weaknesses:

The object part granularity can be at different levels (as also seen from the supplementary figure 4). How to account for such ambiguities during evaluation? For example, if face is a sub-part for a human, what happens if the module parts with higher granularity like eyes, nose, mouth, etc?
The paper's math notations use incorrect formatting throughout. For instance, $C_{obj}$ should be $C_\text{obj}$ and $C_{part}$ should be $C_\text{part}$ and so forth. This inconsistency hampers readability, and I strongly encourage the authors to correct these notation errors.

问题

Please see weaknesses

局限性

Yes

最终评判理由

I thank the authors for their response. The authors have sufficiently addressed my concerns, hence I maintain my positive rating.

格式问题

None that I could find.

作者回复

2025-07-31

We appreciate the reviewer’s positive feedback on our work, especially the highlighting of novelty by integrating MLLMs deeply into the OVPIS task and the handling of varying part granularity based on this. Furthermore, our core contribution grounding object-part hierarchies in the language space and performance over prior baselines is well acknowledged.

We address the reviewer's concerns in the following sections.

1. Granularity Variation

Part granularity is guided by text and can be flexibly specified by the user or by the dataset annotations used for training and evaluation in our approach.

For example, during evaluation, if the ground truth annotation specifies “face” as the part of interest for a “human” object, our model constructs the corresponding part query by concatenating the text embedding of “face” with that of “human”, resulting in a semantically aligned object-part query (e.g., “human-face”).
This design allows the model to be explicitly guided to the appropriate level of granularity, depending on the dataset’s supervision or user input. Moreover, our cross-dataset evaluation (Tables 1 and 2 main paper) is specifically intended to test this capability: different datasets may annotate parts at different levels of detail, and our model’s language-grounded hierarchy allows it to adapt accordingly.
This granularity awareness is a key reason why our method consistently outperforms baseline models in the cross-dataset setting.

We will include more visualization of the model’s predictions on the same image but under different specified part granularities to better illustrate the model’s flexibility in handling varying granularity levels.

2. Notation

We appreciate the reviewer’s feedback and agree that the mathematical notations should be clear and consistent. We have carefully revised all equations and these corrections will be reflected in the updated version.

2025-08-09

I thank the authors for their response. The authors have sufficiently addressed my concerns, hence I maintain my positive rating.

2025-08-06

Dear Reviewer,

Thank you for your dedicated efforts in reviewing this paper. We are currently in the reviewer-author discussion phase, but we have not yet seen your engagement.

This year's Responsible Reviewing initiative requires all reviewers to communicate with authors during this period, emphasizing that ghosting is not acceptable. We kindly ask that you reply and engage with the authors. Please note that participation in discussions with authors is mandatory before submitting "Mandatory Acknowledgement," as submitting it without any engagement is not permitted in this review cycle.

Best, AC

2025-08-08

Dear Reviewer,

We would like to express again our sincere appreciation for your positive review of our work. We hope that we have addressed your concerns and remain committed to further improving the paper as promised. Please feel free to reach out with any additional questions or suggestions.

审稿意见

评分: 4置信度: 32025-07-01

This paper presents LangHOPS, a new framework for Open-Vocabulary Part Instance Segmentation (OVPIS).
LangHOPS explicitly embeds object–part hierarchies in language space, providing a structured representation of object-part relationships.
The method integrates a Multimodal Large Language Model (MLLM) to enable context-aware and granularity-adaptive object-part parsing.
Compared to previous works based on heuristics or constrained CLIP embedding space reasoning, LangHOPS introduces language-grounded part queries for accurate and scalable instance-level segmentation.

优缺点分析

Strengths

The paper introduces the first integration of Multimodal Large Language Models (MLLMs) for the task of object-part instance segmentation, demonstrating originality.
By explicitly modeling language-grounded object–part hierarchies, LangHOPS improves part query representation, enabling enhanced multi-granularity reasoning and context awareness compared to prior CLIP-based methods.
The experimental evaluation is covering major benchmarks, including PartImageNet, Pascal-Part-116, and ADE20K, and demonstrates scalability, with further performance gains observed by incorporating additional training data.

Weaknesses

The proposed approach appears to be an extension of PartGLEE with an integrated MLLM, but the architectural or training-level novelties beyond this integration are not clearly articulated. A more precise explanation of additional contributions, if any, would strengthen the claim of originality.
Since LangHOPS relies on an MLLM (specifically, PaliGemma2), it would strengthen the work to provide quantitative comparisons regarding computational resources, such as model parameters and inference costs, relative to baseline methods.
In Section 3.2 Method Overview, the sub-section title "Object Segmentation" is redundantly repeated in Lines 130 and 133, which should be clarified. Furthermore, inconsistencies in terminology between Figure 2, Section 3.2, and subsequent subsections (e.g., Object-Part Parsing, Part Segmentation) could lead to confusion and should be aligned.
As noted in the NeurIPS Paper Checklist [7. Experiment statistical significance], the experimental results lack information on error bars or measures of variability, limiting the assessment of result consistency and robustness.
The NeurIPS Paper Checklist indicates "[5. Open access to data and code] Yes," but I could not find a clear link or reference for accessing the released data and code at the time of reviewing.
Minor typos present in the manuscript:
- Line 50: produing → producing
- Line 131: featrues → features
- Line 243: dateset → dataset
- Table 1: artImageNet → PartImageNet

问题

In Table 1 and Table 2, experiments are conducted using Pascal-Part-116, which, to my understanding, only provides semantic part segmentation annotations (e.g., "bus's headlight"). Could the authors clarify how part instance segmentation (e.g., "bus 1's headlight 1") is performed using this dataset? Specifically, how was the OVPIS task implemented on PPS-116, which lacks instance-level part annotations?
In Figure 1, the second example (two people walking) appears to demonstrate semantic segmentation, rather than instance-level segmentation, particularly for the parts. Could the authors clarify whether this example illustrates part instance segmentation or standard semantic segmentation?
In Line 192, the paper provides an example of the structured prompt input for the MLLM-based object-part parsing. Could the authors elaborate on what specific outputs are produced by the MLLM at this stage?

局限性

yes

最终评判理由

In my original review, I rated the paper as a borderline reject (score 3) due to concerns about clarity, architectural novelty beyond MLLM integration, and presentation inconsistencies.
The rebuttal and subsequent revisions have directly addressed these points.
The authors clarified how their language-grounded object–part hierarchy differs fundamentally from prior approaches, provided instance-specific representation examples, and reorganized the method section for clearer structure and terminology.
They also added statistical robustness analyses, clarified computational costs, and fixed typographical and notation issues.
The method remains the first to integrate an MLLM into OVPIS with consistent SOTA results across benchmarks, and the strengthened presentation makes the contribution clearer and more convincing.
While some improvements could still be made in presentation polish, the technical merit and demonstrated effectiveness now outweigh earlier reservations.
I am therefore raising my rating from weak reject (3) to 4 (weak accept).

格式问题

N/A

作者回复

2025-07-31

We appreciate the reviewer highlighting the originality of our work by introducing language-grounded object-part hierarchies and MLLMs into the object-part instance segmentation task, as well as, in line with other reviewers, our experiments verifying our approach and showing the superior performance.

1. Novelty Clarification

Our work is not an extension of PartGlee with the integration of an MLLM, but rather proposes novel intermediate representations to enable language-grounded reasoning and the corresponding training strategy. In detail, our additional contributions are as follows:

Our method introduces language-grounded modeling throughout the whole architecture, closely coupling object and part decoding as well as query generation in the same space. Furthermore, the language-grounded object-part hierarchy modeling replaces the brute-force hierarchical representation using a fixed set of parts in the Q-Former of PartGLEE's architecture. Overall, this design enables adaptive part hierarchies in our framework and OVS adaptation to varying part granularity beyond the training data.
We explore the recipes of both one-stage and two-stage training strategies. We provide an empirical study (Table 7 in the submission) showing that the one-stage training strategy leads to better in-domain performance while the two-stage one leads to better cross-dataset performance. This empirical finding suggests that decoupling coarse-to-fine supervision helps improve generalization in tasks involving hierarchical modeling, offering practical insights for similar structured perception problems.
Architectural Verification: We conducted experiments to verify and derive our architecture against a direct extension of PartGLEE, PSALM, and PartCatSeg. Most importantly, our experiments (Table 4 main paper) show that our design improves performance far beyond only utilizing the additional capacity and pretraining of the MLLM, but rather builds a framework around it to effectively utilize its generalist capabilities. We add these verification experiments to the final version of our paper.

Together, these contributions show that our method is not merely an MLLM extension of PartGLEE, but introduces important innovations in representation design and training methodology that contribute to its effectiveness.

2. Parameters and Inference Cost

We now report the footprint of GPU hours, carbon cost, inference cost, and model size of PartGLEE, PSAL,M and LangHOPS in Table R-9. The GPU hours and inference time are reported for a NVIDIA H200 GPU. The spec. power (700W) of H200 and world average carbon intensity of electricity (~0.475 kg CO₂e / kWh) are used for calculating the footprint.

Method	Model Size	Training GPU Hours	Training Footprint (kg CO₂e)	Inference Time (ms)
PSALM†	1.5B	92	30.6	628
PartGLEE	1B	40	13.3	240
LangHOPS	4B	72	23.9	396

Table R-9: Specs of LangHOPS and the baselines. PPS116 + INS + PART -> PartImageNet.

The Table shows that LangHOPS has the largest model size, mainly due to the usage of MLLM (Paligemma2-3B). PSALM† has the longest training time and carbon footprint since it trains the LLM instead of using LoRA, and needs to process all candidate category names, which leads to long input prompts to the LLM. LangHOPS achieves the best performance with reasonable training and inference cost compared to the baselines.

3. Terminology Inconsistency

We appreciate the reviewer's correction. The second "Object Segmentation" in L133 is indeed redundant, and that paragraph is part of object-part parsing. We will address them in the revised version of the paper. "Language-Space Object-Part Representation" and "MLLM-based Object-Part Parsing" are actually both part of "Object-Part Parsing" in Figure 2. We understand that it can cause confusion, and we will clarify this.

4. Error Bars

We have now included mean ± standard deviation results computed over three random seeds (20, 43, and 4337) for both LangHOPS and PartGlee, the strongest baseline.

These results are reported in Tables R-5 and R-6. As shown, the averaged AP values closely align with the originally reported numbers, slightly increasing in most experiments for our method. The standard deviations have a low magnitude, indicating that our results are stable and show a significant improvement over the PartGlee baseline.

Method	PPS-116	-	-	+INS	-	-	+INS+PART	-	-	PartImageNet	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
PartGlee ±	38.4±0.5	8.61±0.46	15.2±0.5	57.6±1.8	11.3±0.2	21.5±0.6	60.1±0.9	10.5±0.7	21.5±0.7	82.0±0.5	42.5±0.8	51.3±0.7
LangHOPS±	48.7±3.3	8.89±0.24	17.7±0.9	60.9±0.7	12.1±1.0	22.9±1.0	63.7±1.4	16.6±0.3	27.0±0.5	84.1±1.0	48.2±0.8	56.1±0.8

Table R-5: PPS116→PartImageNet with standard deviation (±) of the proposed method LangHOPS± and the strongest baseline PartGlee±.

Method	PartImageNet	-	-	+INS	-	-	+INS+PART	-	-	PPS-116	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
PartGlee ±	8.53±0.52	2.05±0.09	3.04±0.16	23.3±0.1	3.10±0.16	6.17±0.16	22.5±0.4	3.60±0.24	6.46±0.26	53.6±1.5	13.8±0.6	19.9±0.7
LangHOPS ±	11.0±1.0	2.17±0.03	3.50±0.18	22.9±0.7	3.68±0.11	6.59±0.19	23.2±0.5	4.51±0.25	7.34±0.28	55.0±1.1	15.0±0.2	21.1±0.3

Table R-6: PartImageNet→PPS116 with standard deviation..

5. Code Release

We will make the code public upon the paper’s release. Due to the IP limitations of the work, it was not possible to release the code during review.

6. Typos

We thank the reviewer for reporting the typos. We will address them and go through the paper carefully, correcting them.

7. Instance-level Annotation

The reviewer is correct that the original Pascal-Part-116 [42] dataset only provides semantic part segmentation annotations. However, as detailed in Appendix B of PartGLEE [20] and its official git repository (assets/DATA.md), PartGLEE further processes these annotations to produce instance-level part segmentation labels, which support the OVPIS task. Our experiments in Table 1 and Table 2 use this PartGLEE-provided version of the PPS-116 dataset, which includes part-instance-level annotations. We will clarify this in the revised version of the paper to avoid confusion.

8. Visualization Clarification

The visualizations in Figure 1 are part instance segmentation outputs. For readability, we display each part belonging to the same type in the same color, allowing the reader to reidentify objects and show only a single label for each class of parts to avoid clutter. We thank the reviewer for pointing out this potentially distracting visualization and will provide a more common visualization showing the separation of instances in the final paper.

9. MLLM Output

The MLLM outputs are embeddings of MLLM-processed prompt tokens, and we extract part queries from those embeddings. Specifically, we feed a structured language prompt—containing object-part hierarchical information—into the MLLM. Unlike typical generative usage, the MLLM here returns token-level embeddings for the entire input sequence, with the output length matching the input. We then extract the embeddings at the positions corresponding to the initial part query tokens. These embeddings are used as the refined, language-informed part queries, which are passed to the visual part decoder.

2025-08-07

Thank you for your detailed and thoughtful responses to my concerns.
The clarifications regarding novelty, parameter and inference cost (2.), and error bars (4.) were especially helpful.

However, there remain a few points for which I would appreciate further clarification or offer additional comments below.

1. Novelty Clarification

1-1.

In your response, you mention:
“the language-grounded object-part hierarchy modeling replaces the brute-force hierarchical representation using a fixed set of parts in the Q-Former of PartGLEE's architecture.

In this context, could you clarify, with an explicit example, whether bus1’s wheel1, bus1’s wheel2, and bus2’s wheel1 are represented identically or differently in your language space E?
Further, in the open-vocabulary setting described in your problem definition, the object-part categories and candidate part classes for each object are assumed to be predefined.
In such a scenario, how does your approach for part representation (e.g., as in Section 3.4) fundamentally differ from simply concatenating the CLIP text embeddings of the object and part name (i.e., E(C_bus) + E(C_window))?
A concrete example or further clarification on this point would be very helpful.

1-2.

The analysis on training strategies provided in the supplementary material is indeed meaningful.
However, since this aspect is not thoroughly analyzed or discussed in the main text, I believe it does not stand out as a main contribution of the work.

3. Clarity and Consistency of Presentation

As you noted, the additional “Object Segmentation” subsection title is a typographical error.
Beyond that, however, there are broader issues with terminology and consistency throughout the manuscript.
For example, the use of “Object Segmentation” vs. “Object-level Segmentation” in subsection titles and the main figure appears inconsistent and reduces the paper’s overall clarity.
Another example is Section 3.4 “Language-Space Object-Part Representation”, which in the main figure is labeled as “Hierarchical Object-Part Representation”.

The correspondence between subsection titles and figure components is structurally ambiguous, and terminology is not used consistently across the manuscript and figures, which may confuse readers (for example, as previously noted, Sections 3.4 and 3.5 are in fact subcomponents of “Object-Part Parsing,” but this hierarchy is not made explicit in the main text).

For instance, in Figure 2, if “MLLM-based Parsing” inside “Object-Part Parsing” refers to Section 3.5, and “Hierarchical Object-Part Representation” refers to Section 3.4, the hierarchy and terminology between text and figure are not consistent.

As Reviewer Ku9B also mentioned, notations lack consistency throughout the paper, which further undermines the clarity and completeness of the presentation. (for example, C_{} sometimes being italicized and sometimes not)

Taken together, the structural inconsistencies between the main text and figures, the lack of terminological coherence, the frequent typographical errors, and the misused subsection titles all contribute to a sense of incompleteness in the manuscript.

6. Additional Typos and Inconsistencies Noted

Figure 2: “obj-art” should be “obj-part”
Figure 2: Capitalization of “C_{obj-part}” is inconsistent with the main text
Figure 2: “Object Segmentation” vs. “Object-level Segmentation” should be unified
Figure 2: “Hierarchical Obj-Part Rep” should be inside the bold pink box for consistency with “MLLM-based Parsing”
Table 1 and 2: “PartGlee” → “PartGLEE”
Table 3: “PartCatSeg” → “PartCATSeg”
“PaliGemma2” → “PaliGemma 2”

While I recognize some of the novelty of the proposed approach and its demonstrated effectiveness, I find that the issues in overall flow and consistency, particularly the structural ambiguities and lack of clarity in connecting sections and figures, significantly limit the impact of the contribution at this stage.
Therefore, despite its strengths, I must still regard this as a borderline paper.

Thank you again for your detailed response and engagement.

2025-08-06

Dear Reviewer,

Thank you for your dedicated efforts in reviewing this paper. We are currently in the reviewer-author discussion phase, but we have not yet seen your engagement.

Best, AC

评论- Reply to the comment by Reviewer rNHe

2025-08-08

We appreciate the reviewer's further constructive comments and suggestions. We agree that clarifying the raised questions is important for improving the quality and clarity of our work. Below, we provide our detailed responses.

1. Novelty Clarification

Architecture Novelty Clarification

In our model, the embeddings of bus1’s wheel1 and bus2’s wheel1 are distinct in the language space E, while the embeddings of bus1’s wheel1 and bus1’s wheel2 are identical. Specifically, the part embedding is constructed by combining the object query $O_{bus1}$ and the CLIP text embedding of the part name “wheel”. Importantly, $O_{bus1}$ is not just a category-level embedding; it encodes instance-specific visual features, as it is used to generate the segmentation mask for that specific bus1 instance. Thus,

$O_{bus1} \neq O_{bus2}$
Consequently, E(bus1’s wheel1) $\neq$ E(bus2’s wheel1)

This differs fundamentally from the alternative approach of simply concatenating the CLIP embeddings of the object and part names (e.g., CLIP(“bus”) + CLIP(“wheel”)), which would not capture instance-level visual features and variation and would lead to ambiguity, especially in scenes with multiple instances of the same object type.

Following the reviewer’s helpful suggestion, we have clarified this point explicitly in Section 3.4 of the revised paper.

Training Strategy

We agree that the training strategy analysis provides useful insight. In response to the reviewer’s comment, we have moved the relevant analysis to Section 4.3 in the main paper from the supplementary and explicitly discussed its impact on performance. While it may not be the central novelty of our method, we believe it adds valuable context and guidance for future work, and we are grateful for the reviewer’s encouragement to better highlight it.

3.Clarity and Consistency of Presentation

We appreciate the reviewer's suggestion for improving the presentation of the paper. In the revised paper, we have

corrected the duplicate “Object Segmentation” subsection title,
ensured consistency and clarity of the method components by
- (a) replacing "object-level segmentation" with "object segmentation" in sections 3 and 4 for consistency;
- (b) unifying the name of section 3.4 in figure 2 to "Language-Space Object-Part Representation";
- (c) clarifying the hierarchy between "Object-Part Parsing" and Section 3.4 and 3.5 in the method overview
- (d) added tags of the subsection numbers to the corresponding modules in Figure 2.
ensured consistency of the mathematical notations and the terminology

6. Additional Typos and Inconsistencies Noted

We have addressed all the typos and inconsistencies noted by the reviewers. In addition, we have carefully proofread the paper to ensure its clarity and coherence.

We sincerely appreciate the reviewer’s meticulous suggestions and thoughtful feedback, which have significantly improved the quality and presentation of the paper. In the revised version, we have made significant efforts to address these issues with

clarified model design and analysis on training strategies in main paper
refined structure of the method section and clear figure–text alignment;
standardized notation and consistent terminology;
corrected typos and coherent phrases

We hope these changes effectively address the reviewer’s concerns and make the contributions of the paper more accessible and impactful. We deeply appreciate the reviewer’s acknowledgment of the strengths and novelty of our approach, and we respectfully ask for reconsideration of the overall rating in light of the improvements.

2025-08-08

Thank you for the detailed revisions, including clearer architectural novelty, explicit instance-specific representation examples, improved training strategy presentation, resolved terminology and figure alignment issues, and added robustness and resource analyses.
These effectively address my earlier concerns, so I am raising my recommendation from weak reject (3) to weak accept (4).

审稿意见

评分: 4置信度: 42025-07-02

The paper introduces LangHOPS, the first framework that tackles open-vocabulary object–part instance segmentation (OVPIS) by (i) explicitly encoding object–part hierarchies in CLIP language space and (ii) passing these language-grounded queries through a lightweight MLLM (PaliGemma-2) to refine part queries before decoding. Experiments on three benchmarks (PartImageNet, PascalPart-116, ADE20K) show consistent state-of-the-art gains: +5.5 AP (in-domain) and +4.8 AP (cross-dataset) on PartImageNet, and +2.5 mIoU on unseen ADE20K parts. Ablations confirm that both the language-grounded hierarchy and the MLLM parser are indispensable and that joint training yields positive object–part synergy.

优缺点分析

Strengths:

1). Solid empirical evidence: thorough in-domain, cross-dataset, zero-shot, scalability and ablation studies.

2). Clear Implementation details: two-stage training, clear loss formulations and hyper-parameters.

3). Significance and originality: First to integrate an MLLM for OVPIS; shows that language-grounded hierarchies boost both part and object quality; could inspire similar hierarchical designs in other dense tasks. Combines hierarchical language modelling with MLLM-driven query refinement; introduces explicit synergy analysis.
Weaknesses

1). Error bars / statistical significance are missing; some ablations (e.g., prompt sensitivity, MLLM size) are limited; compute cost for full training set (INS + PART) not fully quantified.

2). Impact depends on availability of strong MLLMs; benefit over strong vision-only baselines (e.g., clipped Q-Former) is moderate on ADE20K (49.5 hIoU vs 50.0 of PartCatSeg).

3). Repetition/typos (e.g., duplicated “Object Segmentation” heading), table captions compressed, and some notation (Np, H) introduced late.

问题

1). Statistical robustness – Please report confidence intervals (e.g., ± std over three seeds) for main tables. A ≥ 3 pt drop might change my rating. 2). Prompt sensitivity – How stable is performance to wording or ordering of the structured prompt? A study on a held-out set would strengthen claims of robustness. 3). Compute footprint – Provide GPU hours and carbon cost for each training stage and for the large INS + PART run; if substantially larger than baselines, discuss trade-offs. 4). Hierarchy construction – The method assumes a clean object→part taxonomy. How does LangHOPS behave when candidate lists are noisy or overlapping? An experiment with automatically mined candidates would be helpful. 5). Synergy mechanism – The paper shows gains but not why gradients from parts help objects. Visualizing attention maps before/after joint training could clarify this; stronger evidence could raise my score.

局限性

The paper openly notes that synergy is implicit and that the role of language-guided hierarchies on object quality is not fully characterized (Sec. 4.4). However, potential societal impacts (e.g., misuse of fine-grained segmentation for surveillance) are not discussed; consider adding a Broader-Impacts paragraph.

最终评判理由

The rebuttal resolved my main robustness and cost doubts and provided new evidence of hierarchy tolerance and part-object synergy. The method remains the first to integrate an MLLM into OVPIS and delivers consistent SOTA across three datasets with solid ablations. Remaining weaknesses are either incremental (ADE20K gap) or addressable in the final version (impact statement). Overall, reasons to accept now clearly outweigh reasons to reject.

格式问题

1). Minor duplicated subsection titles (Sec. 3.3/3.4). 2). Table 1 caption typo: “→artImageNet”. 3). Occasional spacing errors around “%” and subscripts.

作者回复

2025-07-31

We thank the reviewer for the highly constructive comments and appreciate the recognition of significance and originality, together with highlighting solid empirical evidence in different settings and clear implementation details. We answer to the reviewer's comments as follows.

1. Error Bars

We report mean ± standard deviation results over three random seeds (20, 43, and 4337) for both LangHOPS and PartGlee, the strongest baseline.

As shown in Tables R-5 and R-6, the averaged AP closely align with the originally reported numbers, slightly increasing in most experiments for our method. The standard deviations have a low magnitude, indicating that our results are stable and show a significant improvement over the PartGlee baseline.

Method	PPS-116	-	-	+INS	-	-	+INS+PART	-	-	PartImageNet	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
PartGlee ±	38.4±0.5	8.61±0.46	15.2±0.5	57.6±1.8	11.3±0.2	21.5±0.6	60.1±0.9	10.5±0.7	21.5±0.7	82.0±0.5	42.5±0.8	51.3±0.7
LangHOPS±	48.7±3.3	8.89±0.24	17.7±0.9	60.9±0.7	12.1±1.0	22.9±1.0	63.7±1.4	16.6±0.3	27.0±0.5	84.1±1.0	48.2±0.8	56.1±0.8

Table R-5: PPS116→PartImageNet with standard deviation (±) of the proposed method LangHOPS± and the strongest baseline PartGlee±.

Method	PartImageNet	-	-	+INS	-	-	+INS+PART	-	-	PPS-116	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
PartGlee ±	8.53±0.52	2.05±0.09	3.04±0.16	23.3±0.1	3.10±0.16	6.17±0.16	22.5±0.4	3.60±0.24	6.46±0.26	53.6±1.5	13.8±0.6	19.9±0.7
LangHOPS ±	11.0±1.0	2.17±0.03	3.50±0.18	22.9±0.7	3.68±0.11	6.59±0.19	23.2±0.5	4.51±0.25	7.34±0.28	55.0±1.1	15.0±0.2	21.1±0.3

Table R-6: PartImageNet→PPS116 with standard deviation.

2. Prompt Sensitivity

We conducted two ablation studies on the ordering and wording of the structured input prompts to assess the robustness of our method to prompt formulation.

Robustness to prompt ordering: We randomly shuffled (i) the order of object queries, and (ii) the order of part queries within each object, multiple times during inference. For instance, object 3 may appear before object 1, or part queries within an object may be permuted (e.g., "part 9, part 4, part 6"). As shown in Table R-7, our method remains highly stable across these permutations, with minimal performance degradation, demonstrating robustness to input ordering.

Method	PPS-116	-	-	+INS	-	-	+INS+PART	-	-	PartImageNet	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
Shuffling of Object	47.8±2.7	8.57±0.36	17.1±0.9	61.1±0.8	11.7±0.8	22.6±0.8	65.1±0.9	15.8±0.3	26.7±0.4	82.1±1.6	47.7±1.2	55.3±1.3
Shuffling of Part	46.9±2.4	9.08±0.33	17.5±0.8	58.8±1.1	13.6±0.9	23.6±0.9	64.2±1.6	16.9±0.3	27.4±0.6	81.7±1.0	46.2±0.6	54.1±0.7
No shuffling	48.7±3.3	8.89±0.24	17.7±0.9	60.9±0.7	12.1±1.0	22.9±0.97	63.7±1.4	16.6±0.3	27.0±0.5	84.1±1.0	48.2±0.8	56.1±0.8

Table R-7: Ablations on the ordering of the object and part queries - PPS116->PartImageNet.

Robustness to wording: We further test the model's robustness to unseen part names by replacing the subset (from $0\%$ to 100%) of the original part category names with GPT-4o-generated synonyms (e.g., "foot" → "leg"). As shown in Table R-8, LangHOPS significantly outperforms PartGlee under increasing synonym replacement ratios, indicating strong generalization to semantically similar but unseen phrasing. Note that synonym substitutions may introduce granularity mismatches with the dataset’s ground-truth annotations (e.g., "leg" may exclude "paw" in the ground truth for “foot”), which partially explains the observed performance drop.

Method	0%	-	25%	-	50%	-	75%	-	100%	-
	part	AP	part	AP	part	AP	part	AP	part	AP
PartGLEE	11.2	21.8	9.3	20.3	8.6	19.7	6.6	18.2	5.1	17.0
LangHOPS	17.0	27.1	16.2	26.5	16.5	26.7	14.6	25.3	12.7	23.8

Table R-8: Ablation on the robustness to input part category names. PPS116 + INS + PART -> PartImageNet. Different percentages of part category names replaced with GPT-4o generated synonyms.

Together, these studies demonstrate that LangHOPS is robust to both the structure and vocabulary of the input prompt.

3. Compute Footprint

We now report the footprint of GPU hours, carbon cost, inference cost and model size of PartGLEE, PSALM and LangHOPS in Table R-9. The gpu hours and inference time are reported with Nvidia H200 GPU(s). The spec. power (700W) of H200 and world average carbon intensity of electricity (0.475 kg CO₂e / kWh) are used for calculating the footprint.

Method	Model Size	Training GPU Hours	Training Footprint (kg CO₂e)	Inference Time (ms)
PSALM†	1.5B	92	30.6	628
PartGLEE	1B	40	13.3	240
LangHOPS	4B	72	23.9	396

Table R-9: Specs of LangHOPS and the baselines. PPS116 + INS + PART -> PartImageNet.

4. Robustness to Noisy Hierarchy

We test on the common OVS setting using clean object-part hierarchies, but believe in the value of closing the gap towards a noisy real-world deployment. To evaluate the robustness of LangHOPS to noisy or automatically mined hierarchies, we replace a portion of the clean object-part taxonomy with GPT-4o-generated object-part hierarchies. These auto-mined hierarchies are constructed solely from the object category names and may introduce ambiguity, inconsistency, or irrelevant parts. In Table R-10, we report performance under varying noise levels: for example, at $25\%$ , $25\%$ of the object categories use noisy hierarchies, while the remaining $75\%$ use the clean dataset annotations. We observe that:

LangHOPS consistently outperforms PartGlee across all noise levels;
LangHOPS degrades more gracefully as noise increases, maintaining reasonable AP even when $100\%$ of hierarchies are noisy;
The performance gap widens especially at high noise levels, demonstrating LangHOPS's stronger resilience to imperfect or automatically mined hierarchies.

Please note that the auto-generated hierarchies are often inconsistent with the ground truth annotations in the dataset, leading to lower evaluation metrics. Overall, we agree with the reviewer that developing evaluation protocols for adaptive, task-specific hierarchies remains an open problem and a promising direction for future benchmark design.

Method	0%	-	25%	-	50%	-	75%	-	100%	-
	part	AP	part	AP	part	AP	part	AP	part	AP
PartGLEE	11.2	21.8	10.3	21.1	9.8	20.7	8.2	19.4	3.6	15.8
LangHOPS	17.0	27.1	13.1	24.1	12.4	23.5	8.8	20.7	6.7	19.1

Table R-10: Ablations on the noisy hierarchy construction. Different percentages of obj-part hierarchies from the dataset are replaced with GPT-4o generated ones.

5. Synergy Mechanism - Attention Score

We further prove the object-part synergy mechanism by reporting the average attention score, since the attention maps cannot be provided in the rebuttal format. The average attention score is calculated by summing attention scores of true positive predictions inside the ground truth masks, divided by the area of the masks. The attention is the normalized cos similarity between object queries and the dense features of the final layer of the object/part decoder. The score shows the amount of attention correctly assigned by the model to the ground truth area, and is mapped to the range [0, 1].

In the setting of PPS116+INS+PART -> PartImageNet, as shown in Table R-11, compared to the "detached object-part seg.", the synergized object-part segmentation leads to higher attention scores for both object and part segmentation, proving strong evidence of the synergy between both segmentation tasks.

Setting	Detached Obj-Part Seg	Obj-Part Seg in Synergy
obj attn. score	0.76	0.82
part attn. score	0.58	0.67

Table R-11: Attention score of the object and part segmentation.

We will further show the attention maps and the attention scores in more experimental settings in the revision.

6. Comparison to PartCatSeg

We agree that the improvement over PartCatSeg in OVPS is moderate in a specific dataset (ADE20K). However, it is noticeable that our method is developed and trained with object-part-level instance segmentation. Directly evaluating our method in the semantic segmentation (Table 3 in the main paper) still leads to superior performance (hIOU) in PPS-116 and PartImageNet datasets, showing the great potential of the proposed method.

评论- Official Comment by Reviewer PJs6

2025-08-05

I thank the authors for the detailed rebuttal, which addresses the issues I raised. Given the demonstrated statistical stability, robustness analyses, and clarified resource footprint, I am raising my overall score from 4 (borderline accept) to 5 (accept). I believe the paper makes a meaningful and well-supported contribution to open-vocabulary part segmentation community.

审稿意见

评分: 3置信度: 52025-07-03

This paper introduces an object-level instance and object-specific part-level instance segmentation method. The results on several datasets surpass previous methods.

优缺点分析

Strengths:

This paper provides a detailed definition of the problem associated with this "new" task. The experiments conducted on different benchmarks verify the effeteness of the proposed method, particularly in the cross-dataset experiment.

Weaknesses:

However, the motivation for this task is not explained well. The object-specific part-level instance segmentation is more like a combination of part-segmentation and instance-segmentation tasks. It is hard for readers to get the motivation in the beginning.

In addition, the proposed method includes several components while the connection is not strong. The part segmentation module and the object segmentation have overlaps for a specific object category. For instance, both parts can learn the information for "bus". They are no relations with this information when building the structure.

Another critical issue is why the authors refer to it as language-grounded. The processing of the textual part is limited. The prompts proved from neither object segmentation nor part segmentation are passed through a frozen text-encoder, which is followed by a concatenation. The additional information plays limited role as "language-grounded".

The paper requires careful proofreading, particularly in image captions and Section 3.2, 4.3.

问题

Line 42-47, can you explain the limitations of previous methods? I do not entirely agree with your points.
Why do the authors train the visual and pixel backbones, which is very rare?
What is the difference between the proposed method and a set of learnable, universal queries in PartGLEE in Section 3.4?

局限性

Some parts are insufficiently explained in section 3.2. The improvements in Tables 2 and 3 are not significant.

最终评判理由

The authors provide a rebuttal to address some concerns from the first round. However, some key points (e.g., motivation, the distinction from object/part segmentation) requires more sufficient explanations and experiments. Therefore, I keep my initial rating as borderline reject.

格式问题

None.

作者回复

2025-07-31

We appreciate the reviewer's comments and recognition of performance gains achieved by our method on different benchmarks, especially on open vocabulary cross-dataset tasks, which are a central target of our method.

1. Motivation Clarification

We do not perform object-specific part segmentation, but rather segment parts of all requested objects simultaneously, following the broadly accepted open-vocabulary part segmentation (OVPS) framework.

For the relationship between part- and instance-segmentation, in fact, prior work such as PartGLEE follows the approach mentioned by the reviewer — simply combining object and part segmentation modules via a Q-Former.

In contrast, our method models the object-part hierarchy in language space, leading to significant performance gains over PartGLEE. It is significant to address the challenge:

While object-level instance segmentation is well-defined, part-instance segmentation is task-dependent and inherently requires open-vocabulary capabilities.

For example:

Using a laptop may require segmenting the lid to open it.
Repairing it may require finer segmentation of screws or hinges.

We solve it by:

Incorporating open-vocabulary prompts into object segmentation;
Modeling hierarchical object-part relationships with structured prompts;
Enforcing a one-to-one mapping between object parts and part segmentation queries;
Using language-grounded queries in the part decoder that leverage the reasoning capabilities of an MLLM.

We will clarify this in the introduction.

2. Connection Between Modules

We believe this is a misunderstanding.

Our object and part segmentation modules are tightly coupled via the hierarchical object-part representation (mentioned in Line 140, 174, and Figure 2) and aim at utilizing the joint information between parts and objects through the following information flow:

For each segmented object, a set of part queries is constructed by a concatenation of the object embedding $\mathbf{O^H}$ generated in the object decoder and the language-encoded part description.
These queries are processed by the MLLM, generating queries that are capable of open-vocabulary part segmentation.
The processed queries are used in the part decoder to segment each requested part.

Our ablation study in l269ff and Table 5 shows a comparison with detached object-part segmentation that demonstrates that the joint training with information flow from the part decoder to the object decoder is efficient and improves performance.

We will add this paragraph to the architecture overview.

3. Language Grounding

We respectfully disagree and refer to our method as being language-grounded since we design the whole object-part representation and processing to remain in a language-aligned space. Our motivation to follow this approach is to utilize the generalist language representations learned by the MLLM.

In detail the language-grounding is enabled in the following way:

Both object $\mathbf{O^H}$ and part queries $\mathbf{P}$ generated in the segmentation modules are aligned to the language space and compared against text embeddings of category labels for classification.
More critically, language processing in our method goes beyond the frozen text encoder. We combine the hierarchical part structure in language space with semantic object embeddings produced by the object segmentation, explicitly modeling object-part hierarchies in language space.
We leverage these text-aligned embeddings in a pretrained MLLM, which enriches the queries with contextual and semantic knowledge from the vision-language domain. Similarly, the output of the MLLM resides in a text-informed feature space, allowing the part decoder to extract visual features in a way that is conditioned on open-vocabulary semantics.

This design enables the open-set generalization capability and allows the model to process novel categories and part-object relationships.

4. Limitation of Previous Methods

We clarify the limitations of the existing methods. It is recognized that CLIP’s embeddings have limited capacity for compositional understanding and modeling object-part hierarchies[1, 38, 48]. As a result, CLIP-based methods such as OV-Part [42] and PartCLIPSeg[7] tend to show suboptimal performance in fine-grained part segmentation. To address this, PartCatSeg augments the model with DINOv2 for structural guidance . However, the approach still lacks an explicit mechanism for modeling object-part hierarchies grounded in language, which is critical for compositional generalization.

For another existing work, PartGLEE [20], it has inherent limitation by design in handling part granularity variation across datasets. It parses objects into parts through a learning-from-scratch Q-Former, which is incapable of handling object-part granularity unseen during training. For example, trained with fine-grained annotations like "eye", "nose", "ear" for the cat on Pascal-Part-116, PartGLEE performs poorly when evaluated on PartImageNet, where the cat is annotated with the only coarser parts ("head", "body", "foot", and "tail").

We will revise the part mentioned by the reviewer to make it clear.

5. Training of Visual Backbones and Pixel Decoder

It is indeed necessary to train the visual backbone and pixel decoder. The visual backbone is from Mask2Former CVPR'22, pretrained only in the object-level segmentation task and thus, is limited in segmenting fine-grained, part-level instances. Therefore, we train the backbone on part-level datasets to grant it part-aware segmentation ability. This practice is actually not uncommon in the literature on part-level segmentation (e.g., [20, 40]).

We further conducted an ablation study on the freezing/training of the visual backbone and the pixel decoder. As shown in Table R-1 and R-2, training the visual backbone and the pixel decoders leads to better performance in part-level segmentation and higher overall AP.

Method	PPS-116	-	-	+INS	-	-	+INS+PART	-	-	PartImageNet	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
Frozen Bk+Pd	48.2	6.99	16.1	64.1	8.85	21.1	66.1	9.34	22.0	80.6	30.1	41.3
Frozen Bk	47.6	7.36	16.3	63.0	6.98	19.4	63.4	12.4	23.8	83.2	34.7	45.4
LangHOPS	49.1	8.62	17.6	61.8	13.6	24.3	62.7	17.0	27.1	85.5	47.9	55.8

Table R-1: Ablations on freezing backbone (Bk) and pixel decoder (Pd). Cross-dataset and in-domain evaluation, PPS-116 -> PartImageNet (with Random seed 4337).

Method	PartImageNet	-	-	+INS	-	-	+INS+PART	-	-	PPS-116	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
Frozen Bk+Pd	10.5	1.71	3.05	23.4	2.57	5.73	23.2	2.76	5.86	53.3	7.48	17.7
Frozen Bk	11.8	1.95	3.45	23.3	2.92	6.01	23.0	3.34	6.32	44.2	7.65	15.8
LangHOPS	11.3	2.17	3.47	21.9	3.82	6.55	23.8	4.16	7.13	56.4	15.3	21.4

Table R-2: Ablations on freezing backbone and pixel decoder. PartImageNet -> PPS-116.

6. Differentiation from PartGLEE

Language-Space Object-Part Representation, a core innovation of our method, distinguishes ours from PartGLEE. Unlike PartGLEE, which uses a fixed set of learnable part queries processed by a Q-Former, our method dynamically constructs part queries in language space, conditioned on both the object context and a user-defined list of candidate part categories.

For each object query from the object segmentor, we iterate over candidate part labels and concatenate each text embedding with the object query to form context-aware initial part queries. These are then further refined by a multimodal LLM in Section 3.5. To address the reviewer's concern, we conduct an ablation study by replacing the language-grounded part query initialization with fixed learnable tokens as in PartGLEE and use them as the input to the MLLM, which drops the cross-dataset performance significantly compared to the proposed method:

Method	PPS-116	-	-	+INS	-	-	+INS+PART	-	-	PartImageNet	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
w. LQ	46.8	8.10	16.7	58.2	11.8	22.1	58.9	13.9	23.9	81.3	46.7	54.4
LangHOPS	49.1	8.62	17.6	61.8	13.6	24.3	62.7	17.0	27.1	85.5	47.9	55.8

Table R-3: Ablations on Learnable Query. Cross-dataset and in-domain evaluation, PPS-116 -> PartImageNet (with Random seed 4337). LangHOPS w. LQ means that the "Language-Space Object-Part Representation" is replaced by the PartGLEE-like learnable queries concatenated with object queries.

Method	PartImageNet	-	-	+INS	-	-	+INS+PART	-	-	PPS-116	-	-
	obj	part	AP	obj	part	AP	obj	part	AP	obj	part	AP
w. LQ	11.0	1.95	3.32	28.6	3.02	6.54	22.7	4.26	7.05	54.6	9.65	19.6
LangHOPS	11.3	2.17	3.47	21.9	3.82	6.55	23.8	4.16	7.13	56.4	15.3	21.4

Table R-4: Ablations on Learnable Query. PartImageNet -> PPS-116.

2025-08-08

Thank you for your detailed responses. Some of my previous concerns have been addressed, but I still think some points are not convincing. Regarding the motivation, the authors should carefully state the limitation of current research, particularly for the "new" task. However, a simple combination between A and B does not answer why the authors introduce part and object level segmentation. Regarding the connection, I don't agree that the proposed MLLM processed queries link objects and parts for the part decoder. About the language-grounded, the authors claimed that the text embedding obtained from the prompts are the linguistic alignment with visual part. Then this concept is not perfectly good for this case. Therefore, I keep my original rating.

评论- Reply to the comment by Reviewer thRp

2025-08-09

We appreciate the reviewer's comments, which have helped us better understand and address the remaining concerns.

Motivation Clarification

We agree it is important to clearly articulate the motivation behind the proposed task and method. As stated in the main paper (line 19-25, 71-72) and illustrated in the rebuttal examples (1. Motivation Clarification), the motivation of OVPIS lies not only in the novel hierarchical design for addressing object-part segmentation challenges but also in its potential impact on downstream applications such as image editing and robotic manipulation at the part level. Following the reviewer’s suggestion, we have further emphasized this motivation in the third paragraph of the introduction in the revised paper.

MLLM for the link between multi-granularity concepts

We appreciate the request for clarification. The proposed MLLM receives the object query followed by part queries (as described in line 193: e.g., object_1 with part_1, part_2, ... ) and outputs refined part queries that integrate both object- and part-level information. This design not only enables the MLLM to leverage object-level context to infer part semantics, but also allows bidirectional information flow - from parts back to the object - during training (see Section 4.3, Object-Part Synergy).

We have clarified this mechanism in Section 3.5 of the revised version.

Language-grounded Representation Clarification

We understand the reviewer’s concern and would like to clarify the language grounding process:

In our method, language grounding is not limited to pure text embeddings. Indeed, both the object query $\mathbf{O^H}$ and part query $\mathbf{P}$ incorporate visual information and are aligned with category text embeddings for classification and segmentation tasks.
The statement text embeddings are the linguistic alignment with visual parts was imprecise. In fact, the initial part queries $\mathbf{P}$ are formed by concatenating object queries with text embeddings and then refined via MLLM, resulting in queries enriched with both language semantics and visual information of certain objects.
A more accurate description would be: language-grounded object-part representations are applied as additional semantic information to form visual-language queries to facilitate open-vocabulary segmentation.

We have incorporated this clarification into the Introduction and Section 3.4 of the revised paper.

We sincerely appreciate the reviewer’s comments and suggestions, which allow us to further improve the motivation of the paper and the clarification of the method design with

further highlighted motivation with downstream applications in the introduction
clarified MLLM-based model design
and a detailed description of the language-grounded object-part representation

We hope these improvements address the remaining concerns and clarify the novelty and contributions of our work. We respectfully ask the reviewer to reconsider the overall rating in light of the revision.

2025-08-06

Dear Reviewer,

Thank you for your dedicated efforts in reviewing this paper. We are currently in the reviewer-author discussion phase, but we have not yet seen your engagement.

Best, AC

2025-08-08

Dear Reviewer,

We would like to express again our sincere appreciation for your review of our work. We hope that we have addressed your concerns and remain committed to further improving the paper as promised. Please feel free to reach out with any additional questions or suggestions.

最终决定Accept (poster)

2025-09-17

The paper ultimately received three positive reviews (two Borderline Accepts and one Accept) and a single negative review (Borderline Reject).

The positive reviewers found their concerns are fully addressed during the post-rebuttal discussion, leading two of them to raise their scores. While one reviewer remained explicitly negative, the authors provided thorough and convincing responses that the other reviewers found to sufficiently address concerns regarding novelty, motivation, and technical contributions.

Overall, the AC believes the merits of the paper outweigh its flaws and is therefore recommending acceptance.