RESAnything: Attribute Prompting for Arbitrary Referring Segmentation
摘要
评审与讨论
The paper presents RESAnything, a training-free framework for arbitrary referring-expression segmentation. It first generates SAM mask proposals, then uses a multimodal LLM to write a rich “reference” description for the query and “candidate” descriptions for each proposal. Four cross-modal similarity signals—two from the LLM and two from CLIP—are combined to select or fuse the best mask. Without any fine-tuning, the method more than doubles prior zero-shot accuracy on RefCOCO and outperforms state-of-the-art fine-tuned models on ReasonSeg and the new ABO-Image-ARES benchmark. The paper also contributes that benchmark and detailed ablations of prompts and backbones.
优缺点分析
Strengths
- The approach is training-free yet delivers large zero-shot performance gains, demonstrating strong generalization. Its strong performance on part-level RES makes it useful in many downstream tasks and research.
- It shows that careful prompt engineering plus proposal recycling can close most of the zero-shot gap to supervised models.
- It introduces a new benchmark for evaluating part-level RES.
- The experiments are comprehensive, and the paper presentation is clean and organized.
Weaknesses
- The method design is systematic and well-thought-out, but it is not elegant for the following reasons:
- Its performance relies on multiple underlying foundation models and, most importantly, on good segmentation proposals from SAM. Any change in prediction performance, sensitivity, or granularity of the underlying foundation models will render the proposed method either insufficient (bad proposals) or redundant (when foundational MLLMs get much better).
- This method relies on handcrafted text prompts (as shown in Supp. Sec. 7.1), which are also potentially sensitive to weight updates or swapping of the MLLM model.
- The inference runtime overhead is substantial (as reported in Table 11 in Supp. Sec. 9.4).
- It would be super interesting to see these directions in future or modified work:
- How to generate new proposals when current proposals are not good enough.
- Investigating the upper bound and bottlenecks of training-free methods by conducting a comprehensive qualitative study on samples that are predicted correctly by training-based methods but incorrectly by your method.
问题
Please refer to Weaknesses.
局限性
Yes. The authors discuss these limitations fairly thoroughly.
最终评判理由
Thanks for the response, I have also read other reviewers comments & author's rebuttal, I am happy to keep my score. I hope the author will integrate feedback from all reviewers into their final manuscript.
格式问题
None.
Dear reviewer tcpr, thank you for your encouraging feedback and recognition of our training-free approach's substantial zero-shot performance gains, strong generalization capabilities, and comprehensive experimental validation. We particularly appreciate your thorough review of our supplementary materials. Please find our responses to the identified weaknesses below.
Robustness and usage of foundation models
We acknowledge these limitations regarding foundation model dependency, prompt sensitivity, and runtime overhead. Our ablation studies demonstrate robust performance across diverse MLLMs (Pixtral 12B, Claude 3.5, and Qwen 2) using consistent prompts, and we provide comprehensive prompt analysis in the supplementary materials to verify the effectiveness of attribute-based prompting. More importantly, our focus is on solving referring expression challenges that previous methods fundamentally cannot handle. While current powerful MLLMs have extensively learned concepts from standard benchmarks like RefCOCO series through their training data, the critical research challenge lies in handling specialized, domain-specific and explicit, part-level queries in ReasonSeg and proposed ABO-ARES benchmarks. We understand that future foundational MLLMs may eventually be able to handle these queries through scaling, our work offers a viable pathway to obtain such annotated data for training and explores the possibility of learning rich properties, attributes, and other sparse information about target regions of queries in images without requiring extensive fine-tuning.
Our current method has not been optimized for inference speed. With parallelization on candidate mask generation from SAM, reference text generation, candidate text generation and similarity comparison between the target and candidates, the RESAnything time can be reduced to 3s (on multiple 4090 GPUs), as shown below: max(SAM, reference text generation) (2s) + candidate text generation (0.5s) + max(MLLM similarity computation, CLIP similarity computation) (0.5s)
This optimization leverages the fact that candidate mask generation and reference text generation can be executed in parallel, and similarity computations using MLLM and CLIP can also run concurrently. Furthermore, inference speed could be improved even more by replacing SAM with FastSAM for mask generation.
Future works
Thank you for these insightful suggestions for future research directions. We agree that investigating proposal generation mechanisms when current SAM proposals are insufficient represents a valuable extension, as this directly addresses a key bottleneck in our approach. Our grouping and selection algorithm handles targets consisting of multiple proposals, though we acknowledge this mechanism is not always reliable. As shown in Section 10 of our supplementary materials, we included failure cases that are primarily attributed to poor mask proposals from SAM. This analysis confirms that proposal quality represents the primary bottleneck of our method, making your suggested research direction particularly relevant for advancing training-free referring expression segmentation.
First of all, I highly suggest the author to provide point-to-point responses during future rebuttals, this will help the reviewer and yourself to get your point across.
Secondly, I don't think the author addresses these two points of mine directly?
How to generate new proposals when current proposals are not good enough.
I understand the limitation, but how to address it?
Investigating the upper bound and bottlenecks of training-free methods by conducting a comprehensive qualitative study on samples that are predicted correctly by training-based methods but incorrectly by your method.
Does such examples exist? Have the author taken a look at such examples? Do you have any discoveries?
Thank you for the feedback regarding rebuttal structure.
How to generate new proposals when current proposals are not good enough?
Inadequate proposals indeed lead to a limitation of our method. They may occur in two scenarios:
-
Multi-part/multi-object targets requiring combination: SAM generates individual object / part masks that need grouping. Our current grouping and selection algorithm can handle this but is not always reliable, as shown in supplementary materials, Figure 18: “crib” and “middle bird.”
-
Completely missing ground truth masks: The target mask is entirely absent from SAM proposals, typically for very small objects or those with complicated parts. See examples in supplementary materials Figure 17: “red bike”; Figure 20: “middle set of skis”; Figure 24: “chair with bird.”
To address these limitations, future work may develop adaptive region-specific segmentation that focuses on target-containing regions with appropriate granularity, as discussed in Section 5. Additionally, iterative proposal refinement mechanisms could be implemented to generate additional candidates when initial proposals are insufficient.
Investigating upper bound and bottlenecks through qualitative analysis
Yes, such examples do exist and we have conducted this analysis. As mentioned above, our primary bottleneck is mask proposal quality from SAM. We have identified specific failure cases in our supplementary materials (Figures 17, 18, 20, 24), where training-based methods succeed but our method may fail due to inadequate proposals.
Our key discovery is that this bottleneck represents a fundamental limitation of training-free approaches that rely on general-purpose segmentation models, suggesting that proposal generation quality remains the critical performance determinant.
However, this also highlights the adaptability advantage of our framework. As more powerful segmentation models (beyond SAM) and MLLMs become available, they can be seamlessly integrated into our pipeline to achieve better performance without requiring retraining or architectural modifications.
Thanks for the response, I have also read other reviewers comments & author's rebuttal, I am happy to keep my score.
Thank you so much for taking the time to read our paper and the responses. We are grateful for your recommendation.
This paper proposes RESAnything, a novel open-vocabulary zero-shot method for arbitrary referring expression segmentation, which ingeniously integrates MLLMs, CLIP, and SAM. By generating detailed object/part attribute descriptions via MLLMs and leveraging mask proposals from SAM, RESAnything enables training-free identification of referring expressions encompassing object/part labels, implicit attributes, and logos, achieving remarkable performance. The paper also introduces a new benchmark dataset, ABO-Image-ARES, which bridges a critical gap in RES benchmarks by offering challenging expressions with implicit attribute contexts.
优缺点分析
Strengths:
- This paper ingeniously integrates MLLM, CLIP, and SAM, enabling full exploitation of each component’s zero-shot capabilities for the RES task. The approach achieves performance comparable to existing task-specific models while requiring no training. 2.The newly introduced benchmark dataset, ABO-Image-ARES, addresses a critical gap in RES benchmarks by offering challenging expressions with implicit attribute contexts. 3.The proposed Attribute Prompting approach ingeniously repurposes MLLMs’ captioning capabilities for grounding tasks, offering a novel paradigm for extending their applicability to broader task domains.
Weaknesses: 1.Although the performance improvements are significant, direct comparison between RESAnything with fully training-free methods (e.g., GLCLIP) may appear inequitable. Unlike GLCLIP’s SAM+CLIP pipeline, RESAnything incorporates an MLLM component that has been pre-trained on Referring Expression Comprehension (REC)—a twin task of RES—endowing it with strong REC capabilities. 2.Accordingly, MLLMs such as QWEN2.5-VL have demonstrated remarkable performance in grounding tasks on the RefCOCO (+/g) dataset. Given that RES primarily involves segmentation within detected bounding boxes, the interplay between REC capabilities and RES performance warrants further investigation to validate RESAnything’s effectiveness. 3.Considering that Attribute Prompting relies more heavily on the MLLM's understanding capability, it seems necessary to perform more cross-model comparison using MLLMs with varying levels of understanding and grounding performance, in order to illustrate the extent to which RESAnything depends on the MLLM’s pre-trained knowledge. 4. Regarding the paper’s emphasis on "Chain-of-Thought" (CoT), the corresponding implementation is absent. Neither the algorithm flow nor attribute prompting demonstrates explicit application of CoT for MLLM reasoning. 5.As a core contribution, the authors claim attributes are pivotal to attribute prompts, yet no ablation experiments on different attribute types are provided. 6. Minor issues: -The set of fully-supervised baselines should include more recent methods, as ReLA (GRES) is somewhat outdated. -Algorithm 1 may have an indentation error: the condition for checking the total number of elements in set C should be positioned outside rather than inside the loop.
问题
As mentioned above, several concerns have been raised in the Weaknesses section.
局限性
Yes.
最终评判理由
The authors have provided a detailed reply to my concerns and questions. I raise my rating to 4.
格式问题
No formatting concerns observed.
Dear reviewer R2q5, thank you for your positive feedback and recognition of our ingenious integration of different tools to achieve training-free performance comparable to task-specific models. We appreciate your acknowledgment of our ABO-Image-ARES benchmark and novel attribute prompting paradigm that extends MLLMs' applicability to broader grounding tasks. We address your concerns regarding the identified weaknesses below.
REC capabilities of current MLLM and RESAnything
We appreciate the reviewer's thoughtful observation about the relationship between REC and RES capabilities. While we acknowledge that some MLLMs like Qwen2.5-VL demonstrate strong performance on RefCOCO datasets, we respectfully note several important considerations. First, RefCOCO series, being published 9 years ago, represents a relatively saturated benchmark where both supervised methods and foundation models have been extensively optimized, making strong performance on this dataset less indicative of genuine REC capabilities for novel, complex expressions. As demonstrated in Section 6.1 and Figure 9 of our supplementary materials, when MLLMs are directly asked to provide comprehensive descriptions of target queries in images, they generally fail to give accurate responses, suggesting that they may not have robust REC understanding. More importantly, our method addresses precisely that gap that existing approaches, including well-performing MLLs, cannot handle complex referring expressions involving reasoning and explicit attributes (See the quantitative result comparisons on ReasonSeg and ABO-ARES). These scenarios are absent from traditional benchmarks like RefCOCO series. While the interplay between REC and RES capabilities is indeed worthy of investigation, our focus is on extending both capabilities to challenging scenarios that represent genuine research frontiers rather than optimizing performance on well-established benchmarks. We believe this distinction is crucial for advancing the field toward handling real-world complexity in referring expression tasks.
Ablation on MLLMs and attributes
We have indeed conducted comprehensive ablation studies across MLLMs with varying capabilities and scales, as shown in Table 5 of our paper. Our evaluation includes Claude 3.5 Sonnet (a closed-source commercial model with strong reasoning capabilities), Pixtral 12B (a relatively large open-source model), and Qwen2-VL 7B (a general-scale model), representing a diverse spectrum of MLLM understanding and grounding performance levels.
To validate the effectiveness of attribute prompting, we conducted ablation studies showing significant advantages over conventional prompting on ReasonSeg:
| Method | gIoU | cIoU |
|---|---|---|
| standard prompt | 50.8 | 49.3 |
| attribute prompt | 74.6 | 72.5 |
We show details of standard prompt, attribute prompt, and corresponding generated descriptions in supplementary materials, Section 6.1 and Figure 9. Both quantitative and qualitative results highlight the contribution of our attribute prompting mechanism.
Usage of CoT
We appreciate the clarification regarding Chain-of-Thought implementation. To better contextualize our approach, RESAnything employs a step-by-step decomposition strategy that breaks down complex referring expression segmentation into manageable stages: (1) attribute decomposition of the referring expression, (2) mask proposal generation, (3) multi-modal similarity assessment, and (4) optimal selection and grouping. This systematic approach transforms the complex task into interpretable intermediate steps, enabling our method to handle sophisticated referring expressions that challenge end-to-end approaches. Our attribute prompting guides the MLLM through structured analysis at each stage, facilitating robust understanding of spatial and semantic relationships before reaching the final segmentation decision. We are happy to reword this if the reviewer find that this is confusing.
Others
We will update the quantitative comparison with the latest model in the camera ready version, and correct any formatting errors.
Dear reviewer R2q5,
Thank you for your thoughtful review of “RESAnything: Attribute Prompting for Arbitrary Referring Segmentation”. We truly appreciate your time and effort you invested, and we welcome further discussions on your feedback. We have submitted a detailed rebuttal addressing your concerns including REC capability of current MLLMs, cross-model MLLM & attributes ablation and Chain-of-Thought design.
We welcome any follow-up questions or suggestions to ensure we’ve fully resolved your concerns ahead of the final decision. Your early feedback would be invaluable to us.
Best,
Authors
Thank you for your reply. I hope the authors can clearly indicate which comment each response addresses.
Below are my detailed responses:
-
I do not understand why the authors claim that RESAnything's focus on novel scenes does not require considering the impact of the MLLM's inherent REC capability. Regardless of whether the scenes are novel or old, I believe a quantitative analysis of the MLLM's built-in REC ability is necessary, as RESAnything clearly relies heavily on this capability. The authors should demonstrate that the RESAnything paradigm indeed effectively overcomes the inherent limitations of MLLMs in terms of RES, possibly by leveraging tools like SAM or R-CNN to transform REC into RES.
-
I think the authors are somewhat conflating the REC ability of MLLMs with their CAPTION capability. Section 6.1 and Figure 9 are not sufficient to illustrate the limitations of MLLMs in REC. They only show that MLLMs can perform reasonably well in captioning tasks when provided with good prompts.
-
Merely relying on verbal descriptions to compare the performance of different MLLMs is limited. I hope the authors can provide a quantitative or detailed comparison of their CAPTION or RES capabilities. This, together with Table 3 (right), would better illustrate the dependence of RESAnything on the various capabilities of MLLMs, and would help guide the selection of MLLMs in future use of RESAnything.
-
The ablation study on attribute prompting has addressed my concerns. I believe this experiment is important and should be included in the paper.
-
The authors' response regarding COT has resolved my question. However, although COT does fairly well in representing the staged paradigm of RESAnything, I think it should be named slightly differently from COT as used in the LLM field to reduce potential confusion. The authors may consider this point in light of other reviewers' feedback.
1
Thank you for the question, but we must first clarify that we never made the said claim. As a training-free method, RESAnything must rely on capacities of foundational models, including MLLM’s inherent REC capabilities. However, as the following results demonstrate, while SOTA MLLMs that are capable of REC perform well on RefCOCO benchmarks, they are still quite limited in handling complex referring expressions in ReasonSeg.
| Model | RefCOCO val | RefCOCO+ val | RefCOCOg val | ReasonSeg REC |
|---|---|---|---|---|
| Qwen-VL 7B | 89.36 | 83.12 | 85.58 | 56.93 |
| Qwen-VL 2.5 3B | 89.1 | 82.4 | 85.2 | 56.51 |
This performance gap highlights the fundamental challenge that existing MLLM capabilities are facing with complex reasoning queries. Even if these MLLMs do possess the underlying knowledge about complex referring expressions, we still need a structured framework to consistently elicit and utilize such knowledge effectively. Our attribute prompting and multi-criteria comparison approach provides that structured framework.
2
Thank you for raising this point. We apologize for any confusion caused by our previous wording. Section 6.1 and Figure 9 are not intended to illustrate "limitations of MLLMs in REC," but rather demonstrate the opposite: MLLMs do have inherent capacity for understanding complex referring expressions, but we need an effective framework to systematically utilize such capacity, which is our contribution.
Our previous statement that MLLMs "may not have robust REC understanding" was misleading and we apologize for that confusion. What we meant is that while MLLMs possess the underlying knowledge, they require structured prompting to consistently access this knowledge for complex referring expression tasks.
3
Thanks for the suggestion. In fact, most MLLMs do not support RES tasks directly and generally require further fine-tuning on RES datasets using RefCOCO series data, or special design connecting MLLMs with SAM for RES.
Additionally, most proprietary models do not support REC with output as bounding box coordinates or spatial points (x,y). In the following table, we list performance of MLLMs supported REC task:
| Model | RefCOCO val | RefCOCO+ val | RefCOCOg val |
|---|---|---|---|
| Gemini 1.5 Pro | 73.2 | 62.5 | 75.2 |
| Qwen2.5-VL 3B | 89.1 | 82.4 | 85.2 |
| Qwen2.5-VL 7B | 90.0 | 84.2 | 87.2 |
| Qwen2.5-VL 72B | 92.7 | 88.9 | 89.9 |
| InternVL2.5 78B | 93.7 | 90.4 | 92.7 |
Gemini 1.5 Pro shows limited performance on REC tasks, while Qwen2.5 VL series and InternVL2.5 demonstrate strong performance. We can also observe that larger models consistently outperform smaller variants. In the future selection of MLLMs for RESAnything, we may consider larger models with stronger grounding capabilities to guide better performance outcomes.
4 & 5
Thank you for providing detailed feedback, we will include all experiments in the rebuttal to the final version and repharse any misleading concepts.
Thank you for the response. The authors have provided a detailed reply to my concerns and questions. I raise my rating to 4.
Thank you so much for taking the time to read our paper and the responses. We are grateful for the improved score.
This paper proposes RESAnything, a training-free, open-vocabulary framework for zero-shot referring expression segmentation (RES). Unlike existing RES methods that focus on object-level semantic labels, RESAnything handles arbitrary referring expressions—including part-level, functional, stylistic, and even graphical (e.g., logos)—by leveraging multi-modal large language models (MLLMs) and Segment Anything Model (SAM). The method introduces attribute prompting to describe both the target expression and each mask proposal, and evaluates their alignment using a multi-metric framework involving MLLMs and CLIP. Additionally, the paper contributes a new benchmark dataset, ABO-Image-ARES, focusing on complex, attribute-rich expressions.
优缺点分析
S:
- The method requires no fine-tuning and works zero-shot, making it broadly applicable across domains.
- RESAnything supports a wide spectrum of expressions, including implicit references and attributes like material, style, function, etc.
- The ABO-Image-ARES dataset is a valuable contribution, providing a resource for evaluating reasoning-intensive and part-level RES tasks.
- The method achieves state-of-the-art zero-shot performance on multiple datasets.
W:
- The method is largely a system-level integration of existing tools (SAM, CLIP, MLLMs), with the main innovation being the design of prompts and evaluation metrics.
- While the title claims to cover "anything," the method does not address affordance-based or action-grounded queries. See TOIST: Task-oriented Instance Segmentation Transformer, NeurIPS 2022.
- The pipeline includes multiple calls to SAM and MLLMs for proposal generation, attribute description, and evaluation. This results in high computational cost and limits applicability in real-time or large-scale deployment settings.
问题
- How does RESAnything handle cases where multiple objects share similar attributes (e.g., “the black chair” when there are two black chairs)? Have the authors considered evaluating the method’s generalization to embodied references involving human pose? For example, in Understanding Embodied Reference with Touch-Line Transformer (ICLR 2023)
局限性
- As acknowledged by the authors (Line 308), the method lacks an explicit mechanism to handle varying object scales. This results in performance degradation on large-object segmentation tasks.
最终评判理由
For a foundation-model-style system, empirical results are the most important indicator of value, and the new SOTA performance on affordance-based reference tasks convincingly demonstrates the practical utility of this work. I remain unconvinced that the prompting strategy itself is a core scientific contribution, but this is less critical in the current NLP landscape where such design is often the norm. Overall, the strong results justify acceptance.
格式问题
None
Dear reviewer DVej, thank you for your detailed review and insightful questions! We appreciate your recognition that the proposed zero-shot method is broadly applicable across different domains with support for a wide spectrum of expressions. We are also encouraged by your positive assessment of our proposed benchmark and the outstanding performance demonstrated by our method. Regarding the weaknesses and questions, please find our responses below.
Novelty
While our method integrates existing foundation models, the key contribution lies in the novel attribute prompting framework that enables systematic reasoning about part-level attributes and implicit queries—capabilities that current methods cannot achieve. Our novelty is not prompt engineering, but the systematic multi-stage reasoning framework that decomposes complex spatial and semantic relationships into interpretable steps, addressing the "last mile" challenge in RES. To validate the effectiveness of attribute prompting, we conducted ablation studies showing significant advantages over conventional prompting on ReasonSeg:
| Method | gIoU | cIoU |
|---|---|---|
| standard prompt | 50.8 | 49.3 |
| attribute prompt | 74.6 | 72.5 |
We show details of standard prompt, attribute prompt, and corresponding generated descriptions in supplementary materials, Section 6.1 and Figure 9. In addition, we also contribute a challenging new dataset whose characteristics deviate significantly from RefCoco, a dataset most prior methods may have overfit towards. Our new dataset is also catered to real-world application scenarios, e.g., eCommerce.
Affordance-based or action-grounded expression involving human pose
Although we did not explicitly evaluate on any affordance-based or action-grounded benchmarks such as TOIST and the mentioned ICLR paper, both ReasonSeg and our proposed ABO-ARES benchmark contain substantial affordance-based and action-grounded queries that demonstrate our method's capability in this domain. Specifically, ReasonSeg includes expressions such as "people cross over obstacles," "object used by people to sit down," "where people wash their hands," and "people catch," which directly involve human-object interactions and spatial affordances. Similarly, ABO-ARES contains functionality-oriented queries like "easy pouring," (in Figure 8) "foldable design," and "cushion support" that explicitly reference object affordances and their intended human interactions. The superior performance of our method on both datasets demonstrates effective handling of these scenarios, validating our approach's applicability to affordance-based reasoning. Our training-free framework can be naturally extended to affordance benchmarks by incorporating embodied attributes such as graspability, stability or manipulation properties into the attribute prompting mechanism, allowing seamless adaptation to action-grounded scenarios without further fine-tuning. We will include these related papers in the final version of the paper.
Efficiency and cost
Our current inference speed (12.1s) is not significantly behind the fastest alternative CaR (about 5.3s), which suffers from low accuracy (35.2% vs. ours 74.6% on ReasonSeg). There is an obvious tradeoff between speed and accuracy. A gap of 6.8s is not significant enough to warrant using a more error-prone method. Ultimately, the errors must be corrected in an application and the time it would take to correct a segmentation mask most likely far exceeds 6.8s on average.
Finally, our current method has not been optimized for inference speed. With parallelization on candidate mask generation from SAM, reference text generation, candidate text generation and similarity comparison between the target and candidates, the RESAnything time can be reduced to 3s (on multiple 4090 GPUs), as shown below: max(SAM, reference text generation) (2s) + candidate text generation (0.5s) + max(MLLM similarity computation, CLIP similarity computation) (0.5s)
This optimization leverages the fact that candidate mask generation and reference text generation can be executed in parallel, and similarity computations using MLLM and CLIP can also run concurrently. Furthermore, inference speed could be improved even more by replacing SAM with FastSAM for mask generation.
Handling special queries
“The black chair” is a good example. This actually reflects a general misunderstanding of referring expressions in this task. In linguistics, a referring expression is any noun phrase whose function in discourse is to identify some individual object. When an expression like "the black chair" is ambiguous and refers to multiple valid targets, both objects can legitimately be segmented, as the ambiguity lies in the expression itself rather than representing a failure case. However, for referring expressions that intentionally target multiple objects or parts (e.g., "standing people", "all people", or “all persons except the kid in white”), our proposed grouping and selection mechanism effectively handles these scenarios. We demonstrate this capability through evaluation on the GRES benchmark containing such expressions in our supplementary materials Section 9.3. which specifically includes such multi-target cases and validates our method's robust performance on these linguistically complex expressions. Figure 16 in the supplementary materials shows several multi-part target examples and specifically, if we change the expression from 'lady middle pink' to 'all players' correctly segments both male players in the image, and similarly, segment 'two kids' instead of 'the mom' appropriately selects both kids. Our method's flexibility allows for different grouping settings and configurations, and we acknowledge that future work can further optimize the grouping mechanism to enhance multi-object performance.
I appreciate the authors’ effort to highlight the novelty of the proposed attribute prompting framework. However, as someone who is not deeply familiar with NLP, I would be grateful if the authors could help me better understand the distinction between prompt engineering and a systematic prompting framework. To me, both seem to involve designing ways to query large language models. If the authors could more clearly articulate why their framework goes beyond prompt design, it would greatly help me appreciate the contribution in terms of novelty.
Since the method is named RESAnything, it sets an expectation of broad capability, similar to how Segment Anything or Depth Anything have demonstrated performance across a wide range of downstream tasks. I appreciate the authors pointing out that their method is capable of handling affordance-based and multi-target queries, but I believe this capability would be better supported through quantitative results on such tasks or benchmarks (e.g., TOIST and embodied references). Even if the method does not outperform existing specialized models, seeing how close a generalist system can get is valuable information for the community. Generalist models often trade off some performance for flexibility, and that’s perfectly understandable.
Thank you for your feedback!
Let us first present quantitative results on affordance-based RES tasks as requested. Then we shall come back to your question on clarification of systematic prompting framework, since our answer to this question refers to the quantitative results.
We have conducted evaluation on the COCO-Tasks benchmark to validate our method's performance on affordance-based and action-grounded scenarios.
| Method | mask mIoU@0.5 (14 tasks average) |
|---|---|
| GGNN (baseline of the benchmark) [1] | 32.4 |
| TOIST (with distillation) | 44.1 |
| Taskclip [2] | 50.3 |
| CoTDet [3] | 56.9 |
| RESAnything | 51.2 |
| RESAnything with prompt optimization | 54.6 |
All compared methods are fully-supervised and fine-tuned on COCO-Tasks training data, while RESAnything operates in a zero-shot manner. As seen above, our method surpasses TOIST and TaskCLIP, demonstrating its strong generalization capability across affordance-based scenarios.
Why our framework goes beyond prompt design
Our core contribution is the attribute prompting and the comprehensive pipeline designed around it. This goes beyond simply designing prompts and to encompass a complete multimodal framework for multi-metric mask proposal selection (Section 3.2), including comparison between reference and candidates (text-text similarity, text-mask similarity, grouping and selection). We refer to this integrated pipeline as our systematic prompting framework, which provides a generalizable architecture that operates consistently across diverse referring expression types, from simple object queries to complex part-level and affordance-based scenarios. Our ablation studies between standard prompts and attribute prompts demonstrate that the framework itself is the critical component. In addition, this framework is orthogonal to prompt design: when we incorporate task-specific prompt optimization, e.g., adding “human-object interaction” attributes, the performance improves from 51.2 to 54.6, approaching CoTDet. This demonstrates how prompt engineering can enhance the performance of our framework.
We are happy to refine our wording to better reflect our scope and contribution.
[1] Sawatzky, J., Souri, Y., Grund, C., & Gall, J. (2019). What object should i use?-task driven object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7605-7614).
[2] Chen, H., Huang, W., Ni, Y., Yun, S., Liu, Y., Wen, F., ... & Imani, M. (2024, September). Taskclip: Extend large vision-language model for task oriented object detection. In European Conference on Computer Vision (pp. 401-418). Cham: Springer Nature Switzerland.
[3] Tang, J., Zheng, G., Yu, J., & Yang, S. (2023). Cotdet: Affordance knowledge prompting for task driven object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3068-3078).
Thank you for the detailed and thoughtful follow-up. The newly provided results on COCO-Tasks are particularly compelling — demonstrating that RESAnything achieves zero-shot performance that outperforms TOIST and TaskCLIP, and even approaches CoTDet, all without task-specific fine-tuning. This clearly shows strong generalization and practical utility in affordance-based and action-grounded scenarios, which I initially raised as a concern.
Given this evidence, I now believe this work has the potential to be highly impactful, especially as a generalist system in a field dominated by specialized models. I have updated my recommendation to 5: Accept. I am an expert on reference understanding and I care about this field and I argue for the acceptance of this paper.
Thanks again for the clarifications and additional experiments — they greatly helped clarify both the novelty and significance of the contribution.
Thank you so much for taking the time to read our paper and the responses. We are grateful for the improved score.
This paper introduces a zero-shot and training-free approach for referring expression segmentation by addressing the challenge of handling part-level and implicit expressions. The method leverages “attribute prompting” with an MLLM to generate detailed text descriptions for the referring expression and image segmentation proposals generated by SAM. The most appropriate segmentation mask is selected by a multi-stage mechanism that combines text-to-text and text-to-image comparisons using both MLLM-based binary decisions and CLIP scores as criteria. Experimental results show that the proposed method outperforms existing zero-shot methods on benchmark datasets and the introduced dataset focusing on part-level expressions.
优缺点分析
[Strengths]
- The paper is overall well-written and easy to follow.
- The method is training-free, eliminating the requirement of extensive annotation costs.
- Experimental results show the superiority of the proposed method over existing zero-shot approaches.
[Weaknesses]
-
The technical novelty and contribution of the paper are limited. The method primarily combines existing models like SAM and MLLM. The integration of these components is straightforward, and the major effort seems to be prompt engineering, which does not introduce significant conceptual novelty.
-
Performance of the method highly depends on meticulously engineered prompts. Performance could vary significantly with different MLLMs or slight prompt modifications, limiting robustness and generalizability beyond the tested configurations.
-
The proposal selection mechanism may struggle with targets comprising multiple objects. While the paper mentions combining masks for cases where a single semantic entity spans multiple segments, the overall selection process prioritizes a single mask with the highest CLIP score. Further clarification might be needed on how robustly it handles inherently multi-object targets that are not necessarily parts of a single semantic entity.
-
Although superior to other zero-shot methods on standard RES benchmarks, the proposed method still shows a performance gap compared to recent supervised methods. This suggests that it might not be the top choice for object-level RES tasks where large objects are prevalent.
-
The two-stage approach is computationally inefficient. Runtime of the method is much higher than existing approaches, as reported in the supplementary material.
-
Details of mask filtering are missing. How are duplicate proposals filtered?
问题
Please refer to the weaknesses section.
局限性
Yes
最终评判理由
I have read other reviewers’ comments and the authors’ rebuttal. While the rebuttal addresses some of my concerns, I still have reservations regarding the novelty and technical contribution of the work.
格式问题
N/A
Dear reviewer T2sg, thank you for your thorough review and helpful suggestions! We appreciate you mentioning that our paper is well-written, the proposed training-free method saves the annotation cost for this task, and the experiments show promising results. We address your concerns regarding the identified weaknesses below.
Novelty, Robustness and Generalizability
For most problems covering diverse test cases, the biggest challenge is tackling the “last mile,” and RES is no exception. Our method is designed for, and excels at, tackling such a challenge under complex cases involving implicit queries and complex part-level relations. To achieve this, we propose the novel idea of attributing prompting, with the emphasis on reasoning about part attributes. Combining SAM and MLLM is not our novel claim, attribute prompting is! To further validate its effectiveness, we conducted an ablation to show advantages over conventional prompting: quantitative results on ReasonSeg test set with different prompts highlight the contribution of our attribute prompting mechanism:
| Method | gIoU | cIoU |
|---|---|---|
| standard prompt | 50.8 | 49.3 |
| attribute prompt | 74.6 | 72.5 |
We show details of standard prompt, attribute prompt, and corresponding generated descriptions in supplementary materials, Section 6.1 and Figure 9. In addition, we also contribute a challenging new dataset whose characteristics deviate significantly from RefCoco, a dataset most prior methods may have overfit towards. Our new dataset is also catered to real-world application scenarios, e.g., eCommerce.
Regarding robustness and generalizability, Table 5 demonstrates consistent performance across diverse MLLM backbones (Pixtral 12B, Claude 3.5, Qwen2-VL), validating cross-model reliability. While prompt sensitivity exists, this characteristic actually provides valuable flexibility rather than a limitation: users can adapt the attribute specifications within prompts to optimize performance for specific domains or categories when building unlabeled datasets / in applications. In addition, our training-free approach allows seamless integration of newer, more powerful MLLMs as they become available, providing immediate performance improvements without retraining.
For multi-object scenarios, our selection mechanism employs comprehensive decision-making analyzing both text-text and text-image similarity rather than single CLIP scores (detailed in supplementary Section 6.2). Section 6.2 and Table 6 in the supplementary further show ablation studies on the effectiveness of this multi-metric mechanism to improve robustness. In addition, we extensively evaluated multi-object performance on the GRES task (target containing multiple objects), as shown in the supplementary materials, Section 9.3. The proposed grouping method actually demonstrates promising results among pretrained/zero-shot approaches, empirically confirming robust handling of multi-object scenarios. Figure 16 in the supplementary materials illustrates our capability for grouping and combining multiple parts of the same object. Additionally, our method successfully adapts to different expression granularities. For example, changing the expression from 'lady middle pink' to 'all players' correctly segments both male players in the image, and similarly, segmenting 'two kids' instead of 'the mom' appropriately selects both kids. Our method's flexibility allows for different grouping settings and configurations, and we acknowledge that future work can further optimize the grouping mechanism to enhance multi-object performance.
Performance gap with supervised methods
Regarding the performance gap with supervised methods, we acknowledge this limitation on the RefCOCO series, which represents a fundamental and well-established benchmark in the field. We recognize the continued importance of RefCOCO for evaluating core RES capabilities. However, we would like to highlight that our zero-shot approach is specifically designed to excel in scenarios where supervised methods face limitations. In particular, when labeled training data is scarce or unavailable, or dealing with complex referring expressions that extend beyond traditional benchmark coverage. While RefCOCO provides essential evaluation for common referring expressions involving humans and daily objects, our method demonstrates particular strength in handling sophisticated expressions involving explicit attributes, part-level descriptions, materials, and functionality. These areas represent emerging challenges in real-world applications. Notably, our zero-shot approach achieves state-of-the-art performance on ReasonSeg and our proposed ABO-ARES benchmark, demonstrating superior performance on the sophisticated referring expressions that represent genuine research challenges beyond the scope of traditional benchmarks. We believe the value of our contribution lies in providing an effective solution for domains and applications where extensive supervised training data is unavailable, while simultaneously advancing the field's capability to handle the increasingly complex referring expressions encountered in real-world scenarios beyond traditional benchmark settings.
Efficiency
Our current inference speed (12.1s) is not significantly behind the fastest alternative CaR (about 5.3s), which suffers from low accuracy (35.2% vs. ours 74.6% on ReasonSeg). There is an obvious tradeoff between speed and accuracy. A gap of 6.8s is not significant enough to warrant using a more error-prone method. Ultimately, the errors must be corrected in an application and the time it would take to correct a segmentation mask most likely far exceeds 6.8s on average.
Finally, our current method has not been optimized for inference speed. With parallelization on candidate mask generation from SAM, reference text generation, candidate text generation and similarity comparison between the target and candidates, the RESAnything time can be reduced to 3s (on multiple 4090 GPUs), as shown below: max(SAM, reference text generation) (2s) + candidate text generation (0.5s) + max(MLLM similarity computation, CLIP similarity computation) (0.5s).
This optimization leverages the fact that candidate mask generation and reference text generation can be executed in parallel, and similarity computations using MLLM and CLIP can also run concurrently. Furthermore, inference speed could be improved even more by replacing SAM with FastSAM for mask generation.
Others
Regarding the mask filtering, we set a minimum mask region area threshold of 0.4% of the total image area to remove tiny boundary masks commonly produced by SAM. In addition, we apply a prediction confidence threshold of 0.9 to filter out low-confidence duplicate part segmentation masks, as SAM typically generates masks for both whole objects and their constituent parts with varying confidence levels.
Dear reviewer T2sg,
Thank you for your thoughtful review of “RESAnything: Attribute Prompting for Arbitrary Referring Segmentation”. We truly appreciate your time and effort you invested, and we welcome further discussions on your feedback. We have submitted a detailed rebuttal addressing your 6 concerns including technical novelty, robustness & generalizability, multi-object handling, performance gap vs supervised methods, computation efficiency and mask filtering details.
We welcome any follow-up questions or suggestions to ensure we’ve fully resolved your concerns ahead of the final decision. Your early feedback would be invaluable to us.
Best,
Authors
I appreciate the authors’ detailed response. While the rebuttal addresses some of my concerns, I still have reservations regarding the novelty and technical contribution of the work.
In the ablation study on prompting, what exactly constitutes the “standard prompt”? The “attribute prompt” provided in the supplementary material appears to be carefully engineered. Was the same level of effort applied to designing the standard prompt for a fair comparison?
Although the authors present results using different MLLMs, I remain concerned about the robustness to slight modifications in prompts. While the authors frame this as flexibility and highlight the ability of users to adapt prompts for specific domains, such adaptation may require substantial trial-and-error. This reliance on manual prompt tuning could limit the method’s usability and generalizability in practice.
Thank you for this important clarification question.
For the standard prompt, we keep all other prompt engineering components in the attribute prompt, but replace attribute-related instructions with general ones, asking the MLLM to directly identify the target in the image. Figure 9 in the supplementary material shows an example of replacing attribute related prompt with standard prompt for reference text generation, where the rest of the prompt is kept the same. This modification is applied to all prompts used in our system. We can provide full prompt in the revision.
As shown in Figure 9, when attribute prompting is not used, the MLLM fails to accurately identify and reason about input expression attributes. From the quantitative comparison we provided in the rebuttal, the significant performance gap (74.6% vs 50.8% gIoU) demonstrates that our structured attribute decomposition fundamentally improves the model's reasoning capability for complex referring expressions.
We understand the reviewer’s concern about robustness. While it is true that task-specific prompt optimization will lead to performance improvement, our method can be applied to different RES tasks without modification, yet still achieve competitive performance. We use the same set of prompts across all datasets evaluated in the paper. We also conducted additional experiments on a task that we did not pick ourselves (requested by DVej): affordance-based RES (COCO-Tasks):
| Method | mask mIoU@0.5 (14 tasks average) |
|---|---|
| GGNN (baseline of the benchmark) [1] | 32.4 |
| TOIST (with distillation) [2] | 44.1 |
| Taskclip [3] | 50.3 |
| CoTDet [4] | 56.9 |
| RESAnything | 51.2 |
| RESAnything with prompt optimization | 54.6 |
Even without any text or visual prompt optimization, our method still achieves competitive performance in this new task. When we incorporate task-specific prompt optimization, performance improves from 51.2 to 54.6, approaching the fully-supervised CoTDet method. This demonstrates that while our method is robust enough to perform well without domain-specific tuning, further prompt adaption can further boost performance. The attribute-centric design of our system also makes adaptation simpler: for this task, we simply add "human-object interaction" attributes to both reference text generation and similarity analysis prompts. We agree that further reducing reliance on manual prompting tuning is an important future research direction, and will conduct systematic studies to provide clearer guidelines for task-specific prompt adaptation.
[1] Sawatzky, J., Souri, Y., Grund, C., & Gall, J. (2019). What object should i use?-task driven object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7605-7614).
[2] Li, P., Tian, B., Shi, Y., Chen, X., Zhao, H., Zhou, G., & Zhang, Y. Q. (2022). Toist: Task oriented instance segmentation transformer with noun-pronoun distillation. Advances in Neural Information Processing Systems, 35, 17597-17611.
[3] Chen, H., Huang, W., Ni, Y., Yun, S., Liu, Y., Wen, F., ... & Imani, M. (2024, September). Taskclip: Extend large vision-language model for task oriented object detection. In European Conference on Computer Vision (pp. 401-418). Cham: Springer Nature Switzerland.
[4] Tang, J., Zheng, G., Yu, J., & Yang, S. (2023). Cotdet: Affordance knowledge prompting for task driven object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3068-3078).
Thank you for the clarifications and additional results. My concern remains that the ablation may not fully isolate the effect of attribute prompting from the broader engineered prompt structure, and that performance could still be sensitive to small, unintended prompt variations in practice. I encourage the authors to explore more controlled baselines and systematic robustness evaluations in future work.
Thank you for the feedback. We agree that exploring prompt robustness would be valuable future work. We still hope that our effort and results in addressing your concerns could tip the balance in your overall rating of our work for NeurIPS. Thanks again for all the attention to details!
Thank you to all reviewers for your thorough evaluations and constructive feedback throughout the discussion period. We greatly appreciate the time and effort invested in reviewing our work and engaging in the detailed discussions that have strengthened our paper.
We are encouraged that reviewers recognized our training-free approach's substantial zero-shot performance gains, novel attribute prompting framework, valuable ABO-Image-ARES benchmark contribution, and comprehensive experimental validation.
Regarding the weaknesses, reviewers raised important questions about technical novelty, robustness, and computational efficiency. We addressed most concerns by demonstrating through ablation studies that our attribute prompting framework achieves significant performance improvements (74.6% vs 50.8% gIoU) over standard approaches, validating consistent performance across diverse MLLM backbones and on affordance-based tasks, and showing that parallelization optimizations can significantly reduce inference time. We also clarified our multi-object handling capabilities and mask filtering mechanisms.
We commit to incorporating all experiments, clarifications, and technical details discussed in this rebuttal into the final version. We are excited to pursue the valuable future research directions suggested by reviewers, particularly investigating adaptive proposal generation mechanisms, conducting systematic robustness evaluations, and exploring more controlled baselines for prompt robustness.
Best regards,
Authors
This paper proposes RESAnything, a zero-shot and open-vocabulary framework for arbitrary referring expression segmentation (RES). Unlike prior work that primarily addresses object-level queries, the method extends to part-level labels and implicit references such as function, design, style, or material. The key idea is to employ Chain-of-Thought (CoT) reasoning with attribute prompting, where an LLM generates detailed attribute descriptions to refine segment proposals from a foundational segmentation model.
The paper is clearly written and addresses an important problem, pushing RES beyond conventional settings. Results show state-of-the-art performance among zero-shot methods and strong improvements on challenging cases involving implicit queries and complex part-level relations. In addition, the authors release a new benchmark dataset, which is a meaningful contribution for future research.
Nonetheless, reviewers may question the degree of technical novelty, as the method relies heavily on pretrained LLMs and segmentation backbones, with the primary contribution being the prompting strategy. The robustness of attribute prompting also remains uncertain, as performance could be sensitive to prompt design or small variations.
Overall, the strong empirical results, the broader problem formulation, and the dataset contribution outweigh the concerns about novelty. Therefore, the AC recommends Accept.