6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

4.0

置信度

创新性3.0

质量3.0

清晰度3.0

重要性3.0

NeurIPS 2025

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Ye Liu,Zongyang Ma,Junfu Pu,Zhongang Qi,Yang Wu,Ying Shan,Chang Wen Chen

OpenReview PDF

提交: 2025-04-05更新: 2025-10-29

摘要

关键词

Large Multimodal ModelsPixel-Level Reasoning

评审与讨论

审稿意见

评分: 5置信度: 42025-06-22

This paper proposes a model called UniPixel, which is designed for pixel-level visual reasoning tasks. The model has the ability to flexibly comprehend visual prompt inputs (such as points, boxes, and masks) and generate mask-grounded responses. The motivation is clear. The writing is good. Extensive experiments are conducted to validate the effectiveness of the proposed method.

优缺点分析

[Strengths]

UniPixel introduces an object memory bank, which effectively integrates the internal representations of referred and segmented objects. This allows the model to perform object referring and segmentation tasks more effectively than previous methods.
The authors introduce a new PixelQA task that requires joint object-centric referring, segmentation, and QA in videos.

[Weaknesses]

The main contribution of this paper is constructing a model that can perform both referring and segmentation tasks. However, I think the authors overstate their claim here. Visionllm can perform both referring and segmentation tasks. It also has a region encoder, which can encode the input prompt.

Wu J, Zhong M, Xing S, et al. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks[J]. Advances in Neural Information Processing Systems, 2024, 37: 69925-69975.

In Object Memory Bank, when performing memory pre-filling, we need to estimate the object mask. Therefore, two forward passes are needed for one question (One for mask prediction and one for answer generation). This process increases the computational complexity. In addition, how to train the model with this mechanism?
The authors compare many methods in the experiments. It is important to point out the base model used by the method. For example, LISA uses the LLaMA while the proposed method uses the Qwen-2, which is a more powerful model.
What is the performance of the proposed model on the general VQA tasks, such as VQAv2, MME?

问题

Please refer to [Weaknesses].

局限性

Please refer to [Weaknesses].

最终评判理由

The performance is good. The method is well-motivated.

格式问题

N/A

作者回复

2025-07-31

Thank you for your careful review and for the insightful feedback! We now provide detailed responses to address your concerns.

Q1: VisionLLM v2 [1] can also perform both referring and segmentation tasks.

Thanks for highlighting this important baseline! We clarify that our statement of "Unified" refers to the unification of (1) referring & segmentation and (2) processing images & videos. Although VisionLLM v2 [1] can also jointly perform object referring and segmentation, it is an image-only model and thus cannot achieve pixel-level video understanding. We will carefully revise the relevant statements in our paper to ensure no overclaim. Below, we also provide performance comparisons between VisionLLM v2 and UniPixel.

Comparison on Referring Expression Segmentation (RES) and Reasoning Segmentation tasks:

Method	Base Model	RefCOCO $_{val}$	RefCOCO $_{testA}$	RefCOCO $_{testB}$	RefCOCO+ $_{val}$	RefCOCO+ $_{testA}$	RefCOCO+ $_{testB}$	RefCOCOg $_{val(U)}$	RefCOCOg $_{test(U)}$	ReasonSeg $_{gIoU}$
VisionLLM v2 [1]	Vicuna-7B	76.6	79.3	74.3	64.5	69.8	61.5	70.7	71.2	51.0
UniPixel	Qwen2-VL-2B	76.1	79.2	72.3	67.5	72.2	60.6	70.6	71.4	59.6
UniPixel	Qwen2-VL-7B	79.5	81.1	76.2	72.5	77.0	66.6	74.4	75.3	63.0

Comparison on Referring Expression Comprehension (REC) task:

Method	Base Model	RefCOCO $_{val}$	RefCOCO $_{testA}$	RefCOCO $_{testB}$	RefCOCO+ $_{val}$	RefCOCO+ $_{testA}$	RefCOCO+ $_{testB}$	RefCOCOg $_{val(U)}$	RefCOCOg $_{test(U)}$
VisionLLM v2 [1]	Vicuna-7B	87.9	91.2	84.3	77.6	83.8	70.2	82.9	84.1
UniPixel	Qwen2-VL-2B	89.0	92.1	86.8	81.9	86.9	76.5	85.3	85.7
UniPixel	Qwen2-VL-7B	91.5	93.7	87.9	85.9	90.9	80.8	87.7	88.6

Results in both tables demonstrate that UniPixel significantly outperforms VisionLLM v2 in RES, ReasonSeg, and REC tasks. We will add the above comparisons into Table 3 and Appendix Table 4, and include more detailed discussions in Sec. 2 (Related Work).

Q2: The two-stage inference strategy increases the computational complexity.

Thanks for mentioning this important point! We agree that the object memory bank design inevitably introduces additional latency. However, we clarify that:

This design affects efficiency only when the inputs contain points or boxes. Scenarios involving masks as visual prompts (e.g., region captioning) or segmentation-only tasks (e.g., referring video object segmentation) remain unaffected.
Predicting masks for point and box prompts before providing responses can help better interpret the model’s underlying reasoning process.
This is a reasonable trade-off considering the improved region-level understanding performance. Moreover, the extra computational overhead can be effectively minimized using proper inference techniques.

Below we compare the efficiency and performance of different strategies on PixelQA $_{mixed}$ . The inference speed was tested on a single RTX 6000 Ada GPU with Flash Attention 2.

ID	Object Memory Bank	Vision Encoder Cache	Inference Time (s/video)	$J$	$F$	$J$ & $F$	Acc
A	✗	✗	0.39	45.8	50.6	48.2	67.4
B	✓	✗	0.81	46.4	51.7	49.0	68.5
C	✓	✓	0.52	46.4	51.7	49.0	68.5

A vs. B: Adopting the object memory bank can effectively enhance both segmentation and regional understanding capabilities. This is due to the disentanglement of object localization and regional understanding, allowing both to benefit from training on related tasks (e.g., referring object segmentation and region-based QA).
B vs. C: Given that the two inference stages share the same visual inputs, the vision encoder outputs and the KV cache for these visual tokens can be cached and reused across both stages. This technique improves the inference speed by 56%, bringing it very close to that of single-pass inference.

In addition, how to train the model with this mechanism?

We adopt specialized training samples and design a rule-based pipeline to achieve this mechanism:

During training, as shown in Appendix Table 1, we repurposed video QA samples from VideoRefer-700K [2] and Inst-IT [3] as memory pre-filling samples, in which the model is expected to predict masks for a question with visual prompts. An example is shown below.

 Input: What are [1] <REF> and [2] <REF> doing in the video?
Output: The relevant regions in the question are [1] <SEG> [2] <SEG>.

During inference, when <REF> tokens are detected in the input prompt, we manually construct the corresponding model response using the output template above, ensuring it contains the same number of <SEG> tokens. We then apply teacher forcing to guide the model in predicting masks for these visual prompts. The predicted masks are subsequently used to construct <MEM> tokens to enable regional understanding.

We will carefully revise Sec. 3.2 to provide a clearer explanation of how the object memory bank works.

Q3: It is important to point out the base model used by the method.

Good suggestion! We will revise the Size column to Base Model in all the tables to clearly indicate the underlying LLMs/MLLMs of different methods.

Q4: What is the performance of the proposed model on the general VQA tasks, such as VQAv2, MME?

We compare UniPixel with a strong counterpart Sa2VA and its baseline Qwen2-VL on these benchmarks below. The evaluations were conducted using lmms-eval.

Method	Base Model	VQAv2 $_{testdev}$	MME (perception/cognition)
Sa2VA	InternVL2-4B	–	1553/540
–	Qwen2-VL-2B	61.7	1495/366
–	Qwen2-VL-7B	78.0	1688/628
UniPixel	Qwen2-VL-2B	61.9	1564/580
UniPixel	Qwen2-VL-7B	77.5	1695/643

The comparison shows that UniPixel outperforms Sa2VA while is comparable to Qwen2-VL on general image QA tasks, demonstrating the effectiveness of the proposed fine-grained understanding designs.

We hope our responses above can address your concerns. More discussions are welcome if you have any further questions. Thank you!

References:

[1] VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks, NeurIPS 2024.

[2] VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM, CVPR 2025.

[3] Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning, arXiv 2024.

2025-08-01

Thanks for the reply. My concerns are solved. I highly recommend that the authors revise their paper carefully in the final version.

2025-08-01

Thank you for your quick feedback! We are glad to hear that your concerns have been solved. We will revise the paper accordingly and include more in-depth discussions in the final version. Thanks again for spending time reviewing our paper!

审稿意见

评分: 4置信度: 32025-06-29

Large multimodal models (LMMs) have achieved remarkable success as general-purpose multimodal assistants. However, the authors point out that existing LMMs have not yet been extended to pixel-level understanding. While some recent works have applied LMMs to tasks such as region-level captioning and referring expression segmentation, these tasks do not fully incorporate fine-grained perceptual capabilities.

To address this gap, the authors propose UniPixel, a unified large multimodal model that seamlessly integrates pixel-level perception with general visual understanding. The model demonstrates strong performance not only across multiple baselines but also on a newly introduced benchmark task, PixelQA, highlighting its ability to perform detailed and comprehensive multimodal reasoning.

优缺点分析

Strengths

This work is well-structured and easy to follow.
Unipixel can unify the internal representations through the designed object memory bank. The effectiveness of the model is validated through extensive ablation studies.
The authors demonstrate the superiority of the proposed model across numerous benchmark datasets, as well as on the newly introduced task, PixelQA.

Weaknesses

Most of the LMMs compared in this work are 2B or 7B in scale. While the proposed model demonstrates superior performance compared to similarly sized models, it remains unclear whether it can consistently outperform larger-scale LMMs. Comparing against such larger models would strengthen the credibility of the claimed superiority and provide a more comprehensive evaluation of the proposed approach.
The authors could consider including the GPU requirements for training the proposed model. This information would be valuable for researchers who wish to reproduce the work and better understand the computational resources needed.
The authors have not released the code or the pretrained weights for community use.

问题

Please address the weaknesses.

局限性

yes

最终评判理由

The authors' response have addressed my major concerns. I still maintain a positive attitude toward this paper.

格式问题

作者回复

2025-07-31

Thank you for recognizing the value of our work and for providing the constructive feedback! Below we provide detailed responses to your concerns and suggestions.

Q1: Whether the proposed model can consistently outperform larger-scale LMMs?

Thanks for the insight! We agree that comparisons with larger models enable a more comprehensive evaluation of the proposed method. In our work, we have already included models larger than 7B such as LISA-13B [1], VISA-13B [2], M2SA-13B [3], and InternVL-26B [4]. Below, we supplement our analysis with additional larger models.

Method	LLM Size	ReVOS $_{val}$	MeViS $_{val(u)}$	Ref-DAVIS17 $_{val}$	VideoRefer $^Q$	PixelQA $_{mixed}$
GPT-4o	–	–	–	–	71.3	69.3
Qwen2-VL [5]	72B	–	–	–	70.8	69.1
Sa2VA [6]	26B	58.4	57.3	77.0	–	–
UniPixel	2B	60.2	59.8	74.5	72.5	68.5
UniPixel	7B	63.8	59.9	76.1	76.4	69.9

Compared with larger models such as GPT-4o, Qwen2-VL-72B, and Sa2VA-26B, the segmentation and referring capabilities of UniPixel-7B are still competitive, as demonstrated by the SOTA performance on ReVOS, MeViS, VideoRefer $^Q$ , and PixelQA datasets.

Q2: The authors could consider including the GPU requirements for training the proposed model.

Thanks for the suggestion! We trained our model with 8 RTX 6000 Ada (48GB) GPUs. Training the 2B and 7B variants took around 2.5 days and 4 days, respectively. This information will be included in our revision.

Q3: The authors have not released the code or the pretrained weights for community use.

We clarify that all the code, model checkpoints, training logs, and online demos for this project will be fully open-sourced upon acceptance of the paper. We also welcome and will support the community to develop stronger unified referring and segmentation models based on our codebase.

We hope our responses above can address your concerns. More discussions are welcome if you have any further questions. Thank you!

References:

[1] LISA: Reasoning Segmentation via Large Language Model, CVPR 2024.

[2] VISA: Reasoning Video Object Segmentation via Large Language Models, ECCV 2024.

[3] MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation, ICLR 2025.

[4] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks, CVPR 2024.

[5] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution, arXiv 2024.

[6] Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos, arXiv 2025.

2025-08-04

Dear Reviewer NW7Z,

Thank you again for your insightful review!

Since the rebuttal period is halfway through, we would greatly appreciate it if you could kindly review our response to ensure it adequately addresses your concerns.

Thank you for your time and consideration.

Sincerely,

Authors of Paper 394

2025-08-06

I appreciate the responses from the authors. Most of my concerns have been addressed. However, I still have one question regarding the authors’ statement that all code will be released after the paper is officially published. Shouldn’t we release the code as soon as the work is completed to help accelerate progress in this field? Isn’t that, after all, one of the main purposes of arXiv? Why must we wait until official publication? The field of large language models evolves extremely rapidly; by the time the code is released, the research may no longer be state-of-the-art and thus may provide limited value to the community.

2025-08-06

Dear Reviewer NW7Z,

Thank you for the question regarding code release.

We agree that it is better to upload the arXiv paper and release the code soon (not necessarily until official publication). We postponed this process because we were upgrading the base model from Qwen2-VL to Qwen2.5-VL and training two stronger versions (3B and 7B) with much more data (e.g., GranDf, VideoGCG, Osprey, DAM) these days. The training of our new 7B model was just completed last night, and we will make it publicly available very soon.

As MLLM researchers, we totally understand how rapidly the field is evolving, and that is why we tried to offer stronger models to the community. Aside from the code and models, we are also preparing the multi-source training datasets and organizing them into a single Hugging Face repo to help followers prepare the data faster.

We appreciate your thoughtfulness toward the community at large, which reflects our commitment as well. Thank you for your understanding!

Sincerely,

Authors of Paper 394

审稿意见

评分: 4置信度: 52025-06-30

This paper proposes UniPixel, a large multimodal model (MLLM) designed to address the limitations of existing models in achieving pixel-level fine-grained understanding when processing images and videos. The authors present the UniPixel framework: a unified model capable of flexibly interpreting various visual prompts (points, boxes, masks), which inherently combines referring, segmentation, and reasoning capabilities. Through extensive experiments conducted across 10 different public benchmark datasets, UniPixel achieves state-of-the-art performance on seven distinct tasks, including referring/video object segmentation and region-based question answering.

优缺点分析

Strengths:

The Object Memory Bank is a novel and effective innovation. Its two-stage "pre-fill → inject" reasoning paradigm decouples segmentation/localization from high-level reasoning while ensuring the latter fully leverages the former’s fine-grained outputs.
The work successfully integrates referring object segmentation, video QA, and region-based reasoning into a unified, end-to-end model. Experiments demonstrate state-of-the-art (SOTA) performance across multiple tasks. The paper is well-structured, logically coherent, and highly readable.

Weaknesses:

UniPixel’s strong performance may largely stem from its powerful base model (e.g., Qwen2-VL). The paper inadequately discusses the contribution of this foundational model. For example, whether comparable results could be achieved using only Qwen2-LLM.
UniPixel’s design requires two forward passes per inference, introducing notable latency. The authors neither justify this design choice nor analyze its impact on inference efficiency.

问题

The authors propose the Object Memory Bank, but it requires a two-stage inference process (first extracting mask features, then semantic information). What are the advantages of this design compared to simply plugging in an external segmentation model? Given that the two-stage approach inevitably increases computational overhead and inference latency, how did you balance this trade-off?Were there attempts to achieve effective single-pass inference?
At a high level, UniPixel’s framework resembles Draw-and-Understand [1], with three encoders and three token types for modeling. What are the key architectural innovations or distinctions here?
For referring segmentation and referring understanding tasks, is the observed mutual improvement during training solely attributable to UniPixel’s decoupled design? Would this conclusion hold with a single one-turn inference model? (Intuitively, these are fundamentally distinct tasks. The finding that they can mutually enhance each other is intriguing)

[1] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want, ICLR 2025

局限性

yes

最终评判理由

The authors have adequately addressed most of my concerns, and I recommend acceptance, and they should incorporate the related discussion content in the final version.

格式问题

作者回复

2025-07-30

Many thanks for your careful review and insightful comments! We are encouraged by your recognition of our technical novelty, experiments, and writing. Below we provide responses to your concerns in detail.

Q1: More Ablation Studies on the Base Model

Thanks for your suggestion! We provide the following ablation study to clarify the attribution of UniPixel's performance gains. The evaluation metrics are J&F (RVOS datasets) and Acc (VideoRefer $^Q$ ).

ID	Method	Base Model	Mask Head	ReVOS $_{val}$	MeViS $_{val(u)}$	Ref-DAVIS17 $_{val}$	VideoRefer $^Q$
A	Sa2VA	InternVL2.5-4B	SAM2 (Large)	–	55.9	73.7	–
B	UniPixel	InternVL2.5-4B	SAM2 (Large)	60.9	61.2	74.6	72.5
C	UniPixel	SigLIP-SO-400M + Qwen2-1.5B	SAM2.1 (Base+)	58.6	57.5	73.5	71.1
D	UniPixel	Qwen2-VL-2B	SAM2.1 (Base+)	60.2	59.8	74.5	72.5
E	UniPixel	Qwen2.5-VL-3B	SAM2.1 (Base+)	61.4	59.8	74.1	72.8

Several conclusions can be drawn from the above results:

A vs. B: When using the same base model InternVL2.5-4B and mask head SAM2 (Large), UniPixel clearly outperforms Sa2VA on MeViS (+9.4%) and Ref-DAVIS17 (+1.2%), suggesting that the performance gains of UniPixel largely come from architectural innovations and learning strategies rather than simple backbone/module upgrades.
C vs. D: Following your suggestion (whether comparable results could be achieved using only Qwen2-LLM), we trained an MLLM with SigLIP-SO-400M + Qwen2-1.5B with the same training recipe as VideoGPT+ [2], and adopted it as the base model. The results show that slightly lower but comparable results could still be achieved by a weaker base model. Notably, this variant remains competitive with Sa2VA (InternVL2.5-4B) on MeViS (+2.9%) and Ref-DAVIS17 (-0.3%).
D vs. E: We also tried to upgrade the base model from Qwen2-VL-2B to Qwen2.5-VL-3B. It improves the performance on ReVOS and VideoRefer $^Q$ , but slightly downgrades the results on Ref-DAVIS17.

More detailed discussions on the contributions of base models will be included in our revision.

Q2: Inference Efficiency when Using Object Memory Bank

Thanks for bringing up this important point! We agree that the object memory bank design inevitably introduces additional latency. However, we clarify that:

This design affects efficiency only when the inputs contain points or boxes. Scenarios involving masks as visual prompts (e.g., region captioning) or segmentation-only tasks (e.g., referring video object segmentation) remain unaffected.
Predicting masks for point and box prompts before providing responses can help better interpret the model’s underlying reasoning process.
This is a reasonable trade-off considering the improved region-level understanding performance. Moreover, the extra computational overhead can be effectively minimized using proper inference techniques.

Below, we compare the efficiency and performance of different strategies on PixelQA $_{mixed}$ . The inference speed was tested on a single RTX 6000 Ada GPU with Flash Attention 2.

ID	Object Memory Bank	Vision Encoder Cache	Inference Time (s/video)	$J$	$F$	$J$ & $F$	Acc
A	✗	✗	0.39	45.8	50.6	48.2	67.4
B	✓	✗	0.81	46.4	51.7	49.0	68.5
C	✓	✓	0.52	46.4	51.7	49.0	68.5

What are the advantages of this design compared to simply plugging in an external segmentation model?

Adopting the object memory bank forces the model to predict masks (and leverage them to crop visual features) before actual reasoning. This enables the disentanglement of object localization and regional understanding, allowing both to benefit from training on related tasks (e.g., referring object segmentation and region-based QA). The comparison between A and B in the above table demonstrates that this design can effectively enhance both segmentation and regional understanding capabilities.

Given that the two-stage approach inevitably increases computational overhead and inference latency, how did you balance this trade-off?

The extra computational overhead can be minimized using proper inference techniques. For example, given that the two inference stages share the same visual inputs, the vision encoder outputs and the KV cache for these visual tokens can be cached and reused across both stages. As shown in the above table (B vs. C), this technique improves the inference speed by 56%, bringing it very close to that of single-pass inference.

Were there attempts to achieve effective single-pass inference?

Yes. In fact, A in the above table represents our simplified implementation of segmentation + QA. This is achieved by manually appending a <SEG> token after each visual prompt token. During inference, we decode these <SEG> tokens using the mask head to obtain object masks corresponding to the visual prompts, while treating the LLM outputs as textual responses. However, this paradigm does not allow the model to "re-watch" key objects by mask-pooling important regions, leading to sub-optimal performance in both segmentation and QA.

Q3: Key Architectural Innovations or Distinctions Compared to Draw-and-Understand [1]

Thanks for highlighting this highly related work! We acknowledge its contribution to visual prompting understanding, and will include more discussions in Sec. 2 (Related Work). However, UniPixel is significantly different from their proposed VP-MLLM [1] in several aspects.

Method	Image Input	Video Input	Visual Prompt Input	Mask Output
VP-MLLM [1]	✓	✗	✓	✗
UniPixel	✓	✓	✓ (At Any Frames)	✓

UniPixel supports both image and video inputs, whereas VP-MLLM can only perceive images.
Through joint positional and temporal encoding, UniPixel can accept visual prompt inputs at any frames, while VP-MLLM is limited to prompting on only one frame.
UniPixel integrates referring and segmentation into a unified framework, while VP-MLLM is a referring-only model and cannot perform segmentation tasks.

Conceptually, UniPixel can be regarded as a superset of VP-MLLM, as it supports unified object referring and segmentation in both images and videos.

Q4: The Mutual Improvement of Referring and Segmentation Tasks

Good question! We also thought that these are fundamentally different tasks, but the experimental results in Table 7(a) demonstrate that they can be mutually improved through multi-task co-training. We hypothesize that this is because training on both tasks helps the model focus more on fine-grained semantics rather than solely on global information. Further investigation into this phenomenon is left for our future work.

Is the observed mutual improvement during training solely attributable to UniPixel's decoupled design?

No. It is also affected by the ratio of the two dataset types. During model training, the proportions of samples for regional understanding and object segmentation were 32.3% and 33.9%, respectively, which is approximately a 1:1 ratio. We observe that the mutual improvement effect disappears when the data distribution is imbalanced.

Would this conclusion hold with a single one-turn inference model?

Yes. According to Table 7(a) (the 3rd line), training on both tasks without the object memory bank can also jointly improve segmentation and regional QA performance.

We hope our responses above can address your concerns. More discussions are welcome if you have any further questions. Thank you!

References:

[1] Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want, ICLR 2025.

[2] VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding, arXiv 2024.

2025-08-04

Dear Reviewer D9pB,

Thank you again for your valuable comments!

Since the rebuttal period is halfway through, we would greatly appreciate it if you could kindly review our response to ensure it adequately addresses your concerns.

Thank you for your time and consideration.

Sincerely,

Authors of Paper 394

评论- Response to authors

2025-08-05

Since the authors have resolved most of my questions, I've decided to retain my current score.

2025-08-06

Dear Reviewer D9pB,

Thank you for the feedback! We are pleased to know that most of your questions have been resolved.

Could you please kindly indicate the remaining points that require further clarification? Or any further suggestions for improving our paper? We will be more than happy to assist further and hope that, with your satisfaction, you could consider increasing the scores. Many thanks!

Sincerely,

Authors of Paper 394

审稿意见

评分: 4置信度: 42025-07-02

This paper proposes UniPixel, a unified multi-modal model that jointly handles object referring and segmentation across images and videos. A key innovation is the Object Memory Bank, which enables fine-grained, region-specific reasoning by storing spatial-temporal object features. The model supports diverse visual prompts and introduces a new PixelQA task combining segmentation and question answering. Built on Qwen2-VL with tailored encoders and a SAM-based decoder, UniPixel achieves state-of-the-art results on 10 benchmarks, outperforming larger models.

优缺点分析

Overall, the work is novel, well-motivated, and advances pixel-level vision-language understanding in a unified and efficient manner.

Major Weakness:

I have some concerns regarding the reported performance gains. It appears that UniPixel may largely build upon incremental improvements over Sa2VA, particularly due to two architecture upgrades: the transition from SAM2 to SAM2.1 and from InternVL2.5 to Qwen2-VL. These changes alone could contribute significantly to the performance boost. Furthermore, it is unclear why Qwen2.5-VL, a more recent and potentially stronger backbone, was not adopted. While I acknowledge the importance of engineering innovations, it is essential to disentangle how much of the performance gain comes from the proposed architectural design versus backbone/module upgrades. Unfortunately, the current version of the paper does not provide sufficient ablations or analysis to clarify this.

Minor Weakness:

In Table 3, under the RefCOCO benchmark, the comparison with non-LLM-based methods is somewhat limited. I suggest including more recent and competitive baselines such as C3VG [1], which represents a strong non-LLM approach to multi-task visual grounding.
Similarly, in Appendix Table 5, which compares REC-related methods, it would strengthen the empirical evaluation to include recent non-LLM SOTA methods like OneRef [2] and SimVG [3], both published at NeurIPS 2024. These additions would provide a more comprehensive comparison and better contextualize the performance of UniPixel.

[1] Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints, AAAI 2025.

[2] OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling, NeurIPS 2024.

[3] SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion, NeurIPS 2024.

问题

In Lines 152–153, for objects [1]–[4], are the <MEM> tokens constructed using a single <SEG> token decoded by SAM2 to obtain masks across all frames? How is the multi-frame memory for each object built from a single <SEG> token? Additionally, how does the model determine that <4><MEM> does not exist in the last frame?

局限性

yes

最终评判理由

I appreciate the authors’ detailed responses. After reviewing the other reviewers’ suggestions, I concur with Reviewer D9pB’s concern regarding the time consumption of the two-stage inference process, which also aligns with my own reservations.

Additionally, I believe the description of UniPixel’s memory bank mechanism requires further elaboration to enhance clarity.

At this stage, I maintain my initial assessment as a borderline accept.

格式问题

Not Involved

作者回复

2025-07-28

We sincerely thank you for the detailed and constructive comments. Below we provide point-by-point responses to address your concerns.

Q1: Attribution of Performance Gains

We highlight the differences between Sa2VA and UniPixel in terms of architectural design, supported input/output formats, and frame sampling strategy below.

Method	Base Model	Mask Head	Temporal Prompt Encoding	Object Memory Bank	Point $_{in}$	Box $_{in}$	Mask $_{in}$	Mask $_{out}$	Frame Sampling Strategy
Sa2VA	InternVL2.5	SAM2 (Large)	✗	✗	✗	✗	✓	✓	First 5 Frames
UniPixel	Qwen2-VL	SAM2.1 (Base+)	✓	✓	✓	✓	✓	✓	Uniform or Fixed FPS

Architectural Design: Sa2VA uses SAM2-Large (224M) while UniPixel utilizes SAM2.1-Base+ (81M). These mask heads exhibit similar segmentation capability. Ablation studies on both the base model and mask head will be presented below. Compared with Sa2VA, our key architectual innovations are the Viusal Prompt Encoder w. Joint Positional & Temporal Encoding (Sec. 3.1) and the Object Memory Bank (Sec. 3.2), both effectively enable unified object referring and segmentation.
Supported Input/Output Formats: Our method supports flexible visual prompt inputs (i.e, points, boxes, and masks) at any frame(s). This is achieved by our visual prompt encoder that jointly encodes spatial position and temporal frame index of each visual prompt. Although Sa2VA's preprint also claims support for point and box prompts, their actual code, data, and experiments do not reflect this (Sa2VA GitHub Issue #19). Only mask-pooling-style region caption experiments are presented in their Table 15 and Figure 7.
Frame Sampling Strategy: Sa2VA samples the first 5 video frames as keyframe inputs to the MLLM, after which SAM2 is employed to track the object in the remaining frames. This is sub-optimal as the MLLM cannot perceive the entire video. Instead, UniPixel leverages more flexible uniform or fixed FPS sampling strategies, allowing the LLM to access more temporal context and thereby enhancing overall performance.

We also provide ablation studies to clarify the attribution of performance gains. The metrics are J&F (RVOS datasets) and Acc (VideoRefer $^Q$ ).

ID	Method	Base Model	Mask Head	ReVOS $_{val}$	MeViS $_{val(u)}$	Ref-DAVIS17 $_{val}$	VideoRefer $^Q$
A	Sa2VA	InternVL2-4B	SAM2 (Large)	53.2	52.1	73.8	–
B	Sa2VA	InternVL2.5-4B	SAM2 (Large)	–	55.9	73.7	–
C	Sa2VA	Qwen2-VL-2B	SAM2 (Large)	40.0	49.4	72.0	–
D	UniPixel	Qwen2-VL-2B	SAM2.1 (Base+)	60.2	59.8	74.5	72.5
E	UniPixel	Qwen2-VL-2B	SAM2 (Large)	60.4	59.6	73.9	72.6
F	UniPixel	InternVL2.5-4B	SAM2 (Large)	60.9	61.2	74.6	72.5
G	UniPixel	Qwen2.5-VL-3B	SAM2.1 (Base+)	61.4	59.8	74.1	72.8

A, B, C: Sa2VA's comparison presents the ranking of base MLLMs for RVOS tasks: InternVL2.5-4B > InternVL2-4B > Qwen2-VL-2B. This is aligned with our ablation study (E vs. F).
D, E: Changing from SAM2.1 (Base+) to SAM2 (Large) has minimal effect on model performance.
B, F: When using the same base model InternVL2.5-4B and mask head SAM2 (Large), UniPixel clearly outperforms Sa2VA on MeViS (+9.4%) and Ref-DAVIS17 (+1.2%).
D, G: Upgrading from Qwen2-VL-2B to Qwen2.5-VL-3B improves the performance on ReVOS and VideoRefer $^Q$ , while slightly downgrades the results on Ref-DAVIS17.

The above ablation study clarifies that the performance gains of UniPixel largely come from architectural innovations and learning strategies, rather than simple backbone/module upgrades. We will include more detailed discussions and the table above in our revision. Qwen2.5-VL based UniPixel will also be open-sourced.

Q2: In Table 3, add C3VG [1] as a competitive non-LLM-based baseline.

Thank you for highlighting this strong and important baseline! We have incorporated its evaluation results on RES task in both Table 3 and Appendix Table 4, and included detailed discussions in Sec. 2 (Related Work). Comparisons between C3VG and UniPixel-7B are presented below. Note that the cIoU (cumulative IoU) metric used in our work is equivalent to the oIoU in C3VG's paper. We compare results under both mIoU and oIoU metrics for comprehensive evaluations.

Using mIoU as metric:

Method	LLM	FT	RefCOCO $_{val}$	RefCOCO $_{testA}$	RefCOCO $_{testB}$	RefCOCO+ $_{val}$	RefCOCO+ $_{testA}$	RefCOCO+ $_{testB}$	RefCOCOg $_{val(U)}$	RefCOCOg $_{test(U)}$
C3VG [1]	–	–	81.4	82.9	79.1	77.1	79.6	72.4	76.3	77.1
UniPixel	7B	✗	80.5	81.9	77.8	75.1	79.0	70.3	75.4	76.3
UniPixel	7B	✓	81.9	83.4	80.0	76.8	80.7	72.6	77.4	78.6

Using oIoU as metric:

Method	LLM	FT	RefCOCO $_{val}$	RefCOCO $_{testA}$	RefCOCO $_{testB}$	RefCOCO+ $_{val}$	RefCOCO+ $_{testA}$	RefCOCO+ $_{testB}$	RefCOCOg $_{val(U)}$	RefCOCOg $_{test(U)}$
C3VG [1]	–	–	80.9	83.2	77.9	74.7	78.0	69.0	74.4	76.4
UniPixel	7B	✗	79.5	81.1	76.2	72.5	77.0	66.6	74.4	75.3
UniPixel	7B	✓	81.2	82.5	78.1	74.5	79.0	68.3	76.3	77.5

Here, FT denotes fine-tuning on RefCOCO/+/g datasets after multi-task co-training. Without such fine-tuning, our model outperforms all LLM-based methods but slightly lags behind the non-LLM-based SOTA (C3VG [1]). Such a gap can be effectively closed by fine-tuning on these downstream datasets, as can be seen from the competitive post-FT results. The comparisons suggest that UniPixel can achieve better or comparable results to non-LLM-based SOTAs like C3VG on RES.

Q3: In Appendix Table 5, add OneRef [2] and SimVG [3] as strong non-LLM SOTA methods.

Thanks for this valuable insight! We have added the REC performance of OneRef-L (BEiT3-L) [2] and SimVG-TB (ViT-L/32) [3] in Appendix Table 5. The comparison between these methods and UniPixel is shown below.

Method	LLM	RefCOCO $_{val}$	RefCOCO $_{testA}$	RefCOCO $_{testB}$	RefCOCO+ $_{val}$	RefCOCO+ $_{testA}$	RefCOCO+ $_{testB}$	RefCOCOg $_{val(U)}$	RefCOCOg $_{test(U)}$
OneRef [2]	–	93.2	95.4	90.1	88.4	92.1	82.7	87.8	88.8
SimVG [3]	–	90.6	92.5	87.7	85.4	89.6	79.7	79.3	86.0
UniPixel	2B	89.0	92.1	86.8	81.9	86.9	76.5	85.3	85.7
UniPixel	7B	91.5	93.7	87.9	85.9	90.9	80.8	87.7	88.6

Our 7B model slightly outperforms SimVG, but falls short of surpassing OneRef. We attribute it to the sub-optimal bounding box prediction, where the boxes are derived from masks rather than regressed by task-specific heads. This could potentially be improved by introducing dedicated bounding box heads following [1, 3], suggesting a promising direction for our future work. Nevertheless, to our best knowledge, UniPixel is currently the best-performing LLM-based method (7B-level) on REC.

Q4: Details about the Object Memory Bank.

It seems that OpenReview failed to render some parts of your questions. Based on the available content, we have attempted to reconstruct them as follows. Please kindly correct us if our interpretation is inaccurate.

In Lines 152–153, for objects [1]–[4], are the <MEM> tokens constructed using a single <SEG> token decoded by SAM2 to obtain masks across all frames?

Yes. For each object, one <SEG> token is decoded by SAM2 and be used to construct multiple <MEM> tokens. In this case, there are four objects, thus four <SEG> tokens are generated during memory pre-filling.

How is the multi-frame memory for each object built from a single <SEG> token?

During memory pre-filling, the model predicts one <SEG> token for each object. This token is passed to SAM2 and serves as a "hidden prompt" on the first frame. SAM2 then predicts the object mask for the first frame and propagates it across the remaining frames. The output from SAM2 is a multi-frame mask (a binary mask with shape [ $T$ , $H$ , $W$ ]), which would be saved into the object memory bank.

In the memory injection stage, the saved multi-frame mask is split into $T$ single-frame masks (each with shape [ $H$ , $W$ ]). We then construct the memory-enhanced prompt according to these masks. For example, for a video with 4 frames, if the mask for object [1] at frame <2> is all-zero (i.e., the object is not visible in that frame), the prompt would be [1]: <1> <MEM> <3> <MEM> <4> <MEM> (<2> is omitted since the mask is empty). In this case, only the masks at frames <1>, <3>, and <4> are valid and are used to crop the corresponding visual features. The cropped 2D features are then average-pooled into 1D vectors, projected, and used to replace the <MEM> tokens in the prompt.

How does the model determine that <4> <MEM> does not exist in the last frame?

This is determined by the mask head. If the predicted mask in a frame is all-zero, it means that the object does not exist in that frame.

We will carefully revise Sec. 3.2 to provide a clearer explanation of how the object memory bank works.

We hope our responses above can address your concerns. More discussions are welcome if you have any further questions. Thank you!

References:

[1] Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints, AAAI 2025.

[2] OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling, NeurIPS 2024.

[3] SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion, NeurIPS 2024.

2025-08-04

Dear Reviewer HVUw,

Thank you again for your constructive feedback!

As the rebuttal period is halfway through, we would greatly appreciate it if you could kindly review our response to ensure it adequately addresses your concerns.

Thank you for your time and consideration.

Sincerely,

Authors of Paper 394

2025-08-09

Dear Reviewer HVUw,

As the reviewer-author discussion period will end very soon, we would like to kindly follow up to confirm whether our rebuttal has addressed all your concerns. Please feel free to let us know if any parts require further clarification, and we will be happy to address them accordingly.

Thank you very much for your time and consideration!

Sincerely,

Authors of Paper 394

评论- Please check the rebuttal

2025-08-04

Dear Reviewers D9pB, HVUw. NW7Z,

The authors have submitted their rebuttal.

Please take time to carefully review all other reviews and the authors’ responses, and engage in an open exchange with the authors.

Kindly post your initial response as early as possible within the discussion window to allow sufficient time for interaction.

Your AC

2025-08-07

Dear AC and Reviewers,

We sincerely thank you for your thoughtful review and constructive feedback, which have significantly strengthened our submission.

We are encouraged by your all positive scores and your recognition that:

HVUw: The work is novel, well-motivated, and advances pixel-level vision-language understanding in a unified and efficient manner.
D9pB: The Object Memory Bank is a novel and effective innovation. The work successfully integrates multiple tasks into a unified, end-to-end model. The paper is well-structured, logically coherent, and highly readable.
NW7Z: This work is well-structured and easy to follow. The effectiveness of the model is validated through extensive ablation studies.
EraF: The motivation is clear. The writing is good. The object memory bank allows the model to perform object referring and segmentation more effectively than previous methods.

Following your constructive comments and insightful suggestions, we have made the following improvements:

Ablation Studies on the Base LLMs and Modules: In response to reviewer HVUw, we highlighted the architectural and task differences between Sa2VA and our model, and conducted extensive ablation studies to demonstrate the attribution of performance gains—from our innovations rather than simple backbone/module upgrades.
More Baseline Comparisons: As suggested by reviewer HVUw and EraF, we have incorporated C3VG, OneRef, SimVG, and VisionLLM v2 as important baselines in our experiments. We've also discussed the differences between our UniPixel (unified referring & segmentation) and Draw-and-Understand (referring-only), mentioned by reviewer D9pB. Larger-scale models such as Qwen2-VL-72B and Sa2VA-26B are also compared according to reviewer NW7Z's advice.
Discussion on Inference Efficiency of Object Memory Bank: Following reviewer D9pB and EraF's suggestions, we conducted efficiency tests on the proposed object memory bank, and clarified that the extra computational overhead can be effectively minimized (56% faster) using proper inference techniques (i.e., visual tokens cache and KV cache).
Clarification on Mutual Improvement of Referring and Segmentation: In response to reviewer D9pB, we clarify that the joint improvement of referring and segmentation capabilities can be achieved through multi-task co-training with around 1:1 sample ratios.
Performance on General VQA Tasks: As suggested by reviewer EraF, we evaluated our UniPixel on VQAv2 and MME benchmarks, where the results demonstrate that our unified pixel-level understanding design can also improve general visual-language understanding capabilities.
Other Details (GPU Requirement, Code Release, Presentations): Thanks to the suggestions from reviewer NW7Z and EraF, we've included GPU requirements and training time in the paper, and revised the experiment tables to indicate the base model of each baseline. We are also actively cleaning up the code and models, and we commit to releasing them very soon.

We greatly appreciate your efforts in engaging in the discussions. Your feedback has significantly enhanced the quality of our submission. Any further comments/suggestions are welcome to help us continuously improve our work. Thank you!

Sincerely,

Authors of Paper 394

最终决定Accept (poster)

2025-09-17

This paper introduces UniPixel, a unified multimodal model for pixel-level visual reasoning that integrates object referring, segmentation, and reasoning across images and videos. Its key contribution is the Object Memory Bank, enabling region-specific reasoning and supporting the new PixelQA task. The work is clearly written, well-motivated, and demonstrates strong results across 10 benchmarks. Reviewers appreciated the novelty, unified framework, and extensive evaluation. Main concerns centered on whether performance gains are primarily due to backbone upgrades, and the efficiency trade-offs of the two-stage inference. The rebuttal provided detailed ablations, clarified efficiency mitigations, and expanded comparisons to stronger baselines and larger LMMs. Overall, while some reservations remain, the consensus is that the paper is technically solid and novel, with scores of 4, 4, 4, and 5. It makes a valuable contribution to unified pixel-level vision-language reasoning. The authors are encouraged to provide more detailed explanations of the memory bank mechanism and include additional ablation studies in the camera-ready version.