Referring Expression Matters: Multi-referring Feature Aggregation for Referring Video Object Segmentation

YAN LI,Qiong Wang,Jianwei Zheng,Cong Bai,Lu Zhang

OpenReview PDF

提交: 2023-09-24更新: 2024-03-26

摘要

关键词

Referring Video Object SegmentationReferring expression segmentationMultimodal representation earning

评审与讨论

审稿意见

评分: 1置信度: 32023-10-30

A Referring Video Object Segmentation method is proposed. However, the motivation is not clear. The details of most of the methods are not explained.

优点

The picture of the model architecture of the proposed method is clear.

缺点

The writing is too bad. The details of the Multi-modal Fusion are not explained clearly. No referenced paper is mentioned in Deformable Transformer and Instance Sequence Segmentation. It's quite hard to understand the paper.

问题

How to do Multi-modal Fusion? What's the structure of the Deformable Transformer? What is Cross-Modal Feature Pyramid Network (CM-FPN) ?

伦理问题详情

审稿意见

评分: 5置信度: 42023-10-30

The paper focuses the task of Referring Video Object Segmentation and introduces to integrate multiple referring expressions to boost performance. A neural expression generation module is proposed to create complementary features from these expressions, which not only improves object identification accuracy but also accelerates training convergence. Experimental results on popular RVOS datasets are presented.

优点

(1) The paper explores the effect of multiple referring expressions for RVOS, which is interesting.

(2) This paper is well-written and easy to follow.

缺点

(1) Although the authors present an interesting motivation, suggesting that adjusting referring expressions could enhance segmentation performance, the method proposed does not fully align with this motivation. The reviewer, after going through the introduction, expect to find how the unclear parts within referring expressions are identified and improved. However, the authors merely concatenate multiple referring expressions.

(2) The paper's contribution mainly involves adding an MLP to ReferFormer to merge multiple referring expressions. However, this incremental addition lacks further in-depth consideration, i.e., what kind of scenarios need multiple inputs, how the extent of overlap and divergence between referring expressions affects final performance. Consequently, the contribution of the paper is limited.

(3) The experimental comparisons are unfair. While the proposed method uses multiple referring expressions as input, the compared methods utilize only one expression. To truly demonstrate the impact of the integration of MRE, a more comprehensive comparison should involve merging results from different expressions in other methods. This would effectively showcase the performance gains derived from exploring relationships within referring expressions.

问题

See weaknesses.

审稿意见

评分: 3置信度: 22023-11-01

This paper proposes a referring video object segmentation method via a multi-referring feature aggregation mechanism. This mechanism can effectively obtain complementary features with less redundancy, which is not only helpful in identifying the referred object, but also speeds up the training convergence. Experimental results show the effectiveness and superiority of the proposed method.

优点

The multiple referring expressions can generate a complete and concise linguistic feature, experimental results also show the effectiveness of the proposed strategy.
The proposed method can achieve better training convergence.
The proposed method achieves the new SOTA and outperforms the second-best by a large margin

缺点

The novelty of the proposed method is somewhat limited. The main contribution is the neural expression generation via multiple-referring expressions. It seems that this aggregation strategy is simple and lacks insights.
The authors declare that they proposed different sampling strategies in cross-modal attention for pre-training and fine-tuning to boost the model performance. However, the illustration of this sampling strategy is unclear, and the differences with existing sampling strategies are also unclear. Also, there are no experimental results to support this assertion.
In Eq.3, the authors used the concat operation but in Table 2(c), the proposed NEG is different MRE Cocat, so the reason is unclear.
The authors do not show the training convergence in the pre-training strategy, So, it is hard to assert the proposed method achieves faster convergence only by verifying it in the fine-tuning stage.
I think the comparison is somewhat unfair. The batch size is different. It mainly influences the training convergence and even the performance.

问题

Please seeing the weaknesses.