5.7

/10

Poster3 位审稿人

最低5最高6标准差0.5

4.0

置信度

正确性3.0

贡献度2.7

表达3.0

NeurIPS 2024

A Surprisingly Simple Approach to Generalized Few-Shot Semantic Segmentation

Tomoya Sakai,Haoxiang Qiu,Takayuki Katsuki,Daiki Kimura,Takayuki Osogami,Tadanobu Inoue

OpenReview PDF

提交: 2024-05-13更新: 2025-01-08

TL;DR

A simple approach without resorting to, e.g., complicated modules and meta-learning improved GFSS performance.

摘要

The goal of *generalized* few-shot semantic segmentation (GFSS) is to recognize *novel-class* objects through training with a few annotated examples and the *base-class* model that learned the knowledge about the base classes. Unlike the classic few-shot semantic segmentation, GFSS aims to classify pixels into both base and novel classes, meaning it is a more practical setting. Current GFSS methods rely on several techniques such as using combinations of customized modules, carefully designed loss functions, meta-learning, and transductive learning. However, we found that a simple rule and standard supervised learning substantially improve the GFSS performance. In this paper, we propose a simple yet effective method for GFSS that does not use the techniques mentioned above. Also, we theoretically show that our method perfectly maintains the segmentation performance of the base-class model over most of the base classes. Through numerical experiments, we demonstrated the effectiveness of our method. It improved in novel-class segmentation performance in the $1$-shot scenario by $6.1$% on the PASCAL-$5^i$ dataset, $4.7$% on the PASCAL-$10^i$ dataset, and $1.0$% on the COCO-$20^i$ dataset. Our code is publicly available at https://github.com/IBM/BCM.

关键词

few-shot learningsemantic segmentationcatastrophic forgetting

评审与讨论

审稿意见

评分: 6置信度: 52024-06-14

The authors propose a simple yet effective method termed base-class mining (BCM) for GFSS that does not employ the techniques mentioned earlier in the existing method. Experiments show some improvements on COCO-20i, PASCAL-5i and PASCAL-10i.

优点

The article is well-written.
The proposed method is simple yet effective too some extent.
The computational cost is low as the approach only requires to update several final linear layers.

缺点

The overall novelty is relatively limited, as the idea that "a novel class is classified as the background or a similar base class by the base-class model" has already been explored in continual semantic segmentation [1], where novel class "lake" is classified as base class "water" before learning the new class lake.
The improvement across different settings is relatively marginal.
The computational cost is about 1.5 folds compared to CAPL as in Fig. 7. From my view, the base model is shared among different g_beta. The only computation increase should only be the added final linear layers, which is roughly neglectable. Therefore, I expect more explanations for the increase.

[1] Kim, Beomyoung, Joonsang Yu, and Sung Ju Hwang. "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

问题

When s > 1, each novel class may have multiple mapped base classes. In this case, how to combine the final predictions for the novel classes is not elaborated.

局限性

No negative societal impact is expected.

作者回复

2024-08-05

Thank you for your review. Please find our answers to your questions below.

The overall novelty is relatively limited, as the idea that "a novel class is classified as the background or a similar base class by the base-class model" has already been explored in continual semantic segmentation [1], where novel class "lake" is classified as base class "water" before learning the new class lake.

Thank you for pointing out the paper [1]. In [1], a novel class is classified as a base class, and then misclassification is corrected by post-processing, so-called logit manipulation. In contrast, we explicitly integrate the idea into the model architecture, which vastly differs from [1]. The architecture allows us to see which novel class is related to which base class through BNM, as illustrated in Figure 2(c). Moreover, we showed that our method prevents catastrophic forgetting in theory (Proposition 4.1) and improved performance, particularly in the 1-shot setting. Although the two settings (continual learning in panoptic segmentation [1] and GFSS) have some overlap, achieving better performance by [1] in GFSS would not be straightforward since the number of training samples is quite limited in GFSS, particularly in the 1-shot setting. We believe that our paper contains various quantities that contribute to the overall novelty.

When writing our paper, we did not realize [1] since it does not consider GFSS, and it was published very recently in CVPR2024 and appeared on arXiv roughly a month before the submission deadline. In the final version, we will cite [1].

The improvement across different settings is relatively marginal.

The improvement of the Novel score in the 5-shot PASCAL- $5^i$ is marginal, but except for that, our method improved the Novel score by at least 1% and a maximum of 6%. From the viewpoint of inference time, our method is much faster than DIaM, as shown in Figure 7. Considering these points, the results of experiments would be sufficient to show the effectiveness of the proposed method against the existing methods.

The computational cost is about 1.5 folds compared to CAPL as in Fig. 7. From my view, the base model is shared among different g_beta. The only computation increase should only be the added final linear layers, which is roughly neglectable. Therefore, I expect more explanations for the increase.

Our method has $|\mathcal{B}|$ final linear layers for novel classes, leading to a subtle slowdown when switching the layers, unlike the end-to-end computation of CAPL. Also, our current implementation uses CPUs for the final linear layers, as Scikit-learn is used, unlike CAPL on GPU. This device difference might be another reason for the slowdown. Besides, other miscellaneous things may cause about six milliseconds of slowdown. Nevertheless, the current implementation would show the usefulness of our method. In the final version, we will mention the above points and explain that a more sophisticated implementation will shorten the gap between CAPL and our method.

When s > 1, each novel class may have multiple mapped base classes. In this case, how to combine the final predictions for the novel classes is not elaborated.

Even when s > 1, the inference procedure works as it is since a single base class is mapped to multiple novel classes in BNM.

We show the case when s > 1 with a tiny example: the base classes are 0 and 1, and the novel class is 2. Suppose that we have the following BNM when s = 2:

Base class	Set of novel classes
0	2
1	2

This table shows when the novel class 2 is mapped to two base classes, 0 and 1. In this case, we have the two models: $g_{\beta=0}$ returns 0, or 2, and $g_{\beta=1}$ returns 1 or 2. For each pixel, the base-class model outputs either 0 or 1. We then compute the prediction of the corresponding model $g_\beta$ and overwrite it. Since our method does not need to overwrite the same pixel multiple times, we can straightforwardly combine predictions of $g_\beta$ for the final prediction. To improve the clarity of our paper, we will add the above explanations in the final version. We hope that this answer will resolve your concerns.

2024-08-11

Regarding s>1, only the top-1 prediction from the base model will be the index to select corresponding $g_\beta$ to output the prediction, which will be used to overwrite the base prediction. Is it correct?

2024-08-11

I would like to see the comparison with the few-shot fine-tuning techniques proposed in FSOD [1-2], i.e., finetuning the last linear layers. Adding these two simple methods to the baseline will make the experimental part more convincing.

Moreover, I suggest the authors briefly discuss the limitations of continual CSS [3-5] when it comes to GFSS in the related works.

I will finalize my rating based on the authors' responses.

[1] Wang, Xin, et al. "Frustratingly simple few-shot object detection." arXiv preprint arXiv:2003.06957 (2020).

[2] Yang, Ze, et al. "Efficient few-shot object detection via knowledge inheritance." IEEE Transactions on Image Processing 32 (2022): 321-334.

[3] Cermelli, Fabio, et al. "Modeling the background for incremental learning in semantic segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[4] Yang, Ze, et al. "Label-guided knowledge distillation for continual semantic segmentation on 2d images and 3d point clouds." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[5] Kim, Beomyoung, Joonsang Yu, and Sung Ju Hwang. "ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

2024-08-12

I would like to see the comparison with the few-shot fine-tuning techniques proposed in FSOD [1-2], i.e., finetuning the last linear layers. Adding these two simple methods to the baseline will make the experimental part more convincing.

DIaM fine-tunes the last linear layers, regarded as the simple baseline. In Figure 1(a) in the DIaM paper, they compared the method that fine-tunes the final linear layer with the cross-entropy, where the procedure is close to the few-shot object detection method [1]. The results showed that DIaM outperformed the simple baseline, meaning that our method will outperform such a baseline.

In the final version, we will discuss [1-2] in the related work.

Moreover, I suggest the authors briefly discuss the limitations of continual CSS [3-5] when it comes to GFSS in the related works.

Thank you for your suggestion. We will discuss it in the related work.

2024-08-12

After rebuttal, the authors' response has addressed some of my concerns. I believe this work will have a positive influence on the few-shot semantic segmentation community. Consequently, I decide to raise my rating.

2024-08-12

Regarding s>1, only the top-1 prediction from the base model will be the index to select corresponding to output the prediction, which will be used to overwrite the base prediction. Is it correct?

Yes. The top-1 prediction of models is used. The symbol $s$ is for the top- $s$ strategy in our method. We will clarify this point.

审稿意见

评分: 6置信度: 42024-07-10

This work introduces an interesting method for generalized few-shot segmentation. Unlike previous methods that mainly focus on meta-learning, the proposed method maintains the performance of base classes while achieving decent performance for novel classes. Feature pre-processing and model ensembling techniques are used to further enhance performance. Experiments on PASCAL-5i, PASCAL-10i, PASCAL-20i, and COCO-20i show better or comparable performance to SOTA methods.

优点

The proposed method is simple but effective. Experimental results are generally good. This work provides a new perspective on maintaining base class performance in GFSS. The performance for base classes that are less relevant to novel classes is exactly maintained by design.

缺点

It should be clarified whether the co-occurrences matrix and BNM are calculated per-dataset or per-batch.
Since the proposed method uses feature pre-processing and model ensembling techniques, the comparison may be somewhat unfair.
The written language can be improved.

问题

Typo: line 43, "beneficial, especially."

局限性

The authors have already discussed their limitations in the paper.

作者回复

2024-08-05

Thank you for your feedback. Please find our answers to your questions below.

It should be clarified whether the co-occurrences matrix and BNM are calculated per-dataset or per-batch.

The current implementation computes BNM per dataset. We will clarify this point in the final version.

Since the proposed method uses feature pre-processing and model ensembling techniques, the comparison may be somewhat unfair.

We presented the ablation study, showing that our method without feature pre-processing and ensemble learning techniques achieved better novel-class segmentation performances in the 1-shot setting.

The existing methods often use meta-learning, information maximization principle, and transductive learning in addition to the simple supervised learning method, e.g., minimization of the cross-entropy loss. Each existing method uses various techniques to improve its performance further. Given that, our comparison is reasonable from the viewpoint of standard practice.

From another perspective, the advantage of our method is that we can use feature pre-processing and model ensemble learning techniques at a low cost, as shown in Figure 6 (training time). Pre-processing to feature vectors, rather than input images, would downgrade the performance of the existing methods. Ensemble learning with the existing methods causes computation issues. For instance, ensemble learning further slows down the speed of the existing methods based on transductive learning, such as DIaM.

The written language can be improved.

We will revise the written language.

Typo: line 43, "beneficial, especially."

Thank you very much. We fixed the typo.

评论- Response to author's rebuttal

2024-08-12

Thanks for your response. My concerns have been addressed.

审稿意见

评分: 5置信度: 32024-07-12

The paper presents a new and efficient BCM technique aimed at tackling the issue of generalized few-shot semantic segmentation. It identifies how base and novel classes relate to each other by examining the overlap between the base model's predictions and the true labels of the novel images. Utilizing these insights, the approach trains new models for the novel classes, allowing them to better differentiate from similar base classes, which in turn enhances the segmentation results.

优点

The writing in the paper is easy to understand and direct.

Besides, The performance is commendable, and it also maintains a pleasing level of efficiency.

The approach appears to be general and independent of the model's architecture.

缺点

Given that the base-novel mapping (BNM) facilitates a transition from base to novel classes, there's no necessity to further refine the base-class model and its feature extractor, nor to adjust their weights.

The BCM employs inductive learning instead of transductive learning, which results in swift inference and also ensures that the training process of the BCM is both quick and user-friendly.

The quantitative assessment appears to indicate that the performance of the BCM is state-of-the-art when juxtaposed with traditional methods.

问题

please refer to the weakness section.

局限性

n/a

作者回复

2024-08-05

Thank you for your feedback. It appears you have listed some of the strengths in the weaknesses section. If you forgot to copy and paste questions from your memo, e.g., questions about explanations in our paper, please let us know during the discussion. We would like to improve the presentation of our paper through your comments.

评论- Help Check the Rebuttal, Make Discussions, and Update the Final Recommendations

2024-08-12

Dear Reviewers,

Thanks for serving as a reviewer for the NeurIPS. The rebuttal deadline has just passed.

The author has provided the rebuttal, could you help check the rebuttal and other fellow reviewers' comments, make necessary discussions, and update your final recommendations as soon as possible?

Thank you very much.

Best,

Area Chair

最终决定Accept (poster)

2024-09-25

This paper was reviewed by three experts in the field. The recommendations are (Weak Accept x 2, Borderline Accept). Based on the reviewers' feedback, the decision is to recommend the acceptance of the paper. The reviewers did raise some valuable concerns (especially the unclear paper writing and presentation by Reviewer NeFB, technical novelty claim and comparisons with previous literature by Reviewer MDAS) that should be addressed in the final camera-ready version of the paper. The authors are encouraged to make the necessary changes to the best of their ability.