Learning to Count without Annotations

Lukas Knobel,Tengda Han,Yuki M Asano

OpenReview PDF

提交: 2023-09-22更新: 2024-03-26

TL;DR

We propose a method, which allows training counting models using only self-supervised learning signals.

摘要

关键词

computer visionself-supervisionvisual counting

评审与讨论

审稿意见

评分: 5置信度: 52023-10-31

This paper focuses on the unsupervised object-counting task that does not require any manual annotations. To this end, the authors construct “SelfCollages”, images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Experiments on the counting dataset demonstrate the effectiveness of the proposed method.

优点

The unsupervised counting task is a challenging task, and it is appealing to see the authors propose a practical way.

The proposed method even outperforms simple baselines and generic models such as FasterRCNN and DETR.

缺点

The experiments are not convincing. There are two pioneering works (CrowdCLIP[1] and CSCCNN ) that also focus on the unsupervised counting task. However, the authors do not discuss or compare with them. I would like to see a comprehensive comparison.
The evaluated FSC-147 dataset is not very challenging. I suggest the authors try to conduct experiments on the crowd datasets, which are usually dense and challenging. Compared with CrowdCLIP[1] and CSC-CCNN[2] will make the paper more solid.
It is better to add a subsection to discuss the weakly/semi-supervised counting methods that also reduce the annotation cost.
The motivation for pasting different images on top of a background is not clear.
Do the authors try to utilize other cluster algorithms unless the K-means？

[1] CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model. CVPR 2023. [2] Completely self-supervised crowd counting via distribution matching. ECCV 2022.

问题

see weakness

审稿意见

评分: 3置信度: 52023-10-31

This manuscript targets on few shot counting model training without annotations. Specifically, the authors build a synthetic dataset to provide supervision signal to the object counter, and utilize a DINO and Vision Transformer based architecture to make prediction of density map.

优点

This manuscript is sound in making adequate explaination to the results and experimental analysis;
The writing quality of this manuscript is ok to make me get the points.

缺点

The motivation of this work is poor. I am still confused on why we should build such a sythetic dataset from others to get some supervision signal to train a unsupervised counter.
From the data perspective, these generated data are without double to be filled with artefact and the solution in this manuscript is just the copy-paste, whose contribution is limited.
The counting model utilized in this manuscript is not totally original, which seems to be the DINO + ViT.
It is evident that the authors omitted some methods in unsupervised counter (Completely Self-Supervised Crowd Counting via Distribution Matching-ECCV22), or foundation model based method (Can SAM Count Anything? An Empirical Study on SAM Counting) & (Training-free Object Counting with Prompts).

问题

None

审稿意见

评分: 3置信度: 42023-11-02

This paper introduces a method for counting objects without annotations. It leverages DINO and N-cut to extract object patches and then randomly places them into a background image, allowing for the acquisition of localization labels without annotation. Additionally, it trains a counting model based on CounTR but with the DINO backbone to count objects.

优点

The authors propose a method to generate synthetic data for object counting and implement an unsupervised object counting approach.
They utilize the DINO backbone to create a counting model similar to CounTR.

缺点

The approach of creating synthetic data by copying segmentation results from one image to another is a well-known technique in segmentation [1]. However, this paper applies it to object counting.
The trained model's performance is not satisfactory, particularly in FSC-147 high, which is the primary objective of counting dense and small objects.
The motivation for the counting task is to abstract information from dense scenes that detection models struggle with, particularly partial and occluded objects. However, the proposed method does not effectively handle partial or occluded objects, which contradicts the motivation of the counting task.

[1] Ghiasi, Golnaz, et al. "Simple copy-paste is a strong data augmentation method for instance segmentation." CVPR, 2021.

问题

How does the model perform if fine-tuned on FSC-147?
How does the model perform on a specific dataset without retraining? For instance, previous methods have conducted adaptation on the CARPK dataset.
Why is $n_{max}$ set as a very small value ( $n_{max} = 20$ )? A higher count might be more suitable for a counting model since powerful detection models can handle it in sparse scenes.
Although SC-147 is split into low/medium/high in experiments, the overall performance should also be reported in the corresponding tables and sections.
Table 5 seems unfair. $n_{max}$ in UnCo is 20, making the trained model more suited to FSC-147 low (8-16 objects). CounTR is trained on the entire FSC-147 dataset, with counts ranging from 7 to 3731. The domains are different, and for a fair comparison, CounTR should be trained using only samples from FSC-147-low.
The average baseline in Table 1 seems unfair, particularly for FSC-147 low. Specifically, if the estimation density is 0, the MAE is the average GT count in FSC-147 low (8-16), which is much smaller than 37.

评论- Appreciation of Reviews and Withdrawal Decision

2023-11-16

We appreciate your review of our paper. After careful consideration, we have decided to incorporate your feedback to further improve our paper and withdraw our submission. Thank you for your comments.