PseDet: Revisiting the Power of Pseudo Label in Incremental Object Detection

审稿意见

评分: 6置信度: 52024-11-02

Incremental Object Detection (IOD) expands object detectors without forgetting prior knowledge. Existing methods rely on knowledge distillation but underutilize the teacher model's insights. This paper identifies three problems in pseudo-labeling: limited pseudo label quality, fixed label filtering thresholds, and weak alignment of confidence scores with localization quality. To address these problems, this paper proposes PseDet, which includes a spatiotemporal enhancement module to handle noisy data, a Categorical Adaptive Label Selector for dynamic class-specific thresholds, and confidence score calibration improves label confidence scores alignment with localization.

优点

The proposed modules look useful for incremental object detection, including spatio-temporal enhancement module, categorical adaptive label selector, confidence score calibration supervision.
The motivation of categorical adaptive label selector is clear.
The experimental performance of the proposed PseDet looks good compared to the SOTAs.
According to the ablation study in Table 3, the proposed modules are effective.

缺点

It is unclear why the proposed spatio-temporal enhancement module is effective. With the empirical good results, please provide more inside explanation.
The algorithm of categorical adaptive label selector is provided. Please provide some caption and formula to further describe it for clearer presentation.
In categorical adaptive label selector, why k-means of the confidence scores can be a good threshold?
The difference between the proposed confidence score calibration supervision and the IoU-aware classification score in VFNet [A].

Overall, the proposed modules are effective, but lack inside explanations. I will give a rating of 5. If more inside explanations are provided, I will give a higher rating.

[A] VarifocalNet: An IoU-aware Dense Object Detector.

问题

see the weaknesses

评论- Responses to Official Review by Reviewer qjDZ: Part[1/3]

2024-11-20

Dear Reviewer qjDZ,

Thanks for the valuable feedback. We have tried our best to address all concerns in the last few days. Please see our responses below one by one:

W1: About inside explanation of the spatio-temporal enhancement module.

The spatio-temporal enhancement module aims to strengthen continual learning by addressing both the temporal and spatial domains. Our efforts are primarily overcoming two key challenges: improving the quality of pseudo-labels and enhancing the ability to effectively learn from these pseudo-labels.

(1) Temporal Enhancement: As continuous learning progresses, the model suffers from forgetting of knowledge related to old categories. In Figure 3 of the manuscript, forgetting occurs between the initial step and the second step across all categories. The model's best performance on a category is observed in the initial step of learning this category. Therefore, for pseudo-labels of a specific category, we select the one closest to the initial step on the timeline, as the quality of the pseudo-labels is highest at this point. Temporal Enhancement is used to strengthen a model's ability to resist forgetting in multi-step scenarios. As shown in Figures 5(a) and (b), with temporal enhancement, the FPP on the initial categories decreases to 0.3 $AP$ and 0.5 $AP$ in the two-step and four-step settings, respectively.

(2) Spatio Enhancement: Spatial augmentation transformation includes horizontal flip and scaling.

Models often exhibit a certain spatial bias, where features in specific positions are more easily captured by the model by flipped images. Horizontal flipping can mitigate this bias effect. When we evaluate the GFL model, the performance is 41.5 AP under vanilla settings. After applying horizontal flipping, the performance is 30.8 AP. However, if we use NMS to fuse these two sets, the result improves to 42.2 AP.
Scaling allows the model to have a more flexible perspective, as the size of objects in an image is crucial for model predictions. Scaling down the image helps in recognizing objects that occupy a larger area in the image. Experimental result shows the performance improves to 46.8 $AP_l$ after scale down. Scaling up the image help in identifying smaller objects within the image. The performance improves to 18.6 $AP_m$ after scale up. After integrating all the transformations, the performance improved to 43.1 $AP$ .The table below shows the performance after various spatial transformations, where $AP_s$ , $AP_m$ , $AP_l$ respectively represent the AP for objects of small, middle, and large sizes. The best result is marked with bold.

Method	$AP$	$AP_{50}$	$AP_{75}$	$AP_{s}$	$AP_{m}$	$AP_{l}$
vanilla	41.5	66.2	43.8	16.5	29.3	45.6
scale down	40.1	63.6	42.6	8.3	24.3	46.8
scale up	33.1	58.1	32.8	18.6	28.8	34.3
Spatio Enhancement	43.1	66.8	44.1	19.2	29.7	47.3

评论- Responses to Official Review by Reviewer qjDZ: Part[3/3]

2024-11-20

W4: The difference between the Confidence Score Calibration Supervision and the IoU-aware Classification Score in VFNet.

The differences will be explained in the following two parts:

(1) Differences in Usage:

In VFNet, IACS (IoU-aware Classification Score) reflects the comprehensive scores of the samples predicted by the trained model during the training, replacing the role of confidence scores used in other models. In Varifocal Loss, IACS replaces the prediction score of the model in focal loss. IACS is also used as a basis for ranking and selecting samples in modules such as assigner.
In our method, the scores of the pseudo labels reflect their quality, and the pseudo labels are used for supervision. We use the calibrated score $q$ to represent the quality of the GT labels, with the IoU $\tau$ (between the generated bbox and its GT bbox) jointly indicating labels $\hat{y} = \tau \cdot q$ .

In our method, calibrated score is used to generate continuous GT labels; in VFLNet, IACS is a comprehensive score used for assigners, ranking, and calculating loss, etc.

(2) Differences in Essence:

In VFNet, IACS is generated by merging object predict confidence and localization accuracy into a single composite score. The score is based on the prediction of a sample's classification and bbox. It is obtained through learning
In our method, the score is derived from a mathematical mapping of the model's confidence score, with the aim of making the score more relevant to the quality of pseudo-labels.

In summary, Confidence Score Calibration Supervision is specifically designed for pseudo-label based tasks, making it more suitable for pseudo-labeling methods. It can be effectively applied to other tasks and models, demonstrating superior performance in continual learning tasks. However, IACS is specifically tailored for object detectors with classification and regression branches. It varies across different models and tasks, making it unsuitable for pseudo-labeling tasks. Additionally, it is difficult to extend it to other pseudo-label based tasks such as segmentation or classification.

评论- Responses to Official Review by Reviewer qjDZ: Part[2/3]

2024-11-20

W2: Give more detailed explanation to algorithm of categorical adaptive label selector.

We will add more details of algorithm 1 in the revision. Further explanations and proofs regarding Algorithm 1 are also provided in the response of W3. Thanks for this suggestion!

W3: The reason why k-means of the confidence scores can serve as a good threshold in label selector.

To start with, we infer pseudo-labels represented by $(c, s, b)$ using old model $\theta_{old}$ , where $c$ denotes the class, $s$ indicates the confidence score, and $b$ represents the position coordinates of the bounding box.
Based on careful observation, we have noted that, for each class $i$ , the model's prediction confidence exhibits a distinct bimodal distribution, comprising both high-confidence regions (positive samples) and low-confidence regions (noise). The data in these regions respectively display a "Gaussian bell" shape centered around specific means $\mu$ :

High-confidence peak: This peak indicates that the model is highly confident in its detection results, representing high-quality pseudo-labels. These samples are characterized by clear features and high accuracy, clustering around $\mu_T$ .
Low-confidence peak: Typically, this peak reflects the model's uncertain predictions or samples that are difficult to distinguish. In this region of distributions, the pseudo-labels are mostly inaccurate noise, clustering around $\mu_F$ .

The confidence score follows a bimodal distribution, and its probability density can be expressed as:

$p(s) = \pi_T \mathcal{N}(s \mid \mu_T, \sigma_T^2) + \pi_F \mathcal{N}(s \mid \mu_F, \sigma_F^2)\quad[1]$

Let $S_i \in \mathbb{R}^n$ represents the set of confidence scores for the $i-th$ class. In E-step of the Expectation-Maximization (EM) algorithm, we assign each data point to the cluster in which it has the highest posterior. The posterior is:

$p(c_j=k \mid s_j) = \frac {\pi_k \mathcal{N}(s_j \mid \mu_T, \sigma_T^2) } {\pi_T \mathcal{N}(s_j \mid \mu_T, \sigma_T^2) + \pi_F \mathcal{N}(s_j \mid \mu_F, \sigma_F^2)}\quad[2]$

For the convenience of analysis, we can obtain the reciprocal of the posterior:

$p(c_j= True \mid s_j)^{-1} == 1 + \frac{\pi_F \mathcal{N}(s_j \mid \mu_F, \sigma_F^2)}{\pi_T\mathcal{N}(s_j \mid \mu_T, \sigma_T^2) } ==1 + \frac{\pi_F}{\pi_T} \exp({ \frac{1}{2\sigma_T^2}(s_i-\mu_T)^2 - \frac{1}{2\sigma_F^2}(s_i-\mu_F)^2})\quad[3]$

Here, K-means aims to partition data into two clusters by minimizing the Within-Cluster Sum of Squares:

$J = \sum_{i \in C_k} (\Vert{s_i - \mu_T}\Vert^2+\Vert{s_i - \mu_F}\Vert^2) \quad[4]$

In this distribution, samples can be considered as having soft assignments. However, we can use K-means which is a type of hard assignment to assign sample $i$ to the cluster $k(k \in{\{True,False}\})$ :

$c_k = \arg\min_k \| s_j - \mu_k \|^2\quad[5]$

By combining Equations (3) and (5), we obtain:

$\| s_i - \mu_k \|^2 \propto p^{-1}\quad[6]$

Maximizing the posterior probability $p(c_j=k \mid s_j)$ is equivalent to minimizing the $\| s_j - \mu_k \|^2$ .

K-Means can be considered a special case of the Expectation-Maximization (EM) algorithm with hard assignments, aimed at modeling the bimodal distribution of scores for true and false clusters.

Finally, We distinguish between true positive clusters $C_T$ and false positive clusters $C_F$ based on the criterion $\mu_T > \mu_F$ . In addition, K-means clustering is often highly sensitive to the choice of initial centroids. Therefore, we repeat the clustering process multiple times.

审稿意见

评分: 6置信度: 32024-11-04

The paper addresses the challenge of Incremental Object Detection (IOD), where object detectors need to learn new classes without forgetting previously learned ones. The authors propose PseDet, a framework that improves pseudo-labeling in three key ways:

Spatio-Temporal Enhancement Module: Improves pseudo-label quality by combining predictions from different spatial transformations and temporal steps.
Categorical Adaptive Label Selector: Automatically determines class-specific thresholds for filtering pseudo-labels using K-means clustering, replacing fixed thresholds.
Confidence Score Calibration: Aligns confidence scores with localization quality through non-linear mapping for better supervision.

The approach achieves state-of-the-art performance on COCO dataset, with 43.5+/41.2+ mAP under 1/4-step incremental settings.

优点

The paper identifies and addresses three critical issues in current pseudo-labeling approaches
The proposed solutions are practical and well-justified
The framework is modular and can potentially be integrated with other detection systems
PseDet achieves large performance improvements on several benchmarks.

缺点

Algorithm 1 is not clear. What does D^c_T and D^c_F mean? Why the output of K-means is D^c_T and D^c_F? I suggest the authors to clarify these.
The non-linear mapping function for confidence calibration seems empirically determined without strong theoretical justification. Can PseDet generalize well to other datasets?

问题

See weakness above

评论- Responses to Official Review by Reviewer T1Rk: Part[3/3]

2024-11-20

W2.2: Implementation the method on other datasets.

Thank you for your highly valuable questions, which have been instrumental in enhancing our work. Over the past few days, we have implemented our method on the VOC 2007, designing various settings including one-step: 10+10, 15+5, and 19+1; as well as multi-step: 10+5+5 and 5+5+5+5. The basic settings and pipeline are the same as those presented in our manuscript, and we choose GFL[ref1] as the detector. We replicated ERD[ref2] using the official release repository. Experimental results demonstrate that our method continues to exhibit strong performance and robustness. The source code is released in anonymous code repository.

The table below shows Incremental results ( $mAP$ , %) under the one-step setting. AbsGap (lower is better) and RelGap (lower is better) represents the absolute gap and the relative gap toward upper bound.

Scenarios	Method	$AP\uparrow$	$AP_{50}\uparrow$	$AP_{75}\uparrow$	$AbsGap\downarrow$	$RelGap\downarrow$
Upper Bound	-	41.5	66.2	43.8	-	-
10 + 10	ERD	34.2	60.1	37.2	7.3	17.6%
	PseDet	37.4	61.2	39.1	4.1	9.8%
15 + 5	ERD	36.4	58.6	38.7	5.1	12.3
	PseDet	39.8	64.7	41.3	1.7	4.1%
19 + 1	ERD	37.3	59.2	38.3	4.2	10.1%
	PseDet	40.0	65.0	41.1	1.5	3.6%

The table below shows the results ( $mAP$ , %) under the scenarios of 10+5+5.

Method	10-15	15-20
ERD	34.7	28.6
PseDet	36.2	34.2

The table below shows the results ( $mAP$ , %) under the scenarios of 5+5+5+5.

Method	5-10	10-15	15-20
ERD	30.1	27.9	23.8
PseDet	32.6	31.5	30.2

Notably, in the multi-step setting, we observe that ERD exhibits rapid forgetting after the second step, but our method forgets very slowly. This phenomenon is also related to the anti-noise methods we have designed during the knowledge transfer process.

[ref1]Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. NeurIPS 2020
[ref2] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation. CVPR 2022

评论- Responses to Official Review by Reviewer T1Rk: Part[2/3]

2024-11-20

W2.1: About the non-linear mapping function for confidence calibration.

Below, we offer some deeper insights with the hope of addressing your question. The non-linear mapping function is the solution we proposed based on our observations.

The model's predicted confidence scores lie within the set [0,1]. However, it's not always true that the higher the confidence score of a pseudo-label indicates a better quality. Vice versa. If we use the IoU between the pseudo-label and the GT to represent the quality of the pseudo-label, we can observe that the confidence scores and quality do not exhibit a strong linear relationship which is shown in Figure 4(a).

Typically, a pseudo-label with a score of 0.7 is very close to the ground truth, while one with a score of 0.3 differs significantly from the ground truth. We prefer high-quality labels are adjusted to 0.9 or higher, and low-quality labels or noise to 0.1 or lower in soft label supervision. This non-linear mapping aims to reduce noise interference and enhance the learning of valuable information. The reason we don't just remove low-scoring labels right away is that, after using the Adaptive Label Selector, we can still get some helpful information from them.

For this purpose, we carried out many experiments using different non-linear functions. We selected sigmoid, tanh, and logarithmic functions, and we performed mathematical transformations on these functions to better suit the calibration of scores.

Method	Overall Performance
w\o calibration	34.1 49.2 37.2
$tanh(\alpha \cdot (s - \delta) )$	38.1 55.1 41.7
$\ln (\alpha \cdot (s - \delta) ) + \beta$	35.7 50.4 37.9
$sigmoid(\alpha \cdot (s - \delta) )$	38.5 54.9 41.9

As indicated by the experimental results, we require a function that is convex in high-score regions and concave in low-score regions, which is why we have chosen the sigmoid function. As we expected, if a function can boost the scores of high-quality samples and lower those of low-quality ones, increasing the gap between them under supervision, it can be effective. However, tuning the hyperparameters of the selected functions to attain optimal performance requires adjustments, similar to adjusting the learning rate across different datasets. Nevertheless, in our extended experiments, we employed the same hyperparameters on VOC and COCO datasets and achieved good performance as well.

评论- Responses to Official Review by Reviewer T1Rk: Part[1/3]

2024-11-20

Dear Reviewer T1Rk,

Thank you for the valuable feedback. Over the past few days, we have worked diligently to address all of your concerns. Please see our detailed responses to each point below:

W1: Algorithm 1 is not clear.

(1) $D^c_T$ represents the set of True positive samples for category $c$ and serves as the pseudo-label set to be used for training. $D^c_F$ represents the set of False Positive samples for category $c$ , which are regarded as noise and will be discarded. $D^c_T$ and $D^c_F$ are obtained through clustering using k-means on the set of confidence scores for category $c$ .
(2) Theoretical proof: To start with, we infer pseudo-labels represented by $(c, s, b)$ using old model $\theta_{old}$ , where $c$ denotes the class, $s$ indicates the confidence score, and $b$ represents the position coordinates of the bounding box.
Based on careful observation, we have noted that, for each class $i$ , the model's prediction confidence exhibits a distinct bimodal distribution, comprising both high-confidence regions (positive samples) and low-confidence regions (noise). The data in these regions respectively display a "Gaussian bell" shape centered around specific means $\mu$ :

High-confidence peak: This peak indicates that the model is highly confident in its detection results, representing high-quality pseudo-labels. These samples are characterized by clear features and high accuracy, clustering around $\mu_T$ .
Low-confidence peak: Typically, this peak reflects the model's uncertain predictions or samples that are difficult to distinguish. In this region of distributions, the pseudo-labels are mostly inaccurate noise, clustering around $\mu_F$ .