Thanks for your follow-up question, and truly sorry for causing such a misunderstanding. We clarify our settings and answer this question in the following aspects.

We list some results in the following table, where the experiments adopt the same "annotation cost". Here the “annotation cost” is a comprehensive metric taking both accurate box annotation and coarse cluster annotation into account. Its details are also listed below.

Detector	Label-efficient method	Annotation	Mean L2 APH (Waymo val)
SECOND	-	all frames	57.23
SECOND	-	10% frames	49.11
SECOND	FixMatch [R6]	10% frames	51.45
SECOND	ProficientTeacher [R7]	10% frames	54.16
SECOND	MixSup (ours)	10% annotation cost	54.23
SST	-	all frames	65.54
SST	-	10% frames	50.46
SST	MV-JAR [R8]	10% frames	54.06
SST	MixSup (ours)	10% annotation cost	60.74

We further explain the details of “10% annotation cost”, which consists of the cost of box labels and coarse cluster labels. The following table shows the statistics.

#. used box labels	#. used cluster labels	#. all instances in training set
312,269	2,898,813	7,061,433

As illustrated in Sec 5.3, we conducted a human labeling study by asking three experienced annotators to label 100 frames, containing thousands of objects. Based on the statistics gathered from the annotators, the average time cost of a coarse cluster label is 14% of an accurate box label. In this case, the overall annotation cost is evaluated using the following equation.

Thus, at the same annotation cost, the MixSup also shows superiority.

Finally, we would like to clarify that there are two reasons why we put “10% boxes annotations + coarse labels” on the table in our paper.
The first reason is that we want to imply that the proposed MixSup could utilize “additional” coarse labels, which is the core feature of MixSup.
The second reason is that we would like to offer readers a more comprehensive understanding of our performance.

We hope our explanation could resolve your concern. If not, we are open to any further discussion. If it does, we would greatly appreciate it if you could kindly offer us a positive rating. Many thanks in advance.

References

[R6] Sohn, Kihyuk, et al. "Fixmatch: Simplifying semi-supervised learning with consistency and confidence." Advances in neural information processing systems 33 (2020): 596-608.

[R7] Yin, Junbo, et al. "Semi-supervised 3D object detection with proficient teachers." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.

[R8] Xu, Runsen, et al. "MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.