PaperHub
5.3
/10
Poster4 位审稿人
最低4最高6标准差0.8
5
4
6
6
4.0
置信度
正确性3.0
贡献度2.8
表达3.0
NeurIPS 2024

Just Add $100 More: Augmenting Pseudo-LiDAR Point Cloud for Resolving Class-imbalance Problem

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

A low-cost yet effective data augmentation framework for alleviating class imbalance in 3D object detection.

摘要

关键词
Autonomous DrivingClass ImbalanceData Augmentation

评审与讨论

审稿意见
5

This paper proposes a new method to augment pseudo-LiDAR point cloud to resolve the class-imbalance problem. It consists of generated 3D models of minority classes from video or miniaturized models with 3D reconstruction NERFs. Such models are then sampled from the target LiDAR and added to a Real LiDAR point cloud. LiDAR intensity are generated using a CycleGAN. Experiments have been conducted on several Dataset: KITTI, nuScenes and Lyft. It shows that the proposed method is competitive with the state-of-the-art and outperforms on minority classes.

优点

The paper is well written and easy to follow. The related works section is well written and provides a good overview of the state-of-the-art methods. However, I suggest adding one general paragraph on Data Augmentation and it mains techniques. The proposed model combined several high level techniques such as 3D reconstruction and CycleGAN. The overall architecture is well explained and the intuition behind the method is clear. Due to the limited space, some details are given in the supplementary material.

The experimental part is two-fold: 1) the comparison with the state-of-the-art methods, and 2) the ablation study. The proposed model increases GT-Aug performances about 1% on both mAP and NDS. The improvment remains for several models. It demonstrates the genericity of the proposed augmentation pipeline. More SOTA models

缺点

Regarding the comparison with the state-of-the-art methods, the proposed method is evaluated on several datasets and compared with two data augmentation methods: GT-Aug (vanilla synthesis-based LiDAR data augmentation) and Real-Aug. GT-Aug is more a baseline than a competitor and to my point of view, the only competitor is Real-Aug in this experiment.

问题

An ablation study is proposed with several experiments. It is interesting to see thant using 3DGS instead of Nerf increases the data augmentation process. It could be interesting to see the impact of each step of the pipeline on the final performance by replacing each step by a simpler one. What happen if the luminance generation part is removed or replace by a simpler method? What happen if the 3D reconstruction step is produced with less images? What happen if the object level data alignment is removed and replaced by a general alignment?

局限性

/

作者回复

Thank you for your time and comments! Please see our response below.


[S1] Additional Paragraph for Related Work This is a great suggestion. We will add more paragraphs to the Related Work Section, summarizing the literature on data augmentation techniques.


[W1] Additional Comparison against Other Approaches This is a good question. We chose Real-Aug as our main competitor because it is the latest work and shows promising scores in the public nuScenes leaderboard. However, we fully agree with the reviewer that additional comparison with other augmentation techniques is needed. Thus, as shown in Table VII below, we conduct experiments to analyze the data augmentation effect with other recent approaches, including LiDAR-Aug [F], CA-Aug [G], and 3D-VField [H]. Our experiment confirms that all approaches help improve detection accuracy against the baseline. Notably, Real-Aug and (our proposed) PGT-Aug generally outperform the other approaches. We further provide qualitative analysis in Figure III in our rebuttal pdf file.

Table VII. Performance comparison with other data augmentation approaches.

AP_car 70 (40 recall)
EasyModHard
Baseline88.0874.8570.55
GT-Aug87.8078.3675.41
LiDAR-Aug
$
G
$
87.7578.2475.35
3D-VField
$
H
$
87.0577.1375.55
CA-Aug
$
F
$
88.8278.6675.75
Real-Aug (Our impl.)88.1378.9776.06
PGT-Aug90.0079.4576.35

Experiment Details We compare PGT-Aug with other 3D data augmentation methods on the public KITTI [8] Car benchmark. We create pseudo LiDAR point clouds for our method by randomly sampling ten cars from the CO3D dataset. The instances used are as follows: 106_12650_23736, 106_12662_23043, 157_17286_33548, 185_19982_37678, 194_20899_41094, 216_22796_47484, 216_22827_48422, 421_58405_112551, 206_21799_45886, and 421_58407_112553.

[F] Context-Aware Data Augmentation for LiDAR 3D Object Detection (ICIP 2023)
[G] LiDAR-Aug: A General Rendering-based Augmentation Framework for 3D Object Detection (CVPR 2021)
[H] 3D-VField: Adversarial augmentation of point clouds for domain generalization in 3d object detection (CVPR 2022)


[Q1] More Ablation Studies We conduct the following additional experiments: 1) without luminance generation (constant intensity), 2) instance generation according to the change in the number of images, and 3) without object level data alignment (random sampling, without rigid motion model). On the second row of Table VIII, the removal of luminance (intensity) generation in pseudo LiDAR leads to performance drop in downstream detection tasks. Also, as the percentage of multiview images used during 3D reconstruction decreases, the overall detection performance decreases accordingly. Finally, we conduct experiments on general alignment by replacing object-alignment with random sampling points from dense RGB point clouds and removing motion models. Both experiments show suboptimal performance compared to PGT-Aug.

Table VIII. Ablations in intensity, the number of images, data alignment

CarPedBarrierT.C.BusC.V.TrailerTruckMotorBicyclemAPNDS
Ours85.485.468.071.172.124.240.459.870.358.363.5269.11
constant intensity85.485.068.171.572.623.638.960.068.354.262.7568.75
25% of the number of image85.385.067.371.173.423.140.159.270.256.463.1268.78
50% of the number of image85.584.968.571.472.023.141.459.869.556.463.2669.03
no-motion85.485.267.871.571.923.741.060.169.056.663.2268.80
random sampling85.485.167.271.473.222.839.259.868.155.462.7668.56
审稿意见
4

The paper presents a novel pipeline for LiDAR-based object detectors to solve the class imbalance problem by generating pseudo-LiDAR samples (from multi-view images of miniatures and public videos of an object) and augmenting them during training to balance the performance gap across classes. The augmentation technique proposed in this paper demonstrate the superiority and generality on nuScenes, KITTI and Lyft datasets.

优点

  1. The augmentation technique proposed in this paper demonstrate the superiority and generality through performance improvements in extensive experiments conducted on popular benchmarks, i.e., nuScenes, KITTI, and Lyft, especially for the datasets with large domain gaps captured by different LiDAR configurations.
  2. The paper is well-written and easy to follow, especially the part explaining the background.
  3. It presents good experimental results and intuitive visualizations, convincingly demonstrating its effectiveness.

缺点

  1. There is a lack of comparative experiments with other methods that aim for class imbalance problems.
  2. The motivation of this paper is not clear. There is a need to discuss the necessity of using data augmentation methods to solve class imbalance problems. Why can't we use some loss-based or strategy-based methods to handle class imbalance issues?
  3. The detail of the framework is not clear. For instance, there is an Intensity Domain Alignment in the framework, but what it is in detail? e.g., the structure of it, and how it works.
  4. Is PGT-Aug not friendly to the majority of classes? As shown in Table 2, PGT-Aug's performance is close to that of Real-Aug for the majority of classes.

问题

  1. There are many works dealing with long-tail problems or class imbalance that are not based on data augmentation. Can authors discuss their application to this problem?
  2. Compared to Real-Aug, how is the time cost of the proposed method (PGT-Aug) while achieving less than a 1-point increasement in both metrics mAP and NDS?

局限性

The authors discussed the existence of domain discrepancies in both datasets and categories.

作者回复

Thank you for your time and comments! Please see our response below.


[W1, Q1] Comparison between other methods that aim for class imbalance problems As the reviewer pointed out, there are other comparative methods dealing with class imbalance using loss-based methods [D, E] without adding data. We attach the comparison results with [D, E] in Table V. To match their baselines, we conducted experiments on PointPillar model. Additionally, we re-implemented Class-Balanced Loss [D] with beta 0.999 and resampled the number of objects. While we find out that loss-balancing methods such as Dynamic Weight Average [E] and CB Loss [D] are effective in enhancing minority-class performance, PGT-Aug, a data-augmentation based method, brings the largest performance gain against other approaches. Also, [E] experimentally proves that its GT sampling-based data augmentation was more effective than loss-based method in improving detection performance.

Table V. Comparison between other methods that aim for class imbalance problems

CarPedBarrierT.C.TruckBusC.V.TrailerMotorBicyclemAP
CBLoss
$
D
$
82.774.754.052.151.261.717.930.548.720.549.4
DWA
$
E
$
 | 81.0  | 72.3 | 50.2 | 50.1 | 49.0  | 63.4 | 10.7 | 34.3 | 32.9 | 6.9  | 44.6 |

| PGT-Aug | 83.0 | 71.8 | 54.8 | 51.1 | 54.9 | 69.7 | 20.2 | 39.5 | 49.6 | 14.5 | 50.9 | | PGT Aug + CBLoss | 82.7 | 74.7 | 56.5 | 55.9 | 54.4 | 68.7 | 20.9 | 34.1 | 53.5 | 20.5 | 52.2 |

[D] Class-Balanced Loss Based on Effective Number of Samples (CVPR 2019)
[E] Resolving Class Imbalance for LiDAR-based Object Detector by Dynamic Weight Average and Contextual Ground Truth Sampling (WACV 2023)


[W2, Q1] The motivation of this paper In 3D object detection, there have been many studies aimed at addressing class imbalance problems by modifying model structures (CBGS) or adjusting loss function (DWA). The most widely used method is the over-sampling known as GT-Aug, which involves inserting objects from other frames into the current frame. However, this method is limited to inserting objects from a predefined pool (training set), which restricts learning intra-class diversity, and generating or collecting 3D data for various objects has been extremely challenging and expensive. Recent advancements in differentiable 3D reconstruction techniques, such as NeRF and Gaussian Splatting, have made it possible to reconstruct dense 3D points at a lower cost. By leveraging these 3D reconstruction techniques, we proposed a novel and practical pipeline that overcomes the limitations of traditional insertion methods, particularly in terms of cost and diversity.


[W3] The detail of the framework Due to page limits, we have elaborated the details of our Intensity Domain Alignment network in line 559 of supplementary material. In summary, we adopt CycleGAN framework to learn unpaired translation between RGB and intensity. We designed generators and discriminators with PointNeXt encoders. It receives nuScenes long-tail class samples as real data, and we create fake data by translating, rotating, resizing, and projecting long-tail class RGB samples to the same (x,y,z,l,w,h, theta) of real data. Our generator is trained to generate fake intensity values from RGB features, while discriminator is trained to discriminate between real and fake intensity.


[W4] The effect of PGT-Aug on the majority of classes Our primary objective is to generate and insert the minority classes instead of the majority classes to relieve the class imbalance issue. Thus, we anticipated that the performance of the minority would improve while the performance of the majority classes would either remain stable or improve slightly due to the synergy effect of addressing the imbalance. To verify the effectiveness of the pipeline to majority classes, we conducted experiments on KITTI dataset as shown in Table VI. We reconstructed 10 cars of 3D RGB point clouds from CO3D dataset [I], (See Figure III in Rebuttal-Supp.) and created about 16,000 pseudo LiDARs of Car class. We apply pseudo LiDARs along with GT LiDARs during augmentation, and our PGT-Aug performance largely outperforms GT-Aug, Real-Aug, and other augmentation methods in car detection benchmark. Due to writing limit, please refer to Experiment Details (jNRg).

Table VI. The effect of PGT-Aug on the majority class

AP_car 70 (40 recall)
EasyModHard
Baseline88.0874.8570.55
GT-Aug87.8078.3675.41
LiDAR-Aug
$
G
$
87.7578.2475.35
3D-VField
$
H
$
87.0577.1375.55
CA-Aug
$
F
$
88.8278.6675.75
Real-AUG (Our impl.)88.1378.9776.06
PGT-AUG90.0079.4576.35

[F] Context-Aware Data Augmentation for LiDAR 3D Object Detection (ICIP 2023)
[G] LiDAR-Aug: A General Rendering-based Augmentation Framework for 3D Object Detection (CVPR 2021)
[H] 3D-VField: Adversarial augmentation of point clouds for domain generalization in 3d object detection (CVPR 2022)
[I] Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction (ICCV 2021)


[Q2] Memory usage and inference time If Object-level Domain Alignment is performed during training of detection, it may take additional time compared to Real-Aug. However, in 3D object detection methods, objects to be inserted were stored before training, and objects were loaded and inserted during training. Loading these objects from disk to memory in order to insert them into the scene takes time, and the memory and time complexity for the batch is O(n). Therefore, the time cost is the same as Real-Aug. Even though real-time performance was not a major consideration, we will add this discussion to the paper according to the valuable comments.

审稿意见
6

This paper introduces Pseudo Ground Truth Augmentation (PGT-Aug), a novel data augmentation technique for LiDAR-based 3D object detection. The goal of PGT-Aug is to address class imbalance in training datasets by generating diverse point clouds for minority-class objects.

  • PGT-Aug: a novel cost-effective pipeline for LiDAR-based object detectors to solve the class imbalance problem by
    • (i) generating pseudo-LiDAR samples
    • (ii) augmenting them during training to balance the performance gap across classes.
  • Reduce the domain gap: use spatial distribution matching and data-driven intensity adjustments to achieve
  • a novel map-aware augmentation technique: placing an object into the appropriate locations in the given scene).

优点

  1. Writing is good. A cool manuscript name.
  2. Ample experiment.

缺点

  1. There is code provided, but it lacks documentation and is difficult to use to help understand and visualize the results
  2. Lack of novelty(not sure), see details in question 4.

问题

  1. About data collection. For public videos from YouTube can you deal with the dynamic object.
  2. In line 138-140, I know Plenoxels and 3DGS are view-dependent representations, but why it is not fully visible or uniformly high-density.
  3. what is the degree of spherical harmonic coefficients author use. Why not just use to zero degree of SH to solve question 2.
  4. Now there are a lot of autonomous driving simulator work or object-level NeRF/3DGS reconstruction work or shape-gpt/mesh-gpt, can also achieve small objects clone and modify, you can talk about your work and their differences and advantages.
  5. NuScenes tests whether contains information about the added object

I'm willing to change the grade if address my concerns.

局限性

The author has said it clearly in the article limitation

作者回复

Thank you for your time and comments! Please see our response below.


[W1] Code Documentations Our apologies. We will revise the current documentation and make them easy-to-follow for researchers to use and understand our code easily. Plus, we will add a Jupyter Notebook-based tutorial to guide (i) how to reproduce our model, (ii) how to use our code, and (iii) how to visualize figures (shown in the paper).


[Q1] Data Collection from Videos with Dynamic Objects This is a good question. Our method relies on SfM to optimize camera poses for 3D reconstruction. However, such SfM would not triangulate well with images containing dynamic objects. This would be why we reconstruct the 3D shape of stationary objects, followed by using a rigid body motion model to represent dynamic objects. However, this must be an interesting direction worth exploring in our future work.


[Q2, Q3] View Consistent Representation This is a great question. We agree that the zero-degree spherical harmonic would have a similar effect to ours. However, we do not use it because (1) we empirically observed that our approach is more robust at capturing the original object's color and areas of dark shades (see our Figure I in rebuttal supp). (2) We wanted our model more generally applicable to various generative models, which may and may not use spherical harmonics. Additionally, (3) in terms of FID score, we observe our approach is generally better. As shown in Table IV, we compare ours with a variant model (with plenoxel's SH coefficient set to 0) to see the quality of generated pseudo LiDAR point clouds in terms of FID score. Our experiment further confirms our approach generates point clouds more similar to the actual LiDAR points. We will add discussion on this.

Table IV. FID score evaluation between SH coefficient 0 and ours

TruckBusC.V.TrailerMotorBicycleAverage FID
SH coefficient 014.913.28.07.22.33.08.1
Ours13.213.27.67.32.12.17.7

[W2, Q4] Differences to previous simulator works or generative models By leveraging current 3D reconstruction methods, our main goal is creating a novel and practical pipeline that overcomes the limitations of existing 3D object detections such as class imbalance problems, lack of annotations, in terms of cost and diversity. In other words, our pipeline is not limited to a specific generative model and can be applied to any model. The reasons we used explicit 3D representation models (3DGS or Plenoxel) in the paper instead of the models mentioned by the reviewer are as follows. According to [B] (see section Limitations and Failure Cases in supplementary material), Text-to-3D Generation (Shape-GPT, mesh-GPT, etc.) models tend to collapse modes when the target image distribution is overly concentrated in a single peak, resulting in abnormal 3D objects that are strongly related to specific views (Janus problem). Also, as shown in Figure II of Rebuttal-Supp., recent work of multi-view 3D reconstruction pipeline, DUSt3R [C] often fails to recover the details of miniature-scale objects. However, in order to place the restored objects in various positions, it is necessary to restore the entire shape of the object and create a bounding box. Therefore, we chose models that can reliably reconstruct the entire shape and perform robustly with relatively small objects.

[B] Taming Mode Collapse in Score Distillation for Text-to-3D Generation (CVPR 2024)
[C] DUSt3R: Geometric 3D Vision Made Easy (CVPR 2024)


[Q5] NuScenes Setting All detectors were tested under the same conditions, meaning no further information about our generated objects was used during the evaluation process, i.e., our model only uses the generated objects during training.

审稿意见
6

This paper introduces Pseudo Ground Truth augmentation (PGT-Aug) to address class imbalance in LiDAR-based 3D object detection. PGT-Aug generates diverse pseudo LiDAR point clouds from low-cost miniatures or real-world videos and involves three steps: 3D instance reconstruction, object-level domain alignment, and context-aware placement. Extensive experiments on nuScenes, KITTI, and Lyft benchmarks demonstrate its effectiveness, especially for datasets with large domain gaps.

优点

  • Originality & Practical Impact: Introduces PGT-Aug, a novel, cost-effective pipeline for addressing class imbalance in LiDAR-based object detectors using pseudo-LiDAR samples from videos and miniatures.
  • Quality & Clarity: Offers a thorough methodology with clear explanations and robust evaluations, enhancing the accuracy and robustness of object detectors.

缺点

  • As seen in line 532, this paper does not collect as much data compared to large datasets like KITTI. I question the benefits and improvements brought by this work in terms of data collection.
  • The proposed method requires the use of many pre-trained models, such as the unpaired domain transfer model (L179) and a rigid body motion model (L217). First, the computational efficiency and real-time applicability of these added models need to be addressed. Given the added complexity of the aforementioned models, understanding the computational trade-offs and optimizations required for practical deployment is essential (e.g., the memory usage and inference time), but these aspects are not sufficiently covered in the paper. Second, the impact of the performance of these pre-trained models on the proposed method needs to be discussed and evaluated.
  • The authors use YouTube videos and "crawled data using the following keywords on Google" (L533). They should obtain permission from the data/content owners. Simply citing the sources in the paper is clearly not sufficient.

Minor issue:

  • "and r is 0.1" should be better move to L.190, since there are no r in Eq. (3)

Overall, I really like the interesting idea in this paper. If the authors can address my issues in the rebuttal, I am willing to raise my score.

问题

How was the map information (L200) obtained?

局限性

no potential negative societal impact of their work

作者回复

Thank you for your time and comments! Please see our response below.


[W1] Data Size Our dataset is created to provide pseudo-LiDAR point clouds of minority-class objects, which can be augmented into typical driving datasets (e.g., nuScenes, KITTI, and Lyft) to compensate for the class imbalance problem. Thus, the volume of our dataset should be smaller than that of these datasets but (we believe) sufficient to compensate for the class imbalance problem, as we reported in our experiments. Moreover, we would emphasize that (i) we provide an automated pipeline to generate such pseudo-LiDAR point clouds, enabling the community to use it to produce continuously growing datasets with various objects. (ii) We can generate view-dependent pseudo-LiDAR point clouds that can flexibly be placed anywhere in the scene, significantly improving data efficiency. We will clarify this in the final version of this paper.


[W2-1] Efficiency and Real-time Applicability of Pre-trained Models This is a good question. We would emphasize that (1) pseudo LiDAR point clouds are generated and stored offline, and (2) these generated point clouds are loaded (from memory) and augmented during the training of 3D object detectors. Thus, the need to run pre-trained models (e.g., unpaired domain transfer model) in real-time would be less significant. Further, in the following Table I and II, we analyze the processing time of each model and memory usage for generating pseudo LiDAR point clouds of different classes, i.e., construction vehicles, trucks, trailers, etc. This confirms point clouds can efficiently be generated through our pipeline, taking less than 300ms in total. We will discuss this in detail in the final version of this paper.

Table I. Average Processing Time (per instance, in msec)

ClassC.VTruckTrailerMotorBicycleBus
Intensity Estimation1501402504034178
View Dependent Point Sampling67444030678
Rigid body motion8.808.658.608.538.559.09

Table II. Average Memory Usage (MB)

ClassC.VTruckTrailerMotorBicycleBus
Memory usage4.0064.0524.0984.0134.0124.074

[W2-2] Ablation Studies with Pre-trained Models To demonstrate the impact of pre-trained models (e.g., unpaired domain transfer model and rigid body motion model), we further conduct ablation studies as follows: (1) We compare ours with a variant model without luminance generation (i.e., using constant intensity to see the impact of the unpaired domain transfer model). (2) We compare ours with a variant model without our rigid body motion model. In Table III below, we observe both experiments confirm the impacts of using these pre-trained models, showing a degradation without these models. We will discuss this more thoroughly.

Table III. Ablations in luminance generation and motion model

CarPedBarrierT.C.BusC.V.TrailerTruckMotorBicyclemAPNDS
Ours85.485.468.071.172.124.240.459.870.358.363.5269.11
with constant intensity
$
1
$
85.485.068.171.572.623.638.960.068.354.262.7568.75
without rigid body motion model
$
2
$
85.485.267.871.571.923.741.060.169.056.663.2268.80

[W3] Data/content permission issue This is a good comment. We also take this copyright issue seriously. First of all, we will not download and re-release those video sources. Instead, we will release a list of YouTube links. Further, as the reviewer suggests, we have contacted the copyright holders to obtain permission to share their video links publicly. We will make sure to resolve this copyright issue upon publication.


[W4] Minor Comment We agree with the reviewer. We will move "and r is 0.1" to L190.


[Q1] Map information nuScenes dataset provides the BEV map, annotating commonly-observed map features such as road segments, lanes, crosswalks, walkways, stop lines, and parking lots. For all scenes, we generate an ego-vehicle-centered map in a top-down coordinate system, and over 34k maps are generated. Note that a range of 102.4m x 102.4m is set for generating a map where ego-vehicle is centered (effective forward sensing range of ego-vehicle is 51.2m). KITTI and Lyft datasets do not provide such map information (as mentioned in L291). Thus, we utilize a LiDAR-based ground segmentation method called Patchworks++ [A] to estimate ground. We will further clarify this in the final version of this paper.

[A] Patchwork++: Fast and robust ground segmentation method for 3D LiDAR scans (IROS 2022)

评论

I appreciate the rebuttal and additional experiments. I have read the comments from other reviewers as well as the corresponding rebuttal. The rebuttal has largely addressed my concerns. In particular, I am very satisfied with the experiments and discussion related to [W1] and [W2-1].

Good work! Therefore, I have raised my rating to weak accept. I really look forward to seeing the revised version in the near future.

评论

We are pleased to hear that our response has addressed your concerns. Thank you for your valuable comments, which will help improve the quality of the paper. If you have any further question, please do not hesitate to let us know.
Thank you very much.

作者回复

We sincerely thank all the reviewers for their time and their thoughtful comments and questions. We are encouraged that the reviewers find that: “very well-written and easy to follow” (R-iAKq, R-eXGm, R-kswe, R-jNRg); our method being described as “novel, cost-effective pipeline” (R-iAKq), “superiority and generality” (R-kswe), “well explained and a clear intuition behind the method” (R- jNRg), and “thorough methodology” (R-iAKq); our experiments were commended as “ample” (R-eXGm), “extensive” (R-kswe), “good” (R-kswe), backed by “robust evaluations” (R-iAKq), “intuitive visualizations” (R-kswe), demonstrating “genericity” (R-jNRg) and “effectiveness” (R-kswe).

We attempted our best to address the questions as time allowed. We believe the comments and revisions have strengthened the paper, and we thank all the reviewers for their help. Please find individual responses to your questions below.

Summary of major changes:

  1. We add new experiments on KITTI [8] and nuScenes [9] to demonstrate its superiority compared to other augmentation methods and the effectiveness to solve class imbalance problems, respectively.
  2. We provide additional ablation studies to show the impact of each individual model on the proposed pipeline.
  3. We add detailed visualizations and explanations to highlight our contributions and facilitate a clear understanding of the proposed method.
最终决定

This paper received mixed ratings of two Weak Accepts, Borderline Reject and Borderline Accept. Three Reviewers sharing positive opinions and see the merits of this work. Reviewer kswe (who suggested Weak Reject) has concerns in the comparative experiments, motivation, framework details, and the effect on the majority of classes. The authors provided detailed responses during the Rebuttal phase, while Reviewer kswe did not response. After reading the authors rebuttal and the reviewers' comments, the AC believes the concerns of Reviewer kswe have been mainly addressed. The AC agrees with the other 3 reviewers and suggests Accept. The authors are suggested to includes these importants results provided in Rebuttal into the camera ready version.