5.3

/10

Poster4 位审稿人

最低5最高6标准差0.4

3.5

置信度

正确性2.8

贡献度2.5

表达2.8

NeurIPS 2024

Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models

Yifan Zhang,Junhui Hou

OpenReview PDF

提交: 2024-05-11更新: 2025-01-02

摘要

关键词

self-supervised learningcross-modal distillation3D representation learning

评审与讨论

审稿意见

评分: 5置信度: 42024-07-10

This work aims to tackle the image-to-LiDAR contrastive learning problem for LiDAR-based point cloud segmentation. Previous approaches designed the cross-modal contrastive learning objective for model pretraining, using superpixels and superpoints as guidance.

In this work, the authors observe that the superpixel-driven contrastive loss tends to involve ‘’self-conflict’’ issues during representation learning. A weakly-supervised contrastive distillation method is proposed, which generates semantic superpixels/superpoints using the Segment Anything Model (SAM). Additionally, to balance the imbalanced class distributions of LiDAR scene categories during representation, a density and category-aware sampling strategy is proposed to adjust the sampling probabilities of different anchor points using the weak semantic labels.

The overall framework is named OLIVINE, which adopts three optimization objectives:

Weakly-supervised contrastive distillation using coarse semantic labels to identify positive pairs by category.
Self-supervised contrastive distillation applied to randomly sampled point-pixel pairs.
A regularization framework based on the von Mises-Fisher (vMF) distribution to ensure semantic consistency.

The proposed OLIVINE method is evaluated on the nuScenes, SemanticKITTI, and KITTI object detection datasets. The results exhibit a consistent improvement of the proposed method compared to existing approaches.

优点

(+) This work aims to improve the image-to-LiDAR self-supervised representation learning problem on LiDAR-based point cloud datasets, which is one of the current research hotspots, especially for applications related to autonomous driving and robotics.

(+) The proposed method has exhibited promising performance on mainstream benchmarks, including nuScenes linear probing, nuScenes fine-tuning, SemanticKITTI fine-tuning, and KITTI object detection.

缺点

(-) The weakly-supervised contrastive distillation method has been used in previous literature, such as [R1] and [R2]. Adding semantic categories seems not to cause a major improvement over class-agnostic masks, as the Segment Anything Model is able to segment rather complete and semantically consistent objects and backgrounds. Additionally, using weak labels (which might be erroneous) could introduce additional errors during pretraining.

(-) The motivation for using the von Mises-Fisher (vMF) distribution to enforce consistency regularization for image-to-LiDAR representation learning is not clear enough to demonstrate its superiority. A more detailed explanation and theoretical justification would strengthen this aspect of the work.

(-) Compared to some of the most related works, for example, [R1] and [R3], the scale and depth regarding the experiments (for example, downstream fine-tuning on other datasets than SemanticKITTI) could be further enhanced.

References:

[R1] Youquan Liu, et al. “Segment Any Point Cloud Sequences by Distilling Vision Foundation Models,” NeurIPS, 2023.
[R2] Ayça Takmaz, et al. “OpenMask3D: Open-Vocabulary 3D Instance Segmentation,” NeurIPS, 2023.
[R3] Gilles Puy, et al. “Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving,” arXiv, 2023.

问题

Q1: As mentioned in Weakness 1, the semantic masks generated by the Segment Anything Model could inevitably involve errors (e.g., wrong segmentation results). How do the authors handle the propagated errors during image-to-LiDAR representation learning?
Q2: As mentioned in Weakness 2, could the authors provide more details on the hyperparameter settings for the vMF distribution and the reasoning behind their chosen values? Adding a more detailed explanation and theoretical justification would be even better.
Q3: As mentioned in Weakness 3, having more thorough experimental analyses on other LiDAR-based point cloud datasets, such as SemanticPOSS, Waymo, SynLiDAR, etc., could further consolidate the findings and conclusions drawn in the manuscript.
Q4: As most 2D and 3D representation learning approaches (MoCo, SimCLR, Seal, etc.) do, having empirical analyses of models under out-of-distribution datasets is recommended.
[Minor]: The computational cost of the proposed multi-modal contrastive distillation approach is not thoroughly analyzed, which is crucial for real-time applications in autonomous driving.
[Minor]: The generalizability of OLIVINE to other types of sensors (for example, hybrid-solid LiDARs) or environments (for example, off-board environments) beyond the evaluated datasets is not discussed.
[Minor]: “NuScenes” should be revised to “nuScenes”.

局限性

The authors mentioned "Semantic Label Accuracy" as one of their limitations. As also discussed in Weakness 1, more analyses are needed to address the impact of inaccuracies in the weak labels generated by the Segment Anything Model. These inaccuracies could propagate errors during the image-to-LiDAR representation learning process, potentially affecting the overall performance of the proposed method.

Additionally, while the von Mises-Fisher distribution is used for consistency regularization, the motivation and theoretical foundation for its use are not fully elaborated. A deeper exploration of its advantages and potential drawbacks in this context would be beneficial.

The computational cost associated with the multi-modal contrastive distillation approach is another important aspect that is not thoroughly analyzed. For practical applications, especially in real-time scenarios such as autonomous driving, it is crucial to understand the resource requirements and efficiency of the proposed method.

Lastly, the scalability and generalizability of OLIVINE to other sensor types and different environments have not been extensively discussed. Exploring its applicability in diverse settings and with various sensor configurations would provide a more comprehensive evaluation of its robustness and versatility.

评论- Theoretical Perspectives (Part2)

2024-08-07

Proposition 2: The representation of samples in the same class can vary significantly across different batches during contrastive distillation, and semantic-guided consistency regularization helps to learn structured features.

Justification: Without regularization, the representation of samples within the same class can vary significantly across different batches during contrastive distillation. This variance arises due to random sampling and the influence of negative samples in different batches. The weakly-supervised contrastive loss is defined as:

$\mathcal{L}_{\mathrm{sup}} = - \frac{1}{M_s} \sum _{i=1}^{M_s} \log \left[ \frac{1}{|A(i)|} \sum _{a\in A(i)} \frac{\mathrm{exp}{(\langle\mathbf{G}^{\mathrm{3D}}_i,\mathbf{G}^{\mathrm{2D}}_a \rangle/\tau)}}{\sum _{j=1}^{M_s} \mathrm{exp}{(\langle\mathbf{G}^{\mathrm{3D}}_i,\mathbf{G}^{\mathrm{2D}}_j \rangle /\tau)}}\right]$

The features of negative samples $\mathbf{G}^{\mathrm{2D}}_j$ vary across batches, leading to different optimization paths for each mini-batch. This introduces variability in the learned representations $\mathbf{G}^{\mathrm{3D}}_i$ for samples of the same class $k$ .

When we do not use semantic-guided consistency regularization, the within-class variance for class $k$ across different batches is:

$\sigma_W^2 = \frac{1}{|B|} \sum_{B} \frac{1}{M_k} \sum_{i=1}^{M_k^B} \|g_i^k - \mu_k^B\|^2$

For ease of reading, we use $g_i$ to refer to point feature $\mathbf{G}^{\mathrm{3D}}_i$ . And $\mu_k^B$ is the mean feature vector for class $k$ in batch $B$ . Due to the batch-wise variability in negative samples, $\mu_k^B$ can differ significantly across batches, leading to high within-class variance.

By minimizing the KL divergence, we align feature vectors $g_i$ of class $k$ with the mean direction $\mu_k$ , reducing the spread of feature vectors within the same class. The within-class variance with regularization is:

$\sigma_W^2 = \frac{1}{K} \sum_{k=1}^K \frac{1}{M_k} \sum_{i=1}^{M_k} \|g_i^k - \mu_k\|^2$

Since $\mu_k$ is consistent across batches due to the regularization, the within-class variance is significantly reduced. This results in structured feature representations, enhancing class separability and improving performance in downstream tasks.

Proposition 3: Learning structural representation during pretraining can benefit downstream tasks.

Justification: Structured features are those well-aligned within the same class (low within-class variance $\sigma_W^2$ ) and well-separated between different classes (high between-class variance $\sigma_B^2$ ).

With semantic-guided consistency regularization, feature vectors $g_i^k$ for class $k$ are closely aligned with the mean direction $\mu_k$ . This alignment reduces the within-class variance $\sigma_W^2$ . Weakly-supervised contrastive learning pushes apart feature vectors of different classes, increasing the separation between class means $\mu_k$ . This increases the between-class variance $\sigma_B^2$ .

Take the linear classifier as an example, the decision boundary is determined by the separation between class means. Higher $\sigma_B^2$ and lower $\sigma_W^2$ result in clearer decision boundaries, reducing classification errors.

Consider a simple linear classifier with weight vector $w$ and bias $b$ . The decision function is:

$f(x) = w^T x + b$

The decision boundary is given by:

$w^T x + b = 0$

For well-structured features, the margin (distance between decision boundary and nearest samples) is maximized. The margin $\gamma$ for class $k$ can be expressed as:

$\gamma = \frac{w^T (\mu_k - \mu)}{\|w\|}$

Higher between-class variance ( $\sigma_B^2$ ) and lower within-class variance ( $\sigma_W^2$ ) increase this margin, leading to better classification performance.

[Known issues] If the equations do not display correctly, please refresh the page or try using a different browser.

评论- Theoretical Perspectives (Part1)

2024-08-07

Proposition 1: The features of each class $k$ can be modeled as a von Mises-Fisher (vMF) distribution. This means that for class $k$ , the feature vectors $g_i$ lie on a unit hypersphere and are centered around a mean direction $\mu_k$ with a concentration parameter $\kappa_k$ .

Justification: To show that the features of each class can be effectively modeled by a vMF distribution, we use maximum likelihood estimation (MLE) to determine that the parameters $\mu_k$ and $\kappa_k$ are optimal for the given set of feature vectors.

For a set of $M_k$ feature vectors $\\{g_i\\}_{i=1}^{M_k}$ from class $k$ , the likelihood function for the vMF distribution is:

$L(\mu_k, \kappa_k) = \prod_{i=1}^{M_k} f(g_i; \mu_k, \kappa_k) = \prod_{i=1}^{M_k} \mathcal{K}_{C}(\kappa_k) \exp(\kappa_k \mu_k^T g_i)$

Taking the natural logarithm of the likelihood function, we get the log-likelihood:

$\log L(\mu_k, \kappa_k) = \sum_{i=1}^{M_k} \log f(g_i; \mu_k, \kappa_k) = M_k \log \mathcal{K}_{C}(\kappa_k) + \kappa_k \sum _{i=1}^{M_k} \mu_k^T g_i$

Substituting the expression for $\mathcal{K}_{C}(\kappa_k)$ , we get:

$\log L(\mu_k, \kappa_k) = M_k \left[ \log \left( \frac{\kappa_k^{C/2-1}}{(2\pi)^{C/2} I_{C/2-1}(\kappa_k)} \right) + \frac{\kappa_k}{M_k} \sum_{i=1}^{M_k} \mu_k^T g_i \right]$

$\log L(\mu_k, \kappa_k) = M_k \left[ (C/2-1) \log \kappa_k - \log I_{C/2-1}(\kappa_k) - \frac{C}{2} \log(2\pi) + \frac{\kappa_k}{M_k} \sum_{i=1}^{M_k} \mu_k^T g_i \right]$

To maximize the log-likelihood, we normalize $\mu_k$ by setting it to the normalized sum of the feature vectors: $\mu_k = \frac{\sum_{i=1}^{M_k} g_i}{\|\sum_{i=1}^{M_k} g_i\|}$

The derivative of the log-likelihood with respect to $\kappa_k$ is:

$\frac{\partial \log L(\mu_k, \kappa_k)}{\partial \kappa_k} = M_k \left[ \frac{C/2-1}{\kappa_k} - \frac{I_{C/2}(\kappa_k)}{I_{C/2-1}(\kappa_k)} + \frac{1}{M_k} \sum_{i=1}^{M_k} \mu_k^T g_i \right]$

Setting this derivative to zero, we get:

$\frac{C/2-1}{\kappa_k} - \frac{I_{C/2}(\kappa_k)}{I_{C/2-1}(\kappa_k)} + \frac{1}{M_k} \sum_{i=1}^{M_k} \mu_k^T g_i = 0$

Solving for $\kappa_k$ , we obtain:

$\kappa_k = \frac{\|\sum_{i=1}^{M_k} g_i\| (C - \|\sum_{i=1}^{M_k} g_i\|^2)}{1 - \|\sum_{i=1}^{M_k} g_i\|^2}$

This equation allows us to compute the concentration parameter $\kappa_k$ based on the alignment of the feature vectors. The concentration parameter $\kappa_k$ is larger when the distribution is more tightly clustered around the mean direction, and smaller when the features are more uniformly spread across the hypersphere.

By maximizing the likelihood function for the vMF distribution, we have shown that the parameters $\mu_k$ and $\kappa_k$ can be estimated to model the distribution of feature vectors for each class. The mean direction $\mu_k$ denotes the central direction of the feature cluster, and the concentration parameter $\kappa_k$ controls the tightness of this clustering. Moreover, the way we estimate the parameters of vMF distribution in EMA is also consistent with the results of the above theoretical derivation.

[Known issues] If the equations do not display correctly, please refresh the page or try using a different browser.

作者回复

2024-08-07

Thanks for your time and effort in reviewing our submission and valuable comments. In the following, we will address your concerns and correct the potential misunderstandings.

Q: The weakly-supervised contrastive distillation method has been used in previous literature [R1, R2].

A: We believe there may be a misunderstanding regarding the mentioned methods. Seal [R1] generates semantically coherent superpixels for distinct objects and backgrounds in the 3D scene. However, it does not infer semantic labels or use them to supervise contrastive distillation. Consequently, superpoints and superpixels within the same category may still be mistakenly considered negative pairs during pretraining. [R2] is NOT relevant to contrastive distillation or weakly-supervised learning.

Q: Adding semantic categories seems not to cause a major improvement over class-agnostic masks, as the SAM is able to segment rather complete and semantically consistent objects and backgrounds.

A: Although Seal [R1] uses VFMs to generate semantically coherent superpixels, it can still mistakenly treat superpoints and superpixels of the same category as negative pairs during contrastive distillation. Our method explicitly defines the points and pixels with the same semantic label as positive pairs during weakly-supervised contrastive learning. Besides, our method can achieve better performance with weak labels generated by stronger VFMs. We refer you to Table M1 of the uploaded PDF file for the additional results.

Q: Using weak labels could introduce additional errors during pretraining.

A: We acknowledge there is a trade-off. While the weak labels might be erroneous, they enable semantic-guided image-to-LiDAR contrastive distillation and indeed yield state-of-the-art performance on downstream tasks. Similarly, the superpixels widely used in previous image-to-LiDAR knowledge transfer methods can also be inaccurate but effective.

Q: The motivation for using the vMF distribution is not clear enough. A more detailed explanation and theoretical justification would strengthen this aspect.

A: Thank you for your valuable suggestion. We provide more explanations as follows.

The representation of samples in the same class can vary significantly across different batches during the contrastive distillation, so the model will struggle to learn stable semantic features. By making point features of the same class closely aligned, our method aims to create a more consistent and structured feature space.
The vMF distribution is defined on a hypersphere, making it well suited for directional data in feature space. The concentration parameter can be dynamically adapted during training to refine the feature alignment process. Early in training, a lower $\kappa$ might allow for more exploration, while later stages can benefit from higher $\kappa$ to solidify the learned representations.
Due to the space limitation of each response, we will provide more theoretical justification in another window.

Q: The experiments could be enhanced on more datasets.

A: Following your valuable suggestion, we conducted experiments on more datasets. We refer you to the Table M2 in the uploaded PDF file.

Q: How do the authors handle the propagated errors during image-to-LiDAR representation learning?

A: Thank you for raising this insightful point. We acknowledge that the inaccuracy of the labels generated by the SAM is a limitation of the current pipeline. We are actively developing a new label disambiguation module for future work. This module will utilize the learned feature similarities to refine the coarse labels. We believe that weak labels can help mitigate self-conflict, while structured semantic representations can assist in refining these labels. Together, they mutually reinforce each other, ultimately leading to more robust and accurate representations.

Q: Could authors provide details on the hyperparameter settings for the vMF distribution and the reason behind chosen values?

A: Regarding the vMF distribution, we did not set many hyperparameters. The mean direction and the concentration parameter are learned from the statistical values of features via the EMA (Exponential Moving Average) algorithm. The smoothness coefficient for the moving average is empirically set to 0.0001.

Q: As most 2D and 3D representation learning approaches do, having empirical analyses of models under out-of-distribution datasets is recommended.

A: Following your valuable suggestion, we added experiments on nuScenes-C datasets. We refer you to the Table M3 in the uploaded PDF file.

Q: The computational cost of the proposed approach is not analyzed, which is crucial for real-time applications.

A: Thanks for your comments. Our approach only provides pre-trained weights, which do not affect the inference speed of the model on downstream tasks. As shown in the table below, our OLIVINE does not require obviously more GPU memory or training time compared to other pre-training methods.

Method	GPUMemory (GB)	TrainingTime (Hour)
PPKT	7.6	35.7
SLidR	10.7	38.9
OLIVINE	8.1	36.5

Q: The generalizability of OLIVINE to other types of sensors or environments is not discussed.

A: Following your suggestion, we have supplemented experiments on another six datasets, which demonstrate the generalizability of OLIVINE to some extent (see Table M2 of the uploaded PDF file). Since the computational resources are limited, we will further enhance the experiments part after the rebuttal.

Q: "NuScenes" should be "nuScenes".

A: Thanks for your careful review. We will correct the typo in the final version.

References:
[R1] Liu et al. Segment Any Point Cloud Sequences by Distilling Vision Foundation Models. NeurIPS2023.
[R2] Takmaz et al. OpenMask3D: Open-Vocabulary 3D Instance Segmentation. NeurIPS2023.

2024-08-09

Thanks to the authors for putting tremendous effort into addressing the raised concerns.

I have read the authors' rebuttal, as well as other reviewers' comments, I believe key issues of this work regarding the following aspects have been addressed or partially addressed:

The motivation has been re-stated and is now more straightforward.
The scale of experiments has been largely improved; a substantial amount of downstream tasks on a diverse set of datasets were added, which provide a more comprehensive and convincing evaluation of the proposed method against previous methods.
Several clarifications regarding technical details were provided, which resolved the related issues.

In addition to the above modifications, the authors also attempted to provide some theoretical analyses. However, since I am not an expert in machine learning theory, I leave more room to the ACs and other reviewers to validate the correctness of these theoretical analyses.

One key concern remaining is that the authors may over-claim the contribution of using "semantic superpixels" over the "class-agnostic superpixels". As stated in the previous review: the use of semantic categories seems not to cause a major improvement over class-agnostic masks. Therefore, the authors are suggested to re-elaborate the claim on this aspect to avoid possible "over-claim" issues.

Taking into consideration the authors' rebuttal and other reviewers' comments, I would like to upgrade the rating from Borderline Reject to Borderline Accept.

Meanwhile, I am looking forward to more discussions with the authors and other reviewers during the discussion period.

评论- Authors' Response to Reviewer svRz

2024-08-09

Thank you for the positive feedback provided and the time devoted to this review. We are glad that our efforts have addressed your concerns. Next, we will address your remaining concerns.

Comments: The authors may over-claim the contribution of using "semantic superpixels" over the "class-agnostic superpixels".

Response: We believe there may be a misunderstanding regarding our proposed methods. We would like to clarify the following points to address your concerns:

We have not claimed that using semantic superpixels is our contribution. In fact, our method does not rely on superpixels at all, which is different from previous methods [R1-R4].
Previous methods [R1-R4] use superpixels to pool 3D point features and 2D pixel features, learning with a superpixel-to-superpoint contrastive loss. In contrast, our method directly uses the features of individual points and pixels for contrastive distillation.
The semantic labels can be flexibly utilized in multiple aspects of the proposed method, such as weakly-supervised contrastive distillation, semantic-guided consistency regularization, and category-aware anchor point sampling. These aspects cannot be effectively addressed using only the class-agnostic superpixels.

Comments: The use of semantic categories seems not to cause a major improvement over class-agnostic masks.

Response: Extensive experiments demonstrate that this approach substantially outperforms superpixels-based (class-agnostic mask-based) methods [R1-R4] in various downstream tasks.

Our method achieves a significant improvement over superpixels-based pretraining methods on nuScenes and SemanticKITTI datasets. As shown in the table below, our method outperforms Seal [R1] by a significant margin, achieving an improvement of 5.14% under the setting of linear probing. The full results are available in Table M1 of the uploaded PDF file.
Following your suggestions, we have added experiments on six additional LiDAR-based point cloud datasets and one out-of-distribution dataset. And our proposed OLIVINE consistently outperforms the superpixels-based methods on all datasets. For the full results, please refer to Tables M2 and M3 of the uploaded PDF file.

Method	LP	1%	5%	10%	25%	100%
Random	8.10	30.30	47.84	56.15	65.48	74.66
PPKT	35.90	37.80	53.74	60.25	67.14	74.52
SLidR	38.80	38.30	52.49	59.84	66.91	74.79
ST-SLidR	40.48	40.75	54.69	60.75	67.70	75.14
HVDistill	39.50	42.70	56.60	62.90	69.30	76.60
Seal	44.95	45.84	55.64	62.97	68.41	75.60
Ours	50.09	50.60	60.25	65.07	70.15	76.69

Thanks again for your diligence as a reviewer. It is with great pleasure communicating with you. Please feel free to share any additional comments or feedback on the manuscript.

References:
[R1] Image-to-lidar self-supervised distillation for autonomous driving data.
[R2] Self-supervised image-to-point distillation via semantically tolerant contrastive loss.
[R3] Segment Any Point Cloud Sequences by Distilling Vision Foundation Models.
[R4] HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation.

审稿意见

评分: 6置信度: 32024-07-12

Annotating point clouds with semantic classes can be expensive and time consuming. The authors of this work propose a new pretraining strategy for weakly supervising point cloud segmentation using image-based supervision (i.e., image-to-LIDAR knowledge transfer). The proposed approach improves upon traditional contrastive pretraining strategies by leveraging visual foundation models (VFMs, e.g., SAM) to provide weak supervision for associating LiDAR points with corresponding pixels that have matching semantic classes. The authors also model features using von Mises-Fisher distributions to further encourage feature semantic clustering, and improve upon sampling by incorporating spatial distances of points and class frequency. This approach shows impressive state-of-the-art performance on pretraining across two benchmark datasets (SemanticKITTI and nuScenes). The ablation study also carefully highlights the impact of each of the contributions. This work will likely serve as a healthy addition to the image-to-point knowledge transfer community.

优点

The authors identify an, evidently, common problem in point and pixel contrastive learning and address this issue with the proposed method. Namely, the authors (or perhaps Mahmoud et al. [36], see limitations section) recognize that prior works do not ensure semantic consistency when performing contrastive learning for image to point cloud knowledge transfer. This issue causes objects of the same class (e.g., car) to be pushed apart in feature space, simply because they are not part of the same super pixel. The authors address this using weakly contrastive distillation to ensure semantic consistency across anchor points.
State-of-the-art results by pretraining on nuScenes and SemanticKITTI across a wide range of annotation data limitations.
A detailed and thorough evaluation on multiple benchmarks and an extensive ablation study providing insightful results. The ablation study, in particular, shows the impact of weakly-supervised labels, separate projection heads, different distributions for modeling semantic features, and various sampling strategies.
The authors are tackling an interesting and challenging problem of improving knowledge transfer across modalities. This research area is of particular importance given the decreased interest and investment in annotating campaigns by the community, and increased interest in self-supervised methods.

缺点

The related work section does not adequately differentiate this approach from prior works.

While the related work section does cite relevant works, it does not identify how the shortcomings of any of these works are addressed in this paper. Moreover, the related work section does not isolate how this paper is different, unique, or better than any of the existing approaches at image to point cloud knowledge transfer.
In particular, I found Mahmoud et al.'s [36] approach, for feature similarity from pixels to points and class balancing, sharing many commonalities with the proposed approach, hence, a detailed comparison may be warranted.
Liu et al. [33] also leverage VFMs like SAM for semantic segmentation to improve image to point cloud knowledge transfer, which is strikingly similar to the proposed approach. Detailed comparisons would greatly help clarify these commonalities, and it would strengthen the reader's confidence in the novelty of the proposed approach.

Minor:

L148: Which existing methods make the semantic unmatching mistake? While this may have been briefly mentioned [36] in the introduction, there was no clear statement with multiple cited works to support this claim. Consider citing these (uncited) prior works to provide evidence for this claim.
Tables 2 and 3 could be combined, it seems somewhat unnecessary to keep them separated.
Typographical/grammatical errors: L90, L155, etc.

问题

What is the impact of using Grounded SAM vs other VFMs for this approach?
Which existing methods make the semantic unmatching mistake mentioned in L148?

局限性

Yes, the authors clearly describe the limitations of the approach as it pertains to the (1) accuracy of the pseudo-labels derived from the VFM, (2) the diversity of the training data impacting environment adaptation, and (3) the dependency on highly calibrated cameras and LiDAR sensors to ensure knowledge transfer.

作者回复

2024-08-07

Thanks for your time and effort in reviewing our paper, the valuable comments, and the favorable recommendation.

Q: To differentiate this approach from prior works.

A: We agree with you that it's necessary to highlight the shortcomings of previous works and the novelty of OLIVINE in the related work section. Here, we would like to clarify the following points:

Previous works [R1-R5] have not solved the self-conflict problem properly. Especially, Seal [R4] generates semantically coherent superpixels for distinct objects and backgrounds in the 3D scene. However, the superpoints and superpixels with the same category may still be mistakenly considered negative pairs during contrastive learning. By contrast, our method explicitly defines the points and pixels with the same semantic labels as positive pairs during weakly-supervised contrastive learning.
Our pipeline performs knowledge distillation on two levels: self-supervised and weakly-supervised contrastive learning. To achieve this, we develop two different heads in both the image and point cloud branches to decouple the learned representation. Previous methods [R1-R5] have only attempted self-supervised contrastive distillation and have not explored using labels to guide contrastive distillation.
The representation of samples in the same class can vary significantly across different batches during the contrastive distillation, so the model will struggle to learn stable semantic features. By making point features of the same class closely aligned, our method aims to create a more consistent and structured feature space.
Existing methods [R2-R5] are highly dependent on the generated superpixels. Superpixels balance asymmetries between areas with denser coverage of points and sparser areas in the contrastive loss. However, we do not need this process at all and ensure a uniform representation of both spatial and categorical dimensions by employing a novel sampling strategy.

Q: A detailed comparison with Mahmoud et al [36].

A: Thanks for your suggestion. The main differences between ours and ST-SLidR [R3] are:

ST-SLidR [R3] reduces the contribution of false negative samples based on superpixel-to-superpixel similarity, using 2D self-supervised features to determine semantic similarities between superpixels. By contrast, our method directly estimates the semantic labels of images with VFMs, and defines pixels and points with the same label as positive pairs.
Regarding class balancing, ST-SLidR [R3] assigns higher weights to over-represented anchors that exhibit high similarities to most negative samples. By contrast, our approach directly adjusts the sampling probability of anchor points using easily accessible semantic labels.

In summary, our OLIVINE offers a more direct and effective way to mitigate the effect of false negative samples and class imbalance.

Q: Liu et al. [33] also leverage VFMs. Detailed comparisons help clarify these commonalities.

A: Thanks for your suggestions to compare Seal [R4] and OLIVINE. We would like to clarify the following points:

To avoid over-segmenting semantically coherent areas, Seal [R4] generates superpixels using VFMs instead of the traditional method SLIC. In contrast, our method does not rely on superpixels. Although we also use VFMs, we leverage them to obtain coarse semantic labels for fine-grained contrastive distillation.
In method Seal [R4], the superpoints and superpixels with the same category may still be mistakenly considered negative pairs during contrastive learning. Our method explicitly defines the points and pixels with the same semantic labels as positive pairs during weakly-supervised contrastive learning.
The semantic labels generated by VFMs, rather than superpixels, can be flexibly utilized in multiple aspects of the knowledge transfer process, such as weakly-supervised contrastive distillation, semantic-guided consistency regularization, and category-aware anchor point sampling. These aspects cannot be effectively addressed using only the class-agnostic superpixels.

Q: L148: Which methods make semantic unmatching mistake?

A: Thank you for your valuable feedback. Existing methods [R1, R2, R4, R5] may mistakenly treat unmatched (super)points and (super)pixels in the same category as negative pairs during contrastive distillation. We will cite these methods to support this claim in the revised manuscript to provide clear evidence.

Q: Tables 2 and 3 could be combined. Typo errors...

A: We appreciate your attention to detail. We have combined Tables 2 and 3 to streamline the presentation and carefully corrected the typo errors.

Q: Impact of Grounded SAM vs other VFMs for this approach?

A: Thanks for your question. Our response is as follows:

Grounded SAM supports text prompts by combining Grounding DINO and SAM. Other VFMs that enable text prompts can also be applied in OLIVINE.
The precision of the semantic labels significantly impacts the effectiveness of OLIVINE. Stronger VFMs provide more accurate semantic labels, leading to better learned representations. As shown in the table below, the potential of our method can be further unleashed by using a stronger VFM, namely SEEM [R6].

VFMs	LP	1%	5%	10%	25%
Grounded-SAM	47.30	46.12	57.51	63.04	69.39
Grounded-SAM-HQ	47.84	48.03	58.51	64.08	69.52
SEEM	50.09	50.60	60.25	65.07	70.15

Ref:
[R1] Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining.
[R2] Image-to-lidar self-supervised distillation for autonomous driving data.
[R3] Self-supervised image-to-point distillation via semantically tolerant contrastive loss.
[R4] Segment Any Point Cloud Sequences by Distilling Vision Foundation Models.
[R5] HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation.
[R6] Segment Everything Everywhere All at Once.

2024-08-12

Hello Authors,

I have reread the paper, the other reviews, and the authors’ comments. Thank you for the thorough rebuttal and responses to each of our questions and concerns. The additional tables and experiments are detailed and insightful. My primary concerns related to an incomplete comparison to related work, similarities to R3, and comparisons to other VFMs have been adequately addressed. I am now more confident in maintaining my original rating of weak accept.

评论- Authors' Response to Reviewer uKMo

2024-08-12

Dear Reviewer uKMo,

Thanks again for the time and energy you committed and your valuable comments. Your meticulous review and thoughtful critiques truly reflect your deep domain expertise and diligence as a reviewer. It has been a pleasure communicating and exchanging ideas with you.

Please feel free to share any additional comments or feedback on the manuscript.

Warm regards,

Authors

审稿意见

评分: 5置信度: 42024-07-12

In this paper, the authors introduced a novel approach for improving 3D representation learning by leveraging VFMs to generate semantic labels for weakly-supervised pixel-to-point contrastive distillation. The proposed method addressed the self-conflict issue in traditional contrastive learning and presented a density and category-aware sampling strategy to ensure balanced learning. This approach showed the better performance over existing methods on nuScenes and SemanticKITTI datasets.

优点

First of all, the motivation of the paper seems to be meaningful and pragmatic in the perspective of the better semantic understanding by leveraging VFMs and balanced learning by using the density and category frequency.

The key idea is very intuitive how to integrate of VFMs with existing multi-modal SSL allowing for generating semantic labels to deal with self-conflict issues. More specifically, the model is trained by three objectives as weakly-supervised contrastive distillation using pseudo labels to identify positive pairs by category, self-supervised contrastive learning applied to randomly sampled point-pixel pairs, and lastly a regularization based on the von Mises-Fisher distribution to ensure semantic consistency.

In experimental section, the proposed method achieved SoTA results in two kinds of downstream tasks (segmentation and detection), demonstrating its effectiveness. The ablation study is highly analytical for each level module.

缺点

One concern is the validity of the proposed SSL method for better representation learning. First of all, it is not clear whether the reduced effectiveness with larger data is due to the model’s insufficient size (capability) or limitations in the proposed model itself. Also, an explanation is required to determine whether lower detection performance gain is due to the ineffectiveness of the proposed method, despite using object semantics. If necessary, the reasons for varying performance improvements across different downstream tasks should described in terms of the mechanism of the proposed learning pipeline.

Experiment analysis and technical description are not specific and descriptive in some extent. The category-aware sampling is not specified in detail. There is not detailed description of performance variation based on sampled data groups or the extent of improvement over existing methods when learning from a sample of the entire dataset. (1%, 5%, 10%,…)

问题

There are some vague sentences and grammatical errors in the paper. I recommend that the author will revise the paper.

局限性

I mentioned all comments including reasons and suggestions in the above sections. I recommend that the author will provide all the concerns, and improve the completeness of the paper. If the rebuttal period resolves the above-mentioned concerns, I will gladly raise my score.

作者回复

2024-08-07

We sincerely appreciate the reviewer's time and effort in reviewing our paper. Thanks for your valuable comments and recognition of our work. In the following, we will comprehensively address your concerns.

Comment: The validity of the proposed SSL method. It is not clear whether the reduced effectiveness with larger data is due to the model’s insufficient size or limitations in the proposed model.

Response: Thanks for your comments. We would like to clarify the following points to address your concerns:

When the available training data is limited, the benefits of pre-trained model weights on downstream tasks are more pronounced. This phenomenon is widely observed in self-supervised learning. When downstream task data is limited, these representations are crucial because they provide a strong starting point, capturing features that the model wouldn't learn from the small labeled dataset alone. As the amount of labeled data increases, the model can learn these features directly from the labeled data, making the initial representations from pretraining less critical.
We also conducted experiments with a stronger 3D backbone namely WaffleIron [R1] (see the Table B1). The effect of pre-training weights becomes less obvious when training downstream tasks on sufficient data. So the reduced effectiveness with larger data is not due to the capability of backbone.
We can further improve the performance with more accurate semantic labels generated by a stronger SAM like SEEM [R2]. As shown in the Table B2, our method achieves a 2.03% mIoU improvement on nuScenes with the full training data.
You might mean that the improvement compared to other pretraining methods is not obvious. We believe the main value of self-supervision methods is to improve the performance when the annotation resources are limited. And OLIVINE outperforms existing methods significantly when the annotation is limited.

Table B1: Performance for 3D backbone WaffleIron

Method	1%	10%	100%
Random	33.26	58.13	77.60
Ours	50.14	66.43	78.21

Table B2: Comparison of various pre-training techniques.

Method	LP	1%	5%	10%	25%	100%
Random	8.10	30.30	47.84	56.15	65.48	74.66
Seal	44.95	45.84	55.64	62.97	68.41	75.60
Ours	50.09	50.60	60.25	65.07	70.15	76.69

Comment: An explanation for lower detection performance gain is required. The reasons for varying performance improvements across different downstream tasks should described in terms of the mechanism of the proposed learning pipeline.

Response: Thanks for your insightful questions. We provide the following explanations to address your concerns:

We observed a 2.0% mAP improvement with the SECOND and a 1.5% mAP improvement with the PV-RCNN, surpassing previous pretraining methods. These improvements were achieved by fine-tuning on full training data, so the enhancements may appear less significant compared to using limited labels.
Compared to the semantic segmentation task, the model architecture for object detection is more complex. Besides the 3D backbone, 3D detectors typically project features to a BEV plane, followed by a 2D convolutional network and RoI operations. These crucial components were not pre-trained, which may limit the overall performance gain from our pre-training approach.
It's important to note that semantic segmentation and object detection use different metrics and scales, making direct performance comparisons improper. The nature of these tasks and their evaluation criteria inherently lead to varying degrees of improvement when applying our proposed method.

Comment: Experiment analysis and technical description are not specific and descriptive in some extent. The category-aware sampling is not specified in detail. There is not detailed description of performance variation based on sampled data groups from a sample of the entire dataset.

Response: Thanks for pointing out this issue. Category-aware and density-aware sampling determine the sampling probability of a point by its category frequency and distance information, respectively. These are part of a hybrid strategy we refer to as density and category-aware sampling (DCAS). Following your suggestion, we have added a comparison of sampling strategies using 1%, 5%, 10%, 25%, and 100% of the annotated data from nuScenes. The results are presented in the table below. We found that the density and category-aware sampling strategy consistently achieves the best performance on downstream tasks, effectively leveraging both spatial distribution and category frequency.

Sampling	1%	5%	10%	25%
Random	44.91	56.01	62.58	68.74
Density-aware	45.33	56.6	62.74	68.96
Category-aware	45.74	56.98	62.89	69.18
DCAS (Density and Category-aware)	46.12	57.51	63.04	69.39

Comment: There are some vague sentences and grammatical errors in the paper.

Response: Thank you for your feedback. We appreciate your attention to detail. We have thoroughly reviewed the manuscript, revised the vague sentences, and corrected the grammatical errors.

We genuinely hope that these clarifications address your concerns. Thanks again for your valuable time and feedback. We will include the results and analysis in the revised manuscript.

References:
[R1] Puy et al. Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation. ICCV2023.
[R2] Zou et al. SEEM: Segment Everything Everywhere All at Once. NeurIPS2023.

2024-08-13

Thank you for your response. I think that the author provided feedbacks to address most of my concern, and the additional experiments are informative. After reading other reviewers and author's comments, I keep current rating.

评论- Authors' Response to Reviewer 54Gn

2024-08-13

Dear Reviewer 54Gn,

Thank you for your response and for taking the time to carefully review our rebuttal. We greatly appreciate your recognition of our efforts to address your concerns and the value you found in the additional experiments we conducted. Your detailed and thoughtful review demonstrates a profound expertise in this domain. I have thoroughly enjoyed the opportunity to learn from your perspective.

Please feel free to share any further comments or suggestions.

Warm regards,

Authors

审稿意见

评分: 5置信度: 32024-07-26

The paper addresses the "self-conflict" issue in contrastive image-to-LiDAR knowledge transfer, where features of semantically similar but unmatched points and pixels are unintentionally dissociated, compromising representation integrity. To solve this, Visual Foundation Models are employed to generate semantic labels for weakly-supervised pixel-to-point contrastive distillation. The method includes structuring the feature space with von Mises-Fisher distributions for consistency and adjusting sampling probabilities to handle spatial and category imbalances. Extensive experiments demonstrate that this approach significantly outperforms traditional methods in various downstream tasks.

优点

The paper uses Visual Foundation Models to generate semantic labels, resolving the "self-conflict" issue and improving representation integrity.
The paper proposes a density and category-aware sampling method, ensuring balanced learning and better representation of minority categories.

缺点

The overall architecture is similar to Seal [1], limiting its novelty except for the sampling strategy. Providing more clarification about the differences from Seal would be beneficial.
The improvement in fine-tuning results compared to the state-of-the-art is marginal.

[1] Segment any point cloud sequences by distilling vision foundation models

问题

Please refer to the weaknesses section.

局限性

The paper discusses the limitations in the appendix.

作者回复

2024-08-07

We sincerely appreciate the reviewer's time and effort in reviewing our paper. In the following, we will comprehensively address your concerns.

Comment: The overall architecture is similar to Seal, limiting its novelty except for the sampling strategy. Providing more clarification about the differences from Seal would be beneficial.

Response: Our overall framework significantly differs from the existing method Seal [R1]. We would like to clarify the following points to highlight the novelty of our method:

The purposes of using VFMs in Seal [R1] and our method are completely different. To avoid over-segmenting semantically coherent areas, Seal [R1] generates superpixels using visual foundation models (VFMs) instead of the traditional method SLIC [R2]. In contrast, our method does not rely on superpixels. Although we also use VFMs, we leverage them to obtain coarse semantic labels for fine-grained contrastive distillation.
Although the more precise superpixels generated by VFMs could mitigate the self-conflict problems to some extent, such a method does not solve the problem thoroughly. The superpoints and superpixels with the same category may still be mistakenly considered negative pairs during contrastive learning. Our method explicitly defines the points and pixels with the same semantic labels as positive pairs during weakly-supervised contrastive learning.
Our pipeline performs knowledge distillation on two levels: self-supervised and weakly-supervised contrastive learning. To achieve this, we develop two different heads in both the image and point cloud branches to decouple the learned representation. Previous methods like Seal have only attempted self-supervised contrastive distillation and have not explored using labels to guide contrastive distillation.
We explicitly model the features of each class as a von Mises-Fisher (vMF) distribution, promoting feature consistency within the same category. This approach cultivates a meaningful and structured feature space, an aspect that Seal does not explore.
Existing methods like Seal [R1] are highly dependent on the generated superpixels. Superpixels balance asymmetries between areas with denser coverage of points and sparser areas in the contrastive loss. However, we do not need this process at all and ensure a uniform representation of both spatial and categorical dimensions by employing a novel sampling strategy.

We genuinely hope that these clarifications provide a clearer perspective on our research and its merits. Thanks again for your valuable time and feedback. We will further clarify the novelty and the differences with related methods like Seal [R1] detailedly in the revised manuscript.

Comment: The improvement in fine-tuning results compared to the state-of-the-art is marginal.

Response: We agree with you that the improvement in downstream tasks compared to the state-of-the-art is not significant. But, we have to clarify the following points:

As stated in the manuscript, we believe that employing stronger visual foundation models for more precise semantic labels can lead to better 3D representations. Therefore, we obtained coarse labels with a stronger VFM, namely SEEM, and evaluated the learned 3D representation. As shown in the table below, our method outperforms Seal [R1] by a significant margin, achieving an improvement of 5.14% under the setting of linear probing.
We have completely open-sourced the code for OLIVINE, whereas existing state-of-the-art methods like HVDistill and Seal have NOT yet made their training code available. We believe this contributes positively to the image-to-point knowledge transfer community by promoting transparency and enabling further research.
Our method is compatible with existing techniques. For example, the semantic temporal consistency proposed in Seal [R1] and BEV-based contrastive distillation [R3] can also be integrated into our pipeline. We plan to explore these aspects further once the source code for these works is released.

[Table A1] Comparison of various pre-training techniques for semantic segmentation tasks using either finetuning or linear probing.

Method	LP	1%	5%	10%	25%	100%
Random	8.10	30.30	47.84	56.15	65.48	74.66
PPKT	35.90	37.80	53.74	60.25	67.14	74.52
SLidR	38.80	38.30	52.49	59.84	66.91	74.79
ST-SLidR	40.48	40.75	54.69	60.75	67.70	75.14
HVDistill	39.50	42.70	56.60	62.90	69.30	76.60
Seal	44.95	45.84	55.64	62.97	68.41	75.60
Ours	50.09	50.60	60.25	65.07	70.15	76.69

References:
[R1] Liu et al. Segment Any Point Cloud Sequences by Distilling Vision Foundation Models. NeurIPS2023.
[R2] Achanta et al. Slic superpixels compared to state-of-the-art superpixel methods. TPAMI2021.
[R3] Zhang et al. HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation. IJCV2024.

2024-08-09

Thank you for the detailed rebuttal. It has addressed most of my concerns. As a result, I will maintain my current rating.

评论- Authors' Response to Reviewer 2WGA

2024-08-10

Dear Reviewer 2WGA,

Thank you for taking the time to review our rebuttal and for your constructive feedback throughout the process. We are glad that we could address most of your concerns.

We will actively participate in the Author-Reviewer discussion session. Please feel free to share any additional comments or feedback on the manuscript.

Warm regards,

Authors

作者回复

2024-08-07

We sincerely thank all reviewers for your time and constructive comments.

We are glad that the reviewers see the value in our work:

"The paper addresses the self-conflict issue in contrastive image-to-LiDAR knowledge transfer ... significantly outperforms traditional methods in various downstream tasks" (Reviewer 2WGA);
"the motivation of the paper seems to be meaningful and pragmatic in the perspective of the better semantic understanding" (Reviewer 54Gn);
"The key idea is very intuitive how to integrate VFMs with existing multi-modal SSL" (Reviewer 54Gn);
"The ablation study is highly analytical for each level module" (Reviewer 54Gn);
"This work will likely serve as a healthy addition to the image-to-point knowledge transfer community" (Reviewer uKMo).

We would like to emphasize the uniqueness and advantages of our approach over existing ones:

Previous works [R1-R5] have not solved the self-conflict problem properly. Especially, Seal [R4] generates semantically coherent superpixels for distinct objects and backgrounds in the 3D scene. However, the superpoints and superpixels with the same category may still be mistakenly considered negative pairs during contrastive learning. By contrast, our method explicitly defines the points and pixels with the same semantic labels as positive pairs during weakly-supervised contrastive learning.
Our pipeline performs knowledge distillation on two levels: self-supervised and weakly-supervised contrastive learning. To achieve this, we develop two different heads in both the image and point cloud branches to decouple the learned representation. Previous methods [R1-R5] have only attempted self-supervised contrastive distillation and have not explored using labels to guide contrastive distillation.
The representation of samples in the same class can vary significantly across different batches during the contrastive distillation, so the model will struggle to learn stable semantic features. By making point features of the same class closely aligned, our method aims to create a more consistent and structured feature space.
Existing methods [R2-R5] are highly dependent on the generated superpixels. Superpixels balance asymmetries between areas with denser coverage of points and sparser areas in the contrastive loss. However, we do not need this process at all and ensure a uniform representation of both spatial and categorical dimensions by employing a novel sampling strategy.
ST-SLidR [R3] reduces the contribution of false negative samples based on superpixel-to-superpixel similarity, using 2D self-supervised features to determine semantic similarities between superpixels. By contrast, our method directly estimates the semantic labels of images with VFMs, and defines pixels and points with the same label as positive pairs.
The purposes of using VFMs in Seal [R4] and our method are completely different. To avoid over-segmenting semantically coherent areas, Seal [R4] generates superpixels using visual foundation models (VFMs) instead of the traditional method SLIC [R6]. In contrast, our method does not rely on superpixels. Although we also use VFMs, we leverage them to obtain coarse semantic labels for fine-grained contrastive distillation.

Following the reviewers' valuable comments and suggestions, we have made these efforts:

We have achieved further improvements on downstream tasks using semantic labels generated by stronger VFMs, as suggested by Reviewer 2WGA.
We have highlighted the novelty of our method and clarified the differences from previous methods, as suggested by Reviewers 2WGA and uKMo.
We discussed the reasons for varying performance improvements across different downstream tasks, considering the mechanism of the proposed learning pipeline, as suggested by Reviewer 54Gn.
We have supplemented the experiment analysis and provided a technical description of sampling strategies, as suggested by Reviewer 54Gn.
We have added relevant citations to support our claim in L148, as suggested by Reviewer uKMo.
We have combined Tables 2 and 3 to streamline the presentation and carefully corrected typographical errors, as suggested by Reviewers uKMo and svRz.
We have compared the effects of different VFMs for generating semantic labels, as suggested by Reviewer uKMo.
We have provided a detailed explanation and theoretical justification for the application of the vMF distribution, as suggested by Reviewer svRz.
We have added experiments on six additional LiDAR-based point cloud datasets and one out-of-distribution dataset, as suggested by Reviewer svRz.
We have reported the computational cost of the proposed pretraining method, as suggested by Reviewer svRz.

Finally, we extend our gratitude to the PCs, ACs, and all the reviewers for their dedicated time and effort in this review process. We look forward to engaging in discussions with you over the next few days.

References:
[R1] Liu et al. Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining. arxiv2021.
[R2] Sautier et al. Image-to-lidar self-supervised distillation for autonomous driving data. CVPR2022.
[R3] Mahmoud et al. Self-supervised image-to-point distillation via semantically tolerant contrastive loss. CVPR2023.
[R4] Liu et al. Segment Any Point Cloud Sequences by Distilling Vision Foundation Models. NeurIPS2023.
[R5] Zhang et al. HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation. IJCV2024.

评论- General Response

2024-08-14

Dear Reviewers,

We sincerely thank you for your thoughtful evaluations during the rebuttal stage. We are pleased that our detailed responses have addressed most of your concerns. We appreciate your recognition of the additional experiments and the clarification we have made to the paper.

We are grateful for the time you have invested in reviewing our work and for your consistent recognition of its value. Based on your valuable feedback, we will carefully revise the manuscript, incorporating the additional experiments and analyses into the final version.

Best regards,

The Authors

最终决定Accept (poster)

2024-09-25

After the discussion phase all reviewers recommend acceptance, noting the clear motivation/presentation, compelling results, and thorough evaluation. A substantial rebuttal was submitted that helped to address a majority of the reviewer concerns, including adding additional details, theoretical justifications, and experiments. As such, the ACs reached a decision to accept the paper. Please take the reviewer feedback into account when preparing the camera ready version.