Rethinking Self-Supervise Learning: An Instance-wise Similarity Perspective
We propose a novel self-supervised learning approach, to learn an appropriately sparse IwS matrix in the representation space
摘要
评审与讨论
This paper proposes Sparse-CL, a new contrastive self-supervised learning method that, as opposed to most methods, allows positive pairs across views from different samples. The number of such positive pairs is regularized with a constraint on the instance-wise similarity matrix. Sparse-CL is evaluated on various linear classification benchmarks such as CIFAR and ImageNet, and demonstrates competitive performance in similar setups compared to concurrent methods.
优点
-
The idea of considering different samples as positive pairs in a contrastive loss is interesting and fixes the issue of repelling examples from similar concepts. There is an underlying graph of connections between concepts and the contrastive loss does not take that into account, which is a good motivation for this work.
-
The sparsity constraint is also a good idea. Indeed, discovering the graph might be very difficult and letting the system discover it with properly designed loss constraints seems to be a solution.
-
The results on small datasets are promising and the method achieves a very good performance against competitors.
缺点
-
I disagree that the IwS matrix of SimSiam isa matrix full of ones. In practice, the “critical issues” mentioned with SimSiam are not observed and I am not sure that it can be considered as a problem.
-
The results on ImageNet are good on a comparable setting, but far from being impressive. For example, Swav is compared without multi-crop, which is part of the method. Moreover, recent breakthroughs with the transformer architecture lead to much better results than what is reported in the paper. DINO reported 75% linear evaluation accuracy in 2021 and DINOv2 best model is at 86% accuracy.
-
Sparse-CL with lambda=0.0 performs 71.5% on Cifar-100, which is already better than every other method. How do you explain that ? Is the setup really comparable with other methods ?
-
Explanation in paragraphs Input space and Representation space are redundant and should be independent of the choice of method, here MoCo style method. Maybe just say: is but in representation.
问题
Do you have a way of measuring if your method brings in practice in terms of distance between concepts in representation space, compared to classical methods ? Maybe using k-nn ? It might be possible that other methods already compute a graph of concepts automatically.
Would it be possible to adapt the method to work with redundancy reduction methods (Barlow Twins, VICReg) ?
The paper proposes to understand self-supervised learning from the perspective of instance-wise similarity (IwS). From this perspective, the paper identifies the limitations in current self-supervised learning approaches, including contrastive learning and Siamese methods. To address the limitations, the paper introduces sparse contrastive learning, that learns an appropriately sparse IwS matrix in the representation space. The proposed method is validated through experiments on ImageNet and CIFAR datasets, showing superior performance compared to other state-of-the-art methods.
优点
- The work is well-motivated, aiming to bridge the discrepancy between IwS matrices in input and representation spaces.
- The paper is well-organized, and the proposed method is explained with visual illustrations which aid in understanding the concept of IwS and the proposed Sparse CL approach.
- The authors provide extensive experiments on standard classification benchmarks, including CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1k, which substantiate the claimed benefits of Sparse CL.
缺点
- Contrastive methods in self-supervised learning adopt instance discrimination as the pretext task, and the focus of this line of research (and most previous methods) is how to handle the positive and negative pairs, or the diagonal and off-diagonal entries in the IwS matrix, respectively. However, the authors claim that studying from the perspective of IwS provides a novel framework, which might not be true. In addition, the proposed method fails to deal with false positives, as there might be 0s in the diagonal of the IwS matrix due to semantic inconsistency caused by strong data augmentation. This scenario should be taken into consideration since this paper focuses on IwS.
- A theoretical analysis can be conducted to analyze the alignment and sparsity terms of Sparse CL loss. The InfoNCE loss, used in SimCLR, MoCo and other contrastive methods, can also be decomposed into two terms similar to the proposed loss. The authors should theoretically discuss the relationships between these losses to better demonstrate the advantages of the proposed method.
- The paper could benefit from a broader evaluation on other tasks beyond classification to further validate the generalization ability of the proposed method. For example, the pretrained model can be transferred to object detection and segmentation tasks, which is commonly used to evaluate the performance of self-supervised learning methods.
- Minors: Fig.1(d) shows an all-one IwS matrix for Siamese methods, which is not appropriate, as this is only the situation of mode collapse and the Siamese methods have already addressed this problem. Fig.4 shows a binary 0/1 matrix of Sparse CL in representation space, while Eqn.3 computes a continuous similarity value between 0 and 1. How is the above conversion made?
问题
Please check the weakness.
This paper studies the effect of the regularization term on the IwS metric. By controlling the coefficient of the regularization term, the proposed Sparse CL method is capable of controlling the sparsity of the representation IwS. This paper shows that the method is effective on the downstream classification task.
优点
- This paper designs a loss function to control the sparsity of the IwS metric, which makes good use of inter-image information.
- The empirical results and analysis prove the effectiveness of their method on classification tasks.
缺点
- The IwS of the Siamese network seems to be wrong. Siamese network does not apply constraints on the non-diagonal items so it should not be an all-one matrix. It is shown that non-contrastive SSL implicitly reduces the similarity of off-diagonal samples [1].
- The goal of SSL is to learn generalizable representation rather than improve classification performance. Therefore, the soundness of this paper could be further improved by providing experimental results on other downstream tasks like kNN, semantic segmentation, and object detection.
- The authors provide neither methodology nor empirical comparison with existing inter-image self-supervised learning methods like [2].
Ref:
[1] Zhuo, Zhijian, et al. "Towards a Unified Theoretical Understanding of Non-contrastive Learning via Rank Differential Mechanism." ICLR, 2023.
[2] Xie, Jiahao, et al. "Delving into inter-image invariance for unsupervised visual representations." IJCV, 2022.
问题
From the sensitivity analysis, we can see that the performance is sensitive to the parameter. Given input images, is it possible to estimate the input IwS so that we know the desired sparsity of the representation IwS?