Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization
We introduce a convex neural network distillation method that compresses large pre-trained deep networks without requiring labeled data or fine-tuning, combining convex and non-convex model architectures for efficient deployment on edge devices.
摘要
评审与讨论
This paper presents a new distillation method that efficiently compresses the model by convex optimization-eliminating intermediate non-convex activation function.
优点
This manuscript is very clear about the background knowledge and the motivation for undertaking this work is clear.
缺点
- In the section on related work, there is a lack of information on the most recent work, and the related work is introduced too little.
- The notation in Eq. 1 and Eq. 3 is used incorrectly ().
- Some of the textual content in the figures is too small.
- The innovative content of the article is not sufficient.
- In the experimental part, there is a lack of validation results on large datasets such as ImageNet. Also, using only ResNet18 and MobileNet V3 for experiments is not convincing enough.
- The results in Fig. 4 do not intuitively show the superiority of the proposed approach.
- There is a lack of experiments to compare with other methods, only ablation experiments are performed.
问题
- Why “while DNNs have the capacity to memorize the training dataset, they often end up learning basic solutions that generalize well to test datasets” can “motivate using smaller, compressed models”?
This paper proposes a convex distillation method by combining the representational power of large non-convex DNNs with the favorable optimization landscape of convex NNs. The proposed method can distill the models in a label-free manner without requiring post-compression fine-tuning on the training data. Experiments on several image classification datasets show that convex student models can achieve high compression rates without sacrificing accuracy and outperform non-convex compression methods in low-sample and high-compression regimes.
优点
1.A simple yet effective to distill classification models via convex networks
2.An effective distillation acceleration tool and polishing are used to improve convex solver.
缺点
1.Activation matching is not novel for knowledge distillation.
2.Experimental comparison is not sufficient to support the effectiveness of the proposed method. It lacks SOTA KD methods for fair comparison.
- It is not clear that how to distill all blocks. If the proposed convex distillation performs block-wise distillation, it requires a complex and time-consuming knowledge distillation for handling the entire networks.
问题
See the weaknesses.
This article proposes a novel distillation technique that efficiently compresses deep neural network models through convex optimization. This method eliminates intermediate non-convex activation functions and uses only the intermediate activations of the original model, enabling distillation without the need for labeled data, and achieving comparable performance to the original model without fine-tuning. Experimental results show that this approach not only maintains model performance when compressing image classification models on multiple standard datasets but also performs better and optimizes faster compared to traditional non-convex distillation methods. This work opens up new avenues for future research at the intersection of convex optimization and deep learning.
优点
-
The paper introduces a novel approach to knowledge distillation that leverages convex optimization for efficient compression of deep neural networks.
-
The authors provide extensive empirical evidence to support their claims, demonstrating the effectiveness of their method across multiple standard datasets and in various scenarios.
-
The author provided Google Colab code with very detailed experimental instructions.
缺点
-
Using existing convex neural network packages, there is a lack of originality and workload.
-
There is an issue with the network configuration. For datasets with small image sizes like CIFAR-10, the configuration used for ResNet on ImageNet should not be applied. It should not downsample by 4x from the start, which results in feature maps that are too small.
-
The experiments were only conducted on small datasets and very small networks. Can they be scaled up to larger datasets such as ImageNet?
问题
See Weakness, is there a more reasonable network configuration, more comprehensive experiments?
This paper introduces Convex Distillation, a model compression method that replaces the non-convex layers of deep neural networks with convex approximations. By leveraging convex optimization, the method achieves efficient compression without the need for labeled data or post-compression fine-tuning.
优点
-
Paper is written well and easy to follow.
-
The idea of bridging convex optimization, distillation and compression is interesting.
缺点
-
Practical contributions of convex optimization in model compression are limited. The convexity conversation is only valid and tested up to 3-layer DNNs. It significantly restricts the objective landscape. For simple tasks, it might be fine, while for more complex tasks, it often leads sub-optimal performance.
-
Experimental results are not satisfactory to justify the efficacy of the proposed methods. Only small datasets are included. Meanwhile, the ResNet18 baseline seems not well tuned (with low acc less than 90%).
问题
See the weakness.
This paper proposes a distillation technique that compresses neural networks using convex optimisation. Reviewers appreciated the method proposed, and the motivation for the work. However, multiple reviewers were concerned at the lack of experiments on ImageNet-sized datasets and comparisons to state-of-the-art in distillation.
All reviewers scored "Reject" for this submission, and the authors did not provide a response to address experimental concerns so I see no grounds for acceptance.
审稿人讨论附加意见
The authors did not provide a response so there was no further discussion.
Reject