PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高8标准差1.3
5
8
5
5
3.5
置信度
ICLR 2024

Maestro: Uncovering Low-Rank Structures via Trainable Decomposition

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11

摘要

关键词
low-rank approximationefficient model trainingtrainable decomposition

评审与讨论

审稿意见
5

The authors introduce Maestro, a technique designed for efficient layer-wise low-rank factorization during training. This method incorporates an ordered drop strategy combined with group lasso regularization, encouraging the progressive adoption of lower-rank weights during training. The evaluation is conducted on CIFAR10, MNIST, and Multi-30k, comparing Maestro against various low-rank approaches and several pruning and quantization techniques. Furthermore, the paper offers multiple ablation studies and provides theoretical analysis for specific problems.

优点

  1. The paper is easy to follow.
  2. The theoretical properties are sound with the proposed method.
  3. The algorithm seems reasonable.

缺点

The algorithm seems reasonable to me. However, for the experiments, ImageNet results are missing. As an important benchmark, ImageNet is often used to compare performance between the compression-related tasks. For instance, Cutterfish presented their ResNet-50 results using the ImageNet dataset. To highlight effectiveness, it would be beneficial to include evaluations based on the ImageNet dataset. Additionally, tests on larger models would enhance the comprehensiveness of the study.

Is the #GMACs the training cost? If not, please show the training cost.

问题

How does the proposed method perform on ViT and other larger models using the ImageNet dataset?

评论

We thank the reviewer for their feedback and appreciate their comments. Let us reply point-by-point.

Wrt scaling to the ILSVRC results, we have conducted experiments on ResNet-50 trained on ImageNet and showcase how our method performs. However, due to unforeseen technical difficulties with the filesystem of our GPU infrastructure, we only have preliminary results, up to epoch 60. We will continue the training process and can update the manuscript upon termination. For what it's worth, current results show similar convergence dynamics with Pufferfish, with training reaching 70.05% of accuracy at 47% of MACs and 32% of the original parameters. Moreover, we have results for TinyImageNet and Multi-30K, both of which should hint the scaling ability of our technique. Last, while we did our best to have the results in time, we would like to point out that similar works have been published in ICLR without scaling their evaluation to ImageNet levels (Khodak et al., ICLR'21).

MethodAccuracyGMACsParameters
Pufferfish76.34 % (~70% on epoch 60)3.6 (87%)15.2 (59.5%)
Cuttlefish76.44 %3.6 (87%)14.7 (57.4%)
Ours70.05% (on epoch 60)1.96 (47%)8.2 (32%)

Moreover, we did not have the time to conduct extra training for ViTs on ImageNet due to limited time and resources. Since the ResNet-50 on ImageNet experiment was also requested by Reviewer ZdCx, we have prioritised this. Nevertheless, we have had no indication of scalability issues so far, therefore, we anticipate similar behaviour.

Wrt GMACs, they correspond to inference cost on Tables 2-4. However, we have quantified the training cost in GMACs on Tables 11-14 of the appendix. We have now clarified that in the manuscript and included a pointer to the appendix. We hope that these answer the reviewer's raised points and can raise their score accordingly.

[1] Khodak, M., Tenenholtz, N., Mackey, L., & Fusi, N. (2021). Initialization and regularization of factorized neural layers. ICLR'21.

审稿意见
8

This work mainly focuses on incorporating trainable low-rank layer decompositions in deep-learning models. The authors propose MAESTRO, which progressively finds the optimal rank of each layer during the training by imposing importance ordering via the existing Ordered Dropout technique. The redundant ranks are zeroed out by using the hierarchical group lasso term as the regularizer in the loss function. MAESTRO accounts for data distributions and the target function rather than applying SVD on pre-learned model weights.

优点

The novelty of the work lies in applying the existing Ordered Dropout technique from Federated Learning (FjORD) to optimally order the heterogeneous ranks of various layers in DNNs based on importance criterion, which results in discovering layer-wise low-rank decompositions. In contrast to uniform dropout across the width in each layer ( FjORD), MAESTRO independently decomposes each layer to uncover optimal rank. The authors provide applications of MAESTRO to various layer types in CNNs, FC, and Transformers.

The paper is easy to understand and is well-structured. The experiments are comprehensive and justify theoretical insights.

缺点

  1. The paper suffers from typos. The authors are encouraged to review and proofread the draft.
  • Page 1: …find progressively
  • Page 2: …novelly fuse
  • Page 3: ..have been proposed… (multiple instances)
  • Page 4: …HMA….
  • Page 5: ….orthoghonal….
  1. It is recommended that authors explore a better illustration for Figure 1. For instance, there is not much difference visually in Factorized mapping and Ordered Representation when printed in black/white. It might be helpful to provide a better illustration for the Ordered Dropout process (it is challenging to understand it with symbols without any reference in the figure caption. In current form, it is assumed that the readers will be familiar with OD). Since MAESTRO provides layer-wise decomposition and is generally applicable to various DNN layers, it might be useful to incorporate the various layer types of the DNN network (Sec 3.2) in Figure 1 as an overall summary of the proposed work and its applicability.

问题

Suggestions are provided in the above section.

评论

We would like to thank the reviewer for their valuable feedback and insightful comments.

In the new manuscript, we have fixed all of the indicated typos and have further changed Figure 1 to better illustrate the inner workings of Maestro without the need for prior knowledge of OD.

As requested, the current version of Figure 1 is readable even when printed in black/white. Regarding the various DNN layers, we do not specifically visualize those as their implementation is equivalent to linear layers, as discussed in section 3.2, where the main trick is that we can represent these types of layers as a special case of linear mapping that we then decompose using our technique. If the reviewer believes that our presentation would benefit from further visualization of CNNs, RNNs, and Transformers, we are happy to include such figures in the camera-ready version.

We appreciate both of these suggestions and believe that they ameliorate the quality and legibility of our manuscript.

审稿意见
5

This paper proposes a low-rank compression scheme for deep neural networks, which factorizes fully connected, convolutional, and attention layers in the form A=UV, and progressively reduces the rank of the U and V matrices. For convolutional layers the factorization is applied to the unrolled 2D matrix, while for attention layers it is applied to the Q, K, V matrices. They use ordered dropout and hierarchical group-lasso to facilitate the reduction of the rank of U/V matrices.

优点

Unlike unstructured pruning methods, low-rank compression can preserve the dense structure of matrices, which can extract more performance from GPUs. For the training of transformers on the Multi30k dataset shown in Table 3, the proposed method is able to reduce the number of parameters by more than half compared to the baseline (Pufferfish), while also reducing the perplexity.

缺点

Low-rank compression and Lasso have been around for a very long time, and the only novelty seems to be the use of ordered dropout. The improvement over existing methods is marginal for the experiments with CNNs. The proposed method is obviously very sensitive to the choice of the Lasso coefficient lambda, but there is no theory behind how it can be chosen effectively.

问题

How is the initial factorized mapping performed without SVD? How is the initial maximal rank r chosen?

How does the proposed method compare with other structured pruning methods?

Typos p.4 “multi-head attention (HMA)” > “multi-head attention (MHA)” p.5 “we one could leverage” > “one could leverage” p.5 “Singular Value Decomposition (SVD)” Why define this here when it has been repeatedly used in previous sections?

评论

We would like to thank the reviewer for their time and effort in reviewing our manuscript.

First, we have fixed the indicated typos. Thank you for raising these.

Please allow us to reply to the rest of the reviewer's concerns below:

The only novelty seems to be the use of ordered dropout

We appreciate the reviewer's perspective, but we do not think that this renders our contribution as "marginal." We have non-trivially extended ordered dropout to low-rank decompositions and have further allowed for non-uniform ranks across layers, contrary to FjORD. This has been acknowledged and mentioned as a strong point by Reviewer ZdCx. Moreover, we have used the lasso penalty specifically to optimally consolidate knowledge so that the network can leverage this structure and more efficiently compress the model. As a result, we believe this is a significant step in this area of research.

The proposed method is obviously very sensitive to the choice of the Lasso coefficient lambda, but there is no theory behind how it can be chosen effectively.

Thank you for raising this important point. We acknowledge that while our method can operate with various values of the Lasso coefficient lambda, there is indeed a degree of sensitivity to its selection. Our experiments have shown that Maestro can function effectively even when the Lasso coefficient is absent (set to zero), demonstrating the flexibility of our method. However, we do agree that the choice of lambda can influence the performance and efficiency of the method to some extent. To address this, we discuss some guidelines for selecting an appropriate lambda value. We provide an empirical way for hyperparameter tuning.

Specifically, one can select the value of λgl\lambda_{gl} by starting with a large lambda value and gradually decrease the Lasso coefficient by a factor of 2 until the required criteria (footprint, accuracy) are met. This heuristic finds a close-to-optimal value for the hyperparameter, while the total cost of training remains comparable below the baseline of training the whole model. Please see Tables 11-14 for our lambda sweeps and the optimal values per dataset. The theoretical underpinning behind the selection of λgl\lambda_{gl} is that we want the loss component that relates to hierarchical lasso to be lower than the learning component, so that it does not dominate the learning objective, but not too low that it does not sparsify the network.

Nevertheless, it is important to highlight that Maestro does not specifically require the group Lasso penalty for operation; we utilize it primarily to enhance training efficiency. As demonstrated in Figure 4, we can implement post-training pruning (or even during training, which is not currently shown in our manuscript) even when the Lasso coefficient is set to zero, and still achieve competitive performance across a wide range of computational and memory constraints due to the nature of ordered dropout. Please let us know whether we have sufficiently answered your concern.

How is the initial factorized mapping performed without SVD? How is the initial maximal rank r chosen?

In our manuscript, we describe initializing our network using standard neural network initialization techniques, followed by factorization using SVD to obtain the Maestro layer. Additionally, we explore an alternative approach where each layer is pre-factorized as illustrated in Figure 1, with each matrix initialized using standard neural network initialization methods (e.g., the default initialization in PyTorch). Both approaches lead to the same final results.vWe decided to use SVD as the default initialization method for Maestro in future tasks that we have in mind. For instance, this is particularly relevant when starting with pre-trained model weights. By decomposing weights using SVD, the initial output of the full network remains unchanged. Moreover, SVD can be viewed as an approximation to Maestro that only considers current weights and is computationally inexpensive to compute once, while Maestro also accounts for both data and the full network dynamics.

Regarding the initial maximal rank, for a linear mapping from Rm\mathbb{R}^m to Rn\mathbb{R}^n, we set r=minr = \min {m,nm, n}. This ensures that our network initially has the same capacity as the original network.

评论

How does the proposed method compare with other structured pruning methods?

Thank you for this insightful question. We would like to emphasize that Maestro can be viewed more broadly, not just as a tool for low-rank pruning. We specifically chose low-rank pruning to demonstrate Maestro's capabilities. In this context, we have compared Maestro against established low-rank baselines, such as Pufferfish and Cuttlefish. Our findings show that Maestro either matches or surpasses the performance of these baselines. A key advantage of Maestro is its fully automated approach in selecting the rank for each layer, unlike other methods.

As a future work, we are planning to extend Maestro to include other structured pruning approaches, such as channel-based pruning. Therefore, we believe a more comprehensive comparison with other structured pruning methods would be more beneficial after this extension. This will allow for a fairer comparison, as it would help us to differentiate whether the benefits arise from the nature of the pruning (e.g., rank-based or channel-based) or from the method used to determine the extent of pruning.

For now, we have also added a related commentary in related work, positioning our work against structured pruning solutions (e.g. OTOv2), but due to time limitations, we are unable to run additional experiments during the rebuttal period.

审稿意见
5

The paper proposes MAESTRO, which is a trainable low-rank approximation technique for deep neural networks. It proposes a progressive shrinking approach that decomposes the weights of each layer into low-rank components using an extended version of Ordered Dropout. This allows for efficient compression and trade-off between model size and accuracy. The method is evaluated on various models, datasets, and modalities, showing superior performance compared to other compression methods.

优点

  • The paper extends the Ordered Dropout technique to handle non-uniformity in the search space by allowing different ranks per layer.

  • It introduces a trainable aspect to the decomposition, which enables the model to reflect the data distribution.

  • It provides a latency-accuracy trade-off mechanism for deploying the network on constrained devices.

缺点

  • The citation style seems not correct. It should include the author's names in place of numerical references.

  • Why the method named after "Maestro"? It is never introduced and seems weird to me.

  • The proposed technique appears as a logical improvement from Ordered Dropout. Its effectiveness, however, is primarily demonstrated through toy architectures and datasets, such as ResNet18 and Cifar10. For the method to gain practical and impactful validation, I recommend conducting additional experiments on more complex datasets like ImageNet to substantiate its superiority.

  • Building on the previous point, there are alternative methods that report better accuracy with more compact architectures. For instance, the OTOv2 framework:

Chen, Tianyi, et al. "Only train once: A one-shot neural network training and pruning framework." Advances in Neural Information Processing Systems 34 (2021): 19637-19651.

It structurally prunes the model during training (hence still training efficient), and it achieves a 93.3% accuracy with only 0.55M parameters on Cifar10 using VGG16. This is in contrast to the 93.10% accuracy with 2.20M parameters reported by the proposed method. This comparison casts doubt on the practical utility and the advantages of the low-rank based method presented.

问题

See the weaknesses part above.

评论

We would like to thank the reviewer for their valuable feedback and time invested in reviewing our manuscript.

Wrt the manuscript presentation, we have now changed the citation style to conform to the ICLR style.

Wrt the title, we did not feel that the name needed further explanation since Maestro does not constitute an abbreviation. However, to give some context to the reviewer, our interpretation of the name is that it coordinates the ranks that need to be used for training/inference of a model, much like a Maestro conducts an orchestra. (Like orchestrator components in distributed systems)

Wrt to running experiments on ImageNet to validate the scalability of our contribution, we have started training ResNet-50 on ImageNet. However, due to unforeseen technical difficulties with the filesystem of our GPU infrastructure, we only have preliminary results, up to epoch 60. We will continue the training process and can update the manuscript upon termination. For what it's worth, current results show similar convergence dynamics with Pufferfish, with training reaching 70.05% of accuracy at 47% of MACs and 32% of the original parameters. Moreover, we have results for TinyImageNet and Multi-30K, both of which should hint at the scaling ability of our technique. Last, while we did our best to have the results in time, we would like to point out that similar works have been published in ICLR without scaling their evaluation to ImageNet levels (Khodak et al., ICLR'21).

MethodAccuracyGMACsParameters
Pufferfish76.34 % (~70% on epoch 60)3.6 (87%)15.2 (59.5%)
Cuttlefish76.44 %3.6 (87%)14.7 (57.4%)
Ours70.05% (on epoch 60)1.96 (47%)8.2 (32%)

Last, we have positioned our work against the OTOv2 framework, as requested. Indicatively, OTOv2 involves a much more involved (and costly) training process because the HSPG solving step does not offer a method to trade off accuracy for model size further. In contrast, Maestro enables a more efficient training method along with flexible gains upon deployment without the need to retrain (one-shot). Last, wrt the mentioned figures, let us first note that we are comparing VGG-16 (OTOv2) vs. VGG-19 (Maestro). Be that as it may, our original accuracy of vanilla VGG-19 is 92.94% (vs. 93.2%). Compared to OTOv2's performance of 26.8% of MACs, 5.5% of parameters at +0.1 pp of accuracy, the closest operating points of ours are at:

  • 33% of MACs, 11% of parameters at +0.16 pp of accuracy
  • 20% of MACs, 6% of parameters at -0.24pp of accuracy

While competitive, we may not be beating this baseline on the end accuracy. However, we believe the benefits of training overhead and deployment flexibility make Maestro an advantageous approach in many cases.

[1] Khodak, M., Tenenholtz, N., Mackey, L., & Fusi, N. (2021). Initialization and regularization of factorized neural layers. ICLR'21.

评论

Dear Reviewers, We thank you for your thorough evaluations and constructive feedback on our submission. We are encouraged by your recognition of several key strengths of our work and have diligently addressed the concerns raised. Below, we highlight the main strengths as identified in your reviews:

  1. Novel Usage of Ordered Dropout (Reviewer ZdCx): We are glad that you appreciated our novel extension of the Ordered Dropout technique to handle non-uniformity in the search space, allowing different ranks per layer.
  2. Trainable Decomposition (Reviewer ZdCx): We thank you for your recognition of the trainable aspect of our decomposition technique as a means to reflect data distribution. This aspect is central to our approach, enabling dynamic adaptability and efficiency.
  3. Latency-Accuracy Trade-Off (Reviewer ZdCx): We thank you for highlighting the practical applicability of our method, particularly its capacity to provide a latency-accuracy trade-off for deploying networks on constrained devices.
  4. Preservation of Dense Matrix Structure (Reviewer Mad7): Your acknowledgment of how our low-rank compression technique preserves the dense structure of matrices, thereby enhancing GPU performance, particularly in transformer training, is greatly appreciated.
  5. Application to Various DNN Layers (Reviewer CAHp): We are grateful for your recognition of our method's novelty in applying Ordered Dropout for discovering layer-wise low-rank decompositions in various DNN layers. Your comments affirm the broad applicability and innovation of our work.
  6. Comprehensive Experiments and Clarity (Reviewer CAHp): Your appreciation of our paper's structure and the comprehensive nature of our experiments is highly encouraging. We tried to ensure clarity and thoroughness in our research.
  7. Theoretical Soundness (Reviewer LrqS): We are pleased that you found our theoretical approach sound and the paper easy to follow. We have strived to develop an algorithm that is both theoretically robust and practically applicable.

In response to the weaknesses and questions raised, we have made comprehensive revisions to our manuscript (annotated in blue), ensuring that each concern is addressed in detail, please see our detailed response to your concerns submitted as individual responses to respective reviewers.

We believe these enhancements, driven by your insightful feedback, significantly strengthen our submission.

Thank you once again for your invaluable feedback and guidance.

AC 元评审

The paper introduces MAESTRO, an approach for training neural networks such that the weight matrices eventually become low-rank. As opposed to factorization-based approaches, the paper uses a variation of Ordered Dropout (previously proposed in Horvath et al. '21). Theory reveals that for linear data and simple models (linear autoencoders) the method recovers the expected SVD/PCA. Empirical results show that the method performs well compared to (some) existing pruning-based methods.

While the method is sensible, some reviewers raised concerns regarding

  • the novelty of the method (it being a fairly straightforward application of ordered dropout to encourage low-rankness)
  • the lack of comparisons on challenging baselines. One reviewer in particular raised the question of the lack of experimental results for ImageNet. This is an important drawback, setting it a bit behind many published works in this area. During the response period, the authors were able to obtain partial results, but it appears that at least at the moment, performance of their method still lags (considerably) behind methods such as Cuttlefish and Pufferfish.

The authors are encouraged to address these points while preparing future revisions.

为何不给更高分

The majority of reviewers rated it below the bar for acceptance (and I tend to agree with this assessment.)

为何不给更低分

N/A

最终决定

Reject