CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization
摘要
评审与讨论
The paper introduces CoMERA, a novel training method for large AI models that focuses on optimizing both computing and memory efficiency through rank-adaptive tensor compression. CoMERA aims to reduce training costs and environmental impact by achieving high compression ratios and maintaining accuracy. Key contributions include a multi-objective optimization formulation, performance optimization techniques for tensor-compressed training on GPUs, and empirical validation showing significant speedup and memory savings compared to existing methods.
优点
- The experimental results demonstrate significant improvements in training speed and memory efficiency, outperforming recent methods like GaLore and LTE.
- By addressing both computing and environmental costs, the method has practical implications for large-scale AI model training, making tensor models more practically useful in machine learning.
- In the part of multi-objective optimization, the formulation balances compression ratio and model accuracy, providing a customizable framework for different resource requirements.
缺点
While the method shows impressive results on tested models, scalability to even larger models and diverse architectures remains an area for further exploration.
问题
- For learning TT-ranks, this work imposes diagonal matrices D (shown in Eq. 6) with l_1 norm regularization. This technique for tensor network structure search was also utilized in the recent paper:
Zheng, Yu-Bang, et al. “SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
-
In line 122, could you provide more explanation of what the linear scalarization means? Does it refer solely to Eq. (7), or is there additional context?
-
Could you provide more intuition regarding Eq. (11), i.e., the achievement scalarization? While the two scalarizations in the paper may be standard methods in multi-objective optimization, a more intuitive explanation would help general readers unfamiliar with this field grasp the main idea quickly.
-
In lines 171-173, the design of the “tensor-index level” is unclear. Please improve the clarity of this part if possible.
-
In the contraction section, how many core tensors are involved in the contraction? If the number of involved tensors in the contraction is not large, why not use the default contraction path searching algorithms integrated into einsum, e.g., ‘dp’ or ‘optimal’? What benefits can be achieved from handcrafting a new contraction path for the current model compression task?
-
Could you provide more explanation in Sec. 4.3? Why can CUDA graphs improve tensor computation? What specific efforts were made in this work? Is using CUDA graphs straightforward, or does it require advanced programming techniques, such as designing a more GPU-suitable contraction order?
-
In the comparison with GaLore, why was a 3090 used instead of a 4090, which was used in the original GaLore paper?
-
Regarding the training time, only the time per epoch is given. Is it possible to provide the overall running time? Are more epochs required if tensor networks are utilized?
-
Will the codes be released once the paper is accepted?
局限性
The authors have adequately addressed the limitations.
Responses to Weaknesses:
Weakness 1: Scalability to larger models.
We are conducting larger experiments. The preliminary result is in Figure 1 in the attached PDF and details are in main author rebuttal. We pre-train CodeBERT-Large with 357 million parameters on CodeSearchNet, a 20GB dataset. Compared to uncompressed training, the tensor-compressed model shows a similar convergence curve and reaches a similar training loss and achieves training speedup on a single 3090 GPU. Compression and speedup results are in the following table.
| Pre-training results of CodeBERT-large | ||
|---|---|---|
| compression ratio | overall | 4.25 |
| tensorized layers | 9.77 | |
| training speedup | sequence length 128 | 2.3 |
| sequence length 512 | 1.9 |
Responses to Questions:
Question 1: Compare diagonal D with l_1 norm in the recent paper
We will cite it and discuss the differences. The referenced paper uses sparsity of diagonal matrices to search for a compact TT representation of given data, which is very different from our end-to-end training. We would like to highlight the differences in the following.
- The referenced paper searches for a TT representation of given data. It is same as finding a low-rank approximation for a given tensor. In contrast, our work uses TT to compress weights during end-to-end training and we don’t have any prior information on the tensor (i.e., model parameters). The tensor cores and diagonal entries are determined during training.
- Our work formulates the problem as a more generic multi-objective problem and uses a two-stage algorithm to solve it. The formulation in that paper is similar to linear scalarization approach in our early stage. Our work further uses the achievement scalarization in the late stage to find a model close to our preferred model performance and size.
Question 2: Explanation of linear scalarization
Yes, the linear scalarization solely refers to the Eq(7). More details are in the next question.
Question 3: Intuition for scalarization?
We will include more intuitions. The linear scalarization minimizes the weighted sum of objectives. Solving it provides a Pareto point, but it is hard to control the obtained point. The achievement scalarization finds a Pareto optimal solution that is close to a given target. When one objective is too far away from the target, we will mainly optimize this objective by increasing its weight. It gives us the achievement scalarization problem Eq. (11).
Question 4: Design of the “tensor-index level”.
Thanks! We will revise that. “Tensor-index level” optimization aims to reduce redundant computation. In the tensor indices, two different rows may have shared values. For instance, (2,3,1,3) and (2,3,2,4) are common in (2,3), so we only need to compute (2,3) entry once. In this design, we focus on the unique indices required for the lookup and only compute the required unique indices.
Question 5: Why not using contraction path search in einsum?
For TT with d tensor cores, each contraction has d+1 cores involved and there are in total d+2 coupled einsums, as in Eq. (19) (20) (21). The algorithm in “einsum” only searches for the path for a single einsum operation. However, our path optimization tries to minimize the total costs of all d+2 coupled contraction paths, which cannot be handled by the searching options in “einsum”. We also show in Proposition 4.1 that the contraction path for forward is already near-optimal. Similar results can be shown for back-propagation.
Question 6: Details about Cuda Graph
Thanks! We will include more explanations about Cuda graphs in Section 4.3. Cuda Graph eliminates overhead of launching lots of kernels sequentially. It is more suitable for CoMERA since tensor-compressed training has much more small kernels than uncompressed training. No dynamic control flows and dynamic shapes are allowed in standard CudaGraph. We specifically revise some codes, making them compatible with Cuda Graph. For instance, the inputs are padded into a fixed length and the late-stage achievement scalarization in Section 3.2 is split into two graphs.
Question 7: Why was a 3090 used instead of a 4090 for GaLore comparison?
Thankst! We don’t have the 4090 GPU in our lab. All comparisons are run on the same system and the 3090 GPU.
Question 8: Overall training time
Thanks! Empirically CoMERA is 2-3X faster in the whole training than uncompressed training for transformers on a single GPU, but we do not have theoretical guarantees about the number of epochs although they are similar in our experiments. The detailed explanations are in the main author's rebuttal. Some key points are summarized below:
- Training NNs is a highly non-convex optimization in compressed and uncompressed formats, making theoretical analysis very complicated. Overall training time depends on (1) the number of epochs and (2) time per epoch. While we observed consistent 2-3X speedup in terms of (2), point (1) is highly case-dependent for almost all non-convex optimization solvers.
- CoMERA has a similar empirical convergence behavior to the uncompressed training on our tested cases. We observe that on both 6-encoder transformer and DLRM, shown in Figure 6 on paper and Figure 2 in the PDF.
- On all tested transformers, CAMERA generally takes 2-3X less training time because it has similar convergence curves as the uncompressed model, but each epoch of the tensorized model is 2-3X faster than standard training.
Although our method has similar convergence behavior compared with standard training, we think that it could be misleading to draw any formal conclusion now without theoretical proof (which may or may not exist).
Question 9: Code release?
We are trying to figure out some potential IP issues. We hope to release the codes if the IP issue can be cleared.
Dear Reviewer,
Thank you very much for your thorough review and fruitful comments! We have carefully read all your feedback and addressed your concerns and questions. We would greatly appreciate it if you could take some time to review our responses.
We will stay online these days and are happy to address any further questions or concerns you may have. We look forward to your continued feedback on our work.
Thank you again for your time and consideration!
I appreciated the authors’ response and the clarifications provided. However, I have a few additional comments:
1.Regarding the referenced paper: -- The reference seems to focus on general tensor networks rather than specifically on the TT (Tensor Train) format. While your work indeed targets a different set of tasks, the similarity between the two papers primarily lies in the use of sparse diagonal matrices to determine ranks. However, your response mentioning ‘finding a low-rank approximation for a given tensor’ could lead to confusion. I recommend refining this point to better differentiate your work from the referenced CVPR paper.
-
Regarding the use of GPUs: -- The response mentioning the lack of a 4090 GPU in your lab is not entirely satisfying. If the choice of using a 3090 GPU is purely due to availability, it’s important to clarify this in your paper. Additionally, please ensure that the claims in your paper are accurately framed, especially when comparing your results with those of GaLore. Otherwise, there could be a risk of over-claiming or causing misunderstandings, which would be unfair to the GaLore work.
-
Regarding the release of code: -- The uncertainty surrounding the release of your code due to potential IP issues is concerning. Not providing the code could significantly reduce the contribution and impact of your paper within the research community. If possible, please provide more clarity on this matter or consider alternative ways to share your work’s findings with the community.
Thanks a lot for your further feedback! We hope our explanations successfully address your concerns.
- Compare with SVDinsTN. Thanks! The SVDinsTN in the reference paper and our work both use diagonal matrices to control tensor ranks, but these two works are entirely different.
The two works target two completely different problems. SVDinsTN aims to find a compact tensor network representation of a given tensor, whereas our work aims to adaptively determine ranks and achieve real speedup for tensor-compressed models during end-to-end training without any prior information, which is of great importance given the huge energy cost of current large AI models.
It is not surprising that both works (and possibly some other works) use sparse diagonal matrices and L-1 terms to control tensor ranks. Using diagonal matrices to control the ranks of matrices is very common, like SVD. The L-1 norm is also widely used to induce sparsity in various models, like compressed sensing and Lasso regression. It is natural to combine these two techniques to control tensor ranks, regardless of tensor formats. In our work, we did not claim this as a novel contribution. Our main contributions are: (1) a multi-objective optimization approach to achieve a good balance between model size and performance in end-to-end rank-adaptive training, and (2) numerical and GPU optimization to achieve real 2-3X speedup on GPU. This can have a high impact: since large AI models consume too many resources, even a small reduction can save huge money and energy. Besides the results in the paper, we have proved the great promise of this method in pre-training CodeBERT in the above response.
The main differences between the two works are summarized in the following table. We will include the discussions in our paper.
| CoMERA | SVDinsTN | |
|---|---|---|
| target problem | End-to-end tensor compressed training for memory and computation reduction | Find a compact tensor network representation of given data |
| problem formulation | a multi-objective problem | a single objective problem with regularizations |
| solver | a two-stage approach with two scalarization methods to solve multi-objective problem to balance model performance and size | a proximal alternating minimization method to solve the single objective problem with l1 regularizations |
| tensor format | focus on Tensor-Train and can be easily applied to all general tensor networks | general tensor networks |
| performance optimization | numerical and GPU optimization methods for real speedup on GPUs | N/A |
- GPU. We rent a 4090 GPU to run experiments. The results are in Table 1. Compared to RTX 3090, the training on RTX 4090 uses similar memory and takes less training time, and CoMERA is still the fastest method and consumes the least memory among all three techniques. The memory savings are almost the same as the results reported in Figure 1 in our paper. The speed-up factors are almost identical for batch sizes 32, 64, and 128. For batch size 1, our method is 1.2X faster and 1.7X faster than GaLore on RTX 3090, respectively. The difference is because RTX 4090 GPU significantly accelerates matrix multiplications of batch size 1, while it does not accelarate that much for smaller tensor contractions. We test the times of typical matrix multiplications in CoMERA and GaLore on RTX 3090 and RTX 4090. The results are in Table 2. We find that r=30 matrix multiplication on RTX 3090 has a similar speedup for both batch sizes, whereas the same multiplication on RTX 4090 only has speedup for batch 32 and does not have any speedup for batch 1. We would like to note that it might be caused by that different GPU platforms have different backend overhead, which can become more dominant as computation decreases to batch=1. We will continue optimizing GPU-level kernels to accelerate small tensor contractions and expect to see a similar speedup. We will replace results in our paper with new results on RTX 4090.
Table 1. Training comparison of CoMERA with GaLore and LTE on RTX 4090 GPU.
| CoMERA | GaLore | LTE | ||
|---|---|---|---|---|
| batch 1 | time (min) | 37.1 | 44.8 | N/A |
| memory (MB) | 182 | 1674 | N/A | |
| batch 32 | time (min) | 3.4 | 6.8 | 11.1 |
| memory (MB) | 1780 | 3632 | 4636 | |
| batch 64 | time (min) | 3.4 | 6.3 | 9.1 |
| memory (MB) | 3784 | 4682 | 6628 | |
| batch 128 | time (min) | 3.4 | 6.0 | 8.2 |
| memory (MB) | 7002 | 8048 | 10964 |
Table 2. Time comparison of matrix multiplications on RTX 3090 GPU and on RTX 4090 GPU. The multiplication is done between matrices of sizes (batch*128) 768 and 768 r. The time is in seconds.
| r=30 | r=768 | ||
|---|---|---|---|
| batch 1 | RTX 3090 | 0.34 (4.6X) | 1.58 |
| RTX 4090 | 0.22 (1.0X) | 0.22 | |
| batch 32 | RTX 3090 | 0.55 (5.4X) | 2.98 |
| RTX 4090 | 0.27 (4.5X) | 1.21 |
- Code. Thanks! We will release a version of codes with confidential IP information removed.
Dear Reviwer Rz1r,
Thanks a lot for your detailed technical comments and your follow-up discussion.We fully understand that you may super busy with many deadlines at this moment. As the discussion window will close in 1 day, I would highly appreciate it if you confirm your comments have been addressed or not.
In our original rebuttal, we have addresed the main weaklness (lack of large-scale experiment) by providing the pre-training result of CodeBERT to show the significant training cost reduction. We also provided details to address your 9 questions (e.g., explanation of various scalarization techniques, difference of our contraction sequence optimization with that in einsum, CuDAGraph optimization, overall training time).
In the recent discussion, we have further provided details regarding the comparison with SVDinsTN, result on RTX 4090 (which does not change much from RTX 3090) and code release issue.
If our responses have well addressed your comments, we would highly appreciate it if you can acknowledge this. If some of our responses need further clarification, please feel free to let us know! We are staying online to provide potential further clarifications in a timely manner.
Thanks again for your detailed technical comments & engagement in the discussion.
Warm regards, The authors.
Thank you for the detailed response.
“The SVDinsTN in the reference paper and our work both use diagonal matrices to control tensor ranks, but these two works are entirely different.”
-- Yes, it's true. I agreed that the two works are different since you targeted two different tasks. That is why I put this concern in the question part rather than the weakness part. Thank you for the clarification. I mentioned it again in my last reply since the authors said the referenced paper was for TT (tensor-train). It is a confusing point since that I checked that paper. it seems for a general tensor network rather than the specific TT model.
"We rent a 4090 GPU to run experiments." -- Thank you. I have this concern since in the GaLore paper the main claim is given with the setting of 4090 rather than 3090. It naturally made me to have the question why not to use the same hardware for comparison? Is there any deeper concern from the side of authors? The first-round response from authors is unexpected since I don't realize that 4090 is so difficult to have in the experiment like other more expensive GPUs. But the good thing is that you finally successfully get it.
"We will replace results in our paper with new results on RTX 4090."
-- Thanks , but I don't think it's necessary. It is good to see the proposed tensor-based method works well with a weaker device (3090) than other works like GaLore (using 4090). My concern is mainly for fair and more clear comparison. Putting the new results in the supp. would be very appreciated.
"We will release a version of codes with confidential IP information removed."
-- Thank you for the promise. It is very important for reproducibility and potential impact for the community.
Dear Reviewer Rz1r,
Thanks a lot for your further follow-up discussion.
I'm glad that we are on the same page: (1) the SVDinsTN and our work are entirely different. (2) The use of RTX 3090 GPU rather than 4090 is not a big issue.
Just two tiny additional notes: (1) in our first rebuttal, we indeed pointed out (see our table) that SVDinsTN considered general tensor networks (rather than tensor train in our work),(2) in our paper we implemented all methods (including GaLore and LTE) on RTX 3090 to ensure a fair comparision. We will make these points more explicitly in our revised paper to avoid similar misunderstanding in the future.
Thanks for the quick response.
"(1) the SVDinsTN and our work are entirely different. "
-- For further clarification, I respectfully adjust your claim to the following :
the SVDinsTN and our work (CoMERA) targeted on entirely different tasks but with a similar technical trick (ie., imposing diagonal matrices with sparsity constraint for rank selection)
This is more close to my point.
Furthermore, I appreciate again the authors's effort for the mentioned revision promises of improving the clarity of the paper and code releasing.
Best,
Dear Reviewer Rz1r,
We completely agree with you. We are happy that all of your technical concerns have been addressed.
This paper leverages the tensor decomposition concept in model training and tries to address two questions: the first is how to enable rank-adaptive, and the second one is how to obtain real efficiency. This paper achieves a 2 − 3× speedup per training epoch compared with standard training with on-par accuracy.
优点
- An easy paper to follow
- Although tensor decomposition is not new, they successfully enable real speed-up with several performance optimization techniques.
- Real speed up and on-par accuracy on small models.
缺点
- The problem formulation in 3.1 and how to solve the l0 norm is not quite new.
- Contraction Path Optimization is also not a new topic. The authors should cite previous work, e.g. [1] and discuss differences, and highlight the uniqueness.
- Does this training take more time in terms of convergence? And could the author discuss more about the training difficulty of tensorized layers and normal layers. [1] Gu, Jiaqi, et al. "Heat: Hardware-efficient automatic tensor decomposition for transformer compression." arXiv preprint arXiv:2211.16749 (2022).
问题
See weakness before.
局限性
N/A
Weakness 1: “The problem formulation in 3.1 and how to solve the l0 norm is not quite new.”
Response: Thanks a lot for the comment!
- Our novelty is to formulate the problem of balancing performance and size as a multi-objective problem. The linear scalarization for gently pruning ranks and achievement scalarization for deployment requirements are applied in early-stage and late-stage to solve the problem. It is an early try that this multi-objective optimization approach is applied to satisfy deployment requirements. While the formulation invovles l0 and l1 regularization as in single-objective optimization, the objectives are different from those in existing literature since the mathematical analysis is entirely different.
- Using the l1 relaxation alone can cause both theoretical and practical issues:
- First, the l1 relaxation cannot effectively control the tensor ranks because magnitudes of TT factors can keep growing. Hence, an extra l2 norm regularization term for the tensor cores is added. Mathematically, the new problem with the l2 norm regularization is equivalent to a new constrained multi-objective optimization problem (10) in the paper whose tensor cores are bounded. The equivalence and related theoretical analysis are shown in Proposition 3.1 and its proof.
- Second, the l1 relaxation does not correctly reflect the model size in the achievement scalarization problem in late-stage. Thus, we propose to use the l0 norm for the comparison between performance gap and size gap. This l0-based measure is used as a switching condition between Eq. (13) and Eq. (14), then the l1 norm is used in numerical implementation inside the solver of Eq. (13) and (14). This completely differs from an l1-relaxed single-objective optimization which does not require l0-norm any more and does not require switching.
Weakness 2: Difference from Contraction Path Optimization in [1] HEAT paper.
Response: Thanks! The contraction optimization in [1] is very different from ours. We will cite it and discuss differences. The key differences are summarized in the following table and described below:
- Post-training compression VS compression in training. The paper [1] considered compression after training, where trained model parameters were already given. Ours considers end-to-end tensor-compressed training, where no model parameters are known prior to training. It can reach computation savings and training acceleration, whereas post-training compression cannot.
- Different tensor formats. The paper [1] uses CP, Tucker, and Tensor-Train-Matrix formats to compress linear layers. In contrast, we use the Tensor-Train format for linear layers, and TT matrix for embedding tables.
- One contraction path in forward VS d+2 coupled contraction paths in forward and backward. The paper [1] only discussed the single path optimization for forward propagation in CP format. We have optimized d+2 contraction paths jointly in both forward- and back- propagation in TT format. Since these contractions can be coupled, we have also minimized the overall computation costs by reusing intermediate results. Finally, we have also provided a theoretical analysis, Proposition 4.1, demonstrating the proposed path is near optimal for large batch sizes.
| CoMERA | HEAT | |
|---|---|---|
| Training | end-to-end compressed training | post-training compression |
| Tensor format | TT for linear layers and TT matrix for embedding | CP, Tucker, and TT matrix for linear layers |
| Contraction path | jointly optimize d+2 coupled paths for forward and back in TT | one path for forward-propagation in CP |
Weakness 3.1: Does this training take more time in terms of convergence?
Response: Thanks! In short, empirically our method is 2-3X faster in the whole training process than uncompressed training when training transformers on a single GPU, but we do not have theoretical guarantees about the number of epochs although we observed that they used a similar number of epochs. We have provided detailed explanations in the main author's rebuttal. Some key points are summarized below:
- Training neural networks is a highly non-convex optimization problem in both compressed and uncompressed formats, making theoretical convergence analysis very complicated. The overall training time depends on (1) the number of epochs and (2) time per epoch. While we observed consistent 2-3X speedup in terms of (2), point (1) is highly case-dependent for almost all non-convex optimization solvers.
- Our CoMERA has a similar empirical convergence behavior to the uncompressed training on our tested cases. We observe that on both 6-encoder transformer and DLRM, shown in Figure 6 on paper and Figure 2 in the attached PDF file.
- On all tested transformers, CAMERA generally takes 2-3X less training time because it has similar convergence curves as the uncompressed model, but each epoch of the tensorized model is 2-3X faster than standard training.
Although our method has similar (and sometimes better) convergence behavior compared with standard training, we think that it could be misleading to draw any formal conclusion at this moment without a theoretical proof (which may or may not exist).
Weakness 3.2: And could the author discuss more about the training difficulty of tensorized layers and normal layers.
Response: From our experiments, we did not observe challenging difficulties of training tensorized layers. Adam is utilized to train the tensor-compressed model. The training hyperparameters, such as learning rate and weight decay, differ from standard training because of the different training parameters and loss landscapes, but the convergence behavior is similar. We will provide more details on training tensor-compressed models in the revised paper and conduct more analysis in the future.
Dear Reviewer,
Thank you very much for your thorough review and fruitful comments! We have carefully read all your feedback and addressed your concerns and questions. We would greatly appreciate it if you could take some time to review our responses.
We will stay online these days and are happy to address any further questions or concerns you may have. We look forward to your continued feedback on our work.
Thank you again for your time and consideration!
Dear Reviewer 1f6L,
We sincerely appreciate your fruitful comments and suggestions for our paper CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization.
We have carefully addressed all your comments in our response above. In particular, we have clarified the novelty of our work, provided a comparison with HEAT, and discussed the convergence and overall training time of the proposed algorithm. Additionally, beyond the results mentioned in the paper, we have demonstrated the significant potential of this method in pre-training CodeBERT, a model with 357 million parameters, as detailed in the main author rebuttal.
As the discussion period is drawing to a close in a few days, we would be very grateful if you could take a moment to review our responses. If our replies have satisfactorily addressed your concerns, we greatly appreciate it if you could acknowledge this in the discussion thread. If there are any remaining questions or concerns, please do not hesitate to reach out to us. We will stay online these days and are ready to respond promptly.
Thank you again for your time and consideration. We look forward to your feedback.
Thank you so much for your rebuttal and address my questions.
This paper formulates an optimization framework to search for ranks inside the tensor training format. I still think this question/problem setup is well exposed in efficient ML works (like search pruning ratio, etc.). Although the optimization description is huge, it is not new to me. Moreover, the claimed benefits is not validated on a large model, and I suppose on the large model (especially pre-training), we need to keep a tight compression ratio to have a comparable accuracy (I saw unmatched accuracy on CodeBert)
However, I am very happy to see this work deliver a real training speedup via contraction path order optimization and some CUDA implementation optimization. It would be great to see codes open-sourced.
I would like to keep my rating to 6. Thank you so much.
Dear Reviewer 1f6L,
Thanks for your quick follow-up and insightful discussion.
Your comment on Galore is insightful. Indeed the comparision should not be just focused on compression ratio and GPU-hours. We completely agree that Galore has better convergence gaurantee. Specifically, it will keep the same accuracy with standard training, under the assumption that the SVD compression in gradient compression does not lose much gradient information. Meanwhile, we also would like to gentlely point out that Galore will not gaurantee convergence and may face divergence issue when it has the same high compression ratio as CoMERA, since the gradient information loss is huge.
It is also worth noting that practical LLM pre-training iterations are normally bounded by the given limited budget (e.g., dollar amounts or GPU-hours). As a result, many large-size LLMs are pre-trained for only 1 or 2 epochs in practice. With the same limited budget of GPU-hours, CoMERA can train more iterations than standard pre-training and GaLore, achieving better acuracy.
Regarding the testing accuracy of CoMERA: the testing accuracy of LLMs are evaluated on multiple downstream tasks, which are quite different from the pre-training dataset. As a result, a model (even if it has better training loss) often behave very differently on various downstream tasks: it may has better testing accuracy on some tasks, but meanwhile has worse testing accuracy on other tasks. This is very normal, and is exactly the case in our tensorized BERT-large model: our model outperformed the baseline BERT-large on two downstream tasks (SST2 and MRPC), and under-performed basline BERT-large on one downstream task SQuAD. Currently our produced model beats standard BERT-large on two out of three downstream tasks (at the cost of less GPU-hours in pre-training), and we look forward to its future performance as our pre-training methods keep evolving.
Thanks agian for your timely discussion. Your participation in the discussion is highly appreciated: it will enable a fair and thorough evaluation of our work. Please feel free to let us know if you have further questions.
Best regards, The authors.
Dear Reviewer 1f6L,
Thanks a lot for your discussion and detailed technical discussion.
We agree that there are plenty of work about searching tensor ranks and architectures in the machine learning community. Almost all of them are about tensor data compression and post-training model compression. We would like to remark that searching the architecture and ranks in training is a much more challenging and a rarely investigated task: the model parameters that we need to compress is not given in advance, and we can not afford a training for each architecture/rank setting due to the huge training cost. This is completely different from tensor data compression and post-training compression, where one can easily evaluate the quality of their rank/architecture setting by doing a cheap forward propagation. This is also why compressed training (even in the simpler low-rank matrix format) is a much more challenging and more important task.
Regarding the compression ratio and accuracy in tensor-compressed pre-training, we also would like to remark three key different features in the LLM domain:
(1) It is natural that compressed models have larger training loss, since its optimization feasible set is smaller than that of an uncompressed training. However, a slightly larger training loss does not mean worse testing accuracy, because tensorized models have shown smaller generalization gaps (i.e., the difference between training accuracy and testing accuracy) than uncompressed models in many cases. As an example, let us consider BERT-large pre-training using the same model architecture on the WikiText dataset. Standard pre-training produced a training loss of 1.26, our CoMERA produced a trainining loss of 1.45 (which is worse by 0.19). However, the compressed model produced by CoMERA outperformed standard BERT-Large in 2 of 3 downstream testing tasks due to its smaller generalization gaps, as shown below:
Testing accuracy of standard BERT and tensorized BERT on various downstream tasks.
| Models | Accuracy on SST2 | Accuracy on MRPC | Accuracy on SQuAD |
|---|---|---|---|
| Standard BERT-Large | 91.74% | 86.00% | 90.68% |
| Tensorized BERT-Large (ours) | 92.10% | 86.82% | 88.76% |
Since the downstream testing datasets of CodeBERT were not released to the public by its developers, we could not do the same downstream testing at this moment (we are trying to generate a few downstream testing dataset for CodeBERT by ourselves, but it takes time). However, we can make a prediction based on the available training results. Right now, the training loss of CodeBERT using standard pre-training is , and training loss produced by our CoMERA is . The difference is only 0.12, which is much better than the loss difference (0.19) on the WikiText dataset. Considering the smaller generalization gap of tensorized models, it is reasonable to predict that our compressed CodeBERT-large model will have better (or at least similar) testing accuracy on downstream code generation tasks, if those testing datasets are available.
(2) Different from regular neural network training which often uses a small dataset and thus the network is often over-parameterized and can have a large compression ratio, LLMs (even small-size LLM like BERT) are pre-trained with a super large dataset and the network is under-parameterized. Meanwhile, when those LLMs are released by their developers (often giant tech companies), many tricks have been tried to make sure that these LLMs cannot be dramatically compressed, otherwise customers will quckly swith to much smaller LLMs. As a result, even a overall compression ratio is very large for LLM pre-training. Note that (i) we only compressed part of the layers (with compressed ratio on these layers). The embedding tables and output linear layers used a transposted architecture and thus we didn't compress them; (ii) The LLM compression ratio can be further boosted if we combine CoMERA with quantization techniques.
(3) The importance of speedup factor and overall compression ratio on CodeBERT-large should not be underestimated. This means that we can reduce the GPU-hours by . A GPU-hour reduction can make a huge impact in the LLM community, since current LLM pre-training costs millions of US dollars per training run. As a comparison, the recently released and very popular GaLore (oral paper at ICML'2024) has much less memory saving than CoMERA (as shown in our paper) and actually slows down the training.
Again, we would like to thank Reviewer 1f6L for his/her technical feedback, which will greatly improve the quality of our work. We hope that the above technical details and potential impact in the LLM community can be considered.
Thank you for your quick response.
First, the evaluated test accuracy on standard BERT already shows the CoMERA may lead to accuracy loss on SQUAD (90.68% compared to 88.76%). The results cannot convince me that CoMERA can yield on-par accuracy with the claimed speedup on large models. I do not judge that the claimed speedup is not significant. My concern is whether sacrificing the accuracy in the pre-training stage is meaningful and whether your claimed compression ratio may not hold for large models.
Second, your comment on GaLore is not right and fairly assessed. GaLore theoretically guarantees the same performance as full-rank training, but CoMERA cannot.
I am still doubtful about your statement about your fine-tuning accuracy. Based on my understanding, SQuAD is a relatively harder task than SST2 and MRPC (https://delvify.ai/glue/). For easier tasks, the compressed model may perform well at testing, while for harder tasks, the compressed model shows reduced accuracy(90.68 --> 88.76, as in your case). So, my major concern still remains that your claimed compression ratio and speedup are only feasible for simple tasks, which may not hold for harder tasks/pretraining, making your claim that you can train LLM from scratch(as you compare with Galore) weaker. The accuracy number further makes me concerned about this point.
Also, I am not very confident whether LLM shows low-rank characteristics for the weight matrix, and enforcing the model's weight matrix in a low-rank format since that training begins may not be a good choice.
My score is between 5 and 6, and I'd like to see the author report the claimed speedup with a comparable loss with the baseline in the pre-training stage of larger models instead of the speedup on smaller models/under considerable loss in the future version.
Dear Reviewer 1f6L,
Thanks a lot for your further follow-up discussion: we highly appreciate it, especially considering that yourself may also have many deadlines at this critical moment.
Your conern is very valid. Meanwhile, we would like to share more details and our own experience about the evaluation and pre-training of LLMs.
Regarding the SQuAD: you are VERY right, this is a more challenging downstream task. But the reason of performance difference across various downstream tasks is not just caused by the diffculty of a specific task like SQuAD. For instance, if we want to achieve a different compression ratio, we find that the accuracy on SQuAD can be higher than the basline, but the acuracy on the easier task MRPC may drop a little bit. As a result, we think that the main reasons are as follows:
--(1) We tested the pre-trained model on downstream tasks without any fine-tuning. The performance on various downstream tasks definitely can be further improved if fine-tuning is performed using each downstream dataset.
--(2) Since we did not conduct task-specific fine-tuning, we are atually evaluating a pre-trained model directly on multiple very differerent downstream tasks. Mathematically, this is equivallent to solving an optimization problem with objective function (in pre-training), then evaluating its solution quality on three different objective functions and (on downstream tasks). Without fine-tuning (i.e., optimizing and individually), it is very normal that some tasks show superior performacne and some tasks show slightly degraded performance.
Regarding pre-training larger LLMs like LLaMa-1B (probably you mean LlaMA-7B?). It is indeed our goal to test our method on large models like LlaMA. However, as an academic group, we are constrainted by the computing budget. Here is one budget data sample: pre-training GPT-2 (with 762 Million parameters) would need 100,000 A100 GPU-hours (equivalent to 250,000 USD based on market cloud computing price). Based on the scaling law released by Google Deepmind (Hoffmann, et al., “Training compute-optimal large language models,”arXiv:2203.15556, 2022), we estimate that the pre-training budget of LlaMa-1B would be around 430,000 USD, which is far beyond the financial capability of most (if not all) academic groups. Pre-training LlaMA-7B will cost millions of USD.
We are trying very hard to get more computing resources to validate our ideas on LlaMA-scale models, which is super challenging as an academic group. To be honest, starting up a company may be a better option to achieve this goal (that's why we said that we had some IP concerns when replying to a reviewer's question about a full code release). We will be very happy to share our result with the community in the future, if we can get the testing results on LlaMA-scale models.
Thanks again for your insightful discussions. Your active paticipation in the disucssion is highly appreciated.
The manuscript presents techniques for efficient training from scratch based on tensor decompositions. The authors propose several modifications to the basic training approach to improve accuracy as well as several optimizations for tensor-compressed training, achieving training speedup. In the experiments, the proposed methodology is used to compress a transformer model and a recommender system model.
优点
The authors present a new approach for training a model with tenzorized weights, which allows for adaptation of ranks during training. They also managed to optimize the process and achieve a real-time speed-up, which is not an easy task when working with tensorized models. The described methods are useful and have the potential to significantly contribute to further advancements in efficient operation for tensor decomposition formats.
缺点
-
There are several places where a comparison with the baseline time (e.g., uncompressed) would be appropriate, such as in Figures 5 and 8.
-
Experiments with the transformers are limited. Training a model on MNLI from scratch may not accurately reflect the behaviour of the method for language models, as MNLI is a relatively simple classification dataset that might not require a large number of parameters. It would be beneficial if the paper included an experiment or a simulation of some kind in a pre-training setting for language models, such as pretraining Roberta on C4
问题
-
I am a little confused about how to adapt the rank to achieve further compression. It is not clear from the text what the procedure is and when it is applied during training. Specifically, it is not explained how the ranks are chosen and at what point of time the procedure is applied. I would recommend incorporating a new paragraph into the paper that provides a more detailed explanation of the procedure, as well as some practical advice on how to apply it in various scenarios. This could help readers better understand the potential uses and benefits of this approach.
-
As for the transformer model, I am also interested about the model upper compression limit that still maintains training speedup? I would imagine maintaining good performance can only be possible with lower compression ratios (say x2-5).
局限性
Responses to Weaknesses:
Weakness 1: Baseline time in Figures 5 and 8.
Response: Thanks a lot! We will include baseline time in the figures. For your convenience, we attach results in following Table 1 and Table 2. Table 1 shows time and memory cost for embedding lookup forward-propagation and back-propagation. The uncompressed embedding with sparse gradients is faster than our approach since our TTM embedding table requires extra computation. However, the embedding without sparse gradients is slower than ours because of updating massive gradients. The proposed TTM embedding table uses much less memory, 7X less than embedding with sparse gradients, and 15X less than embedding without sparse gradients. Table 2 shows the time and memory cost of training DLRM using CoMERA and standard training.Standard training is faster than CoMERA since DLRM is an embedding-intensive model, and the computation in the embedding table is look-up rather than matrix multiplications. However, CoMERA uses much less memory, saving 6.9X, 4.8X, and 3.1X memory for batch sizes 10000, 2000, and 4000, respectively.
Table 1. Time and memory for embedding lookups
| CoMERA embedding | uncompressed w/ sparse gradients | uncompressed w/o sparse gradients | ||
|---|---|---|---|---|
| batch 10000 | time (s) | 0.48 | 0.06 | 2.43 |
| memory (MB) | 670 | 5279 | 10558 | |
| batch 20000 | time (s) | 0.82 | 0.11 | 2.48 |
| memory (MB) | 799 | 5284 | 10569 | |
| batch 40000 | time (s) | 1.42 | 0.22 | 2.47 |
| memory (MB) | 896 | 5294 | 10592 |
Table 2. Time and momery for DLRM training
| CoMERA w/ optimization | CoMERA w/o optimization | uncompressed | ||
|---|---|---|---|---|
| batch size 10000 | time (s) | 807 | 1344 | 420 |
| memory (MB) | 2612 | 9259 | 18261 | |
| batch size 20000 | time (s) | 794 | 1182 | 423 |
| memory (MB) | 3947 | 18385 | 19005 | |
| batch size 40000 | time (s) | 791 | N/A | 424 |
| memory (MB) | 6629 | N/A | 20459 |
Weakness 2: Lack of pertaining results on larger models
Response: Thank you! Scaling up our approach to larger models and datasets is an important future work. We are conducting larger experiments and the preliminary result is shown in the following table and Figure 1 in the attached PDF. The details can be found in main author rebuttal. Here are some key results.
- We pre-train the CodeBERT-Large model. It has 24 encoder blocks and in total 357 million parameters, whose architecture is similar to BERT Large. The pre-training is done on the CodeSearchNet, a 20GB dataset. Compared to uncompressed training, the tensor-compressed model shows a similar convergence curve and reaches a similar training loss, while compressing the whole model 4.25 times and linear layers 9.77 times. The tensor-compressed training is about 2.3X faster in sequence length 128 and 1.9X faster in sequence length 512 than uncompressed training on a single RTX 3090 GPU. The results demonstrate that our approach can scale up to larger models. We will investigate more in the future and are optimistic about the results.
||Pre-training results of CodeBERT-large||
|-|---|-| |compression ratio|overall| 4.25| | |tensorized layers| 9.77| |training speedup|sequence length 128| 2.3| ||sequence length 512| 1.9|
Responses to Questions:
Question 1: How to adapt the rank to achieve further compression.
Response: Thank you! We will add more details about how to adapt the rank to achieve further compression. Our rank-adaptive approach uses the multi-objective formulation and consists of two stages:
- In the early stage, we solve the multi-objective problem using linear scalarization. The early stage starts from the beginning of the training process and prunes tensor ranks gently without hurting the model performance.
- After the early stage converges, we may continue training the model in the optional late stage. During the late stage, the multi-objective optimization problem is solved by the achievement scalarization approach to find a model close to our preferred performance and size for specific deployment requirements (e.g., on an FPGA).
Question 2: Model upper compression limit that still maintains training speedup?
Response: In short, speedup can be achieved by CoMERA even when the compression ratio is close to 1. We consider an m by n linear layer and an input of size b by m and it is represented by TT cores with internal TT rank r. Then, the computation cost for linear layer and TT linear layer is about O(bmn) and O(b(m+n)r) respectively when the batch b is large. TT-compression has computation reduction roughly when . However, the compression ratio is close to 1 for such large ranks. We demonstrate the above analysis by testing the training time of the six-encoder transformer on MNLI. The following Table 3 shows the per-epoch training time of CoMERA on MNLI dataset for different compression ratios. The acceleration is more obvious for larger compression ratios. When the compression ratio is greater than 1, CoMERA always has speedup. When the compression ratio approaches 1, the time of CoMERA approaches that of uncompressed training. We will include the results and discussions in our paper.
Table 3. Per epoch training time on MNLI for various compression ratios
| rank 30 | rank 120 | rank 240 | rank 360 | rank 480 | uncompressed | ||
|---|---|---|---|---|---|---|---|
| compression ratio | 50 | 4.9 | 2.2 | 1.5 | 1.1 | N/A | |
| time (min) | batch 32 | 7.2 | 7.79 | 11.16 | 13.94 | 17.46 | 18.5 |
| batch 64 | 6.4 | 6.73 | 9.7 | 12.44 | 15.53 | 16.6 | |
| batch 128 | 5.5 | 6.21 | 9.07 | 11.56 | 14.35 | 16.4 |
In our pre-training of CodeBERT-large, we still get 2.3X speedup when the overall compression ratio is 4.25.
Dear Reviewer,
Thank you very much for your thorough review and fruitful comments! We have carefully read all your feedback and addressed your concerns and questions. We would greatly appreciate it if you could take some time to review our responses.
We will stay online these days and are happy to address any further questions or concerns you may have. We look forward to your continued feedback on our work.
Thank you again for your time and consideration!
Dear Reviewer hvAM,
We sincerely appreciate your valuable comments and suggestions for our paper CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization.
We have carefully addressed all your comments in our response above:
- For the baseline times, the results are provided in the above response and will be included in our revision.
- Regarding the scalability of our approach, we have demonstrated the significant potential of this method in pre-training CodeBERT, a model with 357 million parameters, as detailed in the above response and main author rebuttal.
- We have added more explanations in our responses regarding our rank-adaptive training for further compression and will include additional details in our revised manuscript.
- We conducted experiments to illustrate the relationship between compression ratios and speedup. The results, along with some mathematical analysis, are presented in our responses.
As the discussion period is drawing to a close in a few days, we would be very grateful if you could take a moment to review our responses. If our replies have satisfactorily addressed your concerns, we greatly appreciate it if you could acknowledge this in the discussion thread. If there are any remaining questions or concerns, please do not hesitate to reach out to us. We will stay online these days and are ready to respond promptly.
Thank you again for your time and consideration. We look forward to your feedback.
Thank you for your response. Based on your rebuttal I increased my score from 5 to 6.
This work proposes using low-rank tensor train decomposition to accelerate deep learning model training and save memory usage. In the algorithm, both embedding tables in recommendation systems and large linear weights are written as a tensor train, and rank-adaptive optimization is used to adaptively reduce the rank without sacrificing the model accuracy. Multiple techniques, including TT embedding lookup, contraction path optimization, and CUDA graphs are combined to improve the efficiency. Experimental results show good memory saving and speedup without sacrificing model accuracy.
优点
originality: the idea to adaptively change the TT rank during training looks new. In addition to the training algorithm, multiple performance optimization techniques make the paper solid. In particular, the TT embedding lookup algorithm that does sampling and contraction in a interleaved way is also useful.
缺点
presentation: section 4.2 is hard to understand. For contraction path optimization, it would be good to visualize the process using tensor diagrams.
limitation of the proposed algorithm: it would be good to discuss the limitation of the algorithm in detail. In particular, I believe low-rank weights can help only under relatively small datasets/tasks, under which the low-rankness is a good regularization. Under large dataset/foundation model setting, I suspect the work can beat the baseline algorithms. In particular, the algorithm is also different from previous algorithms that use low-rank. In previous works, low-rank approximation is only applied on gradients rather than weights. To claim that the work can help foundation model training, we need larger datasets.
问题
-
line 42: replies -> relies
-
How to choose a reasonable initial TT rank?
-
how does accuracy compare between CoMERA, GaLore, and LTE?
局限性
n/a
Responses to weaknesses:
Weakness 1: presentation: section 4.2 is hard to understand. For contraction path optimization, it would be good to visualize the process using tensor diagrams.
Response: Thanks a lot for the suggestion! We prepared the tensor diagrams to visualize the contraction paths, but removed them from the paper because of the page limits. We have included the tensor diagrams in the attached PDF, see Figure 3. We will include this diagram in the revision.
Weakness 2: limitation of the proposed algorithm: it would be good to discuss the limitation of the algorithm in detail. In particular, I believe low-rank weights can help only under relatively small datasets/tasks, under which the low-rankness is a good regularization. Under large dataset/foundation model setting, I suspect the work can beat the baseline algorithms. In particular, the algorithm is also different from previous algorithms that use low-rank. In previous works, low-rank approximation is only applied on gradients rather than weights. To claim that the work can help foundation model training, we need larger datasets.
Response: Thank you for the important questions!
- The tensor-compression approach can achieve a higher compression ratio and better speedup on relatively small datasets. On larger datasets and models, a higher rank may be required to maintain the model performance. We are conducting larger experiments and the preliminary result is shown in the following table and Figure 1 in the attached PDF file. We pre-train the CodeBERT-Large model, released by Microsoft. It has 24 encoder blocks and in total 357 million parameters, whose architecture is similar to BERT Large. The pre-training is done on the CodeSearchNet, a 20GB dataset widely used for pre-training LLM for automatic code generation. All linear layers in encoders are compressed into TT format. The embedding table and final linear layers are not compressed because CodeBERT enforces them to be the same, but CoMERA uses different tensor formats for TTM and for linear layers. Compared to uncompressed training, the tensor-compressed model shows a similar convergence curve and reaches a similar training loss, while compressing the whole model 4.25 times and linear layers 9.77 times. The tensor-compressed training is about 2.3X faster in sequence length 128 and 1.9X faster in sequence length 512 than uncompressed training on a single RTX 3090 GPU. The results demonstrate that our approach can scale up to larger models and tasks. We will investigate more in the future and are optimistic about the results on large models and datasets.
| Pre-training results of CodeBERT-large | ||
|---|---|---|
| compression ratio | overall | 4.25 |
| tensorized layers | 9.77 | |
| training speedup | sequence length 128 | 2.3 |
| sequence length 512 | 1.9 |
- The low-rank gradient approximation work, like GaLore, represents gradients by low-rank matrices, reducing the memory costs of first and second momentums for the optimizer. However, GaLore still uses the full model and applies the full back-propagation for gradients. Finding the projector and the projection of gradients into the compact space also bring extra computation overhead. In contrast, our approach directly compresses the weights, and the resulting gradient automatically has a compact low-rank tensorized form. Hence, our tensor-compression approach does not have the computation overhead and can reach better memory savings and speedup. A comparison is shown in Section 5.3 and Figure 1 in the paper. We will include more details to compare our method with the low-rank gradient approximation method.
Responses to questions:
Question 1: line 42: replies -> relies
Response: Thanks! We will fix the typo.
Question 2: How to choose a reasonable initial TT rank?
Response: Thank you for the interesting question! Our work adaptively determines the TT ranks during training. Generally, it is a super challenging problem to choose good initial TT ranks maximizing model performance without sacrificing performance prior to training. For our approach, we usually start with relatively larger ranks and let our method gradually prune ranks during training. This choice provides us with good experiment results as shown in Section 5.
Question 3: how does accuracy compare between CoMERA, GaLore, and LTE?
Response: The CoMERA and GaLore achieve almost the same validation accuracy, 64%, on the MNLI dataset. However, the LTE approach does not converge on the task by using its default setting.
Dear Reviewer,
Thank you very much for your thorough review and fruitful comments! We have carefully read all your feedback and addressed your concerns and questions. We would greatly appreciate it if you could take some time to review our responses.
We will stay online these days and are happy to address any further questions or concerns you may have. We look forward to your continued feedback on our work.
Thank you again for your time and consideration!
Dear Reviewer 6ExQ,
We sincerely appreciate your valuable comments and suggestions for our paper CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization.
We have carefully addressed all your comments in our response above. For the tensor diagrams for the contraction paths, we have attached our previously prepared tensor diagrams in the attached PDF and will include them in our revision to improve readability. Regarding the scalability of our approach, we have demonstrated the significant potential of this method in pre-training CodeBERT, a model with 357 million parameters, as detailed in the above response and main author rebuttal. Additionally, we have discussed the differences between our approach and previous low-rank methods, such as GaLore, and answered other questions in the above response.
As the discussion period is drawing to a close in a few days, we would be very grateful if you could take a moment to review our responses. If our replies have satisfactorily addressed your concerns, we greatly appreciate it if you could acknowledge this in the discussion thread. If there are any remaining questions or concerns, please do not hesitate to reach out to us. We will stay online these days and are ready to respond promptly.
Thank you again for your time and consideration. We look forward to your feedback.
Dear Reviewer 6ExQ,
Thanks a lot for your constructive comments about our submission CoMERA.
We fully understand that you may busy with many deadlines at this moment. As the discussion window will close in 1 day, I would highly appreciate it if you could read our rebuttal.
In summary, we have (1) provided a detailed tensor network diagram in the attached PDF file, (2) shown the promising comperssion and 2.5X speedup in pre-training CodeBERT-large, (3)addressed other minior issues such as typos, initial rank setting and accuracy of GaLore and LTE.
If our responses have well addressed your comments, we would highly appreciate it if you can acknowledge this. If some of our responses needs further clarification, please feel free to let us know! We are staying online, and we are happy to provide potential further clarification in a timely manner.
I would like to thank authors for the detailed feedback. I've decided to raise my score to 6.
Dear Reviewer 6ExQ,
Thanks for your participation in the discussion and for recognizing our work. Your review feedback helped a lot to improve the quality of our work.
Common Concerns
We would like to thank the reviewers for their fruitful suggestions and comments. We have addressed ALL review comments (see the response to each reviewer).
Here we summarize some concerns arised in the review process.
Scalability of tensor-compressed training to larger models
A few reviewers are concerned about whether the tensor-compressed training scales up to larger models and tasks. It is also a problem we are now actively investigating. We are conducting larger experiments and the preliminary result is shown in the following table and Figure 1 in the attached PDF file. We pre-train the CodeBERT-Large model, released by Microsoft. It has 24 encoder blocks and in total 357 million parameters, whose architecture is similar to BERT Large. The pre-training is done on the CodeSearchNet, a 20GB dataset widely used for training LLM for automatic code generation. All linear layers in encoders are compressed into TT format. The embedding table and final linear layers are not compressed because CodeBERT enforces them to be the same, but CoMERA uses different tensor formats for TTM and for linear layers. Compared to uncompressed training, the tensor-compressed model shows a similar convergence curve and reaches a similar training loss, while compressing the whole model 4.25 times and linear layers 9.77 times. The tensor-compressed training is about 2.3X faster in sequence length 128 and 1.9X faster in sequence length 512 than uncompressed training on a single RTX 3090 GPU. The results demonstrate that our approach can scale up to larger models and tasks. We will investigate more in the future and are optimistic about the results on large models and datasets.
| Pre-training results of CodeBERT | ||
|---|---|---|
| compression ratio | overall | 4.25 |
| tensorized layers | 9.77 | |
| training speedup | sequence length 128 | 2.3 |
| sequence length 512 | 1.9 |
Overall training time and total epochs.
The paper mainly compares the per-epoch time of CoMERA and uncompressed training. The reviewers also wonder about the overall training time and the number of epochs. We address the common concern here.
In short, empirically our method is 2-3X faster than uncompressed training when training transformers on a single GPU, but we do not have theoretical guarantees about the number of epochs although we observed that they used a similar number of epochs. Detailed explanations are provided below
- Training neural networks is a highly non-convex optimization problem in both compressed and uncompressed formats, making the theoretical convergence analysis very complicated. Even existing theoretical analysis of uncompressed training is done on simplified cases, such as the two-layer neural networks with infinite width. The overall training time depends on (1) the number of epochs and (2) time per epoch. While we observed 2-3X speedup in terms of (2), point (1) is highly case-dependent for almost all non-convex optimization solvers.
- Our CoMERA has a similar empirical convergence behavior to the uncompressed training on our tested cases. For the 6-encoder transformer, CoMERA converges a little slower than the standard training at the beginning as shown in Figure 6 in the paper, but finally CoMERA achieves a higher validation accuracy. For the DLRM task, CAMERA needs fewer iterations than the standard training, as shown in Figure 2 in the attached PDF file. We will also add that figure to our revised manuscript.
- On all tested transformers, CAMERA generally takes 2-3X less training time because it has similar convergence curves as the uncompressed model, but each epoch of the tensorized model is 2-3X faster than standard training.
Although our method has similar (and sometimes better) convergence behavior compared with standard training. We think that it could be misleading to draw any conclusion at this moment without a theoretical proof (which may or may not exist).
Figures in PDF file (attached)
Figure 1. Training loss of CodeBERT-large.
- Figure 1 shows the empirical convergence curve of tensor-compressed training on the CodeBERT-Large model, a model of 24 encoder blocks and in total 357 million parameters before compression.
Figure 2. The validation normalized cross-entropy loss of training DLRM model.
- It presents the validation loss of CoMERA and uncompressed training during training DLRM. CoMERA converges slightly faster in terms of iterations than the uncompressed training on this task.
Figure 3. Tensor diagrams for contraction paths of TT forward- and back- propagation.
- The figure visualizes the proposed contraction paths for TT forward- and back- propagation as detailed in Section 4.2.
The reviewers all evaluate as borderline accept / accept, with most improving evaluation after discussions with the authors. The authors had extensive discussion / pushed hard in order to get the evaluations improved. I rate as "less certain" as ultimately it seems none of the reviewers were particularly excited about the paper in initial evaluation. But it seems logical to accept the paper given the evaluation changes.