TranSpa: Towards Efficient Structured Sparse Training for Transformers
摘要
评审与讨论
The paper introduces TranSpa, an efficient structured sparsification method tailored for transformers used in both language and vision AI models. Unlike previous methods that focus on individual transformer components, TranSpa considers the correlations between weight matrices and their rows and columns, applying a coupled sparsification approach. By introducing a new granularity level, TranSpa selectively removes less significant parts of the transformer, optimizing its structure and reducing computational costs. Empirical results demonstrate that TranSpa achieves a 1.6x reduction in model size with minimal accuracy loss (0.6 perplexity increase) in GPT-2 training from scratch, and offers a 50% memory reduction and a 1.6x training speedup for sparse LLaMA-1B models.
优点
1.Innovative Sparsification: By considering correlations within weight matrices, TranSpa’s coupled sparsification maintains essential model structure.
2.Efficient Granularity: TranSpa’s new granularity method enables precise pruning of less critical parts, optimizing model efficiency.
3.Resource Savings: TranSpa reduces GPU memory usage by 50% and speeds up training by 21% in large models like LLaMA-1B, addressing scalability.
4.Versatile Application: Effective in both pre-training and fine-tuning stages, TranSpa adapts well across various training scenarios.
缺点
1.One limitation of this paper is that, while it emphasizes the correlation between weight matrices, it does not verify whether this correlation exists only between adjacent weight matrices. Further analysis is needed to confirm if non-adjacent matrices also exhibit correlations, as this could impact the effectiveness and generality of TranSpa’s sparsification approach.
2.A limitation of this paper is the lack of time complexity analysis for key computations, such as compute v and compute . Including these analyses would clarify the computational overhead introduced by TranSpa and provide a more comprehensive understanding of its efficiency.
问题
Please see the weaknesses
The authors propose TranSpa, an efficient structure sparsification method accelerating both training and inference of transformers. TranSpa couples weight matrices, estimates their importance accordingly, and removes coupled row/col pairs in each coupled weight matrix. The proposed method is evaluated across different scenarios, including pre-training, fine-tuning and inference. Experimental results demonstrate that TranSpa effectively speedup up transformer training and inference while maintaining the performance.
优点
- This work proposes to consider the importance of weight matrices by couples and removes row/col pairs in coupled weight matrices, improving the performance.
- It is carefully written.
- Experimental results demonstrate a significant reduction on training time and memory usage.
缺点
- The experimental section lacks some details. For example, Table 1 presents training times but omits FLOPs savings, while Table 2 provides Flops savings without showing time savings.
- The training times for sevearl baselines, such as Monarch in Table 1, are missing, complicating the comparison of baseline performance and TranSpa.
- The estimation of weight importance is based on loss, as outlined in Eq. (6). It's unclear how importance evaluation and sparsification are conducted when there is no loss during inference.
问题
- Can you clarify how to evaluate the importance of weights and conduct sparsification during inference?
- In Table 1, why are some pre-training experiments are conducted on 8 cards while others use 4 cards? This inconsistency makes it inconvenient to derive time savings.
The authors propose a novel granularity, called "coupled weight matrix", to evaluate the importance of structural components within a model and apply structured sparsity based on component importance. The "coupled weight matrix" refers to a pair of weight matrices, such as . By removing the less important rows of and columns of , the authors introduce sparsity into the model.
优点
- A novel granularity is proposed to evaluate the importance of structural components, offering broader perspective on model optimization.
- Experimental results appear promising, demonstrating that this approach can effectively accelerate training and inference by utilizing structured sparsity and achieve high performance.
缺点
- In Eq. 5, the latter part . Consequently, and do not fit the definition of "coupled weight matrix."
- Most models incorporate position embedding in Multi-Head Attention. For example, the attention mechanism in LLaMA can be formulated as , where represents a rotation matrix with an an angle determined by the relative positions of two tokens. Thus, and do not constitute a "coupled weight matrix."
- In current models, such as LLaMA or Mistral, the definition of FFN is . Authors didn't discuss this common structure in the article and I do not observe a "coupled weight matrix" within this FFN.
问题
- I note that your experiments include LLaMA. How did you identify the "coupled weight matrix" within LLaMA's FFN?
This paper proposes TranSpa, an efficient structured sparse training approach for transformers. Experiments on various models in pre-training and fine-tuning show significant improvements in training speed, accuracy, and cost reduction compared to existing methods.
优点
TranSpa proposed Coupled Estimation and coupled sparsitifation in Sec 3. With inspiration from tSVD, TranSpa implements structured pruning which can be easily translated into real acceleration.
TranSpa brings weights reduction and speedup in training and inference, e.g., 1.6× size reduction and 0.6 lower perplexity for GPT - 2, and 50% GPU memory usage reduction and 1.6× inference throughput speedup for LLaMA - 1B. TransPa also demonstrates better performance when scaling to llama-2-7b fine-tuning tasks.
缺点
While the authors applied TransSpa on various architectures (GPT-2, Llama-1b, DeiT), the results are not entirely convincing:
- Table 1 compares TransSpa with the GPT-2 baseline. However, GPT-2's training epochs are twice as many, and the authors don't provide evidence that the TransSpa model has fully trained and converged with half the epochs. This weakens the credibility of the claimed training speedup.
- Table 2 lacks crucial comparisons with PEFT methods for LLM training. The LoRA series is missing data on memory and training time. Additionally, it's unclear why LoRA has the same number of parameters as the baseline.
- Table 3 shows that TransSpa significantly underperforms on small tasks like Winograd (69 → 60), while many studies have shown LoRA can achieve similar performance to Full-FT. This raises concerns about the pre-training results, which are typically considered more challenging than quick 1–3 epoch SFT.
- Table 5 switches to the much smaller CIFAR-10 dataset, despite Table 3 showing comparisons on ImageNet. This inconsistency raises questions about the quality of the loss curve on ImageNet.
The method of TranSpa is not complex, however, why and how it can preseve accuracy compared to full FT is not fully discussed in the paper
问题
see comments above
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.