SS1: Accelerating Inference with Fast and Expressive Sketch Structured Transform
Linear transform with latency benefits ,better parameter-quality tradeoff
摘要
评审与讨论
The paper introduces a structured randomized parameter sharing scheme (SS1) for computational complexity reduction. A couple of coalescing techniques are suggested, where the chunk size does not affect the approximation error. The proposed method is easy to combine with the quantization method to achieve further gain. Experimental results demonstrate that SS1 improves the throughput of GPT-2.
优点
Overall, the paper suggests good additional features to Random Parameter Sharing (RPS).
- The idea of reducing the computational complexity with RPS is interesting and adequately tackles a limitation of RPS.
- The proposed method achieves actual latency improvement in some cases.
- The paper focuses also on the GPU kernel implementation. The details about Forward and Backward Kernels are discussed well.
缺点
However, a few points should be addressed to improve the presentation.
- Experimental results lack an important baseline--the comparison between SS1 and the existing RPS methods should be considered since SS1 is built upon RPS.
- The latency and accuracy comparison of SS1 with other structured matrices is limited to small-sized models. Hence, the performance of SS1 compared to other methods on large models like LLM is unclear.
- Writing quality needs to be improved, especially for the mathematical expressions and algorithms. I am willing to increase the score if the presentation of the algorithms, theorems, and proofs meet the quality of the top-tier ML conference.
- Some mathematical expressions are confusing, e.g., a vector and its element are sometimes denoted by the same alphabet .
- Algorithms 1 and 2 are hard to follow. Adding comments to each important line or block might help readers understand what they are for.
- The abbreviations used in Table 1 are not properly introduced in the text.
- Proofs in Appendix contain typos in many places and should be written with a more professional nuisance.
- It is hard to follow how Algorithm 2 finds the solution of Equation 10. More details regarding Algorithm 2 should be followed.
问题
Major questions
- How does Algorithm 2 guarantee the minimization of the Frobenius norm error in Equation 10? More details should be provided.
- How are - and -coalescing used in the actual SS1 implementation?
- What are the latency and accuracy of Llama-3-8b with low-rank or Monarch matrices?
- How much is SS1 better/worse than the existing RPS methods in terms of accuracy and latency?
Minor questions
- In Equation 10, I suggest denoting the Forbenius norm by , not , where the latter is usually reserved for the operator 2-norm induced by the vector 2-norm.
- Do the hyperparameters and vary from GPU to GPU? How does the inference throughput change over those parameters?
- Typo in Line 220 “[1,2])”
- I suppose “SSL” in Table 1 is a typo and should be SS1.
局限性
SS1 requires a specialized kernel to be effectively utilized on GPUs. Different types of layers (e.g., convolution layer) may require different kernel implementations.
We thank you for the suggestions to improve the paper. Incorporating the suggestions, we have made several writing changes to the paper that are listed in the common rebuttal. We hope to have addressed your concerns and incorporated your suggestions to your satisfaction. Specific concerns are addressed below,
Comparison with RPS:
Existing RPS methods focus on reducing the parameter memory but not the FLOPs. Thus the latency of RPS methods generally is slightly worse than full model latency no matter the amount of compression. Even so, we perform RPS experiments using ROAST for matrix compression. The results are as follows and as expected. We will add these results to the final paper.
| SS1- Quality | SS1-Latency | ROAST-Quality | ROAST-Latency | |
|---|---|---|---|---|
| 2x | 19.45 | 154 | 19.87 | 238 |
| 4x | 19.99 | 148 | 20.2 | 237 |
| 8x | 20.68 | 145 | 20.94 | 222 |
the performance of SS1 compared to other methods on large models like LLM is unclear.
While we do not have the computing power to train 1B+ LLM models, our rigorous tests on different domains, model architectures, and settings encourage us to believe that the results we see are indeed general and will translate even to bigger models.
Writing quality improvements We apologize for the issues with our presentation. We have fixed all the issues listed here and more. The detailed summary of major changes is presented in the common rebuttal.
How does Algorithm 2 guarantee the minimization of the Frobenius norm error in Equation 10? We have improved our explanation of projection from full matrix to SS1 compressed matrix. It should help avoid a lot of confusion. We will provide a brief explanation here. ( we have also added this explanation to appendix and have moved algorithm 2 to appendix)
Firstly, note that parameter sharing in SS1 is only inside a single neuron. So, there are independent RPS instances for each neuron. Thus, the projection of the weight matrix is independently projecting each neuron. Consider a single neuron under RPS and . Given , the solution to get best which minimizes is nothing but solution of linear regression, i.e. . Also, since S is a sparse matrix with exactly one non-zero in each row, is a diagonal matrix. The hash function defined in the paper ensures that all elements of the diagonal are non-zero. Thus, it is invertible. In fact, if you view S as a mapping matrix that maps each element of (i.e., range K) to some element of ( range ), then the value of is just an aggregation of all those elements from that map to this element in ( computed via ) and then normalized by the total number of elements from that map to this element in ( computed via ). This is straightforward to implement and can be done in a blocked manner for the entire matrix . This is presented in the Algo 2.
B_K and B_N parameters : implementation, do they vary with GPU, does it affect throughput The GEMM operation has three block parameters . In the implementation of SS1, our coalescing parameters B_K and B_N behave like standard GEMM block parameters. i.e. they determine what block sizes to bring into shared memory for performing matrix multiplication. What is interesting is that algorithmically, they also determine how the hash function and SS1 recipe in general should behave, which is the novelty in SS1 that ensures that the algorithm is hardware-friendly. These parameters are specific to each GPU and need GPU-specific optimization. Also, the choice of these parameters has a huge impact on throughput, as is true for GEMM operations.
Llama-3-8b experiments We were able to run quality experiments for lowrank. We could not find a ready implementation for monarch projection in original repository for rectangular matrices. We will add complete results in the final paper. The results are very similar across different methods.
| Model | #param | MMLU | Wingogrande | Latency |
|---|---|---|---|---|
| Original | 8B | 65.05 | 76.1 | 378.29 |
| SS1 | 6.1B | 61.26 | 69.93 | 341.59 |
| Monarch | 6.2B | NA | NA | 346.8 |
| LowRank | 6.1B | 62.01 | 68.98 | 339.42 |
We want to note that this is just a proof of concept in support of the fact that SS1 can be used in the era of pretrained models as well. The main experiment still is when we train SS1 from scratch to evaluate quality vs. latency.
We hope to have addressed your clarifications and incorporated your suggestions to your satisfaction. If you are satisfied with our changes, please consider giving us an updated score.
Thank you for your response. The presentation changes stated in the global response addressed the majority of my concerns regarding the presentation. I am inclined to raise the score, as long as the revised manuscript is updated correspondingly so that readers do not need to deal with the ambiguous notations and presentation.
Dear Reviewer, we assure you that we have made the suggested changes to the manuscript in both the main paper and the appendix, and these will be reflected in the final submission. We have also requested AC, PC, and SAC to allow submitting paper PDF during review as a special case. We are awaiting their decision.
The paper introduces the Sketch Structured Transform (SS1), a novel approach designed to enhance the efficiency of tensor multiplication in deep learning models. SS1 leverages structured yet random parameter sharing to reduce computational load while maintaining the expressive power of the models. The authors provide empirical evidence demonstrating SS1's superior quality-efficiency tradeoffs compared to existing methods like Monarch and LowRank. Key findings include significant improvements in inference speed for models such as GPT2, BERT, and Llama-3-8B, even without finetuning. The combination of SS1 with quantization further enhances efficiency, and SS1 also proves effective in reducing inference latency in CPU workloads, such as the DLRM MLPerf Model.
优点
- SS1 introduces a unique method of structured yet random parameter sharing that effectively reduces computational requirements while preserving model expressivity.
- Extensive experiments demonstrate SS1's superior performance across a variety of models and applications, including significant improvements in inference throughput for GPT2, BERT, and Llama-3-8B.
- SS1 can be applied to pre-trained models, allowing them to be projected onto SS1 and finetuned for faster deployment, which is highly practical for real-world applications.
缺点
- The details of GPU optimization, such as K- and N-coalescing, add complexity to the implementation, which might limit widespread adoption.
问题
None
局限性
None
We thank the reviewer for the support of our paper. We have made writing changes to improve the manuscript and additional experiments as suggested by other reviewers. They are listed in common rebuttal. Specific clarifications are answered below,
K- and N-coalescing, add complexity to the implementation
All CUDA kernels optimize block parameters for performance. The K and N coalescing is similar to block size choosing generally required for matrix multiplication. Libraries such as CUBLAS do this optimization and cache the parameters for use, hiding the complexity of the kernels from the end user. Similarly, we do the optimizations in the code and the end user does not need to worry about these parameters.
We hope to have addressed your concerns to your satisfaction. We are happy to provide further clarifications if needed.
the authors propose the Sketch Structured Transform (SS1), a randomized parameter-sharing method that reduces FLOPs and latency while maintaining model quality. SS1 ties parameters within a single neuron weight and can be implemented to reduce input dimensions before multiplying with a compressed weight vector. This method is GPU-friendly and can be integrated with existing RPS methods, providing control over parameter memory and computation. SS1 can also be applied to pre-trained models using a projection function, preserving the benefits of existing models and enabling fast deployment.
优点
-
The authors created a kernel that can work on available HW and reduce workload latency. Moreover, they open source the kernel!
-
The basic idea is clear as described in section 3.1, the rest is kind of messy.
缺点
-
Very hard to follow. They abused notations many time (c is compression rate, size of super group etc.)
-
The algorithms are unclear. For instance, algorithm 2. If we wish to find Z that minimize the frobenius norm between the compressed weight and the original weight we somehow need to search the different options to do it. If you are simply using brute force then just say it and push the algorithm to the appendix. Note that using only w and not checking the activation value when defining compression scheme will be sub-optimal as you might reduce the MSE of ||W-SS1(Z,I(k)||| but ||WX-SS1(Z,I(k))X|| (where x is the input to the linear layer) can be further reduced.
-
Generally the paper is just poorly written and it is a shame as it seems is contains have very interesting ideas
问题
-
Can you explain section 4 in high level. What are you trying to prove and why? Specifically to the best of my understanding, you compressed\quantize only the weights and activation undergo reduction. So why should we care about the variance of dot product of compressed unit norm vectors.
-
Same goes to theorem 2 – can you explain, in what sense you mean unbiased? Is it under permutation of the hash function? If so, I am not sure I understand why it is important if we pick one permutation and use that for the entire ( I mean the weight are tied once)
局限性
The paper show promising results on small scale models. For llama3-8b the results are less promising as only 1.1x speedup was achieved by applying ss1 only on some of the layers (selected based on calibration set). The limitations are partially discussed.
We thank the reviewer for their support of our paper. We apologize for writing issues and assure the reviewer that we have taken due care to fix them both in the main paper and appendix. The list of changes made are presented in the common rebuttal. We clarify some specific concerns below
Notation overload of c The in the text always refers to the compression factor ( a factor of c=4 means compression of 4x). In the text we have said that the supergroup contains “c” groups. Having said that we have refactored the section 3 with standardized notations and explanations to improve the readability of the section and to avoid confusions.
Algorithm 2 We do not need to perform brute force to find the solution to equation ||WX-SS1(Z,I(k))X||. The solution of the equation is exactly shown in algorithm 2. We found a easier way to explain the projection mechanism and have updated the manuscript accordingly. We hope that alleviates the concerns regarding algorithm 2. We also explain it here,
Firstly, note that parameter sharing in SS1 is only inside a single neuron. So, there are independent RPS instances for each neuron. Thus, the projection of the weight matrix is independently projecting each neuron. Consider a single neuron under RPS and . Given , the solution to get best which minimizes is nothing but solution of linear regression, i.e. . Also, since S is a sparse matrix with exactly one non-zero in each row, is a diagonal matrix. The hash function defined in the paper ensures that all elements of the diagonal are non-zero. Thus, it is invertible. In fact, if you view S as a mapping matrix that maps each element of (i.e., range K) to some element of ( range ), then the value of is just an aggregation of all those elements from that map to this element in ( computed via ) and then normalized by the total number of elements from that map to this element in ( computed via ). This is straightforward to implement and can be done in a blocked manner for the entire matrix . This is presented in the Algo 2.
Other presentation issues We have refactored section 3, added detailed comments in algorithm 1, simplified the projection explanation, and made the notation clear and consistent. We hope the reviewer finds the new version friendlier.
Section 4 explanation
Recent literature on RPS shows that the quality of the learned linear model under compression is directly related to the quality of inner product preservation under compression. Thus, inner product preservation is an important problem to analyze in this regard. [1]
The goal in section 4 Theorem 1 is to understand whether projection ( which is the basis for RPS) and quantization ( basis for quantization based compression) can be combined together to obtain better compression rates while preserving quality. While there is general consensus from empirical observations that compression methods can be combined, theorem 1, to our knowledge, is the first theoretical exposition on this topic which also provide guideline as how much compression to perform from individual methods.
The goal of section 4 theorem 2 is to understand the impact of the coalescing-factors have on the “inner product preservation” ( and hence quality of the learned model).
The unbiasedness and variance is under the randomness of permutations ( implemented via hash functions). It is true that we fix these permutations at the start, the randomness is still important to guarantee that we do not choose a “bad” permutation that will compromise the quality of the learned model with a high probability.
[1] Desai, A. and Shrivastava, A.. In defense of parameter sharing for model-compression. International conference on learning representations 2024
We hope to have addressed your clarifications and incorporated your suggestions to your satisfaction. We are happy to provide further clarifications if needed.
I would like to thank the reviewers for their detailed responses. The additional explanations provided in the answers to Reviewer ZODN were very helpful, and they addressed all of my questions. I trust that the authors will revise the manuscript accordingly, and I am raising my score to 6.
This paper introduces Sketch Structured Transform (SS1), a hardware-efficient structured matrix for improving the latency of linear layers in neural networks. The paper theoretically analyzes the effectiveness of SS1 in random projections, optionally combined with quantization. Experiments show favorable performance compared to alternatives such as Monarch and low-rank structures, as the compression ratio is varied.
优点
- Due to its use of the sketch matrix, SS1 is conceptually novel and distinct compared to other popular and performant structures such as Monarch and Low Rank, which only use batched matrix multiplies.
- SS1 show promising performance relative to Monarch and low rank.
- SS1 is designed to be hardware efficient, so that runtime overhead is low and latency improvements can be realized even at fairly small scale like GPT-2 Small.
缺点
The experiment section does not convincingly demonstrate the superiority of SS1 over Monarch and Low Rank for the following reasons:
- The presentation of the results makes it hard to parse how SS1 compares to alternatives as a function of latency, which is the metric the paper aims to optimize for. For example, in Table 1, it would be much better to plot perplexity or loss as a function of latency for each structure. It would only be fair to claim SS1 outperforms alternatives if its loss vs latency curve stays below those of others.
- It does not make sense to control for the model's hidden dimension and only vary the compression ratio, as is done in the experiments. For example, one can increase the compression ratio (number of blocks) in Monarch or decrease the rank in low rank while scaling up the hidden dimension and achieve the same latency. It's possible this is a favorable trade off for the other structures, but the experiment section does not investigate this possibility. Indeed, recent work [1] has shown Monarch with more blocks can perform better for the number same parameters or FLOPs by simultaneously scaling up the hidden dimension. The truly meaningful comparison would be a scaling law plot of loss vs latency with one curve for each pair (structure, compression ratio).
- I strongly suspect the learning rate is not well-tuned in the GPT experiments, making it difficult to trust the results. The learning rate used was for all models. This value is well-tuned for the original dense GPT-2 model, but it is suboptimal for structured models as shown in [1]: Monarch with more blocks and low rank with lower ranks would generally require higher learning rates, otherwise their performance can be significantly underestimated.
问题
- At a conceptual level, SS1 seems similar to a banded matrix with bands, which also have hardware efficient implementations. How does its performance compare to a banded matrix?
- Have you tried either tuning the learning rates for each model in the GPT-2 experiments? Alternatively, you can use the structure-aware learning rate rescaling introduced in [1] to ensure a good value is used for each model without having to tune it.
[1] Qiu et al. 2024. Compute Better Spent: Replacing Dense Layers with Structured Matrices
局限性
The authors discussed some limitations of their work.
Dear reviewer, we would be happy to perform comparative experiments on banded matrix while we prepare for rebuttal as time permits.
Can you please share links to the fast implementation of Banded Weight matrices? In our preliminary search, we have not found a good implementation for banded weight matrices. The closest we found was this genbmm package (https://blog.rush-nlp.com/visualizing-banded-sparse-matrices.html) but it seems to only multiple two banded matrices whereas the setting requires us to multiply one banded matrix (weight) and full matrix (input).
We appreciate the reviewer rating our paper well on counts of soundness and contribution. We have made writing changes to improve the manuscript and additional experiments as suggested by other reviewers. They are listed in the common rebuttal.
Plots of latency vs. perplexity: Thanks for the suggestion. We plot latency vs. perplexity for the GPT experiment in figure 3 ( page 15) and we can see that indeed SS1 curve is the best lying below other curves. We do not use such plots in the main paper since we want to show more latency results for larger models for which we do not have quality numbers in the table in the main paper.
About changing dimensions along with compression rates
The direction of increasing hidden dimension is definitely meaningful not just for baselines such as lowrank, monarch etc. but also our proposed SS1. It makes the experimental space of models explode. Our choice of experimental setup is primarily to set up a practical and fair experimental setting to compare different alternatives. In fact, sticking to base model dimensions and evaluating various compression methods is quite standard in papers that try to fundamentally compare different methods. Some examples being [1,2,3].
[1] Tanaka, H., Kunin, D., Yamins, D.L. and Ganguli, S., 2020. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems, 33, pp.6377-6389.
[2]Frankle, J., Dziugaite, G.K., Roy, D.M. and Carbin, M., 2020. Pruning neural networks at initialization: Why are we missing the mark?. arXiv preprint arXiv:2009.08576.
[3] Desai, A. and Shrivastava, A.. In defense of parameter sharing for model-compression. International conference on learning representations 2024
Learning rate tuning You are correct. Our learning rate is tuned for base models and used across structured models. This again is a commonplace way to compare against different methods [1,2,3]. While it is not optimal for different methods, it avoids entering into hyperparameter tuning for each method x setting. Having said that, we understand the concern. We quickly tested if learning rate tuning would change the findings. We find that learning rate tuning improves results for Monarch and SS1 and does not change the relative ordering.
| Method | learning rate | loss | perplexity |
|---|---|---|---|
| Monarch - 8x | 6e-4 | 3.119 | 22.83 |
| 9e-4 | 3.135 | 23.23 | |
| 2e-3 | 3.017 | 20.65 | |
| 8e-3 | NA | NA | |
| SS1-8x | 2e-3 | 2.99 | 20.13 |
| Lowrank-8x | 2e-3 | 3.161 | 23.67 |
The findings do not change with learning rate tuning.
suggestion on structure aware tuning The work cited Qui et. al. is indeed interesting work and will be useful for our future research. Thanks for pointing it out.
Banded Matrix Comparison We would love to compare against a banded matrix. However, as mentioned in our official comment, we did not find a good implementation of a banded matrix that multiplies with a full matrix. Please let us know if you have any references. We would be happy to include comparative study in our camera ready. Conceptually, SS1 and the banded matrix are very different. Banded matrix is motivated via “sparsity” which drops the input signals for neuron computation whereas SS1 is motivated via parameter sharing which sketches the input to reduce dimensionality. In previous literature we have seen that sketching generally is superior in quality than sparsity. [1]
[1] Desai, A. and Shrivastava, A.. In defense of parameter sharing for model-compression. International conference on learning representations 2024
We hope to have addressed your clarifications and incorporated your suggestions to your satisfaction. we kindly ask you to consider giving us an updated score.
I appreciate the additional experiments. I'm raising my score in light of them. That said, I believe investigating optimally balancing dimension vs compression rates, and adopting structure-aware learning rates (or simply tuning them), will be very important for delivering practical impacts. These hyperparameter choices become extremely important for large-scale training where a small improvement in a hyperparameter can translate to a significant saving of compute. Indeed, by adjusting Monarch learning rate, you have decreased its perplexity by ~10%, and significantly shrinked the gap with SS1. I hope the authors can carefully discuss these considerations in the final paper.
This paper introduces the sketch structured transform (SS1) method for fast inference. SS1 is a form of randomized parameter sharing (RPS) with connections to low-rank/dimensionality-reduction methods. Theory is used to elucidate SS1’s compatibility with quantization -- it is also compatible with other RPS methods. In empirical experiments with popular NLP and CV setups, relative to competing methods, SS1 can preserve performance better as it reduces parameter count and inference latency. SS1 is also shown to have flexibility in the stage it is applied at -- it is applicable to models without any finetuning, before finetuning, and during training.
优点
Originality: The RPS method SS1 is introduced, and it is shown to reduce parameters, FLOPs, and latency. The method is made complementary with tiled matrix multiplication via schemes to coalesce values along multiple axes. Theorems are introduced to helpfully demonstrate SS1’s compatibility with quantization and its hyperparameter flexibility.
Quality: The overall approach is straightforward and intuitive, and it is tested in a range of scenarios that vary the model, dataset domain, and application timing (before or after training). It is compared against strong benchmark methods.
Clarity: The SS1 method is clearly explained and positioned relative to prior work. The figures, equations, and algorithms motivate design choices and illustrate implementation details. The paper is mostly well written.
Significance: This paper addresses efficient inference for matrix-multiplication-focused models (e.g., LLMs), a crucially important topic. The proposed method is applicable to pretrained models, giving it broad relevance.
缺点
The paper doesn't have any major weaknesses. I address some minor weaknesses below (see Questions). These mostly deal with improving the presentation. I would be happy to raise my score if some of them are addressed.
问题
Line 285: this seems incorrect. The quality with SS1 seems worse in Table 2 (right).
Would the benefits of SS1 over other approaches hold at larger model sizes (1B+)?
Line 123: I think this equivalence only holds for points with unit length.
Line 155: Can you please rewrite this sentence to better clarify why the input would have to be read multiple times?
Line 153: should this be “” instead of “”? The latter notation is used on this line, which is inconsistent with the output dimension appearing first in the weight matrix notation on line 139.
Is equation 9 missing the inner product notation?
Line 202: why is subtracted in the indexing?
Line 263: this is a figure, not a table.
局限性
The limitations are well articulated.
We thank the reviewer for the support of our paper. We have made writing changes to improve the manuscript and additional experiments as suggested by other reviewers. They are listed in common rebuttal.
Please find the clarifications requested below:
Line 285: In table 2 (right), we can see that the SS1 model with 74M parameters has better (lower) perplexity than GPT-S-6x with 76M parameters. Thus SS1 models with less parameters can outperform standard model.
Would the benefits of SS1 over other approaches hold at larger model sizes (1B+)? We believe so. While we do not have the computing power to train 1B+ LLM models, our rigorous tests on different domains, model architectures, and settings encourage us to believe that the results we see are indeed general and will translate even to bigger models.
Line 123: The equivalence is due to the relation between inner products and norms. Inner product => norms. Since ||x||^2 = <x, x> norms are just inner product with itself. Norms => inner products. Since 2<x, y> = ||x - y ||^2 - ||x||^2 - ||y||^2 Thus, if norms are preserved, then inner products will also be preserved and vice versa. More discussion can be found in the book[1] (Page 12, first paragraph).
[1] Woodruff, D.P., 2014. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2), pp.1-157.
Line 155: Let us clarify the line here. Claim: If we do not perform K-coalescing, then we might end up requiring to load input multiple times to compute the value of the single neuron. Argument: Consider a neuron that is computed as <z, Sx>, where the dimensions of z and x are large enough so that they cannot be cached or stored on shared memory. In such a case, under certain hash functions that generate S, x[i] and x[i+1] ( for any i) can get multiplied with z[j] and z[k] where j and k are far apart ( In fact under the randomness of the hash function, this event is highly likely). In such a case, if we take a single pass on z, we will have to read the cache line containing x[i] and x[i+1] at least twice.
Taking cue from the need for clarification along with notation and presentation issues pointed out, we have refactored our explanation in section 3 in manuscript while standardizing the notation. We believe the current version is clearer and will help the readers.
Line 153, Equation 9. Thanks for pointing out the inconsistency in handling W. We have fixed the presentation such that we always use W as a K x N matrix and all expressions are written as vector and matrix multiplications (We have removed the inner product notation) for clarity.
Line 202 : B_K B_N is subtracted in the indexing in accordance with the hash function treatment of the ROAST[2] -- recent RPS method. It is a simple solution used to avoid overflows when reading a block of size B_K x B_N from the indexed location.
[2] Desai, A., Zhou, K. and Shrivastava, A., 2023, July. Hardware-aware compression with random operation access specific tile (ROAST) hashing. In the International Conference on Machine Learning (pp. 7732-7749). PMLR.
Line 263: fixed.
We hope to have addressed your clarifications and incorporated your suggestions to your satisfaction. We kindly ask you to consider giving us an updated score as per your indication in the review. We are happy to provide additional clarifications if needed.
Thanks again for the support of our paper.
Dear Reviewer, We hope to have addressed the requested clarifications and incorporated your suggestions. Please let us know if the responses are satisfactory since the discussion period deadline is fast approaching. If so, we request that you update your score as indicated in the review. We are happy to provide further details and answer any more questions.
I have read the other reviews and the authors' response -- thank you for the clarifications. I am raising my score to 6 based on my understanding that the presentation has been improved.
Could the authors please create figures similar to the new one in the authors' general rebuttal for the larger model sizes they explored (i.e., GPT2-M and GPT2-L)? Also, there's a typo in Table 1 -- GPT2-L is not listed, and GPT2-M is listed twice.
Please make the following additional changes:
- On Line 285, make clear that you are comparing to "GPT-S-6x" and not "GPT-S-4x", as the latter is better in terms of perplexity.
- It looks like Line 123 says that inner products between points and distances between points are "equivalent", which is not correct. Could you rewrite this to avoid this confusion?
Dear Reviewer, Thanks for the additional feedback and updated score. We will incorporate this in the final version (Table 1, Line 285, and Line 123). The scatter plots are generated for GPT2-S model as we have quality numbers for these, whereas the larger models are only tested for latency. We can add scatter plots for other datasets we have in the paper and add more baselines to them (eg. RPS).
This isn't critical, but please let me know why the larger models are only tested for latency (and not also quality). The submission might have addressed this, but I don't recall.
For a good comparative study at different compression rates, we need the models to be trained with correctly sized datasets. Larger models need larger datasets and thus more compute (and costs) to train. If larger models are trained with not-so-large datasets, then we cannot measure the impact on expressivity due to structured matrices since compressed models are still expressive enough to absorb all the variation in the dataset.
While the paper's scores are borderline, the consensus of reviewers on its good soundness and contribution of the paper is encouraging. Several suggestions on improving the presentation of the paper were made, which we have incorporated in our updated manuscript. Reviewers AurA and ZoDN are willing to increase their scores if the presentation is improved. We hope they find the changes satisfactory.
Presentation Changes: (Major text changes are marked in blue in uploaded pdf)
- We have standardized the notation used, i.e., using small-case letters for scalars, small-case boldface for vectors, and upper-case boldface for matrices throughout the paper and appendix.
- we have removed the inner product notation and stuck to matrix notation in most cases to avoid confusion. Also, some inconsistencies w.r.t shape of W have been corrected.
- We have simplified the discussion on projection of matrices on SS1 and moved the algorithm to the appendix, where additional details are presented.
- We have fixed the typos in the appendix and main text.
- The condition in theorem 1 on norms of vectors x and y has been corrected. It is ||x|| <= 1 and ||y|| <= 1.
- Table 1 abbreviations are introduced in the preceding paragraph.
- A latency vs. perplexity plot for GPT2-S with SS1, Monarch and Lowrank is added to the appendix. ( attached as pdf to this rebuttal)
We think the proposed changes have improved our paper and thank the reviewers for their careful review and suggestions.
We also provide additional experiments as requested by a few reviewers.
- SS1 vs. RPS : As expected we find that RPS does not reduce latency of the model no matter the compression. Thus SS1 has a better quality-latency tradeoff.
- Does learning rate tuning change the results? : We perform learning rate tuning for Monarch as suggested by the reviewer and find that it does improve the quality for monarch models. However, the same is also true for SS1 and the relative ordering of the methods does not change due to learning rate tuning.
The additional experiments support the superiority of SS1 over alternatives. We will add these results to the final manuscript.
Overall, we thank the reviewers for their time and careful consideration of our paper. We urge them to reconsider their scores if their concerns have been reasonably alleviated.
UPDATE: We just found that we do not have an option to upload the updated manuscript. Since some of the changes requested by the reviewers relate to presentation, and their scores are contingent on that (e.g., reviewer ZoDN), is it possible to share the updated manuscript?
Dear PC, SAC, AC We found that we do not have an option to upload the updated manuscript after making significant changes to incorporate reviewer comments. Since some of the changes requested by the reviewers relate to the presentation, and their scores are contingent on that (e.g., reviewer ZoDN), is it possible to share the updated manuscript?
The authors present a method for accelerating inference in ML models using sketching, called Structured Sketch Transform (SS1). The method balances the efficiency benefits of randomized parameter sharing via sketching with structured transformations that can be efficiently implemented on modern hardware. The method can be flexibly applied – it is applicable to models without any finetuning, before finetuning, and during training – and it combines with other approaches to efficient inference, e.g. quantization. The authors present theoretical analysis justifying the gains in combination with quantization, and experiments show favorable performance compared to alternatives such as Monarch and low-rank structures, as the compression ratio is varied.
Initial reviews of the paper were appreciative of the authors’ method’s generality and novelty, but most reviewers noted issues with presentation, notation, etc. that made it challenging to read the work in parts. After the rebuttal and discussion, reviewers uniformly raised their scores to weak accept, conditional on the authors making significant promised improvements to the presentation. Reviewers also highlighted an unresolved concern about lack of convincing empirical demonstration on LLMs beyond small-scale, especially relative to Monarch’s evaluation at GPT2-M scale, in light of new experiments for the rebuttal on GPT2-S. However, on balance, reviewers advocate for acceptance based on the novelty of the authors' approach and its flexibility, and the AC concurs with this judgment and recommends acceptance. The authors should incorporate all feedback from reviewers into the final revision in order to improve the presentation and readability of the work, and add the experiments that were presented during the rebuttal.