Strided Transformers for Partially-Parallelized Inference

Amil Merchant,Ekin Dogus Cubuk,Stefano Ermon

OpenReview PDF

提交: 2023-09-21更新: 2024-03-26

TL;DR

Strided auto-regressive dependencies for partially-parallelized inference

摘要

关键词

transformersauto-regressiveinference

评审与讨论

审稿意见

评分: 3置信度: 42023-10-19

This paper introduces a partially-parallelized Transformer model called the Strided Transformer. Unlike traditional autoregressive Transformers, which require a token to be fully predicted before predicting the next token, the Strided Transformer feeds the predicted token simultaneously to several subsequent steps and inserts it into different layers. In this way, strided Transformer supports parrallel computation of different time steps during decoding, which accelerate the decoding and reduces the memory load.

优点

The proposed approach is innovative and intuitive.
Based on the current results, the Strided Transformer achieves a favorable speedup with only a relatively small cost of accuracy. Its performance appears promising.

缺点

This work is too hasty and lacks sufficient experiments to support the conclusion. I would raise my score if the authors could provide a fully prepared version during the rebuttal stage.
There is a lack of discussion and comparison with related work, such as non-autoregressive [1] or semi-autoregressive [2] sequence generation approaches.
The authors report only the theoretical decoding speedup, without evaluating its performance in a real-world environment.
Reporting test losses on WMT benchmarks is not a standard practice. Please implement greedy or beam search and provide the corresponding BLEU scores for a more meaningful evaluation.

[1] Gu, Jiatao, et al. "Non-autoregressive neural machine translation." arXiv preprint arXiv:1711.02281 (2017).

[2] Wang, Chunqi, Ji Zhang, and Haiqing Chen. "Semi-autoregressive neural machine translation." arXiv preprint arXiv:1808.08583 (2018).

问题

Why do some results in Table 1 appear inconsistent? For instance, the Layer 6 Stride 1 setting has a test loss of 0.681 in line 2, but a test loss of 0.696 in line 4.

审稿意见

评分: 3置信度: 42023-10-29

The authors introduce the Strided Transformer architecture, which reduces token dependencies in the initial layers of the Transformer model. This design allows for partial parallelization during the decoding process and exhibits good theoretical speedups in both memory-bound and compute-bound settings.

优点

The proposed Strided Transformer architecture exhibits good theoretical speedup properties.

缺点

The work is incomplete, and the experiment is not sound.

The authors only provide a theoretical evaluation of speedup. It seems they did not implement or perform experiments.
Table 1 only presents test loss as a metric for performance degradation, without including any generation results such as BLEU scores to assess the quality of output.
The work is not ready for publication, especially given that it consists of only 8 pages with one page containing only two figures.

问题

The idea is quite fancy. Comprehensive experiments are needed.

伦理问题详情

N/A

审稿意见

评分: 1置信度: 52023-11-05

This paper introduces a partially parallelized inference for Transformer decoders called Strided Transformers. However, the authors completely ignored previous work on (partially) parallel inference with Transformers.

优点

Parallel decoding with Transformers is an important problem that is worth exploring.

缺点

The manuscript completely ignored previous literature on parallel decoding with Transformers, non-autoregressive Transformers, partially parallel decoding with Transformers, and accelerating Transformer decoding.
The paper is not very well written. The methodology is not very clearly described and the Figures are not very clear as illustration. Many sentences are not very natural or easily understandable too.
The experiments are not convincing because they are not conducted on standard datasets nor compared with meaningful baselines.

问题

N/A