FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022Advances in Neural Information Processing Systems, Vol. 35DOI: 10.48550/arXiv.2205.14135 - Introduces FlashAttention, an I/O-aware exact attention algorithm that significantly speeds up transformer inference by reducing memory access overheads, directly relevant to LLM inference acceleration.
Fast Inference from Transformers via Speculative Decoding, Yaniv Leviathan, Matan Kalman, Yossi Matias, 2022Proceedings of the International Conference on Machine Learning (ICML) 2023DOI: 10.48550/arXiv.2211.17192 - Introduces speculative decoding, a technique that uses a smaller draft model to accelerate the inference of larger transformer models, reducing per-token generation latency.