vLLM: Efficient LLM Serving with PagedAttention, Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Ion Stoica, 2023arXiv preprint arXiv:2309.06180 (arXiv)DOI: 10.48550/arXiv.2309.06180 - Introduces PagedAttention, a memory management technique for KV cache that significantly improves the efficiency of continuous batching for LLM inference.
Text Generation Inference: A Production Ready Framework for LLM Serving, Olivier Dehaene, Félix Marty, João Gante, Quentin Lhoest, Victor Sanh, et al., 2023 (Hugging Face Blog) - Details the architecture and features of Hugging Face's Text Generation Inference, which includes an optimized continuous batching implementation for production workloads.