NVIDIA Triton Inference Server User Guide, NVIDIA Corporation, 2024 (NVIDIA Corporation) - Details a high-performance inference serving solution that supports dynamic batching and efficient GPU utilization for model inference, critical for scaling RAG component throughput.
Horizontal Pod Autoscaler, Kubernetes Authors, 2024 (The Kubernetes Project) - Official documentation describing how Kubernetes automatically scales the number of pods in a deployment based on observed CPU utilization or custom metrics, central to autoscaling RAG components.
Milvus: A Purpose-Built Vector Database for Scalable Similarity Search, Jianguo Li, Kai Wang, Xiaomeng Huang, Xiangyu Li, Tao Li, Haojie Zuo, Kun Liu, Jing Li, Yan Liang, Yuhua Zou, Guoliang Li, Jun Jiang, 2021Proceedings of the VLDB Endowment, Vol. 14 (VLDB Endowment)DOI: 10.14778/3476249.3476269 - Presents the architecture and scaling mechanisms of Milvus, a distributed vector database designed for high-throughput similarity search, relevant for scaling vector database components in RAG.