In-Datacenter Performance Analysis of a Tensor Processing Unit, Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, et al., 2017Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA) (ACM)DOI: 10.1145/3079856.3080246 - Describes the architecture and performance of Google's first-generation Tensor Processing Unit (TPU), detailing its systolic array design for neural network acceleration.
NVIDIA Volta Architecture In-Depth, Stephen Jones and David B. Kirk, 2017 (NVIDIA Corporation) - Detailed technical whitepaper describing the NVIDIA Volta GPU architecture, including the introduction and design principles of Tensor Cores for deep learning acceleration.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 201813th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (USENIX Association) - Introduces TVM, an open-source deep learning compiler stack that automatically optimizes and generates code for diverse hardware backends, including CPUs, GPUs, and specialized accelerators, covering the software mapping aspect.