In-Datacenter Performance Analysis of a Tensor Processing Unit, Norman P. Jouppi, Cliff Young, Nishant Agrawal, Mike Baker, Gaurav Bates, Kelly Cao, Raymond M. Chiu, George Chou, Jeremy Clark, Brad Conrad, John N. Cook, Phoebe Coplon, Pat Costello, Anna Cuningham, Nathan Eifrig, Jeremy Kaiser, Paul Kallman, Alan Lee, Jason Li, Alex Lukefahr, David Mullis, Alex Nagurney, LaMDA Tran, Trevor Norris, Grant Ortega, Lawrence Ortega, Rahul Pandit, Daniel Smith, Kevin Tarolli, Greg Tassa, Anant Thazhuthaveetil, Rajat Verma, Dean Way, David Welch, Jennifer Wen, Paul N. Williams, William Wolf, Scott Wong, Tim Xu, and David Zhabel, 2017Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17) (ACM)DOI: 10.1145/3079737.3079803 - This foundational paper introduces Google's Tensor Processing Unit (TPU) and details its architecture and performance characteristics for neural network workloads.
NVIDIA H100 Tensor Core GPU Architecture, NVIDIA, 2022 (NVIDIA) - Official whitepaper describing the architecture of NVIDIA's H100 GPU, including its Tensor Cores, memory hierarchy (HBM3), and interconnects (NVLink), all critical for LLM deployment.
A Survey of Deep Learning Accelerators: Architectural Innovations and Open Challenges, Vahid Esmaeilzadeh, Babak Falsafi, and Hadi Esmaeilzadeh, 2021ACM Computing Surveys, Vol. 54 (Association for Computing Machinery (ACM))DOI: 10.1145/3472017 - Offers a comprehensive overview of diverse deep learning hardware accelerators, covering CPUs, GPUs, FPGAs, and ASICs, discussing their architectural innovations and challenges.