NVIDIA Volta GV100 GPU Architecture, NVIDIA Corporation, 2017 (NVIDIA Corporation) - Introduces the NVIDIA Volta architecture and its Tensor Cores, detailing their principles and initial capabilities for accelerating deep learning workloads.
AMD Instinct MI200 Series: The CDNA2 Architecture, AMD, 2021 (AMD) - Describes the CDNA2 architecture powering AMD Instinct MI200 GPUs, including the design and functionality of its Matrix Core units (MFMA instructions).
CUDA C++ Programming Guide, NVIDIA, 2023 (NVIDIA) - Provides comprehensive guidance on programming NVIDIA GPUs, detailing how to utilize Tensor Cores through CUDA intrinsics and best practices for performance optimization.
TVM: An Automatic End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 201813th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (USENIX Association) - Introduces TVM, an automatic compiler framework that optimizes deep learning workloads for various hardware, including specialized matrix units, through a modular pass infrastructure.
Triton: An Intermediate Language and Compiler for GPU Programming, Philippe Tillet, H. T. Kung, David Cox, 2019Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL '19) (ACM)DOI: 10.1145/3315508.3329973 - Presents Triton, an intermediate language and compiler designed to simplify and optimize high-performance kernel generation for GPUs, particularly for operations leveraging Tensor Cores.