All Courses

Advanced Compiler and Runtime Optimizations for ML Workloads

Chapter 1: Foundation: ML Execution Stack and Challenges

The ML Model Deployment Gap

Overview of ML Compiler and Runtime Stacks

Performance Bottlenecks in ML Inference

Hardware for ML Acceleration

The Need for Specialized Optimizations

Chapter 2: Advanced Intermediate Representations for ML

Limitations of Traditional Compiler IRs

Principles of Multi-Level IRs

MLIR: Dialects and Operations

Representing High-Level ML Graphs (e.g., TF, TOSA)

Lowering Paths within MLIR

Extensibility and Custom Dialects

Hands-on Practical: Analyzing MLIR Representations

Chapter 3: Advanced Graph-Level Optimizations

Graph Rewriting Systems

Aggressive Operator Fusion Techniques

Memory-Aware Layout Transformations

Advanced Algebraic Simplification

Static Memory Planning and Allocation

Handling Control Flow in Graphs

Hands-on Practical: Implementing a Fusion Pass

Chapter 4: Tensor-Level Optimizations and Polyhedral Modeling

Representing Tensor Computations as Loop Nests

Introduction to Polyhedral Modeling

Iteration Domains, Access Functions, and Dependencies

Scheduling Transformations (Skewing, Tiling)

Code Generation from Polyhedral Schedules

Auto-Vectorization Techniques (SIMD)

Memory Hierarchy Optimization: Tiling and Prefetching

Hands-on Practical: Optimizing Loops with Polyhedral Tools

Chapter 5: Code Generation for Heterogeneous Hardware

Target-Specific Instruction Selection

Register Allocation for Vector/Matrix Units

GPU Code Generation: CUDA and ROCm Backends

Generating Code for Tensor Cores and Matrix Units

Targeting AI Accelerators (TPUs, NPUs)

Intermediate Formats for Heterogeneous Execution (SPIR-V)

Vendor-Specific Compiler Toolchains and Libraries (cuDNN, MIOpen)

Hands-on Practical: Analyzing Generated GPU Kernels

Chapter 6: Advanced Runtime Systems for ML

Runtime Architecture Overview

Handling Dynamic Shapes and Sizes

Efficient Memory Management Strategies

Asynchronous Execution and Scheduling

Scheduling for Heterogeneous Systems

Integrating Custom Operators and Kernels

Interoperability with ML Frameworks

Hands-on Practical: Implementing a Simple Allocator

Chapter 7: Just-In-Time (JIT) Compilation Techniques for ML

Motivation for JIT Compilation in ML

Tracing vs. Scripting Approaches

Intermediate Representation in JIT Systems

Runtime Specialization and Polymorphism

Profile-Guided Optimization (PGO) in JITs

Adaptive and Multi-Tier Compilation

Case Study: TensorFlow XLA

Case Study: PyTorch JIT (TorchScript)

Hands-on Practical: Analyzing JIT Compiled Code

Chapter 8: Quantization and Low-Precision Optimizations

Fundamentals of Model Quantization (INT8, FP8)

Representing Quantized Operations in IR

Compiler Passes for Quantization-Aware Training (QAT)

Post-Training Quantization (PTQ) Compilation Flows

Generating Low-Precision Kernels

Mixed-Precision Computation Optimization

Handling Quantization Scales and Zero Points

Hands-on Practical: Lowering Quantized Operations

Chapter 9: Profiling and Performance Analysis Tools

Challenges in Profiling Compiled ML Code

System-Level Profiling (CPU, GPU, Interconnect)

CPU Performance Analysis (VTune, perf)

GPU Kernel Profiling (Nsight Compute, ROCprof)

Correlating Framework Operations to Compiled Kernels

Memory Access Pattern Analysis

Interpreting Compiler Optimization Reports

Hands-on Practical: Profiling an Optimized Model

Representing Quantized Operations in IR

Was this section helpful?

References

MLIR: A Compiler Infrastructure for the End of Moore's Law, Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, Oleksandr Zinenko, 2020 Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI '20) (USENIX Association) DOI: 10.5555/3471017.3471046 - Describes MLIR's design principles, explaining the advantages of multi-level IR and extensible type systems for domain-specific optimizations, including quantization.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, 2018 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE) DOI: 10.1109/CVPR.2018.00287 - A foundational paper introducing the affine quantization scheme with scale and zero-point for neural networks, including the mathematical underpinnings for quantize, dequantize, and requantize operations.
MLIR Quantization Dialect Guide, LLVM Project Developers, 2024 (LLVM Foundation) - Official documentation detailing the MLIR Quantization Dialect, providing specific examples of dedicated quantized types and operations for explicit IR representation.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 2018 Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI '18) (USENIX Association) DOI: 10.5555/3342335.3342371 - Introduces TVM, a deep learning compiler framework with an IR designed to facilitate various optimizations including low-precision and quantization, providing context for compiler design.

© 2025 ApX Machine Learning