All Courses

Advanced JAX: Performance, Optimization, and Scale

Chapter 1: Advanced JAX Transformations and Control Flow

Review of Core Transformations: jit, grad, vmap

Mastering lax.scan for Sequential Operations

Conditional Execution with lax.cond

Looping with lax.while_loop

Combining Control Flow and Transformations

Advanced Masking Techniques

Understanding Closures and JAX Staging

Practical: Implementing Complex Recurrent Logic

Chapter 2: Optimizing JAX Code for Performance

Profiling JAX Code on CPU, GPU, and TPU

Understanding JAX Computation Graphs (jaxpr)

The Role of XLA Compilation

Memory Layout and Its Impact on Performance

Avoiding Recompilation

Fusion and Operator Optimization

Asynchronous Dispatch

Practice: Optimizing a Numerical Computation

Chapter 3: Distributed Computing with JAX

Introduction to Parallelism Concepts

Device Management in JAX

Single-Program Multiple-Data (SPMD) with pmap

Implementing Data Parallelism using pmap

Collective Communication Primitives (psum, pmean, etc.)

Handling Axis Names in pmap

Nested pmap and Advanced Partitioning

Introduction to Multi-Host Programming

Practice: Distributed Data-Parallel Training

Chapter 4: Advanced Automatic Differentiation Techniques

Review of Forward- and Reverse-Mode Autodiff

Jacobian-Vector Products (JVPs) with jax.jvp

Vector-Jacobian Products (VJPs) with jax.vjp

Higher-Order Derivatives

Computing Full Jacobians and Hessians

Custom Differentiation Rules with jax.custom_vjp

Custom Differentiation Rules with jax.custom_jvp

Differentiation through Control Flow Primitives

Handling Non-Differentiable Functions

Practice: Implementing a Custom Gradient

Chapter 5: Interoperability and Custom Operations

Integrating JAX with NumPy

Zero-Copy Data Sharing with DLPack

Calling External CPU/GPU Code with jax.experimental.host_callback

Using jax.pure_callback for Side-Effect Free Calls

Introduction to JAX Primitives

Defining Custom Primitives

Implementing Abstract Evaluation Rules

Implementing Lowering Rules for Backends (CPU/GPU/TPU)

Defining Differentiation Rules for Custom Primitives

Practice: Integrating a C++ Function

Chapter 6: Large-Scale Model Training Techniques

Overview of Challenges in Large Model Training

Introduction to JAX Ecosystem Libraries (Flax, Haiku)

Managing Model Parameters and State

Combining pmap with Training Frameworks

Gradient Accumulation

Gradient Checkpointing (Re-materialization)

Mixed Precision Training

Model Parallelism Strategies

Optimization Algorithms for Large Scale

Practice: Implementing Gradient Checkpointing

Gradient Checkpointing (Re-materialization)

Was this section helpful?

References

Training Deep Nets with Sublinear Memory Cost, Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016 arXiv preprint arXiv:1604.06174 DOI: 10.48550/arXiv.1604.06174 - This paper introduced gradient checkpointing (also known as activation checkpointing or re-materialization) to reduce memory consumption during deep neural network training.
jax.checkpoint, JAX developers, 2024 - Official documentation for the jax.checkpoint (aliased as jax.remat) transformation, providing practical usage details and examples for memory optimization in JAX.
The Reversible Residual Network: Backpropagation Without Storing Activations, Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse, 2017 Advances in Neural Information Processing Systems 30 (NIPS 2017) (Curran Associates Inc.) - This paper introduces Reversible Residual Networks, an architectural design that further optimizes memory by allowing perfect reconstruction of activations during the backward pass, extending the principles of recomputation.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He, 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE) DOI: 10.1109/SC41405.2020.00024 - This paper presents ZeRO, a comprehensive set of memory optimization techniques for large-scale distributed model training, which integrates activation checkpointing within its framework.

© 2025 ApX Machine LearningEngineered with