Learn advanced techniques for quantizing Large Language Models (LLMs) and deploying them for optimized inference performance. This course covers state-of-the-art quantization methods, deployment frameworks, performance analysis, and optimization strategies tailored for LLMs on various hardware platforms.
Prerequisites: Python, ML, LLM basics.
Level: Advanced
Advanced Quantization Techniques
Implement and compare various LLM quantization methods including low-bit (sub-4-bit), mixed-precision, and post-training quantization algorithms like GPTQ and AWQ.
Quantization Calibration
Apply advanced calibration techniques to minimize accuracy loss during LLM quantization.
Performance Analysis
Evaluate the performance (latency, throughput, memory usage) and accuracy trade-offs of quantized LLMs.
Hardware-Specific Optimization
Optimize quantized LLM inference for different hardware targets, including CPUs and GPUs.
Deployment Frameworks
Utilize specialized frameworks and libraries (e.g., TensorRT-LLM, vLLM, TGI, ONNX Runtime) for deploying quantized LLMs efficiently.
Deployment Strategies
Implement deployment strategies for serving quantized LLMs, considering scaling and resource management.
© 2025 ApX Machine Learning