Performance Profiling and Debugging in Distributed Environments
Was this section helpful?
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, 2010Google Technical Report (Google, Inc.) - Foundational paper introducing the concepts of distributed tracing, essential for understanding modern observability tools.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, 2017 (O'Reilly Media) - Comprehensive guide to building and operating reliable distributed systems, including chapters on monitoring, debugging, and incident response.
OpenTelemetry Documentation, OpenTelemetry Authors, 2024 - Official guide for implementing and utilizing OpenTelemetry for distributed tracing, metrics, and log collection across various services.
NVIDIA Nsight Systems Documentation, NVIDIA Corporation, 2024 (NVIDIA Corporation) - Provides detailed instructions and best practices for profiling GPU-accelerated applications, crucial for optimizing LLM and retriever inference.