Advanced Monitoring Logging and Alerting for Distributed RAG
Was this section helpful?
OpenTelemetry Documentation, OpenTelemetry Authors, 2025 - A guide to instrumenting applications for distributed traces, metrics, and logs using the OpenTelemetry standard.
Site Reliability Engineering: How Google Runs Production Systems, Niall Murphy, Betsy Beyer, Chris Jones, Todd Underwood, 2017 (O'Reilly Media) - A text covering principles of monitoring, alerting, Service Level Objectives (SLOs), and operational excellence for distributed systems.
Designing Machine Learning Systems, Chip Huyen, 2022 (O'Reilly Media) - Provides a view of building and operating ML systems, including sections on monitoring, logging, alerting, and MLOps practices applicable to AI.