Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, 2017 (O'Reilly Media) - A guide on SRE practices, including monitoring, alerting, and system health for distributed systems.
Prometheus Documentation, The Prometheus Authors, 2024 - Official documentation for the open-source monitoring system, explaining metrics collection, querying, and alerting.
OpenTelemetry Documentation, The OpenTelemetry Authors, 2025 - Official documentation for the observability framework, detailing distributed tracing, metrics, and logging for cloud applications.