Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Niall Richard Murphy, Jennifer Petoff, and Stephen Thorne, 2017 (O'Reilly Media) - Presents best practices for monitoring, alerting, incident response, and managing operational costs.
Prometheus Documentation, The Prometheus Authors, 2024 (Cloud Native Computing Foundation) - Prometheus official documentation, detailing architecture, metric collection, and alerting.
FinOps: The Cloud Cost Management Guide, J.R. Storment, Mike Fuller, 2023 (O'Reilly Media) - Offers a framework and practices for managing cloud spending, including monitoring, alerting, and cost anomaly detection.