Engineering MLOps: An Introduction to Operationalizing Machine Learning, Emmanuel Raj, Larysa Visengeriyeva, Michael Pradel, Catriona Campbell, and Arpit Agrawal, 2022 (O'Reilly Media) - This book provides a holistic view of the MLOps lifecycle, detailing methods for robust deployment, monitoring, and automated recovery of machine learning models.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, 2017 (O'Reilly Media, Inc.) - A foundational text that establishes principles for operating reliable systems, including the use of SLOs, comprehensive monitoring, and automated incident response, which are prerequisites for effective rollbacks.