Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, 2017 (O'Reilly Media, Inc.) - A primary guide to Site Reliability Engineering, offering explanations of Service Level Indicators, Service Level Objectives, and error budgets. This provides the background for applying these concepts to ML systems.
Building Machine Learning Powered Applications: Going from Idea to Product, Emmanuel Ameisen, 2020 (O'Reilly Media) - Discusses the ML product development cycle, including strategies for monitoring and maintaining models once deployed, which is directly relevant to establishing production ML SLOs.
Fairness and Machine Learning: Limitations and Opportunities, Solon Barocas, Moritz Hardt, and Arvind Narayanan, 2023 (MIT Press) - A comprehensive guide on fairness in machine learning, offering definitions and discussions of fairness metrics that can be used as SLIs for bias-related Service Level Objectives.