Distributed Systems: Principles and Paradigms, Andrew S. Tanenbaum and Maarten van Steen, 2017 (Pearson) - A foundational textbook covering the fundamental principles of distributed computing, including communication, processes, naming, consistency, fault tolerance, and security, providing a theoretical grounding.
Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Niall Richard Murphy, Jennifer Petoff, and Stephen Thorne, 2017 (O'Reilly Media) - Defines Site Reliability Engineering practices at Google, offering insights into managing large, complex distributed systems for availability, performance, and incident response, relevant to production RAG.