Learning from Delayed Rewards, Christopher John Cornish Hellaby Watkins, 1989 (University of Cambridge) - The original PhD thesis that introduced the Q-learning algorithm, establishing the principles of off-policy temporal-difference control.
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (The MIT Press) - A standard textbook in reinforcement learning, providing a thorough explanation of Q-Learning, its update rule, algorithm, and theoretical foundations.