Trust Region Policy Optimization, John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel, 2015Proceedings of the 32nd International Conference on Machine Learning (ICML), Vol. 37 (PMLR (Proceedings of Machine Learning Research)) - This foundational paper introduces Trust Region Policy Optimization (TRPO), detailing its theoretical basis, constrained optimization, and practical approximations.
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - A textbook covering reinforcement learning, including policy gradient and actor-critic methods, providing context for stability challenges in policy optimization.
A Natural Policy Gradient, Sham M. Kakade, 2001Advances in Neural Information Processing Systems, Vol. 14 (NeurIPS) - This paper presents the theoretical basis for natural policy gradients using the Fisher Information Matrix, influencing TRPO's approach to constrained optimization.