Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - A comprehensive textbook providing a detailed treatment of policy gradient methods, the role of baselines in variance reduction, and the principles of actor-critic architectures.
Policy Gradient Methods for Reinforcement Learning with Function Approximation, Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour, 1999Advances in Neural Information Processing Systems, Vol. 12 (NeurIPS)DOI: 10.5555/3008751.3008851 - Introduces the policy gradient theorem and formally discusses how subtracting a state-dependent baseline, such as the value function, does not introduce bias while reducing variance in policy gradient estimates.
Asynchronous Methods for Deep Reinforcement Learning, Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu, 2016Proceedings of The 33rd International Conference on Machine Learning, Vol. 48 (PMLR) - A seminal paper demonstrating successful deep reinforcement learning with an actor-critic architecture, utilizing the value function as a baseline to stabilize training and reduce variance through asynchronous updates.
High-Dimensional Continuous Control Using Generalized Advantage Estimation, John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel, 2015International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1506.02438 - Introduces Generalized Advantage Estimation (GAE), a method to achieve a good bias-variance trade-off in the advantage function estimate, which is important for advanced policy gradient methods.