Asynchronous Methods for Deep Reinforcement Learning, Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu, 2016ICML 2016DOI: 10.48550/arXiv.1602.01783 - Introduces the Asynchronous Advantage Actor-Critic (A3C) algorithm, a method for stable and efficient deep reinforcement learning through asynchronous updates. It also implicitly describes the synchronous variant.
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - A standard textbook on reinforcement learning, providing theoretical foundations for policy gradients, value functions, and actor-critic methods.
High-Dimensional Continuous Control Using Generalized Advantage Estimation, John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel, 2016International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1506.02438 - Presents Generalized Advantage Estimation (GAE), an approach for robust advantage function estimation that significantly improves the performance of actor-critic algorithms like A2C and A3C.
Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces Proximal Policy Optimization (PPO), a widely used policy gradient algorithm that builds upon ideas from A2C and TRPO, offering a balance of performance and ease of implementation.