Off-Policy Deep Reinforcement Learning without Exploration, Scott Fujimoto, David Meger, Doina Precup, 2019Proceedings of the 36th International Conference on Machine Learning, Vol. 97 (PMLR)DOI: 10.48550/arXiv.1812.02900 - The original paper introducing Batch-Constrained Q-learning (BCQ), which uses a generative model and perturbation network to stay close to the dataset's behavior policy.
Conservative Q-Learning for Offline Reinforcement Learning, Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine, 2020Advances in Neural Information Processing Systems (NeurIPS), Vol. 33 (NeurIPS)DOI: 10.48550/arXiv.2006.04779 - The foundational paper on Conservative Q-Learning (CQL), detailing its objective to mitigate Q-value overestimation for out-of-distribution actions through a regularization term.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning, Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine, 2020Advances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, Vol. 3DOI: 10.48550/arXiv.2004.07219 - Introduces the D4RL benchmark suite, a collection of standardized datasets widely used for evaluating and comparing offline reinforcement learning algorithms across various environments.