Switch Transformers: Scaling to Trillion Parameter Models with Constant Cost, William Fedus, Barret Zoph, Noam Shazeer, 2022Journal of Machine Learning Research, Vol. 23 (Journal of Machine Learning Research)DOI: 10.48550/arXiv.2101.03961 - Demonstrates how MoE layers enable scaling Transformer models to trillions of parameters while keeping per-token computational cost constant, introducing the Switch mechanism for efficiency.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui, 2022International Conference on Machine Learning (ICML)DOI: 10.48550/arXiv.2112.06905 - Presents a large-scale MoE language model, GLaM, focusing on efficient training and inference strategies for very sparse MoE architectures.