RoFormer: Enhanced Transformer with Rotary Position Embedding, Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2104.09864 - Presents Rotary Position Embedding (RoPE), a positional encoding method that enhances context length capabilities, directly mentioned in the section.
Sparsely-Gated Mixture-of-Experts Layers, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1701.06538 - A foundational paper introducing sparsely-gated Mixture-of-Experts layers, enabling models to increase capacity without a proportional increase in computational cost.
Distilling the Knowledge in a Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015arXiv preprintDOI: 10.48550/arXiv.1503.02531 - Introduces knowledge distillation, a method where a smaller "student" model learns from a larger "teacher" model, useful for transferring knowledge during architectural changes.