Language Models are Unsupervised Multitask Learners, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019Technical Report (OpenAI) - Describes the GPT-2 model and its training, including the practice of scaling residual layer weights, which helps control signal magnitudes and relates to final layer initialization strategies.
Llama: Open and Efficient Foundation Models, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample, 2023arXiv preprintDOI: 10.48550/arXiv.2302.13971 - Details the architecture and training of the Llama model, explicitly mentioning the use of a small standard deviation (e.g., 0.02) for weight initialization, a common practice for modern large language models.
torch.nn.init, PyTorch Developers, 2022 (PyTorch) - Official documentation for PyTorch's initialization module, providing functions and guidelines for initializing neural network weights, relevant for practical implementation.