Introduction to Sparse Attention Mechanisms

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017 Advances in Neural Information Processing Systems, Vol. 30 (Neural Information Processing Systems) DOI: 10.5555/3295222.3295349 - Foundational paper introducing the Transformer architecture and the self-attention mechanism, highlighting its quadratic complexity with respect to sequence length.
Longformer: The Long-Document Transformer, Iz Beltagy, Matthew E. Peters, and Arman Cohan, 2020 arXiv preprint arXiv:2004.05150 DOI: 10.48550/arXiv.2004.05150 - Introduces the Longformer model, which uses a combination of sliding window and global attention patterns to efficiently process very long sequences.
Big Bird: Transformers for Longer Sequences, Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed, 2020 Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates, Inc.) DOI: 10.5555/3455793.3455928 - Presents BigBird, a sparse attention mechanism that combines global, local, and random attention to achieve linear complexity while maintaining strong performance on long-sequence tasks.