Language Identification and Filtering

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Bag of Tricks for Efficient Text Classification, Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthieu Douze, Hervé Jégou, 2017 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Association for Computational Linguistics) DOI: 10.18653/v1/E17-2007 - Original research paper introducing the fastText library, detailing its architecture and methods for efficient text classification and language identification.
Compact Language Detector 2 (CLD2), Google, 2014 - Official GitHub repository for CLD2, providing source code and technical details on its accurate and fast language detection implementation.
N-gram-based Text Categorization, William B. Cavnar, John M. Trenkle, 1994 Proceedings of the SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval - Foundational paper on using character N-grams for text categorization, including language identification, which influenced many subsequent methods.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2021 arXiv preprint DOI: 10.48550/arXiv.2101.00027 - Describes the creation of a large-scale, diverse dataset for language models, highlighting the importance and methods for data cleaning, including language filtering.