Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeff Ullman, 2020 (Cambridge University Press) - This widely recognized textbook offers a thorough exposition of shingling, Jaccard similarity, MinHash, and Locality-Sensitive Hashing (LSH), which are core methods for detecting near-duplicates in large datasets.
Detecting Near-Duplicates for Web Crawling, Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig, 2000Proceedings of the 5th International Conference on World Wide Web (ACM)DOI: 10.1145/336796.337050 - A seminal paper that introduced MinHash and its application to efficiently identify near-duplicate documents, a crucial step for managing redundancy in large web-scraped corpora, highly relevant to LLM data pipelines.