Apache Spark Documentation, Apache Spark Community, 2024 (Apache Software Foundation) - Official documentation for Apache Spark, providing comprehensive guides on its architecture, APIs (including PySpark DataFrames), distributed data processing, and optimization techniques relevant for building scalable pipelines.
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman, 2020 (Cambridge University Press) - A widely-cited textbook chapter explaining the theoretical foundations and practical applications of Locality-Sensitive Hashing (LSH) and MinHash for efficiently detecting near-duplicate items in large datasets, a core technique for data deduplication.
Dask Documentation, Dask Developers, 2024 (Coiled) - Official documentation for Dask, covering its DataFrame API, distributed computing capabilities, and integration with the Python ecosystem, valuable for designing scalable data preprocessing workflows.