Apache Arrow Documentation, Apache Arrow Project, 2024 - Official documentation for Apache Arrow, detailing its in-memory columnar format, data types, and multi-language interoperability, essential for efficient data interchange and zero-copy reads.
Apache Parquet Documentation, Apache Parquet Project, 2022 (Apache Software Foundation) - Official documentation for Apache Parquet, describing its on-disk columnar storage, compression techniques, and encoding strategies for large-scale persistent datasets.
C-Store: A Column-Oriented DBMS, Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, Stanley B. Zdonik, 2005Proceedings of the 31st International Conference on Very Large Data Bases (ACM)DOI: 10.1109/VLDB.2005.1509709 - A foundational academic paper introducing the principles and advantages of column-oriented database management systems, which directly inform the design of modern columnar data formats like Apache Arrow and Parquet.
Hugging Face Datasets Library Documentation, Hugging Face, 2024 - Official documentation for the Hugging Face datasets library, which provides tools for managing and loading large text datasets for LLM training, often utilizing Apache Arrow internally for performance.