Data Versioning and Experiment Tracking for Machine Learning
Chapter 1: The Need for Reproducibility in Machine Learning
Challenges in Managing ML Projects
Why Git Alone Is Not Sufficient
Defining Reproducibility in ML
Components of a Reproducible ML Workflow
Introduction to Data Versioning Concepts
Introduction to Experiment Tracking Concepts
Chapter 2: Versioning Data with DVC
Data Versioning Strategies
Introducing Data Version Control (DVC)
Setting Up DVC in a Project
Tracking Data Files and Directories
Storing and Retrieving Data Versions
Connecting DVC to Remote Storage (S3, GCS, Azure Blob)
Switching Between Data Versions
Hands-on Practical: Versioning a Dataset
Chapter 3: Tracking Experiments with MLflow
The Importance of Experiment Tracking
Introducing MLflow Tracking
Logging Parameters and Metrics
Logging Artifacts (Models, Plots, Files)
Organizing Runs with Experiments
Comparing Experiment Runs
Practice: Tracking a Training Run
Chapter 4: Integrating DVC and MLflow for Reproducible Workflows
Connecting Data Versions to Experiments
Structuring Projects for Integration
Logging DVC Metadata in MLflow
Reproducing DVC Pipelines
Tracking DVC Pipeline Metrics
Combining DVC Pipelines and MLflow Tracking
Best Practices for Integrated Workflows
Hands-on Practical: Building an Integrated Pipeline