This hands-on exercise guides you through the fundamental DVC workflow. It covers initializing DVC in a project, tracking a dataset, configuring remote storage, pushing data, modifying it, and switching between versions.PrerequisitesBefore you begin, ensure you have the following installed:Git: DVC relies heavily on Git. If you don't have it, download and install it from git-scm.com.DVC: Install DVC using pip. To include support for common cloud storage options (like AWS S3, Google Cloud Storage, Azure Blob), you can install optional dependencies. For this example, we'll initially use local remote storage, so the basic installation is sufficient, but installing extras is good practice:pip install dvc[s3,gcs,azure] # Or just pip install dvcPython (Optional): We'll use a simple Python script to create some dummy data, but you can also create a sample CSV file manually.1. Project SetupFirst, create a new directory for our practice project and navigate into it. Then, initialize both Git and DVC.mkdir dvc-practice cd dvc-practice # Initialize Git repository git init Initialized empty Git repository in /path/to/dvc-practice/.git/ # Initialize DVC dvc init Initialized DVC repository. You can now commit the changes to git. +---------------------------------------------------------------------+ | | | DVC has enabled anonymous aggregate usage analytics. | | Read the analytics documentation (and how to opt-out) here: | | <https://dvc.org/doc/user-guide/analytics> | | | +---------------------------------------------------------------------+ What's next? ------------ - Check out the documentation: <https://dvc.org/doc> - Get help and share ideas: <https://dvc.org/chat> - Star us on GitHub: <https://github.com/iterative/dvc>Running dvc init performs several actions:It creates a .dvc directory to store DVC's internal information and configuration.It might create or modify .dvcignore, which works like .gitignore but for DVC-specific patterns.It often adds .dvc/cache and other internal paths to your .gitignore file to prevent Git from tracking the data cache.Let's commit these initial setup files to Git:git add .dvc .dvcignore .gitignore git commit -m "Initialize DVC"2. Prepare Sample DataWe need some data to version. Let's create a data directory and add a simple CSV file inside it. You can create data/samples.csv manually or use this small Python script:# create_data.py import os import csv os.makedirs('data', exist_ok=True) header = ['id', 'value'] data = [ [1, 10.5], [2, 15.2], [3, 20.0] ] with open('data/samples.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerow(header) writer.writerows(data) print("Created data/samples.csv")Save this as create_data.py and run it:python create_data.py Created data/samples.csvYou should now have a data directory containing samples.csv.3. Track Data with DVCNow, let's tell DVC to start tracking this dataset.dvc add data/samples.csvYou'll see output indicating that DVC is processing the file. This command does two main things:It copies data/samples.csv into DVC's cache (located inside .dvc/cache). The cached file is named based on its content hash (typically MD5).It creates a small text file named data/samples.csv.dvc. This file acts as a pointer or metadata file containing information about the original data, including its hash.It updates .gitignore to ensure the actual data file (data/samples.csv) isn't tracked by Git.Let's examine the .dvc file:cat data/samples.csv.dvcThe output will look something like this (the hash will differ based on the exact file content):outs: - md5: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6 # Example hash path: samples.csvThis file tells DVC that samples.csv (relative to the .dvc file's location) is associated with the specified MD5 hash.4. Commit Data Placeholder to GitThe main step now is to commit the .dvc placeholder file and the updated .gitignore to Git. The actual data (data/samples.csv) remains untracked by Git because it's listed in .gitignore.git add data/samples.csv.dvc .gitignore git commit -m "Track initial dataset using DVC"Your Git history now records the state of your data (via the .dvc file) at this point, without storing the large data file itself.5. Configure Remote StorageTo share data or back it up, we need to configure remote storage. For simplicity in this exercise, we'll use a directory on your local filesystem outside the project directory to simulate a remote location. In a real project, you'd typically use S3, GCS, Azure Blob, or another supported backend.# Create a directory to act as remote storage (adjust path if needed) mkdir /tmp/dvc-practice-storage # Configure this directory as a DVC remote named 'myremote' # The -d flag makes it the default remote dvc remote add -d myremote /tmp/dvc-practice-storageThis command updates the DVC configuration file located at .dvc/config. Let's commit this configuration change:git add .dvc/config git commit -m "Configure local DVC remote storage"Now, your project knows where to push and pull data from.6. Push Data to Remote StorageWith the remote configured, we can push the tracked data (currently residing in the local cache) to the remote storage.dvc pushDVC checks the .dvc files associated with the current Git commit, finds the corresponding data files in the local cache (.dvc/cache), and uploads them to the myremote location (/tmp/dvc-practice-storage). If you inspect the /tmp/dvc-practice-storage directory, you'll find subdirectories named after the first two characters of the MD5 hash, containing the actual data file (also named by its hash).7. Modify Data and Track ChangesLet's simulate updating our dataset. Modify data/samples.csv by adding a new row or changing a value. For instance, add the line 4,25.8 to the end of the file.id,value 1,10.5 2,15.2 3,20.0 4,25.8 # New rowNow, ask DVC about the status of your data:dvc statusDVC compares the current data/samples.csv file with the hash stored in data/samples.csv.dvc and reports that it has been modified.To track this new version, use dvc add again:dvc add data/samples.csvThis updates the hash inside data/samples.csv.dvc to reflect the new content and copies the modified file to the cache.Commit the updated .dvc file to Git to record this new version:git add data/samples.csv.dvc git commit -m "Update dataset (v2)"Finally, push the new version of the data to the remote storage:dvc pushOnly the new data file (corresponding to the new hash) will be uploaded.8. Simulate Retrieving DataImagine you've cloned this repository on a new machine, or you've accidentally deleted your local data or cache. Let's simulate this:# Remove the data file from the working directory rm data/samples.csv # (Optional but illustrative) Remove the local DVC cache rm -rf .dvc/cacheIf you run git status, Git won't notice the missing data/samples.csv because it's ignored. However, dvc status will show that the data file tracked by data/samples.csv.dvc is missing from the workspace.To restore the data corresponding to the current Git commit (which points to v2 of the data), use dvc pull:dvc pullDVC consults data/samples.csv.dvc, finds the required hash, downloads the corresponding file from myremote storage into the local cache (if not already there), and places a copy (or link, depending on configuration) at data/samples.csv. Verify that data/samples.csv is now restored with the v2 content (including the row 4,25.8).9. Switching Between Data VersionsThis is where the connection between Git and DVC becomes apparent. Your Git history tracks different versions of the .dvc pointer files. You can check out an older commit to work with an older version of the data.First, find the commit hash for the initial dataset version (you can use git log). Let's assume the commit message was "Track initial dataset using DVC".# Check out the previous commit (adjust 'HEAD~1' if needed) git checkout HEAD~1 Note: switching to 'HEAD~1'. ... HEAD is now at <commit_hash> Track initial dataset using DVCNow, your data/samples.csv.dvc file contains the original hash. However, the actual data/samples.csv file in your workspace might still be the v2 version (or missing if you followed the previous step closely). Use dvc checkout to synchronize your workspace data with the .dvc files in the currently checked-out Git commit:dvc checkout M data/samples.csv # DVC indicates it's modifying the fileCheck the content of data/samples.csv. It should now contain the original data (without the row 4,25.8).To get back to the latest version:# Switch back to the main branch (or your working branch) git checkout main # Or your branch name, e.g., master # Synchronize data with the latest .dvc file dvc checkout M data/samples.csvVerify that data/samples.csv again contains the v2 data.SummaryIn this hands-on exercise, you successfully used DVC's core commands:dvc init: Initialized DVC in a Git repository.dvc add: Started tracking data files, creating .dvc metadata files.git commit: Saved versions of the .dvc files (data pointers) in Git history.dvc remote add: Configured a location (local directory in this case) to store data.dvc push: Uploaded data files from the local cache to remote storage.dvc pull: Downloaded data files from remote storage based on .dvc files.git checkout: Switched between different code/data pointer versions.dvc checkout: Synchronized the workspace data to match the version specified by the .dvc files in the current Git commit.You've seen how Git tracks the code and the references to data versions, while DVC manages the actual data files and their synchronization with remote storage. This combination allows you to version large datasets effectively alongside your code, forming a foundation for reproducible machine learning projects.