Data Version Control
Why Data Version Control Matters
In traditional software development, Git is the standard tool for code version control. However, in ML/AI projects, data is equally a core asset, yet it often lacks systematic version management.
- Reproducibility of ML experiments: A model's performance is jointly determined by the code, hyperparameters, and training data. Without the ability to trace back data versions, experiment results cannot be reproduced. Data version control ensures that every experiment is tied to a specific data snapshot.
- Dataset iteration: In real-world projects, datasets evolve continuously — new annotations are added, labels are corrected, and samples are expanded. Without version control, it is difficult to track the state of a dataset at different points in time.
- Team collaboration: When multiple people collaborate, they need to work with a consistent dataset. Manually copying data easily leads to version confusion; version control tools guarantee that everyone pulls the same version of the data.
DVC (Data Version Control)
DVC is currently the most popular open-source data version control tool, developed by Iterative.ai and designed specifically for ML projects.
Core Concept
DVC's design philosophy: use Git to manage lightweight metadata files (.dvc files), while storing the actual large files in remote storage. This leverages Git's version control capabilities without bloating the repository with large files. The workflow is as follows:
- Use
dvc addto track large files; DVC generates corresponding.dvcmetadata files - Commit the
.dvcfiles to Git - Use
dvc pushto upload the actual data to remote storage - Others retrieve the code and data via
git pull+dvc pull
Basic Commands
# Initialize DVC in a Git repository
dvc init
# Track data files or directories
dvc add data/training_set.csv
dvc add data/images/
# Commit the generated .dvc files and .gitignore to Git
git add data/training_set.csv.dvc data/images.dvc .gitignore
git commit -m "Add training data v1"
# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage
# Push data to remote storage / Pull data from remote
dvc push
dvc pull
# Check out a historical version of the data
git checkout v1.0
dvc checkout
.dvc File Structure
A .dvc file is a YAML-formatted metadata file that uniquely identifies a data version through an MD5 hash:
outs:
- md5: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6
size: 1073741824
hash: md5
path: training_set.csv
When the data changes, the hash value changes accordingly. Once the updated .dvc file is committed to Git, a mapping between the code version and the data version is established.
Remote Storage
DVC supports multiple remote storage backends, configured via dvc remote add:
- Amazon S3:
dvc remote add -d myremote s3://bucket/path - Google Cloud Storage:
dvc remote add -d myremote gs://bucket/path - Azure Blob Storage:
dvc remote add -d myremote azure://container/path - SSH / Local filesystem:
dvc remote add -d myremote ssh://user@host/path
DVC Pipeline
DVC can also define data processing pipelines through dvc.yaml:
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/
outs:
- data/processed/
train:
cmd: python src/train.py --lr 0.001
deps:
- src/train.py
- data/processed/
outs:
- models/model.pkl
metrics:
- metrics.json:
cache: false
Key advantage: DVC automatically detects changes in dependencies and only re-executes the necessary stages (dvc repro), avoiding redundant computation. Use dvc dag to visualize the pipeline's DAG structure.
DVC Experiments
DVC provides experiment tracking capabilities to record and compare results across different hyperparameter settings:
dvc exp run --set-param train.lr=0.01 # Run an experiment with modified parameters
dvc exp show # View all experiment results
dvc exp diff exp-abc123 exp-def456 # Compare two experiments
Other Data Version Control Tools
Git LFS (Large File Storage)
Git LFS is an official Git extension that replaces actual large files with pointer files, tightly integrated into the Git workflow.
git lfs install
git lfs track "*.h5"
git add .gitattributes model.h5
git commit -m "Add model file"
It is suitable for files up to a few hundred MB, especially for teams already using GitHub/GitLab that do not want to introduce additional tools. Limitations: suboptimal support for very large-scale datasets, and relatively high storage costs.
Delta Lake
An open-source storage layer by Databricks that provides ACID transaction support for data lakes. Built on the Parquet format, it achieves version control through a Transaction Log. Core features include ACID transactions, Time Travel (querying historical data by version number or timestamp), and Schema Evolution. It is well suited for structured data at the TB scale and above within the Spark ecosystem.
LakeFS
LakeFS provides Git-like operations (branch, commit, merge) for managing data in data lakes. It is compatible with the S3 API and integrates seamlessly with existing data infrastructure. It is a good fit for teams that need to safely test data changes in production environments.
Hugging Face Datasets
Hugging Face Hub offers a dataset hosting and version management service for the ML community, built on Git LFS, with a user-friendly API.
from datasets import load_dataset
dataset = load_dataset("squad", revision="v2.0") # Load a specific version
Data Management Best Practices
Separate Data from Code
Large files should not be stored directly in a Git repository. Recommended project structure:
project/
src/ # Code -> Git
configs/ # Configs -> Git
data/ # Data -> DVC / Remote storage
models/ # Models -> DVC / Remote storage
dvc.yaml # Pipeline definition -> Git
dvc.lock # Pipeline lock file -> Git
Never Commit Large Files to Git
Once a large file is committed to Git history, the repository size will not shrink even if the file is subsequently deleted (Git retains the full history). Data directories should be excluded in .gitignore, and DVC or Git LFS should be used for management.
Dataset Documentation (Datasheets for Datasets)
Every dataset should have clear documentation that records: source and collection method, scale and distribution, annotation guidelines, known biases, and licensing. This originates from the "Datasheets for Datasets" framework proposed by Gebru et al.
Data Lineage
Data Lineage tracks the complete transformation path of data from its original source to its final use: Where did this data come from? What processing steps did it go through? DVC Pipelines inherently provide a degree of Data Lineage capability; for more complex scenarios, tools such as Apache Atlas and Amundsen can be used.
Data Management in MLOps: Feature Store
Feature Store is a critical infrastructure component in MLOps — a centralized feature management platform that addresses a core question: How can features be efficiently shared and reused across a team? Key capabilities include:
- Feature registration and discovery: Search for and reuse existing feature definitions
- Online/offline consistency: Ensure that the same feature computation logic is used for both training (offline) and inference (online)
- Feature versioning: Track the change history of feature definitions and feature values
- Point-in-time correctness: Prevent data leakage by ensuring that only feature values available before a given point in time are used during training
Common implementations include Feast (open-source, lightweight), Tecton (commercial, fully managed), Hopsworks (Spark/Flink integration), and Databricks Feature Store. Feature Stores complement data version control: DVC manages raw data and dataset versions, while Feature Stores manage the feature lifecycle.