Skip to content

Data Version Control

Why Data Version Control Matters

In traditional software development, Git is the standard tool for code version control. However, in ML/AI projects, data is equally a core asset, yet it often lacks systematic version management.

  • Reproducibility of ML experiments: A model's performance is jointly determined by the code, hyperparameters, and training data. Without the ability to trace back data versions, experiment results cannot be reproduced. Data version control ensures that every experiment is tied to a specific data snapshot.
  • Dataset iteration: In real-world projects, datasets evolve continuously — new annotations are added, labels are corrected, and samples are expanded. Without version control, it is difficult to track the state of a dataset at different points in time.
  • Team collaboration: When multiple people collaborate, they need to work with a consistent dataset. Manually copying data easily leads to version confusion; version control tools guarantee that everyone pulls the same version of the data.

DVC (Data Version Control)

DVC is currently the most popular open-source data version control tool, developed by Iterative.ai and designed specifically for ML projects.

Core Concept

DVC's design philosophy: use Git to manage lightweight metadata files (.dvc files), while storing the actual large files in remote storage. This leverages Git's version control capabilities without bloating the repository with large files. The workflow is as follows:

  1. Use dvc add to track large files; DVC generates corresponding .dvc metadata files
  2. Commit the .dvc files to Git
  3. Use dvc push to upload the actual data to remote storage
  4. Others retrieve the code and data via git pull + dvc pull

Basic Commands

# Initialize DVC in a Git repository
dvc init

# Track data files or directories
dvc add data/training_set.csv
dvc add data/images/

# Commit the generated .dvc files and .gitignore to Git
git add data/training_set.csv.dvc data/images.dvc .gitignore
git commit -m "Add training data v1"

# Configure remote storage
dvc remote add -d myremote s3://my-bucket/dvc-storage

# Push data to remote storage / Pull data from remote
dvc push
dvc pull

# Check out a historical version of the data
git checkout v1.0
dvc checkout

.dvc File Structure

A .dvc file is a YAML-formatted metadata file that uniquely identifies a data version through an MD5 hash:

outs:
- md5: a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6
  size: 1073741824
  hash: md5
  path: training_set.csv

When the data changes, the hash value changes accordingly. Once the updated .dvc file is committed to Git, a mapping between the code version and the data version is established.

Remote Storage

DVC supports multiple remote storage backends, configured via dvc remote add:

  • Amazon S3: dvc remote add -d myremote s3://bucket/path
  • Google Cloud Storage: dvc remote add -d myremote gs://bucket/path
  • Azure Blob Storage: dvc remote add -d myremote azure://container/path
  • SSH / Local filesystem: dvc remote add -d myremote ssh://user@host/path

DVC Pipeline

DVC can also define data processing pipelines through dvc.yaml:

stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/
    outs:
      - data/processed/
  train:
    cmd: python src/train.py --lr 0.001
    deps:
      - src/train.py
      - data/processed/
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

Key advantage: DVC automatically detects changes in dependencies and only re-executes the necessary stages (dvc repro), avoiding redundant computation. Use dvc dag to visualize the pipeline's DAG structure.

DVC Experiments

DVC provides experiment tracking capabilities to record and compare results across different hyperparameter settings:

dvc exp run --set-param train.lr=0.01   # Run an experiment with modified parameters
dvc exp show                             # View all experiment results
dvc exp diff exp-abc123 exp-def456       # Compare two experiments

Other Data Version Control Tools

Git LFS (Large File Storage)

Git LFS is an official Git extension that replaces actual large files with pointer files, tightly integrated into the Git workflow.

git lfs install
git lfs track "*.h5"
git add .gitattributes model.h5
git commit -m "Add model file"

It is suitable for files up to a few hundred MB, especially for teams already using GitHub/GitLab that do not want to introduce additional tools. Limitations: suboptimal support for very large-scale datasets, and relatively high storage costs.

Delta Lake

An open-source storage layer by Databricks that provides ACID transaction support for data lakes. Built on the Parquet format, it achieves version control through a Transaction Log. Core features include ACID transactions, Time Travel (querying historical data by version number or timestamp), and Schema Evolution. It is well suited for structured data at the TB scale and above within the Spark ecosystem.

LakeFS

LakeFS provides Git-like operations (branch, commit, merge) for managing data in data lakes. It is compatible with the S3 API and integrates seamlessly with existing data infrastructure. It is a good fit for teams that need to safely test data changes in production environments.

Hugging Face Datasets

Hugging Face Hub offers a dataset hosting and version management service for the ML community, built on Git LFS, with a user-friendly API.

from datasets import load_dataset
dataset = load_dataset("squad", revision="v2.0")  # Load a specific version

Data Management Best Practices

Separate Data from Code

Large files should not be stored directly in a Git repository. Recommended project structure:

project/
  src/           # Code -> Git
  configs/       # Configs -> Git
  data/          # Data -> DVC / Remote storage
  models/        # Models -> DVC / Remote storage
  dvc.yaml       # Pipeline definition -> Git
  dvc.lock       # Pipeline lock file -> Git

Never Commit Large Files to Git

Once a large file is committed to Git history, the repository size will not shrink even if the file is subsequently deleted (Git retains the full history). Data directories should be excluded in .gitignore, and DVC or Git LFS should be used for management.

Dataset Documentation (Datasheets for Datasets)

Every dataset should have clear documentation that records: source and collection method, scale and distribution, annotation guidelines, known biases, and licensing. This originates from the "Datasheets for Datasets" framework proposed by Gebru et al.

Data Lineage

Data Lineage tracks the complete transformation path of data from its original source to its final use: Where did this data come from? What processing steps did it go through? DVC Pipelines inherently provide a degree of Data Lineage capability; for more complex scenarios, tools such as Apache Atlas and Amundsen can be used.

Data Management in MLOps: Feature Store

Feature Store is a critical infrastructure component in MLOps — a centralized feature management platform that addresses a core question: How can features be efficiently shared and reused across a team? Key capabilities include:

  • Feature registration and discovery: Search for and reuse existing feature definitions
  • Online/offline consistency: Ensure that the same feature computation logic is used for both training (offline) and inference (online)
  • Feature versioning: Track the change history of feature definitions and feature values
  • Point-in-time correctness: Prevent data leakage by ensuring that only feature values available before a given point in time are used during training

Common implementations include Feast (open-source, lightweight), Tecton (commercial, fully managed), Hopsworks (Spark/Flink integration), and Databricks Feature Store. Feature Stores complement data version control: DVC manages raw data and dataset versions, while Feature Stores manage the feature lifecycle.


评论 #