Docker
Docker
Common commands:
# Check NVIDIA driver
nvidia-smi
# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
# 1. Stop and remove the current container
docker-compose down
# 2. Remove the old image (force cleanup)
docker rmi doc-recognition:latest
# 3. Clean Docker build cache (optional but recommended)
docker builder prune -f
# 4. Rebuild the image (without cache)
docker-compose build --no-cache
# 5. Start a new container
docker-compose up -d
# 6. Check logs to confirm successful startup
docker-compose logs -f
DEBUG: Troubleshooting steps when cuDNN cannot be found
# cd to the project directory
docker-compose exec recognition bash
# 1. Check nvidia-smi
nvidia-smi
# 2. Check whether PaddlePaddle can detect CUDA
python3 -c "import paddle; print('PaddlePaddle:', paddle.__version__); print('CUDA compiled:', paddle.is_compiled_with_cuda())"
# 3. Test creating a GPU tensor (this will fail if cuDNN is missing)
python3 << 'EOF'
import paddle
try:
paddle.set_device('gpu:0')
x = paddle.randn([2, 2])
print("✓ GPU 工作正常!")
except Exception as e:
print(f"✗ GPU 错误: {e}")
EOF
# Exit
exit
Docker Core Concepts
Image vs. Container vs. Registry
| Concept | Analogy | Description |
|---|---|---|
| Image | Class | A read-only template containing the runtime environment and code |
| Container | Instance | A running instance of an image; read-write |
| Registry | GitHub | Stores and distributes images (e.g., Docker Hub, GHCR) |
Dockerfile Best Practices
Basic Structure
# Choose an appropriate base image
FROM python:3.10-slim
# Set working directory
WORKDIR /app
# Copy dependency files first (to leverage layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Then copy project code
COPY . .
# Expose port
EXPOSE 8000
# Start command
CMD ["python", "main.py"]
Multi-stage Builds
Multi-stage builds can significantly reduce the final image size by separating build-time dependencies from runtime dependencies:
# ====== Build stage ======
FROM python:3.10 AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# ====== Runtime stage ======
FROM python:3.10-slim AS runtime
# Copy only the installed packages, without build tools
COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app
CMD ["python", "main.py"]
Result: Build-stage tools like gcc and build-essential do not end up in the final image.
Layer Caching Optimization
Docker builds layer by layer -- if a layer has not changed, the cache is used. Optimization principles:
- Place infrequently changing layers first: base image -> system packages -> Python dependencies -> code
- Copy dependency files before code:
requirements.txtchanges rarely; code changes often - Merge RUN commands to reduce layers:
# Good: one RUN, one layer
RUN apt-get update && \
apt-get install -y --no-install-recommends git curl && \
rm -rf /var/lib/apt/lists/*
# Bad: three RUNs, three layers
RUN apt-get update
RUN apt-get install -y git curl
RUN rm -rf /var/lib/apt/lists/*
.dockerignore
Create a .dockerignore file to exclude unnecessary files, speeding up builds and reducing context size:
# .dockerignore
__pycache__/
*.pyc
.git/
.env
*.egg-info/
dist/
build/
.vscode/
*.md
data/
wandb/
outputs/
checkpoints/
Docker Compose: Multi-Container Orchestration
Basic Configuration
# docker-compose.yml
version: "3.8"
services:
api:
build: .
ports:
- "8000:8000"
volumes:
- ./data:/app/data
environment:
- MODEL_PATH=/app/models/best.pt
depends_on:
- redis
restart: unless-stopped
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
worker:
build: .
command: python worker.py
depends_on:
- redis
deploy:
replicas: 2
volumes:
redis_data:
Service Dependencies and Health Checks
services:
db:
image: postgres:15
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
api:
build: .
depends_on:
db:
condition: service_healthy # wait until the database is ready
Common Compose Commands
docker-compose up -d # Start all services in the background
docker-compose down # Stop and remove containers
docker-compose logs -f api # View logs for a specific service
docker-compose exec api bash # Open a shell inside a container
docker-compose build --no-cache # Rebuild without cache
docker-compose ps # View service status
GPU Containers
NVIDIA Container Toolkit
The NVIDIA Container Toolkit (formerly nvidia-docker2) enables Docker containers to access host GPUs.
Installation (Ubuntu):
# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Usage:
# Command line
docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
docker run --gpus '"device=0,1"' my-training-image # specify GPUs
# Docker Compose
services:
train:
image: my-training-image
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all # or specify a number like 1
capabilities: [gpu]
AI Project Docker Template
Python + CUDA + PyTorch
# Use NVIDIA's official PyTorch base image
FROM nvcr.io/nvidia/pytorch:24.01-py3
# Or build from a CUDA base image
# FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
# RUN apt-get update && apt-get install -y python3 python3-pip
# RUN pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
WORKDIR /workspace
# Install project dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy project code
COPY . .
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV CUDA_VISIBLE_DEVICES=0
# Training entry point
CMD ["python", "train.py"]
Common CUDA Base Image Options
| Image Type | Description | Size |
|---|---|---|
nvidia/cuda:12.1.0-base-ubuntu22.04 |
CUDA runtime only | ~120MB |
nvidia/cuda:12.1.0-runtime-ubuntu22.04 |
CUDA runtime + libraries (cuBLAS, etc.) | ~800MB |
nvidia/cuda:12.1.0-devel-ubuntu22.04 |
Includes compiler toolchain (nvcc) | ~3.5GB |
nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 |
Includes cuDNN + compiler toolchain | ~4.5GB |
nvcr.io/nvidia/pytorch:24.01-py3 |
NVIDIA NGC with pre-installed PyTorch | ~15GB |
Selection guidelines: Use runtime for inference, devel for training or compiling custom operators, and NGC images for rapid experimentation.
Command Cheat Sheet
Image Management
docker build -t myimage:v1 . # Build an image
docker images # List local images
docker rmi myimage:v1 # Remove an image
docker tag myimage:v1 registry/myimage:v1 # Tag an image
docker push registry/myimage:v1 # Push an image
docker pull pytorch/pytorch:latest # Pull an image
docker image prune -f # Clean up unused images
Container Management
docker run -it --rm myimage bash # Interactive run
docker run -d -p 8000:8000 myimage # Run in background with port mapping
docker run -v $(pwd)/data:/data myimage # Mount a volume
docker ps # List running containers
docker ps -a # List all containers
docker stop <container_id> # Stop a container
docker rm <container_id> # Remove a container
docker exec -it <container_id> bash # Enter a running container
docker logs -f <container_id> # View logs
System Maintenance
docker system df # View Docker disk usage
docker system prune -a --volumes # Clean up all unused resources (use with caution)
docker builder prune -f # Clean build cache
docker network ls # List networks
docker volume ls # List volumes
Debugging Tips
# View processes inside a container
docker top <container_id>
# View container resource usage
docker stats
# View image layer history
docker history myimage:v1
# Copy files from container to host
docker cp <container_id>:/app/output.txt ./output.txt
# Create a new image from a container (save state after debugging)
docker commit <container_id> debug-snapshot:v1