Docker

Common commands:

# Check NVIDIA driver
nvidia-smi

# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# 1. Stop and remove the current container
docker-compose down

# 2. Remove the old image (force cleanup)
docker rmi doc-recognition:latest

# 3. Clean Docker build cache (optional but recommended)
docker builder prune -f

# 4. Rebuild the image (without cache)
docker-compose build --no-cache

# 5. Start a new container
docker-compose up -d

# 6. Check logs to confirm successful startup
docker-compose logs -f

DEBUG: Troubleshooting steps when cuDNN cannot be found

# cd to the project directory
docker-compose exec recognition bash
# 1. Check nvidia-smi
nvidia-smi

# 2. Check whether PaddlePaddle can detect CUDA
python3 -c "import paddle; print('PaddlePaddle:', paddle.__version__); print('CUDA compiled:', paddle.is_compiled_with_cuda())"

# 3. Test creating a GPU tensor (this will fail if cuDNN is missing)
python3 << 'EOF'
import paddle
try:
    paddle.set_device('gpu:0')
    x = paddle.randn([2, 2])
    print("✓ GPU 工作正常！")
except Exception as e:
    print(f"✗ GPU 错误: {e}")
EOF

# Exit
exit

Docker Core Concepts

Image vs. Container vs. Registry

Concept	Analogy	Description
Image	Class	A read-only template containing the runtime environment and code
Container	Instance	A running instance of an image; read-write
Registry	GitHub	Stores and distributes images (e.g., Docker Hub, GHCR)

Dockerfile Best Practices

Basic Structure

# Choose an appropriate base image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Copy dependency files first (to leverage layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Then copy project code
COPY . .

# Expose port
EXPOSE 8000

# Start command
CMD ["python", "main.py"]

Multi-stage Builds

Multi-stage builds can significantly reduce the final image size by separating build-time dependencies from runtime dependencies:

# ====== Build stage ======
FROM python:3.10 AS builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# ====== Runtime stage ======
FROM python:3.10-slim AS runtime

# Copy only the installed packages, without build tools
COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app

CMD ["python", "main.py"]

Result: Build-stage tools like gcc and build-essential do not end up in the final image.

Layer Caching Optimization

Docker builds layer by layer -- if a layer has not changed, the cache is used. Optimization principles:

Place infrequently changing layers first: base image -> system packages -> Python dependencies -> code
Copy dependency files before code: requirements.txt changes rarely; code changes often
Merge RUN commands to reduce layers:

# Good: one RUN, one layer
RUN apt-get update && \
    apt-get install -y --no-install-recommends git curl && \
    rm -rf /var/lib/apt/lists/*

# Bad: three RUNs, three layers
RUN apt-get update
RUN apt-get install -y git curl
RUN rm -rf /var/lib/apt/lists/*

.dockerignore

Create a .dockerignore file to exclude unnecessary files, speeding up builds and reducing context size:

# .dockerignore
__pycache__/
*.pyc
.git/
.env
*.egg-info/
dist/
build/
.vscode/
*.md
data/
wandb/
outputs/
checkpoints/

Docker Compose: Multi-Container Orchestration

Basic Configuration

# docker-compose.yml
version: "3.8"

services:
  api:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - ./data:/app/data
    environment:
      - MODEL_PATH=/app/models/best.pt
    depends_on:
      - redis
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

  worker:
    build: .
    command: python worker.py
    depends_on:
      - redis
    deploy:
      replicas: 2

volumes:
  redis_data:

Service Dependencies and Health Checks

services:
  db:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

  api:
    build: .
    depends_on:
      db:
        condition: service_healthy  # wait until the database is ready

Common Compose Commands

docker-compose up -d          # Start all services in the background
docker-compose down            # Stop and remove containers
docker-compose logs -f api     # View logs for a specific service
docker-compose exec api bash   # Open a shell inside a container
docker-compose build --no-cache  # Rebuild without cache
docker-compose ps              # View service status

GPU Containers

NVIDIA Container Toolkit

The NVIDIA Container Toolkit (formerly nvidia-docker2) enables Docker containers to access host GPUs.

Installation (Ubuntu):

# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Usage:

# Command line
docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
docker run --gpus '"device=0,1"' my-training-image  # specify GPUs

# Docker Compose

services:
  train:
    image: my-training-image
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # or specify a number like 1
              capabilities: [gpu]

AI Project Docker Template

Python + CUDA + PyTorch

# Use NVIDIA's official PyTorch base image
FROM nvcr.io/nvidia/pytorch:24.01-py3

# Or build from a CUDA base image
# FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
# RUN apt-get update && apt-get install -y python3 python3-pip
# RUN pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

WORKDIR /workspace

# Install project dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy project code
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV CUDA_VISIBLE_DEVICES=0

# Training entry point
CMD ["python", "train.py"]

Common CUDA Base Image Options

Image Type	Description	Size
`nvidia/cuda:12.1.0-base-ubuntu22.04`	CUDA runtime only	~120MB
`nvidia/cuda:12.1.0-runtime-ubuntu22.04`	CUDA runtime + libraries (cuBLAS, etc.)	~800MB
`nvidia/cuda:12.1.0-devel-ubuntu22.04`	Includes compiler toolchain (nvcc)	~3.5GB
`nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04`	Includes cuDNN + compiler toolchain	~4.5GB
`nvcr.io/nvidia/pytorch:24.01-py3`	NVIDIA NGC with pre-installed PyTorch	~15GB

Selection guidelines: Use runtime for inference, devel for training or compiling custom operators, and NGC images for rapid experimentation.

Command Cheat Sheet

Image Management

docker build -t myimage:v1 .           # Build an image
docker images                            # List local images
docker rmi myimage:v1                    # Remove an image
docker tag myimage:v1 registry/myimage:v1  # Tag an image
docker push registry/myimage:v1          # Push an image
docker pull pytorch/pytorch:latest       # Pull an image
docker image prune -f                    # Clean up unused images

Container Management

docker run -it --rm myimage bash         # Interactive run
docker run -d -p 8000:8000 myimage       # Run in background with port mapping
docker run -v $(pwd)/data:/data myimage  # Mount a volume
docker ps                                 # List running containers
docker ps -a                              # List all containers
docker stop <container_id>                # Stop a container
docker rm <container_id>                  # Remove a container
docker exec -it <container_id> bash       # Enter a running container
docker logs -f <container_id>             # View logs

System Maintenance

docker system df                          # View Docker disk usage
docker system prune -a --volumes          # Clean up all unused resources (use with caution)
docker builder prune -f                   # Clean build cache
docker network ls                         # List networks
docker volume ls                          # List volumes

Debugging Tips

# View processes inside a container
docker top <container_id>

# View container resource usage
docker stats

# View image layer history
docker history myimage:v1

# Copy files from container to host
docker cp <container_id>:/app/output.txt ./output.txt

# Create a new image from a container (save state after debugging)
docker commit <container_id> debug-snapshot:v1

Docker

Docker

Docker Core Concepts

Image vs. Container vs. Registry

Dockerfile Best Practices

Basic Structure

Multi-stage Builds

Layer Caching Optimization

.dockerignore

Docker Compose: Multi-Container Orchestration

Basic Configuration

Service Dependencies and Health Checks

Common Compose Commands

GPU Containers

NVIDIA Container Toolkit

AI Project Docker Template

Python + CUDA + PyTorch

Common CUDA Base Image Options

Command Cheat Sheet

Image Management

Container Management

System Maintenance

Debugging Tips

评论 #