Fine-tuning
Fine-tuning
- SFT (Supervised Fine-Tuning): Traditional fine-tuning using labeled data pairs.
- PEFT (LoRA, Adapter): Cost-efficient fine-tuning methods.
- Alignment:
- RLHF: Fine-tuning with reinforcement learning (the core technique behind ChatGPT).
- DPO (Direct Preference Optimization): A recently popular simplified alternative to RLHF that eliminates the reinforcement learning step.
Transfer Learning
The central idea of transfer learning is to apply knowledge learned from one task (source domain) to another related task (target domain). Transfer learning is typically carried out through the following approaches:
- Pre-training: Training a general-purpose model on a large-scale dataset
- Fine-tuning: Updating all or part of the model's parameters
- Zero-Shot / Few-shot Learning
- Domain Adaptation: Addressing the problem of distribution mismatch between source and target data. For example, transferring a model trained on synthetic images to real-world photographs.
Transfer learning is applicable to virtually all deep learning architectures:
| Architecture Type | What Gets Transferred | Typical Applications |
|---|---|---|
| CNN (Vision) | Convolutional kernels (filters for extracting edges, shapes, and textures) | Medical image recognition, autonomous driving object detection |
| Transformer (NLP) | Attention mechanism weights, language representation capabilities | Translation, sentiment analysis, code generation, legal document understanding |
| RNN (Sequential) | Temporal sequence logic of hidden states | Speech recognition, stock prediction fine-tuning |
| GNN (Graph Neural Networks) | Structural feature embeddings of nodes and edges | Drug molecule discovery, social network recommendations |
Transfer learning is a cornerstone of modern AI because data is scarce, compute is expensive, and pre-trained models offer stronger generalization capabilities.
Full Fine-tuning
Suppose the pre-trained model parameters are \(\Theta\). We divide the network into two parts:
- \(\theta_{backbone}\): The feature extraction layers (e.g., the convolutional layers in ResNet).
- \(\theta_{head}\): The classification output layers (the final fully connected layers).
Full fine-tuning uses the entire pre-trained weight set as initialization and updates all parameters on the new task:
(Note: Here \(\theta = \{\theta_{backbone}, \theta_{head}\}\), meaning all parameters participate in the gradient \(\nabla\) computation.)
Rationale: This approach assumes a significant gap between the source and target domains, or that the dataset is large enough to allow the model to fully adapt to the new task.
Partial Fine-tuning
Partial fine-tuning assumes that low-level features (edges, textures) are universal, and only the high-level semantic features closer to the output need to be fine-tuned.
Update rule:
Rationale: By setting \(\frac{\partial L}{\partial \theta_{low}} = 0\) (manually blocking gradient flow), general feature extraction capabilities are forcibly preserved, reducing training overhead.
Differential Learning Rate Fine-tuning
Differential learning rate fine-tuning applies different learning rates \(\eta\) to different layers. Typically: \(\eta_{backbone} \ll \eta_{head}\).
Update rule:
Rationale: This is a weighted trust mechanism. We place great confidence in the knowledge already embedded in the base model (hence updating it very slowly), while allowing the newly added output layers to rapidly adapt to new classes.
Fixed Feature Extractor
The fixed feature extractor approach leaves the pre-trained layer weights entirely unchanged, treating the model as a "black-box" feature transformer, and trains only the newly attached classification head.
Update rule:
Rationale: In PyTorch, this is implemented via param.requires_grad = False. The computation graph stops backpropagating gradients at the classification head boundary, yielding maximum computational efficiency.
LLM PEFT
LLM PEFT refers to the mainstream Parameter-Efficient Fine-Tuning (PEFT) approach for large language models as of 2026.
Most development no longer starts from scratch but builds upon existing models:
- Pre-training (Student Phase): Learning general grammar and commonsense knowledge, acquiring the ability to predict the next token.
- Instruction Fine-tuning (SFT - Expert Phase): Teaching the model to "follow instructions" by training on question-answer pairs to adapt it to specific domains (e.g., electrical engineering).
- Alignment (Polishing Phase): Using techniques like RLHF to make responses safer, more helpful, and better aligned with human preferences.
Technology Stack
The technology stack and tools for PEFT include:
- Hardware layer: CUDA — NVIDIA's engine that translates Python code into GPU instructions to perform deep learning computations such as matrix multiplication.
- Compute engine: PyTorch — the dominant deep learning framework today, featuring automatic differentiation and more.
- Optimization layer: Unsloth — an acceleration library specifically optimized for LoRA (Low-Rank Adaptation). By implementing hand-tuned, more efficient mathematical operators, it reduces VRAM usage by over 70% and improves speed by 2-3x. It makes fine-tuning 7B or even 70B models feasible on consumer-grade GPUs (e.g., RTX 4090).
- Management layer: Hugging Face Transformers — provides a unified interface so that whether you want to use Meta's Llama, Google's Gemma, or Alibaba's Qwen, the code is nearly identical.
- Hugging Face Datasets — offers standardized data loading. Regardless of whether your raw data is in JSON or CSV, it can quickly convert it into the format required by the model.
- Python/Jupyter Notebook: Application layer.
Training Pipeline
Computers cannot understand text — they can only process numbers. Therefore, tokenization is the essential first step. Tokenization algorithms break sentences into minimal units called "tokens" (e.g., splitting "Learning" into "Learn" and "ing"), assigning each unit a unique numeric ID. The tokenization efficiency of different models directly affects their ability to handle long texts.
During training, the loss function serves as the "yardstick" for measuring model performance. It uses mathematical formulas to compute the gap between predicted results and ground-truth answers. If the model's prediction is wildly off (e.g., predicting "banana" as the unit of resistance), the loss spikes sharply. The core objective of training is to continuously adjust parameters via gradient descent to minimize the loss, driving predictions toward accuracy.
When a model has finished learning and is deployed for use, the process is called inference. Unlike training, inference does not require updating parameters or computing gradients, so its demands on compute and VRAM are much lower.
To enable massive models to run on commodity hardware or even mobile devices, quantization is a critical technique. It compresses the model's original high-precision parameters (e.g., 16-bit floating point) down to 4-bit or even lower. Although this introduces a very slight loss in precision, it dramatically reduces VRAM usage, allowing large models that would otherwise require high-end GPUs to run smoothly on affordable hardware.
LoRA
LoRA stands for Low-Rank Adaptation.
- Low-Rank: A mathematical concept meaning that instead of updating the model's enormous, full-rank weight matrix, we decompose the update into two "narrow" matrices (with a very small rank \(r\)).
- Adaptation: Refers to its purpose — not training a model from scratch, but adapting an already trained model to a specific new task or style.
In deep learning, a model is essentially a collection of massive weight matrices. Traditional fine-tuning requires updating every single number in these matrices. However, today's models are simply too large. For example, with Llama 3 70B, even updating just 1% of the parameters is infeasible on consumer GPUs (because training requires storing each parameter's "gradients" and "momentum," consuming several times more VRAM than the parameters themselves).
Researchers quickly discovered that to teach a model a new skill (such as writing poetry or code), you don't actually need to change all parameter directions. The essential modification is low-rank in nature. For instance, although a model may have 10,000 dimensions, only about 8 core dimensions may actually matter. LoRA captures only those critical 8 dimensions while leaving the remaining 9,992 untouched.
Mathematical Formulation
Suppose the model has an original weight matrix \(W_0\) of dimension \(d \times k\). You need to compute a weight update \(\Delta W\), yielding the fine-tuned weight \(W = W_0 + \Delta W\). Here, \(\Delta W\) is the same size as \(W_0\). If \(W_0\) is \(4096 \times 4096\), then \(\Delta W\) also has 16 million parameters.
LoRA assumes that \(\Delta W\) can be decomposed into the product of two much smaller matrices:
where:
- Matrix A has dimensions \(d \times r\)
- Matrix B has dimensions \(r \times k\)
This \(r\) is the so-called LoRA Rank.
For example, suppose \(d=4096, k=4096\), and we set \(r=8\):
- Traditional fine-tuning: Parameter count \(= 4096 \times 4096 = 16,777,216\)
- LoRA: Parameter count \(= (4096 \times 8) + (8 \times 4096) = 65,536\)
If you increase \(r\) from 8 to 64 or higher, the most immediate effect is increased VRAM usage (higher training overhead). The larger \(r\) is, the wider matrices \(A\) and \(B\) become, containing more parameters.
- You need more VRAM to store the gradients and optimizer states for these parameters.
- If you're running on a 24GB card, \(r=8\) might work fine, but pushing it to \(r=256\) could trigger an OOM (Out of Memory) error.
A higher \(r\) means \(\Delta W\) can capture more detail. If your task is highly complex (e.g., teaching the model an entirely new language), a high \(r\) is more beneficial. For simpler tasks (e.g., just adjusting the tone of voice), a low \(r\) suffices.
A higher \(r\) also means longer training times. Although LoRA is fast, doubling the parameter count increases the floating-point operations (FLOPs) required per training step, slowing down training.
A higher \(r\) increases the risk of overfitting: think of it as writing on a small notepad — you can only jot down the key points (good generalization). But with a huge blackboard (high \(r\)), you might end up memorizing meaningless noise from the training set, causing the model to perform worse in practice (overfitting).
Choosing r
As \(r\) increases, the parameter count in LoRA matrices \(A\) and \(B\) grows, leading to proportionally larger gradient and optimizer state storage. However, even with \(r\) set to 128 or 256, the VRAM footprint is typically still far smaller than that of full parameter tuning, because full parameter tuning requires storing gradients and optimizer states for every single original parameter, whereas LoRA only does so for those two narrow matrices.
Although LoRA is highly effective and cost-efficient, full parameter tuning remains irreplaceable in the following three core scenarios:
- Injecting a large volume of entirely new, domain-specific knowledge
- Changing the model's output format or task paradigm
- When data is abundant and compute resources are plentiful