Data Cleaning and Preprocessing
This note covers the core methods of data cleaning and preprocessing in ML/AI engineering. In a typical machine learning workflow, data preparation often accounts for 60%-80% of the total effort, and its quality directly determines the upper bound of model performance.
Why Data Cleaning Matters
There is a classic saying in the industry: Garbage In, Garbage Out. No matter how advanced a model architecture is, if the input data has serious quality issues, the model will not perform well.
A key insight in practice is that data quality > model complexity. A simple linear model trained on clean data often outperforms a deep neural network trained on dirty data. Andrew Ng's concept of Data-Centric AI embodies exactly this philosophy: rather than repeatedly tuning model architectures and hyperparameters, it is better to invest effort in improving data quality.
Therefore, data cleaning is not an optional "preparatory step" but rather the highest-leverage component in the ML pipeline.
Common Data Quality Issues
After obtaining raw data, the following categories of problems typically need to be investigated:
- Missing Values: Some fields are empty or NaN, commonly caused by logging failures, users not filling in fields, etc.
- Duplicates: The same record appears multiple times, which can cause the model to overfit on those samples.
- Outliers: Values far beyond the normal range, potentially due to sensor malfunction or data entry errors.
- Label Noise: Incorrect labels in classification tasks, which directly degrades the quality of the supervisory signal.
- Class Imbalance: A significant disparity in the ratio of positive to negative samples (e.g., less than 1% positive samples in fraud detection), causing the model to bias toward the majority class.
There is no one-size-fits-all solution for these problems; handling them requires judgment based on the specific business context. The key techniques are discussed in detail below.
Handling Missing Values
Deletion
The most straightforward approach, suitable when the proportion of missing data is small:
- Listwise Deletion: If a row contains any missing value, the entire row is discarded. Simple, but may result in significant data loss.
- Column Deletion: If a column has an excessively high missing rate (e.g., above 50%), consider dropping the feature entirely.
Statistical Imputation
Fill missing values using column-level statistics:
- Mean Imputation: Suitable for continuous features with a normal distribution. \(x_{\text{fill}} = \bar{x}\)
- Median Imputation: More robust for skewed distributions. \(x_{\text{fill}} = \text{median}(x)\)
- Mode Imputation: Suitable for categorical features. \(x_{\text{fill}} = \text{mode}(x)\)
These methods are simple and fast but ignore correlations between features.
KNN Imputation
Uses the K-Nearest Neighbors algorithm to estimate missing values based on the values of the K most similar neighbors. This approach captures local relationships between features but scales poorly with dataset size.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
X_filled = imputer.fit_transform(X)
Model-Based Imputation
Treats the feature with missing values as the target variable and trains a model using the remaining features to predict the missing entries. Common methods include MICE (Multiple Imputation by Chained Equations) and IterativeImputer. This approach generally yields the best results but also incurs the highest computational cost.
Outlier Detection
IQR Method
A classic method based on the interquartile range. Compute the first quartile \(Q_1\) and third quartile \(Q_3\), and let \(IQR = Q_3 - Q_1\). Then:
Points falling outside this range are considered outliers. This method makes no distributional assumptions and is broadly applicable.
Z-Score Method
Assumes the data is approximately normally distributed and computes the standardized score for each sample:
Points with \(|z| > 3\) are typically flagged as outliers. Simple and intuitive, but less effective for non-normally distributed data.
Isolation Forest
A tree-based unsupervised anomaly detection algorithm. The core idea is that anomalous points, being "sparse," are easier to isolate through random partitioning of the feature space (requiring fewer splits). Well-suited for high-dimensional data.
from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.05)
labels = clf.fit_predict(X) # -1表示异常
DBSCAN
The density-based clustering algorithm DBSCAN was not originally designed for anomaly detection, but it labels points in low-density regions as noise (label=-1), making it a natural fit for detecting outliers. It works well for data with clear spatial clustering structure.
Data Standardization and Normalization
When features have vastly different scales (e.g., age in 0-100 vs. income in 0-1,000,000), the convergence speed of gradient descent-based algorithms is severely affected. Standardization and normalization are designed to address this problem.
Min-Max Scaling (Normalization)
Linearly scales data to the \([0, 1]\) interval:
Suitable for data with well-defined boundaries that is not sensitive to outliers (e.g., image pixel values). The downside is its sensitivity to extreme values -- a single outlier can compress the range of all other data points.
Z-Score Standardization
Transforms data to have zero mean and unit standard deviation:
The most commonly used method. Applicable to most ML algorithms, especially those that rely on distance or gradient computations, such as SVM, Logistic Regression, and Neural Networks.
Robust Scaling
Uses the median and interquartile range instead of the mean and standard deviation:
Robust to outliers. Preferred when the data contains many outliers and you do not want to perform outlier removal beforehand.
Selection Guidelines
| Scenario | Recommended Method |
|---|---|
| Data has a well-defined range with no outliers | Min-Max Scaling |
| General-purpose, approximately normal data | Z-Score Standardization |
| Data contains many outliers | Robust Scaling |
| Tree-based models (Decision Tree, XGBoost, etc.) | Standardization is generally unnecessary |
Feature Encoding
Machine learning models can only process numerical inputs, so categorical features must be encoded as numbers.
One-Hot Encoding
Converts categorical variables into binary vectors. Suitable for unordered features with a small number of categories (e.g., color, city). A large number of categories leads to a dimensionality explosion.
Label Encoding
Maps each category to an integer (e.g., Red=0, Green=1, Blue=2). Suitable for ordinal categories (e.g., education level: elementary < middle school < university). Using it for unordered categories introduces a spurious ordinal relationship.
Target Encoding
Replaces category values with statistics of the target variable (e.g., the mean). Suitable for high-cardinality categorical features. Care must be taken to avoid data leakage; it is typically used with cross-validation or smoothing.
Text Data Preprocessing
Tokenization
Splits raw text into a sequence of tokens. English has natural whitespace delimiters, while Chinese requires dedicated segmentation tools (e.g., jieba). In the era of LLMs, subword-level tokenizers such as BPE (Byte Pair Encoding) and SentencePiece have become the mainstream approach.
Stop Words Removal
Removes high-frequency but low-information words such as "the," "is," etc. This is useful in traditional NLP pipelines (e.g., TF-IDF + SVM), but in deep learning and Transformer architectures, stop word removal is generally not performed, as the attention mechanism can learn which words are important on its own.
Stemming and Lemmatization
- Stemming: Rule-based suffix stripping, e.g., running -> run. Fast but crude.
- Lemmatization: Dictionary-based reduction to the base form, e.g., better -> good. More accurate but slower.
Both techniques are primarily used in traditional NLP pipelines. Modern pretrained model tokenizers already have built-in capabilities for handling morphological variations.
Image Data Preprocessing
Resize
Resizes all images to a uniform spatial resolution, such as \(224 \times 224\) (the standard input for ResNet) or \(384 \times 384\). This is necessary because most CNN architectures require fixed-size inputs. It is important to choose an appropriate interpolation method (bilinear, bicubic, etc.).
Normalize Pixel Values
Raw pixel values range from \([0, 255]\) and typically need to be normalized to \([0, 1]\) or standardized to a specific mean and standard deviation. For example, ImageNet-pretrained models commonly use:
# ImageNet标准归一化参数
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
Color Space Conversion
Perform color space conversions as required by the task, such as RGB to grayscale (to reduce computational cost) or RGB to HSV (more effective for certain visual tasks). Specialized domains like medical imaging may also involve specific window width and window level adjustments.
Data Pipeline
In production engineering, data preprocessing should not consist of scattered scripts but should be organized into a reproducible, version-controlled pipeline.
The ETL Paradigm
ETL (Extract-Transform-Load) is the classic paradigm in data engineering:
- Extract: Retrieve raw data from data sources (databases, APIs, file systems).
- Transform: Execute all preprocessing steps including cleaning, standardization, and feature engineering.
- Load: Write the processed data to the target storage (feature stores, training datasets, etc.).
sklearn Pipeline
sklearn provides the Pipeline class, which chains multiple preprocessing steps and a model into a single unit, ensuring that the exact same processing flow is used during both training and inference:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
The advantage of this approach is that when pipe.fit() is called, the imputer and scaler are fit only on the training set and only apply transform on the test set, thereby preventing data leakage.
Reproducibility
Reproducibility of data pipelines is a fundamental requirement for production-grade ML. Key practices include:
- Fix random seeds: Ensure that data splitting, random sampling, and other stochastic steps are reproducible.
- Version control: Use tools like DVC (Data Version Control) to version both data and pipelines.
- Separate configuration from code: Extract hyperparameters, file paths, and other configuration items from the code to facilitate experiment management.
- Logging: Record data statistics (sample counts, missing rates, distributions, etc.) for each run to aid in debugging.
A well-designed data pipeline should be idempotent: given the same input, it produces exactly the same output regardless of how many times it is run.