Data Cleaning and Preprocessing

This note covers the core methods of data cleaning and preprocessing in ML/AI engineering. In a typical machine learning workflow, data preparation often accounts for 60%-80% of the total effort, and its quality directly determines the upper bound of model performance.

Why Data Cleaning Matters

There is a classic saying in the industry: Garbage In, Garbage Out. No matter how advanced a model architecture is, if the input data has serious quality issues, the model will not perform well.

A key insight in practice is that data quality > model complexity. A simple linear model trained on clean data often outperforms a deep neural network trained on dirty data. Andrew Ng's concept of Data-Centric AI embodies exactly this philosophy: rather than repeatedly tuning model architectures and hyperparameters, it is better to invest effort in improving data quality.

Therefore, data cleaning is not an optional "preparatory step" but rather the highest-leverage component in the ML pipeline.

Common Data Quality Issues

After obtaining raw data, the following categories of problems typically need to be investigated:

Missing Values: Some fields are empty or NaN, commonly caused by logging failures, users not filling in fields, etc.
Duplicates: The same record appears multiple times, which can cause the model to overfit on those samples.
Outliers: Values far beyond the normal range, potentially due to sensor malfunction or data entry errors.
Label Noise: Incorrect labels in classification tasks, which directly degrades the quality of the supervisory signal.
Class Imbalance: A significant disparity in the ratio of positive to negative samples (e.g., less than 1% positive samples in fraud detection), causing the model to bias toward the majority class.

There is no one-size-fits-all solution for these problems; handling them requires judgment based on the specific business context. The key techniques are discussed in detail below.

Handling Missing Values

Deletion

The most straightforward approach, suitable when the proportion of missing data is small:

Listwise Deletion: If a row contains any missing value, the entire row is discarded. Simple, but may result in significant data loss.
Column Deletion: If a column has an excessively high missing rate (e.g., above 50%), consider dropping the feature entirely.

Statistical Imputation

Fill missing values using column-level statistics:

Mean Imputation: Suitable for continuous features with a normal distribution. \(x_{\text{fill}} = \bar{x}\)
Median Imputation: More robust for skewed distributions. \(x_{\text{fill}} = \text{median}(x)\)
Mode Imputation: Suitable for categorical features. \(x_{\text{fill}} = \text{mode}(x)\)

These methods are simple and fast but ignore correlations between features.

KNN Imputation

Uses the K-Nearest Neighbors algorithm to estimate missing values based on the values of the K most similar neighbors. This approach captures local relationships between features but scales poorly with dataset size.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
X_filled = imputer.fit_transform(X)

Model-Based Imputation

Treats the feature with missing values as the target variable and trains a model using the remaining features to predict the missing entries. Common methods include MICE (Multiple Imputation by Chained Equations) and IterativeImputer. This approach generally yields the best results but also incurs the highest computational cost.

Outlier Detection

IQR Method

A classic method based on the interquartile range. Compute the first quartile \(Q_1\) and third quartile \(Q_3\), and let \(IQR = Q_3 - Q_1\). Then:

\[\text{Normal Range} = [Q_1 - 1.5 \times IQR, \; Q_3 + 1.5 \times IQR]\]

Points falling outside this range are considered outliers. This method makes no distributional assumptions and is broadly applicable.

Z-Score Method

Assumes the data is approximately normally distributed and computes the standardized score for each sample:

\[z = \frac{x - \mu}{\sigma}\]

Points with \(|z| > 3\) are typically flagged as outliers. Simple and intuitive, but less effective for non-normally distributed data.

Isolation Forest

A tree-based unsupervised anomaly detection algorithm. The core idea is that anomalous points, being "sparse," are easier to isolate through random partitioning of the feature space (requiring fewer splits). Well-suited for high-dimensional data.

from sklearn.ensemble import IsolationForest
clf = IsolationForest(contamination=0.05)
labels = clf.fit_predict(X)  # -1表示异常

DBSCAN

The density-based clustering algorithm DBSCAN was not originally designed for anomaly detection, but it labels points in low-density regions as noise (label=-1), making it a natural fit for detecting outliers. It works well for data with clear spatial clustering structure.

Data Standardization and Normalization

When features have vastly different scales (e.g., age in 0-100 vs. income in 0-1,000,000), the convergence speed of gradient descent-based algorithms is severely affected. Standardization and normalization are designed to address this problem.

Min-Max Scaling (Normalization)

Linearly scales data to the \([0, 1]\) interval:

\[x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}\]

Suitable for data with well-defined boundaries that is not sensitive to outliers (e.g., image pixel values). The downside is its sensitivity to extreme values -- a single outlier can compress the range of all other data points.

Z-Score Standardization

Transforms data to have zero mean and unit standard deviation:

\[x' = \frac{x - \mu}{\sigma}\]

The most commonly used method. Applicable to most ML algorithms, especially those that rely on distance or gradient computations, such as SVM, Logistic Regression, and Neural Networks.

Robust Scaling

Uses the median and interquartile range instead of the mean and standard deviation:

\[x' = \frac{x - \text{median}}{Q_3 - Q_1}\]

Robust to outliers. Preferred when the data contains many outliers and you do not want to perform outlier removal beforehand.

Selection Guidelines

Scenario	Recommended Method
Data has a well-defined range with no outliers	Min-Max Scaling
General-purpose, approximately normal data	Z-Score Standardization
Data contains many outliers	Robust Scaling
Tree-based models (Decision Tree, XGBoost, etc.)	Standardization is generally unnecessary

Feature Encoding

Machine learning models can only process numerical inputs, so categorical features must be encoded as numbers.

One-Hot Encoding

Converts categorical variables into binary vectors. Suitable for unordered features with a small number of categories (e.g., color, city). A large number of categories leads to a dimensionality explosion.

Label Encoding

Maps each category to an integer (e.g., Red=0, Green=1, Blue=2). Suitable for ordinal categories (e.g., education level: elementary < middle school < university). Using it for unordered categories introduces a spurious ordinal relationship.

Target Encoding

Replaces category values with statistics of the target variable (e.g., the mean). Suitable for high-cardinality categorical features. Care must be taken to avoid data leakage; it is typically used with cross-validation or smoothing.

Text Data Preprocessing

Tokenization

Splits raw text into a sequence of tokens. English has natural whitespace delimiters, while Chinese requires dedicated segmentation tools (e.g., jieba). In the era of LLMs, subword-level tokenizers such as BPE (Byte Pair Encoding) and SentencePiece have become the mainstream approach.

Stop Words Removal

Removes high-frequency but low-information words such as "the," "is," etc. This is useful in traditional NLP pipelines (e.g., TF-IDF + SVM), but in deep learning and Transformer architectures, stop word removal is generally not performed, as the attention mechanism can learn which words are important on its own.

Stemming and Lemmatization

Stemming: Rule-based suffix stripping, e.g., running -> run. Fast but crude.
Lemmatization: Dictionary-based reduction to the base form, e.g., better -> good. More accurate but slower.

Both techniques are primarily used in traditional NLP pipelines. Modern pretrained model tokenizers already have built-in capabilities for handling morphological variations.

Image Data Preprocessing

Resize

Resizes all images to a uniform spatial resolution, such as \(224 \times 224\) (the standard input for ResNet) or \(384 \times 384\). This is necessary because most CNN architectures require fixed-size inputs. It is important to choose an appropriate interpolation method (bilinear, bicubic, etc.).

Normalize Pixel Values

Raw pixel values range from \([0, 255]\) and typically need to be normalized to \([0, 1]\) or standardized to a specific mean and standard deviation. For example, ImageNet-pretrained models commonly use:

# ImageNet标准归一化参数
mean = [0.485, 0.456, 0.406]
std  = [0.229, 0.224, 0.225]

Color Space Conversion

Perform color space conversions as required by the task, such as RGB to grayscale (to reduce computational cost) or RGB to HSV (more effective for certain visual tasks). Specialized domains like medical imaging may also involve specific window width and window level adjustments.

Data Pipeline

In production engineering, data preprocessing should not consist of scattered scripts but should be organized into a reproducible, version-controlled pipeline.

The ETL Paradigm

ETL (Extract-Transform-Load) is the classic paradigm in data engineering:

Extract: Retrieve raw data from data sources (databases, APIs, file systems).
Transform: Execute all preprocessing steps including cleaning, standardization, and feature engineering.
Load: Write the processed data to the target storage (feature stores, training datasets, etc.).

sklearn Pipeline

sklearn provides the Pipeline class, which chains multiple preprocessing steps and a model into a single unit, ensuring that the exact same processing flow is used during both training and inference:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

The advantage of this approach is that when pipe.fit() is called, the imputer and scaler are fit only on the training set and only apply transform on the test set, thereby preventing data leakage.

Reproducibility

Reproducibility of data pipelines is a fundamental requirement for production-grade ML. Key practices include:

Fix random seeds: Ensure that data splitting, random sampling, and other stochastic steps are reproducible.
Version control: Use tools like DVC (Data Version Control) to version both data and pipelines.
Separate configuration from code: Extract hyperparameters, file paths, and other configuration items from the code to facilitate experiment management.
Logging: Record data statistics (sample counts, missing rates, distributions, etc.) for each run to aid in debugging.

A well-designed data pipeline should be idempotent: given the same input, it produces exactly the same output regardless of how many times it is run.