Feature Engineering in Practice
Introduction
Feature engineering is the process of transforming raw data into model-ready features, and it is one of the most impactful steps in a machine learning project. This article covers feature types, encoding methods, scaling, feature selection, and feature extraction.
1. Feature Types
| Type | Examples | Processing |
|---|---|---|
| Numeric | Age, income, temperature | Scaling, binning, interaction |
| Categorical | City, education, color | Encoding (one-hot, label) |
| Text | Reviews, titles | TF-IDF, word embeddings |
| Temporal | Dates, timestamps | Extract year/month/day/weekday/hour |
| Ordinal | Ratings (1-5 stars) | Order-preserving encoding |
2. Categorical Feature Encoding
2.1 One-Hot Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# pandas
df_encoded = pd.get_dummies(df, columns=["city"], drop_first=True)
# sklearn
ohe = OneHotEncoder(drop='first', sparse_output=False)
encoded = ohe.fit_transform(df[["city"]])
Suited for: low cardinality (< 20), unordered categories.
Issue: high-cardinality categories cause dimensionality explosion.
2.2 Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["city_encoded"] = le.fit_transform(df["city"])
Suited for: ordinal categories (e.g., education: elementary < middle < university), or tree-based models.
2.3 Target Encoding
Replace categories with statistics of the target variable:
# Use category-level target means
target_means = df.groupby("city")["target"].mean()
df["city_target_enc"] = df["city"].map(target_means)
# Add smoothing to prevent overfitting on small categories
def target_encode_smoothed(df, col, target, alpha=10):
global_mean = df[target].mean()
stats = df.groupby(col)[target].agg(['mean', 'count'])
smoothed = (stats['count'] * stats['mean'] + alpha * global_mean) / (stats['count'] + alpha)
return df[col].map(smoothed)
Suited for: high-cardinality categorical features. Compute within each fold during cross-validation.
2.4 Frequency Encoding
freq = df["city"].value_counts(normalize=True)
df["city_freq"] = df["city"].map(freq)
2.5 Encoding Method Selection
| Method | Cardinality | Model Type | Information Preserved |
|---|---|---|---|
| One-Hot | Low (< 20) | Linear models | Unordered information |
| Label | Ordinal | Tree models | Order information |
| Target | High | All | Relationship with target |
| Frequency | High | All | Distribution information |
| Embedding | Very high | Deep learning | Learned representation |
3. Numeric Feature Scaling
3.1 Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Suited for: models that assume approximately normal data (linear regression, SVM, neural networks).
3.2 Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
Suited for: models requiring a specific range (e.g., [0, 1]) such as neural networks and image pixels.
3.3 Robust Scaling
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
Suited for: data with outliers (uses median and IQR instead of mean and standard deviation).
3.4 Log Transformation
# Handle right-skewed distributions
df["log_income"] = np.log1p(df["income"]) # log(1+x) handles zeros
# Box-Cox transformation (automatically finds optimal transform)
from scipy.stats import boxcox
df["income_bc"], lambda_opt = boxcox(df["income"] + 1)
4. Feature Selection
4.1 Filter Methods
Evaluate each feature independently based on statistical metrics:
from sklearn.feature_selection import mutual_info_classif, SelectKBest, f_classif
# Mutual information
mi_scores = mutual_info_classif(X, y)
mi_df = pd.DataFrame({"feature": X.columns, "MI": mi_scores}).sort_values("MI", ascending=False)
# F-test
selector = SelectKBest(f_classif, k=20)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
| Method | Suited For | Measures |
|---|---|---|
| Mutual Information (MI) | Classification/Regression | Nonlinear dependence |
| F-test | Classification | Linear relationship |
| Chi-squared test | Classification + categorical features | Independence |
| Variance | Any | Information content (low variance = useless) |
4.2 Wrapper Methods
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
# Recursive Feature Elimination
rfe = RFE(
estimator=RandomForestClassifier(n_estimators=100),
n_features_to_select=20,
step=5
)
rfe.fit(X, y)
selected = X.columns[rfe.support_]
4.3 Embedded Methods
Leverage the model's own feature importance:
# L1 regularization (automatic feature selection)
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5)
lasso.fit(X, y)
selected = X.columns[lasso.coef_ != 0]
# Tree model feature importance
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier().fit(X, y)
importances = pd.Series(model.feature_importances_, index=X.columns)
top_features = importances.nlargest(20).index
5. Feature Extraction
5.1 PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # retain 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f"Original dimensions: {X.shape[1]}, After PCA: {X_pca.shape[1]}")
5.2 Autoencoder
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 128),
nn.ReLU(),
nn.Linear(128, latent_dim),
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.ReLU(),
nn.Linear(128, input_dim),
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z), z
6. Feature Construction
6.1 Numeric Interactions
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X[["age", "income"]])
# Produces: age, income, age*income
# Manual construction
df["income_per_age"] = df["income"] / (df["age"] + 1)
df["bmi"] = df["weight"] / (df["height"] / 100) ** 2
6.2 Temporal Features
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek
df["hour"] = df["date"].dt.hour
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
# Cyclical encoding (avoid treating December and January as distant)
df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)
6.3 Group Statistics
# User-level aggregation features
user_stats = df.groupby("user_id").agg({
"amount": ["mean", "sum", "count", "std"],
"category": "nunique",
"date": ["min", "max"]
}).reset_index()
user_stats.columns = ["_".join(col).strip("_") for col in user_stats.columns]
7. AutoML Feature Engineering
| Tool | Capability |
|---|---|
| Featuretools | Automated deep feature synthesis |
| tsfresh | Automatic time series feature extraction |
| AutoFeat | Automatic feature construction + selection |
| FLAML / AutoGluon | End-to-end AutoML |
# Featuretools example
import featuretools as ft
es = ft.EntitySet(id="data")
es.add_dataframe(dataframe=df, dataframe_name="orders", index="order_id")
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="orders",
max_depth=2,
)
References
- "Feature Engineering for Machine Learning" - Zheng & Casari
- "Feature Engineering and Selection" - Kuhn & Johnson
- scikit-learn Preprocessing Documentation