Skip to content

Feature Engineering in Practice

Introduction

Feature engineering is the process of transforming raw data into model-ready features, and it is one of the most impactful steps in a machine learning project. This article covers feature types, encoding methods, scaling, feature selection, and feature extraction.


1. Feature Types

Type Examples Processing
Numeric Age, income, temperature Scaling, binning, interaction
Categorical City, education, color Encoding (one-hot, label)
Text Reviews, titles TF-IDF, word embeddings
Temporal Dates, timestamps Extract year/month/day/weekday/hour
Ordinal Ratings (1-5 stars) Order-preserving encoding

2. Categorical Feature Encoding

2.1 One-Hot Encoding

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# pandas
df_encoded = pd.get_dummies(df, columns=["city"], drop_first=True)

# sklearn
ohe = OneHotEncoder(drop='first', sparse_output=False)
encoded = ohe.fit_transform(df[["city"]])

Suited for: low cardinality (< 20), unordered categories.

Issue: high-cardinality categories cause dimensionality explosion.

2.2 Label Encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["city_encoded"] = le.fit_transform(df["city"])

Suited for: ordinal categories (e.g., education: elementary < middle < university), or tree-based models.

2.3 Target Encoding

Replace categories with statistics of the target variable:

# Use category-level target means
target_means = df.groupby("city")["target"].mean()
df["city_target_enc"] = df["city"].map(target_means)

# Add smoothing to prevent overfitting on small categories
def target_encode_smoothed(df, col, target, alpha=10):
    global_mean = df[target].mean()
    stats = df.groupby(col)[target].agg(['mean', 'count'])
    smoothed = (stats['count'] * stats['mean'] + alpha * global_mean) / (stats['count'] + alpha)
    return df[col].map(smoothed)

Suited for: high-cardinality categorical features. Compute within each fold during cross-validation.

2.4 Frequency Encoding

freq = df["city"].value_counts(normalize=True)
df["city_freq"] = df["city"].map(freq)

2.5 Encoding Method Selection

Method Cardinality Model Type Information Preserved
One-Hot Low (< 20) Linear models Unordered information
Label Ordinal Tree models Order information
Target High All Relationship with target
Frequency High All Distribution information
Embedding Very high Deep learning Learned representation

3. Numeric Feature Scaling

3.1 Standardization

\[ x' = \frac{x - \mu}{\sigma} \]
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Suited for: models that assume approximately normal data (linear regression, SVM, neural networks).

3.2 Min-Max Scaling

\[ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \]
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

Suited for: models requiring a specific range (e.g., [0, 1]) such as neural networks and image pixels.

3.3 Robust Scaling

\[ x' = \frac{x - Q_2}{Q_3 - Q_1} \]
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

Suited for: data with outliers (uses median and IQR instead of mean and standard deviation).

3.4 Log Transformation

# Handle right-skewed distributions
df["log_income"] = np.log1p(df["income"])  # log(1+x) handles zeros

# Box-Cox transformation (automatically finds optimal transform)
from scipy.stats import boxcox
df["income_bc"], lambda_opt = boxcox(df["income"] + 1)

4. Feature Selection

4.1 Filter Methods

Evaluate each feature independently based on statistical metrics:

from sklearn.feature_selection import mutual_info_classif, SelectKBest, f_classif

# Mutual information
mi_scores = mutual_info_classif(X, y)
mi_df = pd.DataFrame({"feature": X.columns, "MI": mi_scores}).sort_values("MI", ascending=False)

# F-test
selector = SelectKBest(f_classif, k=20)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
Method Suited For Measures
Mutual Information (MI) Classification/Regression Nonlinear dependence
F-test Classification Linear relationship
Chi-squared test Classification + categorical features Independence
Variance Any Information content (low variance = useless)

4.2 Wrapper Methods

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Recursive Feature Elimination
rfe = RFE(
    estimator=RandomForestClassifier(n_estimators=100),
    n_features_to_select=20,
    step=5
)
rfe.fit(X, y)
selected = X.columns[rfe.support_]

4.3 Embedded Methods

Leverage the model's own feature importance:

# L1 regularization (automatic feature selection)
from sklearn.linear_model import LassoCV

lasso = LassoCV(cv=5)
lasso.fit(X, y)
selected = X.columns[lasso.coef_ != 0]

# Tree model feature importance
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier().fit(X, y)
importances = pd.Series(model.feature_importances_, index=X.columns)
top_features = importances.nlargest(20).index

5. Feature Extraction

5.1 PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)  # retain 95% variance
X_pca = pca.fit_transform(X_scaled)
print(f"Original dimensions: {X.shape[1]}, After PCA: {X_pca.shape[1]}")

5.2 Autoencoder

import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, input_dim),
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z

6. Feature Construction

6.1 Numeric Interactions

# Polynomial features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X[["age", "income"]])
# Produces: age, income, age*income

# Manual construction
df["income_per_age"] = df["income"] / (df["age"] + 1)
df["bmi"] = df["weight"] / (df["height"] / 100) ** 2

6.2 Temporal Features

df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek
df["hour"] = df["date"].dt.hour
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)

# Cyclical encoding (avoid treating December and January as distant)
df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)

6.3 Group Statistics

# User-level aggregation features
user_stats = df.groupby("user_id").agg({
    "amount": ["mean", "sum", "count", "std"],
    "category": "nunique",
    "date": ["min", "max"]
}).reset_index()
user_stats.columns = ["_".join(col).strip("_") for col in user_stats.columns]

7. AutoML Feature Engineering

Tool Capability
Featuretools Automated deep feature synthesis
tsfresh Automatic time series feature extraction
AutoFeat Automatic feature construction + selection
FLAML / AutoGluon End-to-end AutoML
# Featuretools example
import featuretools as ft

es = ft.EntitySet(id="data")
es.add_dataframe(dataframe=df, dataframe_name="orders", index="order_id")

feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="orders",
    max_depth=2,
)

References

  • "Feature Engineering for Machine Learning" - Zheng & Casari
  • "Feature Engineering and Selection" - Kuhn & Johnson
  • scikit-learn Preprocessing Documentation

评论 #