Feature Engineering

Feature engineering is the process of transforming raw data into features that are better suited for machine learning models. In traditional machine learning, feature engineering is often more important than model selection -- "data and features determine the upper bound of machine learning performance, while models and algorithms merely approach that bound."

Overview of Feature Engineering

Why Is Feature Engineering Important?

Improved model performance: Good features can enable simple models to match the performance of complex ones
Faster training: Removing redundant features significantly accelerates training
Enhanced interpretability: Meaningful features make model decisions easier to understand
Reduced overfitting: Removing noisy features improves generalization

General Feature Engineering Workflow

Data exploration and understanding
Missing value handling
Feature encoding
Feature scaling
Feature construction
Feature selection
Validation and iteration

Feature Selection

Feature selection aims to identify the most valuable subset of original features, eliminating redundant and noisy ones.

Filter Methods

Filter methods are model-independent and evaluate the relevance of features to the target variable based solely on statistical measures.

Pearson Correlation Coefficient:

Measures the linear correlation between feature \(X\) and target \(Y\):

\[ r_{XY} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}} \]

\(|r_{XY}|\) close to 1 indicates strong linear correlation. Note that the Pearson coefficient only captures linear relationships.

Mutual Information:

Mutual information captures arbitrary dependencies (including nonlinear ones) between feature \(X\) and target \(Y\):

\[ I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)} \]

For continuous variables, KNN-based estimators can be used (e.g., sklearn.feature_selection.mutual_info_classif).

Variance Threshold:

Remove features with variance below a threshold. A feature that barely varies is unlikely to carry useful information:

\[ \text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 < \tau \]

Other Filter Methods:

Chi-squared test: For categorical features with categorical targets
ANOVA F-test: For continuous features with categorical targets
Fisher Score: Ratio of between-class variance to within-class variance

Wrapper Methods

Wrapper methods treat feature selection as a search problem, using model performance as the evaluation criterion.

Recursive Feature Elimination (RFE):

Train the model using all features
Evaluate the importance of each feature (e.g., absolute coefficient values in linear models)
Remove the least important feature
Repeat steps 1--3 until the target number of features is reached

RFE has a relatively high time complexity of \(O(d \cdot T_{\text{train}})\), where \(d\) is the number of features.

Other Wrapper Methods:

Forward Selection: Start with an empty set and add the best feature at each step
Backward Elimination: Start with the full set and remove the worst feature at each step
Genetic Algorithms: Use evolutionary strategies to search for the optimal feature subset

Embedded Methods

Embedded methods perform feature selection as part of the model training process.

L1 Regularization (Lasso):

\[ \min_{\mathbf{w}} \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \alpha\|\mathbf{w}\|_1 \]

L1 regularization shrinks the coefficients of unimportant features to exactly zero, achieving sparsity. A larger \(\alpha\) leads to more zero coefficients.

Tree-Based Feature Importance:

Split-based importance: Sum of information gain from splits on each feature across all trees
Permutation-based importance: The decrease in model performance when a feature's values are randomly shuffled

Method Type	Advantages	Disadvantages	Typical Methods
Filter	Fast computation, model-independent	Ignores feature interactions	Correlation, mutual information, chi-squared
Wrapper	Considers feature interactions	High computational cost, prone to overfitting	RFE, forward selection
Embedded	Balances efficiency and effectiveness	Depends on a specific model	Lasso, tree-based importance

Feature Construction

Feature construction is the process of creating new features from existing ones based on domain knowledge and data characteristics.

Polynomial Features

For features \(x_1, x_2\), generate second-degree polynomial features:

\[ \phi(\mathbf{x}) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2] \]

sklearn.preprocessing.PolynomialFeatures automatically generates polynomial features of a specified degree.

Caveat: High-degree polynomials cause a combinatorial explosion (\(d\) features at degree \(k\) produce \(\binom{d+k}{k}\) features) and should be combined with regularization.

Interaction Features

Manually create interaction features with business meaning, for example:

E-commerce: total spending / number of purchases = average order value
Finance: total debt / annual income = debt-to-income ratio
Advertising: clicks / impressions = click-through rate

Temporal Feature Extraction

Extract multiple meaningful features from timestamps:

Cyclical features: year, month, day, hour, day of week, weekend flag
Time-difference features: days since last purchase, days since registration
Rolling window features: mean, max, and standard deviation over the past 7/30 days
Sine/cosine encoding: Encode cyclical features to preserve periodicity:

\[ x_{\sin} = \sin\left(\frac{2\pi \cdot \text{hour}}{24}\right), \quad x_{\cos} = \cos\left(\frac{2\pi \cdot \text{hour}}{24}\right) \]

Text Features

Bag of Words (BoW): Term frequency counts
TF-IDF:

\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)} \]

Word embeddings: Averaged pre-trained embeddings from Word2Vec, GloVe, etc.

Feature Encoding

Converting non-numeric features into numerical form is a necessary step for modeling.

One-Hot Encoding

Converts categorical features into binary vectors. For example, color \(\{\text{red}, \text{green}, \text{blue}\}\) is encoded as:

Red: \([1, 0, 0]\)
Green: \([0, 1, 0]\)
Blue: \([0, 0, 1]\)

Best suited for: Unordered categorical features with low cardinality (\(< 20\)). High-cardinality features lead to the curse of dimensionality.

Label Encoding

Maps each category to an integer: red \(\to 0\), green \(\to 1\), blue \(\to 2\).

Note: Label encoding introduces an artificial ordering, so it is typically only appropriate for ordinal categories (e.g., education level: high school < bachelor's < master's < doctorate) or tree-based models (which are not affected by ordering).

Target Encoding

Replaces category values with the conditional mean of the target variable:

\[ \text{TE}(x_k) = \frac{n_k \cdot \bar{y}_k + m \cdot \bar{y}_{\text{global}}}{n_k + m} \]

where \(n_k\) is the count for category \(k\), \(\bar{y}_k\) is the target mean for category \(k\), \(m\) is a smoothing parameter, and \(\bar{y}_{\text{global}}\) is the global target mean.

Risk: Highly susceptible to target leakage; encoding must be performed within cross-validation folds.

WoE (Weight of Evidence)

Primarily used in credit scoring:

\[ \text{WoE}_i = \ln\frac{p(\text{Good}_i)}{p(\text{Bad}_i)} = \ln\frac{n_{i,\text{good}}/N_{\text{good}}}{n_{i,\text{bad}}/N_{\text{bad}}} \]

The associated Information Value (IV) measures a feature's predictive power:

\[ \text{IV} = \sum_i \left(\frac{n_{i,\text{good}}}{N_{\text{good}}} - \frac{n_{i,\text{bad}}}{N_{\text{bad}}}\right) \cdot \text{WoE}_i \]

IV Range	Predictive Power
\(< 0.02\)	Useless
\(0.02 - 0.1\)	Weak
\(0.1 - 0.3\)	Moderate
\(0.3 - 0.5\)	Strong
\(> 0.5\)	Suspicious (possible overfitting)

Feature Scaling

Differences in units and numerical ranges across features can severely affect model performance.

StandardScaler (Standardization)

Scales features to zero mean and unit variance:

\[ x' = \frac{x - \mu}{\sigma} \]

Best suited for: Most models, especially distance-based models (SVM, KNN) and gradient-optimized models (linear regression, neural networks).

MinMaxScaler (Normalization)

Scales features to the \([0, 1]\) interval:

\[ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \]

Best suited for: Models requiring bounded inputs (e.g., neural networks with certain activation functions). Sensitive to outliers.

RobustScaler

Uses the median and interquartile range, making it more robust to outliers:

\[ x' = \frac{x - \text{median}(X)}{\text{IQR}(X)} \]

where \(\text{IQR} = Q_3 - Q_1\).

Scaling Method	Preserves Distribution Shape	Sensitive to Outliers	Best Suited For
StandardScaler	Yes	Yes	General purpose, gradient-based models
MinMaxScaler	Yes	Very sensitive	Bounded inputs, image data
RobustScaler	Yes	No	Data with outliers

Note: Tree-based models (decision trees, Random Forest, XGBoost, etc.) are not sensitive to feature scaling.

Missing Value Handling

Common Imputation Strategies

Strategy	Method	Best Suited For
Deletion	Drop rows or columns with missing values	Low missing rate (\(< 5\%\))
Mean/Median	Fill with the feature's mean or median	Numeric features, missing at random
Mode	Fill with the most frequent value	Categorical features
Constant	Fill with a specific value (e.g., \(-1\), "Unknown")	Missingness itself is informative
Interpolation	Linear or spline interpolation	Time series data
KNN imputation	Fill with values from nearest neighbors	Samples share similarity

MICE (Multiple Imputation by Chained Equations)

MICE is a multiple imputation method that iteratively models each feature with missing values conditionally on all other features:

Perform an initial fill for all missing values (e.g., mean imputation)
For each feature \(X_j\) with missing values: - Treat \(X_j\) as the target variable with other features as inputs - Train a regression model on the non-missing samples - Predict the missing values using the trained model
Repeat step 2 for multiple rounds until convergence
Repeat the entire process \(M\) times to generate \(M\) complete datasets

The strength of MICE lies in its ability to leverage correlations among features and quantify uncertainty through multiple imputation.

Automated Feature Engineering

Featuretools

Featuretools is an automated feature engineering library based on Deep Feature Synthesis (DFS):

Define entity sets and relationships (e.g., user-order-product relationships)
Automatically generate aggregation features (e.g., average order amount per user) and transformation features (e.g., time differences)
Supports custom primitives

AutoFeat

AutoFeat automatically constructs and selects nonlinear features:

Generate a large pool of candidate features through arithmetic operations (addition, subtraction, multiplication, division, logarithm, square root, etc.)
Use L1 regularization to select the most useful subset
Produce interpretable feature expressions