Skip to content

Feature Engineering

Feature engineering is the process of transforming raw data into features that are better suited for machine learning models. In traditional machine learning, feature engineering is often more important than model selection -- "data and features determine the upper bound of machine learning performance, while models and algorithms merely approach that bound."

Overview of Feature Engineering

Why Is Feature Engineering Important?

  • Improved model performance: Good features can enable simple models to match the performance of complex ones
  • Faster training: Removing redundant features significantly accelerates training
  • Enhanced interpretability: Meaningful features make model decisions easier to understand
  • Reduced overfitting: Removing noisy features improves generalization

General Feature Engineering Workflow

  1. Data exploration and understanding
  2. Missing value handling
  3. Feature encoding
  4. Feature scaling
  5. Feature construction
  6. Feature selection
  7. Validation and iteration

Feature Selection

Feature selection aims to identify the most valuable subset of original features, eliminating redundant and noisy ones.

Filter Methods

Filter methods are model-independent and evaluate the relevance of features to the target variable based solely on statistical measures.

Pearson Correlation Coefficient:

Measures the linear correlation between feature \(X\) and target \(Y\):

\[ r_{XY} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}} \]

\(|r_{XY}|\) close to 1 indicates strong linear correlation. Note that the Pearson coefficient only captures linear relationships.

Mutual Information:

Mutual information captures arbitrary dependencies (including nonlinear ones) between feature \(X\) and target \(Y\):

\[ I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)} \]

For continuous variables, KNN-based estimators can be used (e.g., sklearn.feature_selection.mutual_info_classif).

Variance Threshold:

Remove features with variance below a threshold. A feature that barely varies is unlikely to carry useful information:

\[ \text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 < \tau \]

Other Filter Methods:

  • Chi-squared test: For categorical features with categorical targets
  • ANOVA F-test: For continuous features with categorical targets
  • Fisher Score: Ratio of between-class variance to within-class variance

Wrapper Methods

Wrapper methods treat feature selection as a search problem, using model performance as the evaluation criterion.

Recursive Feature Elimination (RFE):

  1. Train the model using all features
  2. Evaluate the importance of each feature (e.g., absolute coefficient values in linear models)
  3. Remove the least important feature
  4. Repeat steps 1--3 until the target number of features is reached

RFE has a relatively high time complexity of \(O(d \cdot T_{\text{train}})\), where \(d\) is the number of features.

Other Wrapper Methods:

  • Forward Selection: Start with an empty set and add the best feature at each step
  • Backward Elimination: Start with the full set and remove the worst feature at each step
  • Genetic Algorithms: Use evolutionary strategies to search for the optimal feature subset

Embedded Methods

Embedded methods perform feature selection as part of the model training process.

L1 Regularization (Lasso):

\[ \min_{\mathbf{w}} \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \alpha\|\mathbf{w}\|_1 \]

L1 regularization shrinks the coefficients of unimportant features to exactly zero, achieving sparsity. A larger \(\alpha\) leads to more zero coefficients.

Tree-Based Feature Importance:

  • Split-based importance: Sum of information gain from splits on each feature across all trees
  • Permutation-based importance: The decrease in model performance when a feature's values are randomly shuffled
Method Type Advantages Disadvantages Typical Methods
Filter Fast computation, model-independent Ignores feature interactions Correlation, mutual information, chi-squared
Wrapper Considers feature interactions High computational cost, prone to overfitting RFE, forward selection
Embedded Balances efficiency and effectiveness Depends on a specific model Lasso, tree-based importance

Feature Construction

Feature construction is the process of creating new features from existing ones based on domain knowledge and data characteristics.

Polynomial Features

For features \(x_1, x_2\), generate second-degree polynomial features:

\[ \phi(\mathbf{x}) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2] \]

sklearn.preprocessing.PolynomialFeatures automatically generates polynomial features of a specified degree.

Caveat: High-degree polynomials cause a combinatorial explosion (\(d\) features at degree \(k\) produce \(\binom{d+k}{k}\) features) and should be combined with regularization.

Interaction Features

Manually create interaction features with business meaning, for example:

  • E-commerce: total spending / number of purchases = average order value
  • Finance: total debt / annual income = debt-to-income ratio
  • Advertising: clicks / impressions = click-through rate

Temporal Feature Extraction

Extract multiple meaningful features from timestamps:

  • Cyclical features: year, month, day, hour, day of week, weekend flag
  • Time-difference features: days since last purchase, days since registration
  • Rolling window features: mean, max, and standard deviation over the past 7/30 days
  • Sine/cosine encoding: Encode cyclical features to preserve periodicity:
\[ x_{\sin} = \sin\left(\frac{2\pi \cdot \text{hour}}{24}\right), \quad x_{\cos} = \cos\left(\frac{2\pi \cdot \text{hour}}{24}\right) \]

Text Features

  • Bag of Words (BoW): Term frequency counts
  • TF-IDF:
\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)} \]
  • Word embeddings: Averaged pre-trained embeddings from Word2Vec, GloVe, etc.

Feature Encoding

Converting non-numeric features into numerical form is a necessary step for modeling.

One-Hot Encoding

Converts categorical features into binary vectors. For example, color \(\{\text{red}, \text{green}, \text{blue}\}\) is encoded as:

  • Red: \([1, 0, 0]\)
  • Green: \([0, 1, 0]\)
  • Blue: \([0, 0, 1]\)

Best suited for: Unordered categorical features with low cardinality (\(< 20\)). High-cardinality features lead to the curse of dimensionality.

Label Encoding

Maps each category to an integer: red \(\to 0\), green \(\to 1\), blue \(\to 2\).

Note: Label encoding introduces an artificial ordering, so it is typically only appropriate for ordinal categories (e.g., education level: high school < bachelor's < master's < doctorate) or tree-based models (which are not affected by ordering).

Target Encoding

Replaces category values with the conditional mean of the target variable:

\[ \text{TE}(x_k) = \frac{n_k \cdot \bar{y}_k + m \cdot \bar{y}_{\text{global}}}{n_k + m} \]

where \(n_k\) is the count for category \(k\), \(\bar{y}_k\) is the target mean for category \(k\), \(m\) is a smoothing parameter, and \(\bar{y}_{\text{global}}\) is the global target mean.

Risk: Highly susceptible to target leakage; encoding must be performed within cross-validation folds.

WoE (Weight of Evidence)

Primarily used in credit scoring:

\[ \text{WoE}_i = \ln\frac{p(\text{Good}_i)}{p(\text{Bad}_i)} = \ln\frac{n_{i,\text{good}}/N_{\text{good}}}{n_{i,\text{bad}}/N_{\text{bad}}} \]

The associated Information Value (IV) measures a feature's predictive power:

\[ \text{IV} = \sum_i \left(\frac{n_{i,\text{good}}}{N_{\text{good}}} - \frac{n_{i,\text{bad}}}{N_{\text{bad}}}\right) \cdot \text{WoE}_i \]
IV Range Predictive Power
\(< 0.02\) Useless
\(0.02 - 0.1\) Weak
\(0.1 - 0.3\) Moderate
\(0.3 - 0.5\) Strong
\(> 0.5\) Suspicious (possible overfitting)

Feature Scaling

Differences in units and numerical ranges across features can severely affect model performance.

StandardScaler (Standardization)

Scales features to zero mean and unit variance:

\[ x' = \frac{x - \mu}{\sigma} \]

Best suited for: Most models, especially distance-based models (SVM, KNN) and gradient-optimized models (linear regression, neural networks).

MinMaxScaler (Normalization)

Scales features to the \([0, 1]\) interval:

\[ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \]

Best suited for: Models requiring bounded inputs (e.g., neural networks with certain activation functions). Sensitive to outliers.

RobustScaler

Uses the median and interquartile range, making it more robust to outliers:

\[ x' = \frac{x - \text{median}(X)}{\text{IQR}(X)} \]

where \(\text{IQR} = Q_3 - Q_1\).

Scaling Method Preserves Distribution Shape Sensitive to Outliers Best Suited For
StandardScaler Yes Yes General purpose, gradient-based models
MinMaxScaler Yes Very sensitive Bounded inputs, image data
RobustScaler Yes No Data with outliers

Note: Tree-based models (decision trees, Random Forest, XGBoost, etc.) are not sensitive to feature scaling.


Missing Value Handling

Common Imputation Strategies

Strategy Method Best Suited For
Deletion Drop rows or columns with missing values Low missing rate (\(< 5\%\))
Mean/Median Fill with the feature's mean or median Numeric features, missing at random
Mode Fill with the most frequent value Categorical features
Constant Fill with a specific value (e.g., \(-1\), "Unknown") Missingness itself is informative
Interpolation Linear or spline interpolation Time series data
KNN imputation Fill with values from nearest neighbors Samples share similarity

MICE (Multiple Imputation by Chained Equations)

MICE is a multiple imputation method that iteratively models each feature with missing values conditionally on all other features:

  1. Perform an initial fill for all missing values (e.g., mean imputation)
  2. For each feature \(X_j\) with missing values: - Treat \(X_j\) as the target variable with other features as inputs - Train a regression model on the non-missing samples - Predict the missing values using the trained model
  3. Repeat step 2 for multiple rounds until convergence
  4. Repeat the entire process \(M\) times to generate \(M\) complete datasets

The strength of MICE lies in its ability to leverage correlations among features and quantify uncertainty through multiple imputation.


Automated Feature Engineering

Featuretools

Featuretools is an automated feature engineering library based on Deep Feature Synthesis (DFS):

  • Define entity sets and relationships (e.g., user-order-product relationships)
  • Automatically generate aggregation features (e.g., average order amount per user) and transformation features (e.g., time differences)
  • Supports custom primitives

AutoFeat

AutoFeat automatically constructs and selects nonlinear features:

  1. Generate a large pool of candidate features through arithmetic operations (addition, subtraction, multiplication, division, logarithm, square root, etc.)
  2. Use L1 regularization to select the most useful subset
  3. Produce interpretable feature expressions

Feature Engineering Pipeline Best Practices

Preventing Data Leakage

  • All transformations must be fit on the training set and then applied to the test set
  • Use sklearn.pipeline.Pipeline to ensure the correct ordering
  • Target encoding must be performed within cross-validation folds
Raw data
  → Missing value handling (MICE / simple imputation)
  → Feature encoding (One-Hot / Target Encoding)
  → Feature construction (polynomial / interaction features)
  → Feature scaling (StandardScaler / RobustScaler)
  → Feature selection (Embedded / Filter)
  → Model training

Key Considerations

  • Split data first, then engineer features: Avoid letting test set information influence training
  • Document all transformations: Ensure that the production environment can reproduce the training-time feature processing
  • Monitor feature drift: Continuously monitor changes in feature distributions after deployment
  • Keep it simple: More features are not always better; excessive features can lead to overfitting and maintenance difficulties

Feature Engineering Requirements by Model Type

Model Needs Scaling Needs Encoding Handles Missing Values Needs Feature Selection
Linear / Logistic Regression Yes Yes No Recommended
SVM Yes Yes No Recommended
KNN Yes Yes No Recommended
Decision Tree No No (Label OK) Partial Optional
Random Forest No No (Label OK) Partial Optional
XGBoost / LightGBM No Partial Yes Optional
Neural Networks Yes Yes No Recommended

评论 #