Skip to content

Data Science

Data science is an interdisciplinary field that extracts knowledge and insights from data through statistics, computer science, and domain expertise. In the AI/ML domain, data science serves as the cornerstone for building high-quality models — the quality and handling of data often matter more than the model itself in determining final outcomes.

The Relationship Between Data Science and AI

Data Is the Fuel of AI

The success of modern AI depends heavily on data. Whether in traditional machine learning or deep learning, model performance is ultimately bounded by data quality. There is a classic saying in the industry: "Garbage in, garbage out" — if the input data is garbage, the model's output will be worthless.

The role of data science within the AI/ML pipeline is as follows:

Raw Data → 数据采集 → 数据清洗 → EDA → 特征工程 → 模型训练 → 模型评估 → 部署上线
           \_________________________数据科学________________________/

From Raw Data to Models: The Data Science Workflow

A complete data science workflow typically consists of the following steps:

Phase Core Tasks Common Tools
Problem Definition Clarify business objectives and formulate them as ML problems
Data Collection Web scraping, APIs, database queries, sensors Scrapy, SQL, Spark
Data Cleaning Handling missing values, duplicates, and outliers Pandas, NumPy
Exploratory Data Analysis Statistical summaries, visualization, hypothesis testing Matplotlib, Seaborn
Feature Engineering Feature selection, extraction, and transformation Scikit-learn, Featuretools
Model Building Algorithm selection, model training Scikit-learn, XGBoost, PyTorch
Model Evaluation Cross-validation, metric analysis Scikit-learn
Deployment & Monitoring Model serving, data drift monitoring MLflow, Docker, Kubernetes

In real-world industrial settings, the data preparation phase (collection, cleaning, EDA, and feature engineering) typically consumes 60%–80% of the total project time. This underscores the critical importance of data science skills for the success of AI projects.


Data Types and Structures

Understanding data types is a prerequisite for all data work. Different types of data require different processing strategies and model architectures.

Classification by Degree of Structure

Type Definition Examples Common Storage
Structured Data Tabular data with a fixed schema User tables, transaction records, sensor readings SQL databases, CSV
Semi-structured Data Hierarchically organized but not strictly tabular JSON, XML, log files NoSQL (MongoDB), Elasticsearch
Unstructured Data No predefined structure Text, images, audio, video Object storage (S3), file systems

In the AI domain, structured data is typically handled by traditional ML models (e.g., XGBoost, Random Forest), while unstructured data relies more on deep learning models (e.g., CNNs for images, Transformers for text).

Feature Types

In machine learning, each column of data is called a feature. Features can be classified by their mathematical properties:

Feature Type Description Examples Common Processing Methods
Numerical Continuous or discrete values Age, income, temperature Standardization, normalization
Categorical Unordered category labels Gender, city, color One-hot Encoding, Label Encoding
Ordinal Categories with a natural ordering Education level (high school < bachelor's < master's), rating (1–5 stars) Ordinal Encoding
Temporal Timestamps or time series Order date, heartbeat signal Extract year/month/week/hour, sliding window
Text Natural language text Reviews, news headlines TF-IDF, Word2Vec, BERT Embedding

Numerical standardization is one of the most common preprocessing operations. Common methods include:

  • Min-Max Normalization: Scales data to the \([0, 1]\) interval:
\[ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \]
  • Z-score Standardization: Transforms data to have zero mean and unit standard deviation:
\[ x' = \frac{x - \mu}{\sigma} \]

Standardization is especially important for distance-based algorithms (e.g., KNN, SVM) and models optimized with gradient descent (e.g., neural networks), because inconsistent feature scales can cause certain features to dominate model learning.


Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of systematically exploring data before modeling. Its purpose is to understand data distributions, discover patterns, detect anomalies, and validate hypotheses, thereby informing subsequent feature engineering and model selection.

Statistical Summaries and Distributions

The first step is to understand the basic statistics of each feature:

Statistic Formula Purpose
Mean \(\bar{x} = \frac{1}{N}\sum_{i=1}^{N} x_i\) Measures the central tendency of the data
Median The \(\frac{N+1}{2}\)-th value after sorting A robust measure of central tendency against outliers
Standard Deviation (Std) \(\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \bar{x})^2}\) Measures the dispersion of the data
Skewness \(\frac{1}{N}\sum\left(\frac{x_i - \bar{x}}{\sigma}\right)^3\) Measures the symmetry of the distribution
Kurtosis \(\frac{1}{N}\sum\left(\frac{x_i - \bar{x}}{\sigma}\right)^4 - 3\) Measures the thickness of the distribution tails

If the absolute value of skewness exceeds 1, the distribution is heavily skewed and may require a log transformation or Box-Cox transformation to correct.

Correlation Analysis

The Pearson correlation coefficient measures the linear correlation between two variables:

\[ r_{xy} = \frac{\sum_{i=1}^{N}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{N}(y_i - \bar{y})^2}} \]

Here \(r \in [-1, 1]\), and values of \(|r|\) closer to 1 indicate stronger linear correlation. It is important to note that the Pearson correlation coefficient captures only linear relationships. For nonlinear relationships, Spearman's rank correlation or Mutual Information can be used instead.

In feature selection, highly correlated features (e.g., \(|r| > 0.9\)) indicate information redundancy, and one of the correlated features is typically removed to avoid multicollinearity.

Visualization Methods

Chart Type Use Case Information Revealed
Histogram Univariate distribution Distribution shape, degree of skewness
Box Plot Univariate distribution + outliers Quartiles, outliers (points beyond 1.5 times the IQR)
Scatter Plot Bivariate relationships Correlation between variables, clustering tendencies
Heatmap Multivariate correlation Correlation matrix among features
Pair Plot Multivariate relationships Pairwise relationships among all features
Violin Plot Grouped distribution comparison Distribution differences of a feature across categories

Missing Value Analysis

The strategy for handling missing values depends on the missing data mechanism:

  • MCAR (Missing Completely At Random): Missingness is unrelated to any variable. Rows can be directly deleted or values imputed.
  • MAR (Missing At Random): Missingness is related to other observed variables. Conditional imputation based on related variables is needed.
  • MNAR (Missing Not At Random): Missingness is related to the missing value itself. This is the hardest case to handle and requires domain knowledge.

Common handling methods include: deleting rows/columns with missing values, imputation with mean/median/mode, model-based imputation (e.g., KNN Imputation), and treating missingness as an independent feature (Missing Indicator).

Outlier Detection

Common methods for outlier detection:

  • IQR Method: A value is considered an outlier if \(x < Q_1 - 1.5 \times \text{IQR}\) or \(x > Q_3 + 1.5 \times \text{IQR}\), where \(\text{IQR} = Q_3 - Q_1\).
  • Z-score Method: A value is considered an outlier if \(|z| > 3\) (i.e., more than 3 standard deviations from the mean).
  • Isolation Forest: Based on the random forest paradigm, anomalous points are more easily "isolated" (requiring fewer splits).

Outliers do not always need to be removed — in scenarios such as fraud detection, outliers are precisely the targets we aim to identify.


Feature Engineering

Feature engineering is the process of transforming raw data into features that a model can learn from efficiently. Good feature engineering can significantly boost model performance, sometimes even more so than switching to a more complex model architecture. As Andrew Ng once said: "Applied machine learning is basically feature engineering."

Feature Selection

The goal of feature selection is to identify the most valuable subset from all available features, removing irrelevant and redundant ones.

Method Category Principle Representative Methods Pros and Cons
Filter Methods Model-independent, ranking based on statistical metrics Variance threshold, mutual information, chi-squared test, Pearson correlation Fast, but ignores feature interactions
Wrapper Methods Uses model performance as the evaluation criterion Forward selection, backward elimination, Recursive Feature Elimination (RFE) Good performance, but computationally expensive
Embedded Methods Automatic selection during model training L1 regularization (Lasso), tree-based feature importance Balances efficiency and performance

L1 Regularization (Lasso) produces sparse weights, automatically shrinking the coefficients of unimportant features to zero:

\[ J(\theta) = \text{Loss}(\theta) + \lambda \sum_{j=1}^{d} |\theta_j| \]

Feature Extraction

Feature extraction uses mathematical transformations to map original high-dimensional features into a lower-dimensional space while preserving as much important information as possible.

PCA (Principal Component Analysis) is the most classic linear dimensionality reduction method. Its core idea is to find the directions of maximum variance in the data (i.e., the principal components) and project the data onto these directions:

  1. Center the data matrix: \(X' = X - \bar{X}\)
  2. Compute the covariance matrix: \(C = \frac{1}{N} X'^T X'\)
  3. Perform eigenvalue decomposition on the covariance matrix and select the eigenvectors corresponding to the \(k\) largest eigenvalues
  4. Project the data onto the subspace spanned by these \(k\) eigenvectors

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method that excels at visualizing high-dimensional data in two or three dimensions. It constructs probability distributions in both the high-dimensional and low-dimensional spaces, then minimizes the KL divergence between the two distributions. t-SNE is commonly used to visualize word embeddings, image feature spaces, and clustering results.

UMAP (Uniform Manifold Approximation and Projection) is an improved version of t-SNE that better preserves global structure while maintaining local structure, and is computationally faster. For large-scale datasets, UMAP is generally a better choice than t-SNE.

Autoencoder is a neural network-based nonlinear dimensionality reduction method. By training an encoder-decoder architecture, the encoder compresses the input into a low-dimensional representation (the bottleneck), and the decoder reconstructs the input from this representation. The output of the bottleneck layer serves as the extracted features.

Feature Transformation

Transformation Use Case Formula / Description
Log Transform Right-skewed distributions \(x' = \log(x + 1)\)
Polynomial Transform Capturing nonlinear relationships \((x_1, x_2) \to (x_1, x_2, x_1^2, x_1 x_2, x_2^2)\)
Binning Discretizing continuous variables Group age into: young / middle-aged / elderly
Box-Cox Transform Making data more normally distributed \(x' = \frac{x^\lambda - 1}{\lambda}, \lambda \neq 0\)
Target Encoding Encoding categorical features as statistics of the target variable Encode a city as the average house price in that city

Feature Importance and Interpretability

After model training, understanding which features contribute most to predictions is crucial:

  • Tree-based feature importance: Based on the sum of split gains across all trees for a given feature
  • Permutation Importance: Randomly shuffle the values of a feature and observe the resulting drop in model performance
  • SHAP (SHapley Additive exPlanations): Based on Shapley values from game theory, SHAP assigns a contribution value to each feature for every individual sample. It is currently the most popular tool for model interpretability

Class Imbalance

In real-world scenarios (such as fraud detection, disease diagnosis, and anomaly detection), the ratio of positive to negative samples is often severely imbalanced (e.g., fraudulent transactions may account for only 0.1%). In such cases, models tend to predict all samples as the majority class, resulting in extremely poor recognition of the minority class.

Handling Methods

(1) Data-Level Approaches

Method Principle Pros and Cons
Random Oversampling Randomly duplicate minority class samples Simple, but prone to overfitting
SMOTE Generate new samples by interpolating between minority class samples Mitigates overfitting, but may introduce noise
Random Undersampling Randomly remove majority class samples Simple, but loses information
Tomek Links Remove majority class samples on the decision boundary Cleans the decision boundary; often combined with other methods

The specific steps of SMOTE (Synthetic Minority Over-sampling Technique):

  1. For each minority class sample \(x_i\), find its \(k\) nearest neighbors (of the same class)
  2. Randomly select one neighbor \(x_{nn}\)
  3. Generate a new sample by random interpolation between \(x_i\) and \(x_{nn}\): \(x_{new} = x_i + \lambda \cdot (x_{nn} - x_i)\), where \(\lambda \in [0, 1]\)

(2) Algorithm-Level Approaches

  • Cost-sensitive Learning: Assigns different misclassification costs to different classes. In the loss function, the minority class receives a higher weight:
\[ L = -\sum_{i=1}^{N} w_{y_i} \cdot [y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)] \]

Here, \(w\) for the minority class is much larger than for the majority class. Most frameworks (e.g., Scikit-learn's class_weight='balanced') support automatic weight computation.

  • Focal Loss: Proposed by Facebook in RetinaNet, Focal Loss reduces the loss weight for easily classified samples, allowing the model to focus on hard-to-classify samples:
\[ FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \]

Evaluation Metrics for Imbalanced Data

In imbalanced settings, accuracy is a highly misleading metric. For example, in a dataset with 1000 samples of which only 10 are positive, a model that predicts all samples as negative still achieves 99% accuracy.

The following metrics should be used instead:

Metric Use Case Description
Precision When false positives are costly The proportion of truly positive samples among those predicted as positive
Recall When false negatives are costly The proportion of positive samples that are correctly identified
F1-Score Balanced trade-off The harmonic mean of Precision and Recall
AUC-ROC Threshold-independent evaluation Area under the TPR vs. FPR curve across different thresholds
AUC-PR (PR Curve) Severely imbalanced scenarios Area under the Precision vs. Recall curve; more sensitive than AUC-ROC
MCC (Matthews Correlation Coefficient) Overall quality assessment A correlation coefficient that accounts for all four confusion matrix outcomes

Data Pipeline

In production environments, data must flow through a series of automated processing stages from raw data sources to the final model or analytics system. This automated workflow is known as a data pipeline.

ETL vs. ELT

Characteristic ETL ELT
Full Name Extract-Transform-Load Extract-Load-Transform
Transformation Timing Transformed in a staging layer before loading Transformed in the target system after loading
Use Case Traditional data warehouses Cloud data lakes, big data platforms
Compute Resources Relies on the ETL server Leverages the target system's compute power
Representative Tools Informatica, Talend dbt, Snowflake, BigQuery

Batch vs. Streaming

Characteristic Batch Processing Stream Processing
Data Processing Mode Scheduled bulk processing Real-time per-record or micro-batch processing
Latency Minutes to hours Milliseconds to seconds
Use Case Report generation, model training Real-time recommendations, fraud detection
Representative Tools Spark Batch, Hadoop MapReduce Kafka Streams, Flink, Spark Streaming

In AI/ML scenarios, model training typically uses batch processing (requiring large volumes of historical data), while model inference may require stream processing (e.g., a real-time recommendation system needs to return results within milliseconds of a user click).

Data Quality Monitoring

Data quality is the most easily overlooked yet most impactful aspect of ML systems. Common data quality issues include:

  • Data Drift: The distribution of input data changes over time. For example, user behavior may shift dramatically during a pandemic, causing a recommendation model to fail.
  • Concept Drift: The relationship between inputs and outputs changes. For example, a keyword's connotation may shift from positive to negative.
  • Schema Changes: An upstream system modifies the data format or field semantics.
  • Data Latency / Missing Data: A data source fails to produce data during a certain time period.

Monitoring approaches include: statistical comparison (rate of change in mean and variance), distribution tests (KS test, PSI), and data quality rule engines (Great Expectations, Deequ).

A robust ML system must monitor both model performance metrics and input data quality metrics, and trigger model retraining when significant drift is detected.


评论 #