Data Science

Data science is an interdisciplinary field that extracts knowledge and insights from data through statistics, computer science, and domain expertise. In the AI/ML domain, data science serves as the cornerstone for building high-quality models — the quality and handling of data often matter more than the model itself in determining final outcomes.

The Relationship Between Data Science and AI

Data Is the Fuel of AI

The success of modern AI depends heavily on data. Whether in traditional machine learning or deep learning, model performance is ultimately bounded by data quality. There is a classic saying in the industry: "Garbage in, garbage out" — if the input data is garbage, the model's output will be worthless.

The role of data science within the AI/ML pipeline is as follows:

Raw Data → 数据采集 → 数据清洗 → EDA → 特征工程 → 模型训练 → 模型评估 → 部署上线
           \_________________________数据科学________________________/

From Raw Data to Models: The Data Science Workflow

A complete data science workflow typically consists of the following steps:

Phase	Core Tasks	Common Tools
Problem Definition	Clarify business objectives and formulate them as ML problems	—
Data Collection	Web scraping, APIs, database queries, sensors	Scrapy, SQL, Spark
Data Cleaning	Handling missing values, duplicates, and outliers	Pandas, NumPy
Exploratory Data Analysis	Statistical summaries, visualization, hypothesis testing	Matplotlib, Seaborn
Feature Engineering	Feature selection, extraction, and transformation	Scikit-learn, Featuretools
Model Building	Algorithm selection, model training	Scikit-learn, XGBoost, PyTorch
Model Evaluation	Cross-validation, metric analysis	Scikit-learn
Deployment & Monitoring	Model serving, data drift monitoring	MLflow, Docker, Kubernetes

In real-world industrial settings, the data preparation phase (collection, cleaning, EDA, and feature engineering) typically consumes 60%–80% of the total project time. This underscores the critical importance of data science skills for the success of AI projects.

Data Types and Structures

Understanding data types is a prerequisite for all data work. Different types of data require different processing strategies and model architectures.

Classification by Degree of Structure

Type	Definition	Examples	Common Storage
Structured Data	Tabular data with a fixed schema	User tables, transaction records, sensor readings	SQL databases, CSV
Semi-structured Data	Hierarchically organized but not strictly tabular	JSON, XML, log files	NoSQL (MongoDB), Elasticsearch
Unstructured Data	No predefined structure	Text, images, audio, video	Object storage (S3), file systems

In the AI domain, structured data is typically handled by traditional ML models (e.g., XGBoost, Random Forest), while unstructured data relies more on deep learning models (e.g., CNNs for images, Transformers for text).

Feature Types

In machine learning, each column of data is called a feature. Features can be classified by their mathematical properties:

Feature Type	Description	Examples	Common Processing Methods
Numerical	Continuous or discrete values	Age, income, temperature	Standardization, normalization
Categorical	Unordered category labels	Gender, city, color	One-hot Encoding, Label Encoding
Ordinal	Categories with a natural ordering	Education level (high school < bachelor's < master's), rating (1–5 stars)	Ordinal Encoding
Temporal	Timestamps or time series	Order date, heartbeat signal	Extract year/month/week/hour, sliding window
Text	Natural language text	Reviews, news headlines	TF-IDF, Word2Vec, BERT Embedding

Numerical standardization is one of the most common preprocessing operations. Common methods include:

Min-Max Normalization: Scales data to the \([0, 1]\) interval:

\[ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \]

Z-score Standardization: Transforms data to have zero mean and unit standard deviation:

\[ x' = \frac{x - \mu}{\sigma} \]

Standardization is especially important for distance-based algorithms (e.g., KNN, SVM) and models optimized with gradient descent (e.g., neural networks), because inconsistent feature scales can cause certain features to dominate model learning.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the process of systematically exploring data before modeling. Its purpose is to understand data distributions, discover patterns, detect anomalies, and validate hypotheses, thereby informing subsequent feature engineering and model selection.

Statistical Summaries and Distributions

The first step is to understand the basic statistics of each feature:

Statistic	Formula	Purpose
Mean	\(\bar{x} = \frac{1}{N}\sum_{i=1}^{N} x_i\)	Measures the central tendency of the data
Median	The \(\frac{N+1}{2}\)-th value after sorting	A robust measure of central tendency against outliers
Standard Deviation (Std)	\(\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \bar{x})^2}\)	Measures the dispersion of the data
Skewness	\(\frac{1}{N}\sum\left(\frac{x_i - \bar{x}}{\sigma}\right)^3\)	Measures the symmetry of the distribution
Kurtosis	\(\frac{1}{N}\sum\left(\frac{x_i - \bar{x}}{\sigma}\right)^4 - 3\)	Measures the thickness of the distribution tails

If the absolute value of skewness exceeds 1, the distribution is heavily skewed and may require a log transformation or Box-Cox transformation to correct.

Correlation Analysis

The Pearson correlation coefficient measures the linear correlation between two variables:

\[ r_{xy} = \frac{\sum_{i=1}^{N}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{N}(y_i - \bar{y})^2}} \]

Here \(r \in [-1, 1]\), and values of \(|r|\) closer to 1 indicate stronger linear correlation. It is important to note that the Pearson correlation coefficient captures only linear relationships. For nonlinear relationships, Spearman's rank correlation or Mutual Information can be used instead.

In feature selection, highly correlated features (e.g., \(|r| > 0.9\)) indicate information redundancy, and one of the correlated features is typically removed to avoid multicollinearity.

Visualization Methods

Chart Type	Use Case	Information Revealed
Histogram	Univariate distribution	Distribution shape, degree of skewness
Box Plot	Univariate distribution + outliers	Quartiles, outliers (points beyond 1.5 times the IQR)
Scatter Plot	Bivariate relationships	Correlation between variables, clustering tendencies
Heatmap	Multivariate correlation	Correlation matrix among features
Pair Plot	Multivariate relationships	Pairwise relationships among all features
Violin Plot	Grouped distribution comparison	Distribution differences of a feature across categories

Missing Value Analysis

The strategy for handling missing values depends on the missing data mechanism:

MCAR (Missing Completely At Random): Missingness is unrelated to any variable. Rows can be directly deleted or values imputed.
MAR (Missing At Random): Missingness is related to other observed variables. Conditional imputation based on related variables is needed.
MNAR (Missing Not At Random): Missingness is related to the missing value itself. This is the hardest case to handle and requires domain knowledge.

Common handling methods include: deleting rows/columns with missing values, imputation with mean/median/mode, model-based imputation (e.g., KNN Imputation), and treating missingness as an independent feature (Missing Indicator).

Outlier Detection

Common methods for outlier detection:

IQR Method: A value is considered an outlier if \(x < Q_1 - 1.5 \times \text{IQR}\) or \(x > Q_3 + 1.5 \times \text{IQR}\), where \(\text{IQR} = Q_3 - Q_1\).
Z-score Method: A value is considered an outlier if \(|z| > 3\) (i.e., more than 3 standard deviations from the mean).
Isolation Forest: Based on the random forest paradigm, anomalous points are more easily "isolated" (requiring fewer splits).

Outliers do not always need to be removed — in scenarios such as fraud detection, outliers are precisely the targets we aim to identify.

Feature Engineering

Feature engineering is the process of transforming raw data into features that a model can learn from efficiently. Good feature engineering can significantly boost model performance, sometimes even more so than switching to a more complex model architecture. As Andrew Ng once said: "Applied machine learning is basically feature engineering."

Feature Selection

The goal of feature selection is to identify the most valuable subset from all available features, removing irrelevant and redundant ones.

Method Category	Principle	Representative Methods	Pros and Cons
Filter Methods	Model-independent, ranking based on statistical metrics	Variance threshold, mutual information, chi-squared test, Pearson correlation	Fast, but ignores feature interactions
Wrapper Methods	Uses model performance as the evaluation criterion	Forward selection, backward elimination, Recursive Feature Elimination (RFE)	Good performance, but computationally expensive
Embedded Methods	Automatic selection during model training	L1 regularization (Lasso), tree-based feature importance	Balances efficiency and performance

L1 Regularization (Lasso) produces sparse weights, automatically shrinking the coefficients of unimportant features to zero:

\[ J(\theta) = \text{Loss}(\theta) + \lambda \sum_{j=1}^{d} |\theta_j| \]

Feature Extraction

Feature extraction uses mathematical transformations to map original high-dimensional features into a lower-dimensional space while preserving as much important information as possible.

PCA (Principal Component Analysis) is the most classic linear dimensionality reduction method. Its core idea is to find the directions of maximum variance in the data (i.e., the principal components) and project the data onto these directions:

Center the data matrix: \(X' = X - \bar{X}\)
Compute the covariance matrix: \(C = \frac{1}{N} X'^T X'\)
Perform eigenvalue decomposition on the covariance matrix and select the eigenvectors corresponding to the \(k\) largest eigenvalues
Project the data onto the subspace spanned by these \(k\) eigenvectors

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method that excels at visualizing high-dimensional data in two or three dimensions. It constructs probability distributions in both the high-dimensional and low-dimensional spaces, then minimizes the KL divergence between the two distributions. t-SNE is commonly used to visualize word embeddings, image feature spaces, and clustering results.

UMAP (Uniform Manifold Approximation and Projection) is an improved version of t-SNE that better preserves global structure while maintaining local structure, and is computationally faster. For large-scale datasets, UMAP is generally a better choice than t-SNE.

Autoencoder is a neural network-based nonlinear dimensionality reduction method. By training an encoder-decoder architecture, the encoder compresses the input into a low-dimensional representation (the bottleneck), and the decoder reconstructs the input from this representation. The output of the bottleneck layer serves as the extracted features.

Feature Transformation

Transformation	Use Case	Formula / Description
Log Transform	Right-skewed distributions	\(x' = \log(x + 1)\)
Polynomial Transform	Capturing nonlinear relationships	\((x_1, x_2) \to (x_1, x_2, x_1^2, x_1 x_2, x_2^2)\)
Binning	Discretizing continuous variables	Group age into: young / middle-aged / elderly
Box-Cox Transform	Making data more normally distributed	\(x' = \frac{x^\lambda - 1}{\lambda}, \lambda \neq 0\)
Target Encoding	Encoding categorical features as statistics of the target variable	Encode a city as the average house price in that city

Feature Importance and Interpretability

After model training, understanding which features contribute most to predictions is crucial:

Tree-based feature importance: Based on the sum of split gains across all trees for a given feature
Permutation Importance: Randomly shuffle the values of a feature and observe the resulting drop in model performance
SHAP (SHapley Additive exPlanations): Based on Shapley values from game theory, SHAP assigns a contribution value to each feature for every individual sample. It is currently the most popular tool for model interpretability

Class Imbalance

In real-world scenarios (such as fraud detection, disease diagnosis, and anomaly detection), the ratio of positive to negative samples is often severely imbalanced (e.g., fraudulent transactions may account for only 0.1%). In such cases, models tend to predict all samples as the majority class, resulting in extremely poor recognition of the minority class.

Handling Methods

(1) Data-Level Approaches

Method	Principle	Pros and Cons
Random Oversampling	Randomly duplicate minority class samples	Simple, but prone to overfitting
SMOTE	Generate new samples by interpolating between minority class samples	Mitigates overfitting, but may introduce noise
Random Undersampling	Randomly remove majority class samples	Simple, but loses information
Tomek Links	Remove majority class samples on the decision boundary	Cleans the decision boundary; often combined with other methods

The specific steps of SMOTE (Synthetic Minority Over-sampling Technique):

For each minority class sample \(x_i\), find its \(k\) nearest neighbors (of the same class)
Randomly select one neighbor \(x_{nn}\)
Generate a new sample by random interpolation between \(x_i\) and \(x_{nn}\): \(x_{new} = x_i + \lambda \cdot (x_{nn} - x_i)\), where \(\lambda \in [0, 1]\)

(2) Algorithm-Level Approaches

Cost-sensitive Learning: Assigns different misclassification costs to different classes. In the loss function, the minority class receives a higher weight:

\[ L = -\sum_{i=1}^{N} w_{y_i} \cdot [y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)] \]

Here, \(w\) for the minority class is much larger than for the majority class. Most frameworks (e.g., Scikit-learn's class_weight='balanced') support automatic weight computation.

Focal Loss: Proposed by Facebook in RetinaNet, Focal Loss reduces the loss weight for easily classified samples, allowing the model to focus on hard-to-classify samples:

\[ FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) \]

Evaluation Metrics for Imbalanced Data

In imbalanced settings, accuracy is a highly misleading metric. For example, in a dataset with 1000 samples of which only 10 are positive, a model that predicts all samples as negative still achieves 99% accuracy.

The following metrics should be used instead:

Metric	Use Case	Description
Precision	When false positives are costly	The proportion of truly positive samples among those predicted as positive
Recall	When false negatives are costly	The proportion of positive samples that are correctly identified
F1-Score	Balanced trade-off	The harmonic mean of Precision and Recall
AUC-ROC	Threshold-independent evaluation	Area under the TPR vs. FPR curve across different thresholds
AUC-PR (PR Curve)	Severely imbalanced scenarios	Area under the Precision vs. Recall curve; more sensitive than AUC-ROC
MCC (Matthews Correlation Coefficient)	Overall quality assessment	A correlation coefficient that accounts for all four confusion matrix outcomes

Data Pipeline

In production environments, data must flow through a series of automated processing stages from raw data sources to the final model or analytics system. This automated workflow is known as a data pipeline.

ETL vs. ELT

Characteristic	ETL	ELT
Full Name	Extract-Transform-Load	Extract-Load-Transform
Transformation Timing	Transformed in a staging layer before loading	Transformed in the target system after loading
Use Case	Traditional data warehouses	Cloud data lakes, big data platforms
Compute Resources	Relies on the ETL server	Leverages the target system's compute power
Representative Tools	Informatica, Talend	dbt, Snowflake, BigQuery

Batch vs. Streaming

Characteristic	Batch Processing	Stream Processing
Data Processing Mode	Scheduled bulk processing	Real-time per-record or micro-batch processing
Latency	Minutes to hours	Milliseconds to seconds
Use Case	Report generation, model training	Real-time recommendations, fraud detection
Representative Tools	Spark Batch, Hadoop MapReduce	Kafka Streams, Flink, Spark Streaming

In AI/ML scenarios, model training typically uses batch processing (requiring large volumes of historical data), while model inference may require stream processing (e.g., a real-time recommendation system needs to return results within milliseconds of a user click).

Data Quality Monitoring

Data quality is the most easily overlooked yet most impactful aspect of ML systems. Common data quality issues include:

Data Drift: The distribution of input data changes over time. For example, user behavior may shift dramatically during a pandemic, causing a recommendation model to fail.
Concept Drift: The relationship between inputs and outputs changes. For example, a keyword's connotation may shift from positive to negative.
Schema Changes: An upstream system modifies the data format or field semantics.
Data Latency / Missing Data: A data source fails to produce data during a certain time period.

Monitoring approaches include: statistical comparison (rate of change in mean and variance), distribution tests (KS test, PSI), and data quality rule engines (Great Expectations, Deequ).

A robust ML system must monitor both model performance metrics and input data quality metrics, and trigger model retraining when significant drift is detected.