Data Science
Data science is an interdisciplinary field that extracts knowledge and insights from data through statistics, computer science, and domain expertise. In the AI/ML domain, data science serves as the cornerstone for building high-quality models — the quality and handling of data often matter more than the model itself in determining final outcomes.
The Relationship Between Data Science and AI
Data Is the Fuel of AI
The success of modern AI depends heavily on data. Whether in traditional machine learning or deep learning, model performance is ultimately bounded by data quality. There is a classic saying in the industry: "Garbage in, garbage out" — if the input data is garbage, the model's output will be worthless.
The role of data science within the AI/ML pipeline is as follows:
Raw Data → 数据采集 → 数据清洗 → EDA → 特征工程 → 模型训练 → 模型评估 → 部署上线
\_________________________数据科学________________________/
From Raw Data to Models: The Data Science Workflow
A complete data science workflow typically consists of the following steps:
| Phase | Core Tasks | Common Tools |
|---|---|---|
| Problem Definition | Clarify business objectives and formulate them as ML problems | — |
| Data Collection | Web scraping, APIs, database queries, sensors | Scrapy, SQL, Spark |
| Data Cleaning | Handling missing values, duplicates, and outliers | Pandas, NumPy |
| Exploratory Data Analysis | Statistical summaries, visualization, hypothesis testing | Matplotlib, Seaborn |
| Feature Engineering | Feature selection, extraction, and transformation | Scikit-learn, Featuretools |
| Model Building | Algorithm selection, model training | Scikit-learn, XGBoost, PyTorch |
| Model Evaluation | Cross-validation, metric analysis | Scikit-learn |
| Deployment & Monitoring | Model serving, data drift monitoring | MLflow, Docker, Kubernetes |
In real-world industrial settings, the data preparation phase (collection, cleaning, EDA, and feature engineering) typically consumes 60%–80% of the total project time. This underscores the critical importance of data science skills for the success of AI projects.
Data Types and Structures
Understanding data types is a prerequisite for all data work. Different types of data require different processing strategies and model architectures.
Classification by Degree of Structure
| Type | Definition | Examples | Common Storage |
|---|---|---|---|
| Structured Data | Tabular data with a fixed schema | User tables, transaction records, sensor readings | SQL databases, CSV |
| Semi-structured Data | Hierarchically organized but not strictly tabular | JSON, XML, log files | NoSQL (MongoDB), Elasticsearch |
| Unstructured Data | No predefined structure | Text, images, audio, video | Object storage (S3), file systems |
In the AI domain, structured data is typically handled by traditional ML models (e.g., XGBoost, Random Forest), while unstructured data relies more on deep learning models (e.g., CNNs for images, Transformers for text).
Feature Types
In machine learning, each column of data is called a feature. Features can be classified by their mathematical properties:
| Feature Type | Description | Examples | Common Processing Methods |
|---|---|---|---|
| Numerical | Continuous or discrete values | Age, income, temperature | Standardization, normalization |
| Categorical | Unordered category labels | Gender, city, color | One-hot Encoding, Label Encoding |
| Ordinal | Categories with a natural ordering | Education level (high school < bachelor's < master's), rating (1–5 stars) | Ordinal Encoding |
| Temporal | Timestamps or time series | Order date, heartbeat signal | Extract year/month/week/hour, sliding window |
| Text | Natural language text | Reviews, news headlines | TF-IDF, Word2Vec, BERT Embedding |
Numerical standardization is one of the most common preprocessing operations. Common methods include:
- Min-Max Normalization: Scales data to the \([0, 1]\) interval:
- Z-score Standardization: Transforms data to have zero mean and unit standard deviation:
Standardization is especially important for distance-based algorithms (e.g., KNN, SVM) and models optimized with gradient descent (e.g., neural networks), because inconsistent feature scales can cause certain features to dominate model learning.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is the process of systematically exploring data before modeling. Its purpose is to understand data distributions, discover patterns, detect anomalies, and validate hypotheses, thereby informing subsequent feature engineering and model selection.
Statistical Summaries and Distributions
The first step is to understand the basic statistics of each feature:
| Statistic | Formula | Purpose |
|---|---|---|
| Mean | \(\bar{x} = \frac{1}{N}\sum_{i=1}^{N} x_i\) | Measures the central tendency of the data |
| Median | The \(\frac{N+1}{2}\)-th value after sorting | A robust measure of central tendency against outliers |
| Standard Deviation (Std) | \(\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \bar{x})^2}\) | Measures the dispersion of the data |
| Skewness | \(\frac{1}{N}\sum\left(\frac{x_i - \bar{x}}{\sigma}\right)^3\) | Measures the symmetry of the distribution |
| Kurtosis | \(\frac{1}{N}\sum\left(\frac{x_i - \bar{x}}{\sigma}\right)^4 - 3\) | Measures the thickness of the distribution tails |
If the absolute value of skewness exceeds 1, the distribution is heavily skewed and may require a log transformation or Box-Cox transformation to correct.
Correlation Analysis
The Pearson correlation coefficient measures the linear correlation between two variables:
Here \(r \in [-1, 1]\), and values of \(|r|\) closer to 1 indicate stronger linear correlation. It is important to note that the Pearson correlation coefficient captures only linear relationships. For nonlinear relationships, Spearman's rank correlation or Mutual Information can be used instead.
In feature selection, highly correlated features (e.g., \(|r| > 0.9\)) indicate information redundancy, and one of the correlated features is typically removed to avoid multicollinearity.
Visualization Methods
| Chart Type | Use Case | Information Revealed |
|---|---|---|
| Histogram | Univariate distribution | Distribution shape, degree of skewness |
| Box Plot | Univariate distribution + outliers | Quartiles, outliers (points beyond 1.5 times the IQR) |
| Scatter Plot | Bivariate relationships | Correlation between variables, clustering tendencies |
| Heatmap | Multivariate correlation | Correlation matrix among features |
| Pair Plot | Multivariate relationships | Pairwise relationships among all features |
| Violin Plot | Grouped distribution comparison | Distribution differences of a feature across categories |
Missing Value Analysis
The strategy for handling missing values depends on the missing data mechanism:
- MCAR (Missing Completely At Random): Missingness is unrelated to any variable. Rows can be directly deleted or values imputed.
- MAR (Missing At Random): Missingness is related to other observed variables. Conditional imputation based on related variables is needed.
- MNAR (Missing Not At Random): Missingness is related to the missing value itself. This is the hardest case to handle and requires domain knowledge.
Common handling methods include: deleting rows/columns with missing values, imputation with mean/median/mode, model-based imputation (e.g., KNN Imputation), and treating missingness as an independent feature (Missing Indicator).
Outlier Detection
Common methods for outlier detection:
- IQR Method: A value is considered an outlier if \(x < Q_1 - 1.5 \times \text{IQR}\) or \(x > Q_3 + 1.5 \times \text{IQR}\), where \(\text{IQR} = Q_3 - Q_1\).
- Z-score Method: A value is considered an outlier if \(|z| > 3\) (i.e., more than 3 standard deviations from the mean).
- Isolation Forest: Based on the random forest paradigm, anomalous points are more easily "isolated" (requiring fewer splits).
Outliers do not always need to be removed — in scenarios such as fraud detection, outliers are precisely the targets we aim to identify.
Feature Engineering
Feature engineering is the process of transforming raw data into features that a model can learn from efficiently. Good feature engineering can significantly boost model performance, sometimes even more so than switching to a more complex model architecture. As Andrew Ng once said: "Applied machine learning is basically feature engineering."
Feature Selection
The goal of feature selection is to identify the most valuable subset from all available features, removing irrelevant and redundant ones.
| Method Category | Principle | Representative Methods | Pros and Cons |
|---|---|---|---|
| Filter Methods | Model-independent, ranking based on statistical metrics | Variance threshold, mutual information, chi-squared test, Pearson correlation | Fast, but ignores feature interactions |
| Wrapper Methods | Uses model performance as the evaluation criterion | Forward selection, backward elimination, Recursive Feature Elimination (RFE) | Good performance, but computationally expensive |
| Embedded Methods | Automatic selection during model training | L1 regularization (Lasso), tree-based feature importance | Balances efficiency and performance |
L1 Regularization (Lasso) produces sparse weights, automatically shrinking the coefficients of unimportant features to zero:
Feature Extraction
Feature extraction uses mathematical transformations to map original high-dimensional features into a lower-dimensional space while preserving as much important information as possible.
PCA (Principal Component Analysis) is the most classic linear dimensionality reduction method. Its core idea is to find the directions of maximum variance in the data (i.e., the principal components) and project the data onto these directions:
- Center the data matrix: \(X' = X - \bar{X}\)
- Compute the covariance matrix: \(C = \frac{1}{N} X'^T X'\)
- Perform eigenvalue decomposition on the covariance matrix and select the eigenvectors corresponding to the \(k\) largest eigenvalues
- Project the data onto the subspace spanned by these \(k\) eigenvectors
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method that excels at visualizing high-dimensional data in two or three dimensions. It constructs probability distributions in both the high-dimensional and low-dimensional spaces, then minimizes the KL divergence between the two distributions. t-SNE is commonly used to visualize word embeddings, image feature spaces, and clustering results.
UMAP (Uniform Manifold Approximation and Projection) is an improved version of t-SNE that better preserves global structure while maintaining local structure, and is computationally faster. For large-scale datasets, UMAP is generally a better choice than t-SNE.
Autoencoder is a neural network-based nonlinear dimensionality reduction method. By training an encoder-decoder architecture, the encoder compresses the input into a low-dimensional representation (the bottleneck), and the decoder reconstructs the input from this representation. The output of the bottleneck layer serves as the extracted features.
Feature Transformation
| Transformation | Use Case | Formula / Description |
|---|---|---|
| Log Transform | Right-skewed distributions | \(x' = \log(x + 1)\) |
| Polynomial Transform | Capturing nonlinear relationships | \((x_1, x_2) \to (x_1, x_2, x_1^2, x_1 x_2, x_2^2)\) |
| Binning | Discretizing continuous variables | Group age into: young / middle-aged / elderly |
| Box-Cox Transform | Making data more normally distributed | \(x' = \frac{x^\lambda - 1}{\lambda}, \lambda \neq 0\) |
| Target Encoding | Encoding categorical features as statistics of the target variable | Encode a city as the average house price in that city |
Feature Importance and Interpretability
After model training, understanding which features contribute most to predictions is crucial:
- Tree-based feature importance: Based on the sum of split gains across all trees for a given feature
- Permutation Importance: Randomly shuffle the values of a feature and observe the resulting drop in model performance
- SHAP (SHapley Additive exPlanations): Based on Shapley values from game theory, SHAP assigns a contribution value to each feature for every individual sample. It is currently the most popular tool for model interpretability
Class Imbalance
In real-world scenarios (such as fraud detection, disease diagnosis, and anomaly detection), the ratio of positive to negative samples is often severely imbalanced (e.g., fraudulent transactions may account for only 0.1%). In such cases, models tend to predict all samples as the majority class, resulting in extremely poor recognition of the minority class.
Handling Methods
(1) Data-Level Approaches
| Method | Principle | Pros and Cons |
|---|---|---|
| Random Oversampling | Randomly duplicate minority class samples | Simple, but prone to overfitting |
| SMOTE | Generate new samples by interpolating between minority class samples | Mitigates overfitting, but may introduce noise |
| Random Undersampling | Randomly remove majority class samples | Simple, but loses information |
| Tomek Links | Remove majority class samples on the decision boundary | Cleans the decision boundary; often combined with other methods |
The specific steps of SMOTE (Synthetic Minority Over-sampling Technique):
- For each minority class sample \(x_i\), find its \(k\) nearest neighbors (of the same class)
- Randomly select one neighbor \(x_{nn}\)
- Generate a new sample by random interpolation between \(x_i\) and \(x_{nn}\): \(x_{new} = x_i + \lambda \cdot (x_{nn} - x_i)\), where \(\lambda \in [0, 1]\)
(2) Algorithm-Level Approaches
- Cost-sensitive Learning: Assigns different misclassification costs to different classes. In the loss function, the minority class receives a higher weight:
Here, \(w\) for the minority class is much larger than for the majority class. Most frameworks (e.g., Scikit-learn's class_weight='balanced') support automatic weight computation.
- Focal Loss: Proposed by Facebook in RetinaNet, Focal Loss reduces the loss weight for easily classified samples, allowing the model to focus on hard-to-classify samples:
Evaluation Metrics for Imbalanced Data
In imbalanced settings, accuracy is a highly misleading metric. For example, in a dataset with 1000 samples of which only 10 are positive, a model that predicts all samples as negative still achieves 99% accuracy.
The following metrics should be used instead:
| Metric | Use Case | Description |
|---|---|---|
| Precision | When false positives are costly | The proportion of truly positive samples among those predicted as positive |
| Recall | When false negatives are costly | The proportion of positive samples that are correctly identified |
| F1-Score | Balanced trade-off | The harmonic mean of Precision and Recall |
| AUC-ROC | Threshold-independent evaluation | Area under the TPR vs. FPR curve across different thresholds |
| AUC-PR (PR Curve) | Severely imbalanced scenarios | Area under the Precision vs. Recall curve; more sensitive than AUC-ROC |
| MCC (Matthews Correlation Coefficient) | Overall quality assessment | A correlation coefficient that accounts for all four confusion matrix outcomes |
Data Pipeline
In production environments, data must flow through a series of automated processing stages from raw data sources to the final model or analytics system. This automated workflow is known as a data pipeline.
ETL vs. ELT
| Characteristic | ETL | ELT |
|---|---|---|
| Full Name | Extract-Transform-Load | Extract-Load-Transform |
| Transformation Timing | Transformed in a staging layer before loading | Transformed in the target system after loading |
| Use Case | Traditional data warehouses | Cloud data lakes, big data platforms |
| Compute Resources | Relies on the ETL server | Leverages the target system's compute power |
| Representative Tools | Informatica, Talend | dbt, Snowflake, BigQuery |
Batch vs. Streaming
| Characteristic | Batch Processing | Stream Processing |
|---|---|---|
| Data Processing Mode | Scheduled bulk processing | Real-time per-record or micro-batch processing |
| Latency | Minutes to hours | Milliseconds to seconds |
| Use Case | Report generation, model training | Real-time recommendations, fraud detection |
| Representative Tools | Spark Batch, Hadoop MapReduce | Kafka Streams, Flink, Spark Streaming |
In AI/ML scenarios, model training typically uses batch processing (requiring large volumes of historical data), while model inference may require stream processing (e.g., a real-time recommendation system needs to return results within milliseconds of a user click).
Data Quality Monitoring
Data quality is the most easily overlooked yet most impactful aspect of ML systems. Common data quality issues include:
- Data Drift: The distribution of input data changes over time. For example, user behavior may shift dramatically during a pandemic, causing a recommendation model to fail.
- Concept Drift: The relationship between inputs and outputs changes. For example, a keyword's connotation may shift from positive to negative.
- Schema Changes: An upstream system modifies the data format or field semantics.
- Data Latency / Missing Data: A data source fails to produce data during a certain time period.
Monitoring approaches include: statistical comparison (rate of change in mean and variance), distribution tests (KS test, PSI), and data quality rule engines (Great Expectations, Deequ).
A robust ML system must monitor both model performance metrics and input data quality metrics, and trigger model retraining when significant drift is detected.