探索性数据分析

概述

探索性数据分析（EDA, Exploratory Data Analysis）是在建模之前理解数据的关键步骤。通过统计描述和可视化，发现数据中的模式、异常和关系，指导后续的特征工程和模型选择。

1. EDA 工作流

数据加载
  → 描述统计（Describe）
  → 单变量分析（Univariate）
  → 双变量分析（Bivariate）
  → 多变量分析（Multivariate）
  → 假设生成（Hypothesize）
  → 特征工程方向

2. 描述统计

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")

# 基本信息
print(df.shape)
print(df.dtypes)
print(df.describe())           # 数值列统计
print(df.describe(include='O'))  # 类别列统计

# 缺失值
print(df.isnull().sum().sort_values(ascending=False))

# 数据类型分布
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

关键统计量：

统计量	含义	关注点
count	非空数量	缺失值
mean / median	集中趋势	偏差 → 偏态
std	离散程度	变异性
min / max	极值	异常值
Q1 / Q3	四分位数	分布形状
skewness	偏度	>1 右偏，<-1 左偏
kurtosis	峰度	>3 重尾

3. 单变量分析

3.1 数值变量

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 直方图
axes[0].hist(df["age"], bins=30, edgecolor='black')
axes[0].set_title("Age Distribution")

# 箱线图
sns.boxplot(y=df["income"], ax=axes[1])
axes[1].set_title("Income Box Plot")

# KDE 密度图
sns.kdeplot(df["score"], fill=True, ax=axes[2])
axes[2].set_title("Score Density")

plt.tight_layout()
plt.show()

箱线图解读：

            ┌─────┐
    ────────┤     ├────────
            │  +  │          + = 中位数
    ────────┤     ├────────
            └─────┘
    │ IQR │
    Q1    Q3

    ◇ 异常值（超过 1.5*IQR）

3.2 类别变量

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# 频率柱状图
df["category"].value_counts().plot.bar(ax=axes[0])
axes[0].set_title("Category Counts")

# 饼图（类别少时）
df["gender"].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[1])
axes[1].set_title("Gender Distribution")

4. 双变量分析

4.1 数值 vs 数值

# 散点图
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="age", y="income", hue="gender", alpha=0.6)
plt.title("Age vs Income")

# 相关系数矩阵
corr_matrix = df[numeric_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title("Correlation Matrix")

4.2 数值 vs 类别

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 分组箱线图
sns.boxplot(data=df, x="education", y="income", ax=axes[0])
axes[0].set_title("Income by Education")

# 小提琴图
sns.violinplot(data=df, x="gender", y="score", ax=axes[1])
axes[1].set_title("Score by Gender")

4.3 类别 vs 类别

# 交叉表
cross_tab = pd.crosstab(df["gender"], df["purchased"], normalize='index')
cross_tab.plot.bar(stacked=True)
plt.title("Purchase Rate by Gender")

# 热力图
sns.heatmap(pd.crosstab(df["city"], df["product"]), annot=True, fmt='d', cmap='YlOrRd')

5. 多变量分析

5.1 PCA 可视化

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[numeric_cols])

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df["label"], cmap='viridis', alpha=0.6)
plt.colorbar(scatter)
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})")
plt.title("PCA Visualization")

# 方差解释比例
plt.figure()
plt.bar(range(len(pca.explained_variance_ratio_)), 
        pca.explained_variance_ratio_)
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")

5.2 t-SNE 可视化

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=df["label"], 
                      cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter)
plt.title("t-SNE Visualization")

PCA vs t-SNE：

维度	PCA	t-SNE
方法	线性降维	非线性降维
保留的结构	全局	局部
速度	快	慢
可重复	是	取决于随机种子
新数据	可直接投影	需重新计算

5.3 Pair Plot

# 变量两两关系图
sns.pairplot(df[["age", "income", "score", "label"]], 
             hue="label", diag_kind="kde")

6. 时间序列 EDA

# 时间趋势
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)

fig, axes = plt.subplots(3, 1, figsize=(12, 10))

# 原始序列
df["value"].plot(ax=axes[0], title="Time Series")

# 滚动统计
df["value"].rolling(30).mean().plot(ax=axes[1], label="30-day MA")
df["value"].rolling(30).std().plot(ax=axes[1], label="30-day Std")
axes[1].legend()
axes[1].set_title("Rolling Statistics")

# 季节性分解
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df["value"], period=365)
decomp.plot()

7. EDA 工具

工具	特点	使用方式
matplotlib	基础绑图，灵活	`plt.plot()`
seaborn	统计图，美观	`sns.boxplot()`
plotly	交互式图表	`px.scatter()`
pandas-profiling	一键自动 EDA	`ProfileReport(df)`
sweetviz	对比分析	`sv.compare()`
D-Tale	交互式界面	`dtale.show(df)`

一键 EDA

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Data Report", explorative=True)
profile.to_file("report.html")

8. EDA 最佳实践

实践	说明
先看大局	先 `shape`、`dtypes`、`describe`
关注缺失	缺失模式往往暗示数据采集问题
检查分布	正态？偏态？多峰？
识别异常	异常值可能是错误，也可能是信号
记录发现	用 Notebook 记录每一步发现和假设
可视化优先	图表比数字更直观
迭代进行	EDA 不是一次性的，建模后可能回来重新探索

参考资料

"Python for Data Analysis" - Wes McKinney
"Storytelling with Data" - Cole Nussbaumer Knaflic
seaborn 官方文档：https://seaborn.pydata.org
matplotlib 官方文档：https://matplotlib.org

探索性数据分析

概述

1. EDA 工作流

2. 描述统计

3. 单变量分析

3.1 数值变量

3.2 类别变量

4. 双变量分析

4.1 数值 vs 数值

4.2 数值 vs 类别

4.3 类别 vs 类别

5. 多变量分析

5.1 PCA 可视化

5.2 t-SNE 可视化

5.3 Pair Plot

6. 时间序列 EDA

7. EDA 工具

一键 EDA

8. EDA 最佳实践

参考资料

评论 #