Skip to content

Exploratory Data Analysis

Introduction

Exploratory Data Analysis (EDA) is a critical step for understanding data before modeling. Through descriptive statistics and visualization, it reveals patterns, anomalies, and relationships in the data, guiding subsequent feature engineering and model selection.


1. EDA Workflow

Load Data
  → Descriptive Statistics (Describe)
  → Univariate Analysis
  → Bivariate Analysis
  → Multivariate Analysis
  → Hypothesis Generation
  → Feature Engineering Direction

2. Descriptive Statistics

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")

# Basic info
print(df.shape)
print(df.dtypes)
print(df.describe())           # numeric column stats
print(df.describe(include='O'))  # categorical column stats

# Missing values
print(df.isnull().sum().sort_values(ascending=False))

# Data type distribution
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

Key statistics:

Statistic Meaning Focus
count Non-null count Missing values
mean / median Central tendency Discrepancy → skewness
std Dispersion Variability
min / max Extreme values Outliers
Q1 / Q3 Quartiles Distribution shape
skewness Skewness >1 right-skewed, <-1 left-skewed
kurtosis Kurtosis >3 heavy tails

3. Univariate Analysis

3.1 Numeric Variables

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram
axes[0].hist(df["age"], bins=30, edgecolor='black')
axes[0].set_title("Age Distribution")

# Box plot
sns.boxplot(y=df["income"], ax=axes[1])
axes[1].set_title("Income Box Plot")

# KDE density plot
sns.kdeplot(df["score"], fill=True, ax=axes[2])
axes[2].set_title("Score Density")

plt.tight_layout()
plt.show()

Box plot interpretation:

            ┌─────┐
    ────────┤     ├────────
            │  +  │          + = median
    ────────┤     ├────────
            └─────┘
    │ IQR │
    Q1    Q3

    ◇ Outliers (beyond 1.5*IQR)

3.2 Categorical Variables

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Frequency bar chart
df["category"].value_counts().plot.bar(ax=axes[0])
axes[0].set_title("Category Counts")

# Pie chart (when few categories)
df["gender"].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[1])
axes[1].set_title("Gender Distribution")

4. Bivariate Analysis

4.1 Numeric vs Numeric

# Scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="age", y="income", hue="gender", alpha=0.6)
plt.title("Age vs Income")

# Correlation matrix
corr_matrix = df[numeric_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title("Correlation Matrix")

Correlation coefficient interpretation:

| \(|r|\) Range | Strength | |------------|----------| | 0.0 - 0.3 | Weak | | 0.3 - 0.7 | Moderate | | 0.7 - 1.0 | Strong |

Note

The correlation coefficient measures only linear relationships. Two variables may have a strong nonlinear relationship yet a correlation coefficient near 0.

4.2 Numeric vs Categorical

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Grouped box plot
sns.boxplot(data=df, x="education", y="income", ax=axes[0])
axes[0].set_title("Income by Education")

# Violin plot
sns.violinplot(data=df, x="gender", y="score", ax=axes[1])
axes[1].set_title("Score by Gender")

4.3 Categorical vs Categorical

# Cross-tabulation
cross_tab = pd.crosstab(df["gender"], df["purchased"], normalize='index')
cross_tab.plot.bar(stacked=True)
plt.title("Purchase Rate by Gender")

# Heatmap
sns.heatmap(pd.crosstab(df["city"], df["product"]), annot=True, fmt='d', cmap='YlOrRd')

5. Multivariate Analysis

5.1 PCA Visualization

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[numeric_cols])

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df["label"], cmap='viridis', alpha=0.6)
plt.colorbar(scatter)
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})")
plt.title("PCA Visualization")

# Explained variance ratio
plt.figure()
plt.bar(range(len(pca.explained_variance_ratio_)), 
        pca.explained_variance_ratio_)
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")

5.2 t-SNE Visualization

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=df["label"], 
                      cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter)
plt.title("t-SNE Visualization")

PCA vs t-SNE:

Dimension PCA t-SNE
Method Linear dimensionality reduction Nonlinear dimensionality reduction
Structure preserved Global Local
Speed Fast Slow
Reproducible Yes Depends on random seed
New data Can project directly Requires recomputation

5.3 Pair Plot

# Pairwise variable relationship plot
sns.pairplot(df[["age", "income", "score", "label"]], 
             hue="label", diag_kind="kde")

6. Time Series EDA

# Time trends
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)

fig, axes = plt.subplots(3, 1, figsize=(12, 10))

# Raw series
df["value"].plot(ax=axes[0], title="Time Series")

# Rolling statistics
df["value"].rolling(30).mean().plot(ax=axes[1], label="30-day MA")
df["value"].rolling(30).std().plot(ax=axes[1], label="30-day Std")
axes[1].legend()
axes[1].set_title("Rolling Statistics")

# Seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df["value"], period=365)
decomp.plot()

7. EDA Tools

Tool Features Usage
matplotlib Basic plotting, flexible plt.plot()
seaborn Statistical plots, aesthetically pleasing sns.boxplot()
plotly Interactive charts px.scatter()
pandas-profiling One-click automated EDA ProfileReport(df)
sweetviz Comparative analysis sv.compare()
D-Tale Interactive UI dtale.show(df)

One-Click EDA

from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Data Report", explorative=True)
profile.to_file("report.html")

8. EDA Best Practices

Practice Description
See the big picture first Start with shape, dtypes, describe
Focus on missing data Missing patterns often hint at data collection issues
Check distributions Normal? Skewed? Multimodal?
Identify anomalies Outliers may be errors or signals
Document findings Record every finding and hypothesis in notebooks
Visualization first Charts are more intuitive than numbers
Iterate EDA is not a one-time activity; you may return after modeling

References

  • "Python for Data Analysis" - Wes McKinney
  • "Storytelling with Data" - Cole Nussbaumer Knaflic
  • seaborn Official Documentation: https://seaborn.pydata.org
  • matplotlib Official Documentation: https://matplotlib.org

评论 #