Exploratory Data Analysis
Introduction
Exploratory Data Analysis (EDA) is a critical step for understanding data before modeling. Through descriptive statistics and visualization, it reveals patterns, anomalies, and relationships in the data, guiding subsequent feature engineering and model selection.
1. EDA Workflow
Load Data
→ Descriptive Statistics (Describe)
→ Univariate Analysis
→ Bivariate Analysis
→ Multivariate Analysis
→ Hypothesis Generation
→ Feature Engineering Direction
2. Descriptive Statistics
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
# Basic info
print(df.shape)
print(df.dtypes)
print(df.describe()) # numeric column stats
print(df.describe(include='O')) # categorical column stats
# Missing values
print(df.isnull().sum().sort_values(ascending=False))
# Data type distribution
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
Key statistics:
| Statistic | Meaning | Focus |
|---|---|---|
| count | Non-null count | Missing values |
| mean / median | Central tendency | Discrepancy → skewness |
| std | Dispersion | Variability |
| min / max | Extreme values | Outliers |
| Q1 / Q3 | Quartiles | Distribution shape |
| skewness | Skewness | >1 right-skewed, <-1 left-skewed |
| kurtosis | Kurtosis | >3 heavy tails |
3. Univariate Analysis
3.1 Numeric Variables
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Histogram
axes[0].hist(df["age"], bins=30, edgecolor='black')
axes[0].set_title("Age Distribution")
# Box plot
sns.boxplot(y=df["income"], ax=axes[1])
axes[1].set_title("Income Box Plot")
# KDE density plot
sns.kdeplot(df["score"], fill=True, ax=axes[2])
axes[2].set_title("Score Density")
plt.tight_layout()
plt.show()
Box plot interpretation:
┌─────┐
────────┤ ├────────
│ + │ + = median
────────┤ ├────────
└─────┘
│ IQR │
Q1 Q3
◇ Outliers (beyond 1.5*IQR)
3.2 Categorical Variables
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Frequency bar chart
df["category"].value_counts().plot.bar(ax=axes[0])
axes[0].set_title("Category Counts")
# Pie chart (when few categories)
df["gender"].value_counts().plot.pie(autopct='%1.1f%%', ax=axes[1])
axes[1].set_title("Gender Distribution")
4. Bivariate Analysis
4.1 Numeric vs Numeric
# Scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x="age", y="income", hue="gender", alpha=0.6)
plt.title("Age vs Income")
# Correlation matrix
corr_matrix = df[numeric_cols].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title("Correlation Matrix")
Correlation coefficient interpretation:
| \(|r|\) Range | Strength | |------------|----------| | 0.0 - 0.3 | Weak | | 0.3 - 0.7 | Moderate | | 0.7 - 1.0 | Strong |
Note
The correlation coefficient measures only linear relationships. Two variables may have a strong nonlinear relationship yet a correlation coefficient near 0.
4.2 Numeric vs Categorical
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Grouped box plot
sns.boxplot(data=df, x="education", y="income", ax=axes[0])
axes[0].set_title("Income by Education")
# Violin plot
sns.violinplot(data=df, x="gender", y="score", ax=axes[1])
axes[1].set_title("Score by Gender")
4.3 Categorical vs Categorical
# Cross-tabulation
cross_tab = pd.crosstab(df["gender"], df["purchased"], normalize='index')
cross_tab.plot.bar(stacked=True)
plt.title("Purchase Rate by Gender")
# Heatmap
sns.heatmap(pd.crosstab(df["city"], df["product"]), annot=True, fmt='d', cmap='YlOrRd')
5. Multivariate Analysis
5.1 PCA Visualization
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[numeric_cols])
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df["label"], cmap='viridis', alpha=0.6)
plt.colorbar(scatter)
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})")
plt.title("PCA Visualization")
# Explained variance ratio
plt.figure()
plt.bar(range(len(pca.explained_variance_ratio_)),
pca.explained_variance_ratio_)
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")
5.2 t-SNE Visualization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=df["label"],
cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter)
plt.title("t-SNE Visualization")
PCA vs t-SNE:
| Dimension | PCA | t-SNE |
|---|---|---|
| Method | Linear dimensionality reduction | Nonlinear dimensionality reduction |
| Structure preserved | Global | Local |
| Speed | Fast | Slow |
| Reproducible | Yes | Depends on random seed |
| New data | Can project directly | Requires recomputation |
5.3 Pair Plot
# Pairwise variable relationship plot
sns.pairplot(df[["age", "income", "score", "label"]],
hue="label", diag_kind="kde")
6. Time Series EDA
# Time trends
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
fig, axes = plt.subplots(3, 1, figsize=(12, 10))
# Raw series
df["value"].plot(ax=axes[0], title="Time Series")
# Rolling statistics
df["value"].rolling(30).mean().plot(ax=axes[1], label="30-day MA")
df["value"].rolling(30).std().plot(ax=axes[1], label="30-day Std")
axes[1].legend()
axes[1].set_title("Rolling Statistics")
# Seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(df["value"], period=365)
decomp.plot()
7. EDA Tools
| Tool | Features | Usage |
|---|---|---|
| matplotlib | Basic plotting, flexible | plt.plot() |
| seaborn | Statistical plots, aesthetically pleasing | sns.boxplot() |
| plotly | Interactive charts | px.scatter() |
| pandas-profiling | One-click automated EDA | ProfileReport(df) |
| sweetviz | Comparative analysis | sv.compare() |
| D-Tale | Interactive UI | dtale.show(df) |
One-Click EDA
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Data Report", explorative=True)
profile.to_file("report.html")
8. EDA Best Practices
| Practice | Description |
|---|---|
| See the big picture first | Start with shape, dtypes, describe |
| Focus on missing data | Missing patterns often hint at data collection issues |
| Check distributions | Normal? Skewed? Multimodal? |
| Identify anomalies | Outliers may be errors or signals |
| Document findings | Record every finding and hypothesis in notebooks |
| Visualization first | Charts are more intuitive than numbers |
| Iterate | EDA is not a one-time activity; you may return after modeling |
References
- "Python for Data Analysis" - Wes McKinney
- "Storytelling with Data" - Cole Nussbaumer Knaflic
- seaborn Official Documentation: https://seaborn.pydata.org
- matplotlib Official Documentation: https://matplotlib.org