Time Series Analysis

Time series analysis encompasses a collection of statistical methods and machine learning techniques for handling data points ordered chronologically. From stock price prediction to weather forecasting, from server load monitoring to supply chain management, time series analysis has extremely broad applications in industry.

Learning path: Stationarity testing → Classical statistical models → Exponential smoothing → ML feature engineering → Deep learning methods → Evaluation and practice

Overview of Time Series Analysis

Basic Concepts

A time series \(\{y_t\}_{t=1}^{T}\) is a sequence of data observed at equally spaced time points. The core components of a time series include:

Component	Description	Example
Trend	Long-term upward or downward direction	GDP growing year over year
Seasonality	Regular fluctuations with a fixed period	Retail sales surging every Christmas
Cyclicity	Fluctuations without a fixed period	Business cycles (recessions and booms)
Noise	Unpredictable random fluctuations	Measurement errors

Stationarity

Stationarity is the most fundamental concept in time series analysis. A strictly stationary process has a joint distribution invariant to time shifts. In practice, weak stationarity (Wide-Sense Stationary) is more commonly used:

Constant mean: \(\mathbb{E}[y_t] = \mu\) for all \(t\)
Finite and constant variance: \(\text{Var}(y_t) = \sigma^2 < \infty\)
Autocovariance depends only on lag: \(\text{Cov}(y_t, y_{t+h}) = \gamma(h)\), depending only on the lag \(h\)

Stationarity tests:

ADF test (Augmented Dickey-Fuller): Tests for the presence of a unit root; a small p-value leads to rejection of the "non-stationary" null hypothesis
KPSS test: The null hypothesis is stationarity; using both tests together is more reliable
Differencing: Apply \(d\)-th order differencing to a non-stationary series to achieve stationarity, \(\Delta y_t = y_t - y_{t-1}\)

Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF)

The ACF measures the linear correlation between a time series and its lagged version:

\[ \rho(h) = \frac{\gamma(h)}{\gamma(0)} = \frac{\text{Cov}(y_t, y_{t+h})}{\text{Var}(y_t)} \]

The PACF measures the direct linear relationship between \(y_t\) and \(y_{t+h}\) after removing the effects of intermediate lags.

ACF and PACF plots are essential tools for selecting model orders:

Model	ACF pattern	PACF pattern
AR(p)	Tails off (exponential decay)	Cuts off after lag \(p\)
MA(q)	Cuts off after lag \(q\)	Tails off
ARMA(p,q)	Tails off	Tails off

Classical Methods

AR / MA / ARMA / ARIMA Models

AR(p) -- Autoregressive model: The current value is a linear combination of the past \(p\) values plus white noise:

\[ y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \dots + \phi_p y_{t-p} + \epsilon_t \]

MA(q) -- Moving Average model: The current value is a linear combination of the past \(q\) noise terms:

\[ y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_q \epsilon_{t-q} \]

ARMA(p,q): Combines AR and MA:

\[ y_t = c + \sum_{i=1}^{p} \phi_i y_{t-i} + \epsilon_t + \sum_{j=1}^{q} \theta_j \epsilon_{t-j} \]

ARIMA(p,d,q): Adds differencing to ARMA to handle non-stationary series. \(d\) is the order of differencing.

Box-Jenkins Methodology

The Box-Jenkins methodology is a systematic procedure for selecting and fitting ARIMA models:

Identification: Examine ACF/PACF plots to determine candidate values for \(p\), \(d\), \(q\)
Estimation: Fit model parameters using Maximum Likelihood Estimation (MLE)
Diagnostics: Check whether residuals are white noise (Ljung-Box test)
Forecasting: If diagnostics pass, use the model for prediction

Seasonal Decomposition

Seasonal ARIMA, i.e., SARIMA(p,d,q)(P,D,Q)\(_s\), where \(s\) is the seasonal period.

Classical additive/multiplicative decomposition:

Additive model: \(y_t = T_t + S_t + R_t\) (trend + seasonality + residual)
Multiplicative model: \(y_t = T_t \times S_t \times R_t\)

STL decomposition (Seasonal and Trend decomposition using Loess) is a more robust method that can handle changing seasonality.

Exponential Smoothing

Simple Exponential Smoothing (SES)

Suitable for series with no trend and no seasonality:

\[ \hat{y}_{t+1} = \alpha y_t + (1 - \alpha) \hat{y}_t, \quad 0 < \alpha < 1 \]

A larger \(\alpha\) gives more weight to recent observations; a smaller \(\alpha\) produces smoother forecasts.

Double Exponential Smoothing (Holt's Method)

Adds a trend component:

\[ \ell_t = \alpha y_t + (1 - \alpha)(\ell_{t-1} + b_{t-1}) \]

\[ b_t = \beta(\ell_t - \ell_{t-1}) + (1 - \beta) b_{t-1} \]

\[ \hat{y}_{t+h} = \ell_t + h \cdot b_t \]

where \(\ell_t\) is the level component and \(b_t\) is the trend component.

Holt-Winters Method

Further incorporates a seasonal component, with both additive and multiplicative variants:

Method	Use case	Seasonality behavior
Holt-Winters additive	Constant seasonal amplitude	\(S_t\) is added to the forecast
Holt-Winters multiplicative	Seasonal amplitude grows with trend	\(S_t\) multiplies the forecast

Prophet

Overview

Prophet is an open-source time series forecasting tool from Meta (formerly Facebook), designed specifically for business time series. It is robust to missing values, outliers, and trend changes.

Core Model

Prophet decomposes a time series into three additive components:

\[ y(t) = g(t) + s(t) + h(t) + \epsilon_t \]

Component	Description	Implementation
\(g(t)\): Trend	Long-term growth trend	Piecewise linear or logistic growth curve with automatic changepoint detection
\(s(t)\): Seasonality	Periodic patterns	Fourier series: \(s(t) = \sum_{n=1}^{N}\left(a_n \cos\frac{2\pi nt}{P} + b_n \sin\frac{2\pi nt}{P}\right)\)
\(h(t)\): Holidays	Holiday/special event effects	User-provided holiday list; model estimates effect sizes

Advantages of Prophet:

User-friendly for non-data-scientists with intuitive parameters
Automatically handles missing data and outliers
Allows manual addition of changepoints and holidays
Built-in uncertainty intervals

Machine Learning Methods

Feature Engineering

Transforming time series into tabular data is the key to applying traditional ML models:

Feature type	Example	Description
Sliding window (lag features)	\(y_{t-1}, y_{t-2}, \dots, y_{t-k}\)	Values from the past \(k\) time steps
Rolling statistics	Moving average, rolling standard deviation	Captures local trends and volatility
Date/time features	Month, day of week, hour, is_holiday	Encodes temporal periodicity
Difference features	\(y_t - y_{t-1}\), \(y_t - y_{t-7}\)	Captures changes
Fourier features	\(\sin(2\pi t / P)\), \(\cos(2\pi t / P)\)	Encodes seasonality

XGBoost / LightGBM for Time Series

Gradient boosted trees perform exceptionally well in time series competitions:

Advantages: No stationarity assumption required, automatically handles nonlinearity, can incorporate external features
Caveats: Must use time-ordered cross-validation (no random splitting) to avoid data leakage
Multi-step forecasting: Recursive forecasting (predict step-by-step, feeding predictions as next inputs) or direct multi-output

Deep Learning Methods

LSTM for Time Series

LSTM (Long Short-Term Memory) is naturally suited for sequence modeling:

Encoder-decoder architecture for multi-step forecasting
Can handle multivariate time series
Drawbacks: slow training, sensitive to hyperparameters

Temporal Fusion Transformer (TFT)

Google's TFT (2021) combines several advanced techniques:

Variable selection network: Automatically identifies important features
Temporal attention: Captures both short- and long-term dependencies
Interpretability: Provides feature importance scores and temporal attention weights
Achieved SOTA on multiple benchmark datasets

PatchTST

Nie et al. (2023) proposed segmenting time series into patches (similar to how ViT processes images):

Splits long sequences into fixed-length patches
Each patch serves as a token input to the Transformer
Dramatically reduces computational complexity while preserving long-range dependencies
Channel-independence strategy improves multivariate forecasting

Evaluation Methods

Common Evaluation Metrics

Metric	Formula	Characteristics
MAE	\(\frac{1}{T}\sum_{t=1}^T \\|y_t - \hat{y}_t\\|\)	Intuitive, robust to outliers
RMSE	\(\sqrt{\frac{1}{T}\sum_{t=1}^T (y_t - \hat{y}_t)^2}\)	Amplifies large errors
MAPE	\(\frac{100\%}{T}\sum_{t=1}^T \left\\|\frac{y_t - \hat{y}_t}{y_t}\right\\|\)	Percentage error, but unstable when \(y_t \approx 0\)
sMAPE	\(\frac{200\%}{T}\sum_{t=1}^T \frac{\\|y_t - \hat{y}_t\\|}{\\|y_t\\| + \\|\hat{y}_t\\|}\)	Symmetric version of MAPE
MASE	\(\frac{\text{MAE}}{\text{MAE}_{\text{naive}}}\)	Improvement relative to naive forecast, suitable for cross-series comparison

Backtesting

Time series evaluation must respect temporal ordering:

Rolling window validation: A fixed-size window slides forward; the model is retrained and evaluated at each step
Expanding window validation: The training set progressively grows while the prediction window moves forward
No future data leakage: Strictly ensure all training data precedes the prediction time point

Expanding window validation illustration:

Fold 1: [=====Train=====][Test]
Fold 2: [======Train======][Test]
Fold 3: [=======Train=======][Test]
Fold 4: [========Train========][Test]
                                    → Time direction

Method Selection Guide

Scenario	Recommended method	Rationale
Small data, univariate	ARIMA / Exponential smoothing	Few parameters, less prone to overfitting
Business forecasting (with seasonality/holidays)	Prophet	Easy to use, interpretable
Rich external features	XGBoost / LightGBM	Strong feature integration capability
Long sequences, multivariate, large data	Transformer-based (TFT/PatchTST)	Powerful modeling capacity
Uncertainty estimation needed	GP / Bayesian methods / Prophet	Built-in uncertainty quantification