Skip to content

High-performance 2D particle filter for stochastic volatility estimation. Intel MKL-accelerated, AVX2-optimized, ~36k ticks/sec at N=4000.

License

Notifications You must be signed in to change notification settings

Tugbars/Particle-Filter-2D

Repository files navigation

PF2D: Self-Calibrating Particle Filter with PMMH

High-performance particle filter in C with Intel MKL acceleration. 41× faster than FilterPy (NumPy golden standard). Features self-calibration via ESS-driven adaptation, online parameter re-estimation via PMMH, and automatic CPU profiling for Intel hybrid architectures (P-core/E-core optimization).

Tracks a 2D latent state [price, log_volatility] using Sequential Monte Carlo methods.

Benchmark Results

Tested on Intel i9-14900KF (8 P-cores, 16 threads), 4000 particles, 1000 ticks:

Library Latency (μs) Throughput vs pf2d
pf2d (batch) 17.2 58,188/sec baseline
pf2d (loop) 17.9 55,751/sec 1.0× slower
particles (Chopin) 211.4 4,730/sec 12.3× slower
FilterPy (NumPy) 710.9 1,407/sec 41.4× slower

→ pf2d is 41× faster than FilterPy and 12× faster than the academic reference implementation.

Note: mesured with benchmark_comparison.py in python folder


Why PF2D Over Traditional Filters?

Feature EKF/UKF PF2D
Distribution assumption Gaussian only Any (particles)
Nonlinearity handling Linearization/sigma points Exact (Monte Carlo)
Heavy tails
Multimodal states ✓ (multiple regimes)
Self-calibration ✓ (ESS-driven σ_vol scaling)
Online parameter updates Manual/offline ✓ (PMMH)

PF2D is essentially a better UKF that:

  • Makes no Gaussian assumptions
  • Handles regime switching natively
  • Self-calibrates via adaptive layer
  • Re-estimates parameters online via PMMH

The Parameter Drift Problem

The stochastic volatility model has parameters that can become stale:

log_vol[t+1] = (1-θ) × log_vol[t] + θ × μ_v + σ_v × ε
Parameter Symbol What it controls
mu_vol μ_v Long-term mean volatility level
sigma_vol σ_v How erratically volatility moves
drift Expected return per tick
theta_vol θ Mean-reversion speed (fixed)

The problem: Parameters calibrated in one regime fail in another.

Solution: PMMH re-estimates parameters when conditions change.


PMMH: Online Parameter Re-Estimation

Why PMMH?

Approach Problem
Offline batch MLE Too slow, stale by completion
Online gradient descent Gets stuck in local optima
Grid search Curse of dimensionality
PMMH ✓ Handles intractable likelihood, ✓ Bayesian posterior, ✓ Fast enough

PMMH (Particle Marginal Metropolis-Hastings) uses the particle filter itself to estimate the likelihood of different parameter values, then uses MCMC to find the posterior distribution.

PMMH Optimization Journey

We optimized PMMH from 3.3x to 7.6x speedup over scalar baseline:

Optimization Time Speedup μ_v Error
Scalar baseline 1853 ms 1x 0.086
MKL VSL/VML 984 ms 1.9x 0.086
+ Division elimination ~900 ms
+ Adaptive ESS=0.5 0.878 ❌
+ Adaptive ESS=0.25 0.481 ❌
+ Adaptive ESS=0.1 0.243
+ Adaptive ESS=0.05 984 ms 6.1x 0.019
+ Float precision 812 ms 7.6x 0.019

Key PMMH Optimizations

1. Division Elimination

Hot loop had expensive division:

// Before: 20-30 cycles per division
double vol = exp(log_vol);
double inv_vol = 1.0 / vol;  // SLOW
double z = ret * inv_vol;

// After: vectorized, no division
noise[i] = -2.0 * log_vol[i];
vdExp(np, noise, weights_exp);  // inv_var = exp(-2*log_vol)
double log_w = -0.5 * ret_sq * weights_exp[i] - log_vol[i];

2. Adaptive Resampling with ESS Threshold

Resampling is expensive (builds CDF, random gather). Only do it when necessary:

double ess = (sum_w * sum_w) / sum_w_sq;  // Effective Sample Size

if (ess < N * 0.05) {
    // Resample (expensive)
} else {
    // Pointer swap (free!)
}

Tuning the threshold was critical:

ESS Threshold Error Why
0.5 0.878 Too aggressive, skips needed resamples
0.25 0.481 Still too aggressive
0.1 0.243 Getting better
0.05 0.019 Optimal — resamples when truly needed

3. Float Precision

Single precision doubles SIMD throughput (8 floats vs 4 doubles per AVX register):

#ifndef PMMH_USE_DOUBLE
    #define PMMH_USE_FLOAT
#endif

#ifdef PMMH_USE_FLOAT
    typedef float pmmh_real;
    #define pmmh_vExp vsExp
    #define pmmh_RngGaussian vsRngGaussian
#else
    typedef double pmmh_real;
    #define pmmh_vExp vdExp
    #define pmmh_RngGaussian vdRngGaussian
#endif

Result: No accuracy loss, 23% faster.

4. Arena Allocation

All particle arrays in single contiguous block:

// Single malloc, single TLB entry
char *arena = mkl_malloc(total_size, 64);
s->log_vol     = (pmmh_real*)(arena + 0 * r_size);
s->log_vol_new = (pmmh_real*)(arena + 1 * r_size);
s->weights     = (pmmh_real*)(arena + 2 * r_size);
// ... etc

Benefits:

  • Single TLB entry for all arrays
  • Predictable memory layout for prefetcher
  • Cache-line aligned (prevents false sharing)

5. Prefetching for Random Gather

Resampling gathers particles by random ancestor index:

for (int i = 0; i < np; i++) {
    // Prefetch future random access
    if (i + 8 < np) {
        _mm_prefetch(&log_vol_new[ancestors[i + 8]], _MM_HINT_T0);
    }
    log_vol_curr[i] = log_vol_new[ancestors[i]];
}

PMMH Final Configuration

#define PMMH_N_PARTICLES     256
#define PMMH_N_ITERATIONS    300
#define PMMH_N_BURNIN        100
#define PMMH_ESS_THRESHOLD   0.05   // Tuned for accuracy
#define PMMH_USE_FLOAT              // 2x SIMD throughput

Final performance: 812ms for 16 parallel chains, 7.6x speedup, 0.019 μ_v error.


PF2D_ADAPTIVE: Self-Calibration Layer

Between parameter updates, the particle filter may drift. The adaptive layer provides continuous self-tuning.

Feature 1: ESS-Driven σ_vol Scaling

ESS persistently lowσ_vol_scale *= 1.02  (widen exploration)
ESS persistently highσ_vol_scale *= 0.99  (tighten)
ESS healthyσ_vol_scale1.0    (decay to baseline)

Why: Low ESS means particles aren't tracking well. Widening σ_vol helps particles explore more of the state space.

Feature 2: Volatility Regime Detection

vol_short_ema / vol_long_ema > 1.5Enter high-vol mode
vol_short_ema / vol_long_ema < 1.2Exit high-vol mode (hysteresis)

High-vol mode adjusts:

  • Lower resample threshold (more aggressive resampling)
  • Wider kernel bandwidths

Feature 3: PMMH Integration

// When PMMH completes: reset scaling to prevent double-adaptation
void pf2d_adaptive_reset_after_pmcmc(PF2D *pf) {
    pf->adaptive.sigma_vol_scale = 1.0;
    pf->adaptive.low_ess_streak = 0;
    pf->adaptive.high_ess_streak = 0;
}

Why reset after PMMH? New parameters will naturally improve tracking. If we keep the scaled σ_vol, we over-compensate.


PF2D Optimizations

Optimization History

Version Mean P50 Throughput Change
Baseline 73.7 μs 92.1 μs 13,574/sec
MKL vectorized 37.8 μs 23.1 μs 26,468/sec +95%
Fused parallel regions 35.7 μs 20.7 μs 27,995/sec +6%
ICDF RNG method 33.4 μs 19.1 μs 29,988/sec +7%
P-core affinity 32.2 μs 17.1 μs 31,057/sec +4%
Block processing 31.0 μs 16.0 μs 32,922/sec +6%
Single-pass variance 30.0 μs 15.6 μs 33,214/sec +1%
Float precision 28.0 μs 12.6 μs 36,198/sec +9%

Key Optimizations

Fused Parallel Regions

Eliminated 3 OpenMP barrier syncs by fusing RNG + physics + weights into single parallel region.

Savings: ~6 μs/tick

ICDF vs Box-Muller RNG

Method Operations Performance
Box-Muller log + sqrt + sin + cos Baseline
ICDF erfinv polynomial +7% faster

LUT Regime Sampling

O(1) lookup instead of O(R) search:

int lut_idx = (int)(u * 1023);  // 1024-entry LUT
int regime = regime_lut[lut_idx];

Intel Hybrid CPU Configuration

Intel 12th-14th gen CPUs have P-cores (fast) and E-cores (slow). Default MKL uses all cores, causing E-core bottleneck.

pf2d_mkl_config_14900kf();  // 16 P-core threads only

Impact: +30-50% performance.


State Space Model

State Vector

x_t = [price_t, log_vol_t]

Dynamics (per regime r)

price_t   = price_{t-1} + drift_r + exp(log_vol_{t-1}) × ε₁
log_vol_t = (1 - θ_r) × log_vol_{t-1} + θ_r × μ_r + σ_r × ε₂

where ε₁, ε₂ ~ N(0, 1)

Regime Parameters

Parameter Symbol Description Typical Values
drift Price drift per tick 0.0 to 0.001
theta_vol θ Mean-reversion speed 0.02 to 0.20
mu_vol μ Long-run log-volatility -4.6 (≈1% vol)
sigma_vol σ Vol-of-vol 0.03 to 0.20
rho ρ Price-vol correlation -0.5 to 0.0

Quick Start

C API

#include "particle_filter_2d.h"
#include "pf2d_adaptive.h"
#include "pmmh_mkl.h"

int main() {
    // Configure for Intel hybrid CPU
    pf2d_mkl_config_14900kf(0);
    
    // Create filter: 4000 particles, 4 regimes
    PF2D* pf = pf2d_create(4000, 4);
    
    // Configure regimes
    pf2d_set_regime_params(pf, 0, 0.001, 0.02, -4.6, 0.05, 0.0);
    pf2d_set_regime_params(pf, 1, 0.000, 0.05, -4.8, 0.03, 0.0);
    pf2d_set_regime_params(pf, 2, 0.000, 0.10, -3.5, 0.10, 0.0);
    pf2d_set_regime_params(pf, 3, 0.000, 0.20, -3.0, 0.20, 0.0);
    
    // Initialize
    pf2d_initialize(pf, 100.0, 0.01, -4.6, 0.5);
    pf2d_adaptive_init(pf);
    pf2d_warmup(pf);
    
    // Main loop
    for (int t = 0; t < n_ticks; t++) {
        PF2DOutput out = pf2d_update(pf, observations[t], &regime_probs);
        pf2d_adaptive_tick(pf, &out);
        
        // Use out.price_mean, out.vol_mean, out.regime_probs, etc.
    }
    
    pf2d_destroy(pf);
}

Running PMMH for Parameter Re-estimation

#include "pmmh_mkl.h"

// When you need to re-estimate parameters:
PMMHPrior prior = {
    .mean = {0.0005, -3.5, 0.15},  // drift, mu_vol, sigma_vol
    .std  = {0.001,  1.5,  0.15}
};

PMMHResult result = pmmh_run_parallel(
    returns, n_obs,           // observation window
    &prior,
    256,                      // particles
    300,                      // iterations  
    100,                      // burnin
    16,                       // chains
    0.02                      // theta_vol (fixed)
);

// Apply new parameters to PF2D
pf2d_set_regime_params(pf, regime, 
    result.posterior_mean.drift,
    0.02,  // theta fixed
    result.posterior_mean.mu_vol,
    result.posterior_mean.sigma_vol,
    0.0);

pf2d_adaptive_reset_after_pmcmc(pf);

printf("PMMH: μ_v=%.3f±%.3f, σ_v=%.3f±%.3f, accept=%.1f%%\n",
       result.posterior_mean.mu_vol, result.posterior_std.mu_vol,
       result.posterior_mean.sigma_vol, result.posterior_std.sigma_vol,
       result.acceptance_rate * 100);

Python API

from pf2d import ParticleFilter2D, create_default_filter
import numpy as np

pf = create_default_filter(n_particles=4000)
pf.initialize(price0=100.0, log_vol0=np.log(0.01))
pf.warmup()

# Batch processing
price_est, vol_est, ess = pf.run(observations)

# Streaming
for obs in observation_stream:
    result = pf.update(obs)

Build

Windows (Visual Studio)

mkdir build && cd build
cmake .. -G "Visual Studio 17 2022" -A x64
cmake --build . --config Release

Linux

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j

Requirements

  • Intel MKL 2024+
  • OpenMP
  • C11 compiler
  • CMake 3.16+

Files

├── particle_filter_2d.h      # Core PF2D header
├── particle_filter_2d.c      # Core implementation
├── pf2d_adaptive.h           # Self-calibration header
├── pf2d_adaptive.c           # Self-calibration implementation
├── pmmh_mkl.h                # PMMH with MKL optimizations
├── python/
│   └── pf2d.py               # Python bindings

References

  • Andrieu, Doucet, Holenstein (2010). Particle Markov chain Monte Carlo methods.
  • Gordon, Salmond, Smith (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation.
  • Doucet, Johansen (2009). A tutorial on particle filtering and smoothing.

License

GPL-3.0 license

About

High-performance 2D particle filter for stochastic volatility estimation. Intel MKL-accelerated, AVX2-optimized, ~36k ticks/sec at N=4000.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published