PF2D: Self-Calibrating Particle Filter with PMMH

High-performance particle filter in C with Intel MKL acceleration. 41× faster than FilterPy (NumPy golden standard). Features self-calibration via ESS-driven adaptation, online parameter re-estimation via PMMH, and automatic CPU profiling for Intel hybrid architectures (P-core/E-core optimization).

Tracks a 2D latent state [price, log_volatility] using Sequential Monte Carlo methods.

Benchmark Results

Tested on Intel i9-14900KF (8 P-cores, 16 threads), 4000 particles, 1000 ticks:

Library	Latency (μs)	Throughput	vs pf2d
pf2d (batch)	17.2	58,188/sec	baseline
pf2d (loop)	17.9	55,751/sec	1.0× slower
particles (Chopin)	211.4	4,730/sec	12.3× slower
FilterPy (NumPy)	710.9	1,407/sec	41.4× slower

→ pf2d is 41× faster than FilterPy and 12× faster than the academic reference implementation.

Note: mesured with benchmark_comparison.py in python folder

Why PF2D Over Traditional Filters?

Feature	EKF/UKF	PF2D
Distribution assumption	Gaussian only	Any (particles)
Nonlinearity handling	Linearization/sigma points	Exact (Monte Carlo)
Heavy tails	✗	✓
Multimodal states	✗	✓ (multiple regimes)
Self-calibration	✗	✓ (ESS-driven σ_vol scaling)
Online parameter updates	Manual/offline	✓ (PMMH)

PF2D is essentially a better UKF that:

Makes no Gaussian assumptions
Handles regime switching natively
Self-calibrates via adaptive layer
Re-estimates parameters online via PMMH

The Parameter Drift Problem

The stochastic volatility model has parameters that can become stale:

log_vol[t+1] = (1-θ) × log_vol[t] + θ × μ_v + σ_v × ε

Parameter	Symbol	What it controls
`mu_vol`	μ_v	Long-term mean volatility level
`sigma_vol`	σ_v	How erratically volatility moves
`drift`	—	Expected return per tick
`theta_vol`	θ	Mean-reversion speed (fixed)

The problem: Parameters calibrated in one regime fail in another.

Solution: PMMH re-estimates parameters when conditions change.

PMMH: Online Parameter Re-Estimation

Why PMMH?

Approach	Problem
Offline batch MLE	Too slow, stale by completion
Online gradient descent	Gets stuck in local optima
Grid search	Curse of dimensionality
PMMH	✓ Handles intractable likelihood, ✓ Bayesian posterior, ✓ Fast enough

PMMH (Particle Marginal Metropolis-Hastings) uses the particle filter itself to estimate the likelihood of different parameter values, then uses MCMC to find the posterior distribution.

PMMH Optimization Journey

We optimized PMMH from 3.3x to 7.6x speedup over scalar baseline:

Optimization	Time	Speedup	μ_v Error
Scalar baseline	1853 ms	1x	0.086
MKL VSL/VML	984 ms	1.9x	0.086
+ Division elimination	~900 ms	—	—
+ Adaptive ESS=0.5	—	—	0.878 ❌
+ Adaptive ESS=0.25	—	—	0.481 ❌
+ Adaptive ESS=0.1	—	—	0.243
+ Adaptive ESS=0.05	984 ms	6.1x	0.019 ✓
+ Float precision	812 ms	7.6x	0.019 ✓

Key PMMH Optimizations

1. Division Elimination

Hot loop had expensive division:

// Before: 20-30 cycles per division
double vol = exp(log_vol);
double inv_vol = 1.0 / vol;  // SLOW
double z = ret * inv_vol;

// After: vectorized, no division
noise[i] = -2.0 * log_vol[i];
vdExp(np, noise, weights_exp);  // inv_var = exp(-2*log_vol)
double log_w = -0.5 * ret_sq * weights_exp[i] - log_vol[i];

2. Adaptive Resampling with ESS Threshold

Resampling is expensive (builds CDF, random gather). Only do it when necessary:

double ess = (sum_w * sum_w) / sum_w_sq;  // Effective Sample Size

if (ess < N * 0.05) {
    // Resample (expensive)
} else {
    // Pointer swap (free!)
}

Tuning the threshold was critical:

ESS Threshold	Error	Why
0.5	0.878	Too aggressive, skips needed resamples
0.25	0.481	Still too aggressive
0.1	0.243	Getting better
0.05	0.019	Optimal — resamples when truly needed

3. Float Precision

Single precision doubles SIMD throughput (8 floats vs 4 doubles per AVX register):

#ifndef PMMH_USE_DOUBLE
    #define PMMH_USE_FLOAT
#endif

#ifdef PMMH_USE_FLOAT
    typedef float pmmh_real;
    #define pmmh_vExp vsExp
    #define pmmh_RngGaussian vsRngGaussian
#else
    typedef double pmmh_real;
    #define pmmh_vExp vdExp
    #define pmmh_RngGaussian vdRngGaussian
#endif

Result: No accuracy loss, 23% faster.

4. Arena Allocation

All particle arrays in single contiguous block:

// Single malloc, single TLB entry
char *arena = mkl_malloc(total_size, 64);
s->log_vol     = (pmmh_real*)(arena + 0 * r_size);
s->log_vol_new = (pmmh_real*)(arena + 1 * r_size);
s->weights     = (pmmh_real*)(arena + 2 * r_size);
// ... etc

Benefits:

Single TLB entry for all arrays
Predictable memory layout for prefetcher
Cache-line aligned (prevents false sharing)

5. Prefetching for Random Gather

Resampling gathers particles by random ancestor index:

for (int i = 0; i < np; i++) {
    // Prefetch future random access
    if (i + 8 < np) {
        _mm_prefetch(&log_vol_new[ancestors[i + 8]], _MM_HINT_T0);
    }
    log_vol_curr[i] = log_vol_new[ancestors[i]];
}

PMMH Final Configuration

#define PMMH_N_PARTICLES     256
#define PMMH_N_ITERATIONS    300
#define PMMH_N_BURNIN        100
#define PMMH_ESS_THRESHOLD   0.05   // Tuned for accuracy
#define PMMH_USE_FLOAT              // 2x SIMD throughput

Final performance: 812ms for 16 parallel chains, 7.6x speedup, 0.019 μ_v error.

PF2D_ADAPTIVE: Self-Calibration Layer

Between parameter updates, the particle filter may drift. The adaptive layer provides continuous self-tuning.

Feature 1: ESS-Driven σ_vol Scaling

ESS persistently low  → σ_vol_scale *= 1.02  (widen exploration)
ESS persistently high → σ_vol_scale *= 0.99  (tighten)
ESS healthy           → σ_vol_scale → 1.0    (decay to baseline)

Why: Low ESS means particles aren't tracking well. Widening σ_vol helps particles explore more of the state space.

Feature 2: Volatility Regime Detection

vol_short_ema / vol_long_ema > 1.5 → Enter high-vol mode
vol_short_ema / vol_long_ema < 1.2 → Exit high-vol mode (hysteresis)

High-vol mode adjusts:

Lower resample threshold (more aggressive resampling)
Wider kernel bandwidths

Feature 3: PMMH Integration

// When PMMH completes: reset scaling to prevent double-adaptation
void pf2d_adaptive_reset_after_pmcmc(PF2D *pf) {
    pf->adaptive.sigma_vol_scale = 1.0;
    pf->adaptive.low_ess_streak = 0;
    pf->adaptive.high_ess_streak = 0;
}

Why reset after PMMH? New parameters will naturally improve tracking. If we keep the scaled σ_vol, we over-compensate.

PF2D Optimizations

Optimization History

Version	Mean	P50	Throughput	Change
Baseline	73.7 μs	92.1 μs	13,574/sec	—
MKL vectorized	37.8 μs	23.1 μs	26,468/sec	+95%
Fused parallel regions	35.7 μs	20.7 μs	27,995/sec	+6%
ICDF RNG method	33.4 μs	19.1 μs	29,988/sec	+7%
P-core affinity	32.2 μs	17.1 μs	31,057/sec	+4%
Block processing	31.0 μs	16.0 μs	32,922/sec	+6%
Single-pass variance	30.0 μs	15.6 μs	33,214/sec	+1%
Float precision	28.0 μs	12.6 μs	36,198/sec	+9%

Key Optimizations

Fused Parallel Regions

Eliminated 3 OpenMP barrier syncs by fusing RNG + physics + weights into single parallel region.

Savings: ~6 μs/tick

ICDF vs Box-Muller RNG

Method	Operations	Performance
Box-Muller	log + sqrt + sin + cos	Baseline
ICDF	erfinv polynomial	+7% faster

LUT Regime Sampling

O(1) lookup instead of O(R) search:

int lut_idx = (int)(u * 1023);  // 1024-entry LUT
int regime = regime_lut[lut_idx];

Intel Hybrid CPU Configuration

Intel 12th-14th gen CPUs have P-cores (fast) and E-cores (slow). Default MKL uses all cores, causing E-core bottleneck.

pf2d_mkl_config_14900kf();  // 16 P-core threads only

Impact: +30-50% performance.

State Space Model

State Vector

x_t = [price_t, log_vol_t]

Dynamics (per regime r)

price_t   = price_{t-1} + drift_r + exp(log_vol_{t-1}) × ε₁
log_vol_t = (1 - θ_r) × log_vol_{t-1} + θ_r × μ_r + σ_r × ε₂

where ε₁, ε₂ ~ N(0, 1)

Regime Parameters

Parameter	Symbol	Description	Typical Values
`drift`	—	Price drift per tick	0.0 to 0.001
`theta_vol`	θ	Mean-reversion speed	0.02 to 0.20
`mu_vol`	μ	Long-run log-volatility	-4.6 (≈1% vol)
`sigma_vol`	σ	Vol-of-vol	0.03 to 0.20
`rho`	ρ	Price-vol correlation	-0.5 to 0.0

Quick Start

C API

#include "particle_filter_2d.h"
#include "pf2d_adaptive.h"
#include "pmmh_mkl.h"

int main() {
    // Configure for Intel hybrid CPU
    pf2d_mkl_config_14900kf(0);
    
    // Create filter: 4000 particles, 4 regimes
    PF2D* pf = pf2d_create(4000, 4);
    
    // Configure regimes
    pf2d_set_regime_params(pf, 0, 0.001, 0.02, -4.6, 0.05, 0.0);
    pf2d_set_regime_params(pf, 1, 0.000, 0.05, -4.8, 0.03, 0.0);
    pf2d_set_regime_params(pf, 2, 0.000, 0.10, -3.5, 0.10, 0.0);
    pf2d_set_regime_params(pf, 3, 0.000, 0.20, -3.0, 0.20, 0.0);
    
    // Initialize
    pf2d_initialize(pf, 100.0, 0.01, -4.6, 0.5);
    pf2d_adaptive_init(pf);
    pf2d_warmup(pf);
    
    // Main loop
    for (int t = 0; t < n_ticks; t++) {
        PF2DOutput out = pf2d_update(pf, observations[t], &regime_probs);
        pf2d_adaptive_tick(pf, &out);
        
        // Use out.price_mean, out.vol_mean, out.regime_probs, etc.
    }
    
    pf2d_destroy(pf);
}

Running PMMH for Parameter Re-estimation

#include "pmmh_mkl.h"

// When you need to re-estimate parameters:
PMMHPrior prior = {
    .mean = {0.0005, -3.5, 0.15},  // drift, mu_vol, sigma_vol
    .std  = {0.001,  1.5,  0.15}
};

PMMHResult result = pmmh_run_parallel(
    returns, n_obs,           // observation window
    &prior,
    256,                      // particles
    300,                      // iterations  
    100,                      // burnin
    16,                       // chains
    0.02                      // theta_vol (fixed)
);

// Apply new parameters to PF2D
pf2d_set_regime_params(pf, regime, 
    result.posterior_mean.drift,
    0.02,  // theta fixed
    result.posterior_mean.mu_vol,
    result.posterior_mean.sigma_vol,
    0.0);

pf2d_adaptive_reset_after_pmcmc(pf);

printf("PMMH: μ_v=%.3f±%.3f, σ_v=%.3f±%.3f, accept=%.1f%%\n",
       result.posterior_mean.mu_vol, result.posterior_std.mu_vol,
       result.posterior_mean.sigma_vol, result.posterior_std.sigma_vol,
       result.acceptance_rate * 100);

Python API

from pf2d import ParticleFilter2D, create_default_filter
import numpy as np

pf = create_default_filter(n_particles=4000)
pf.initialize(price0=100.0, log_vol0=np.log(0.01))
pf.warmup()

# Batch processing
price_est, vol_est, ess = pf.run(observations)

# Streaming
for obs in observation_stream:
    result = pf.update(obs)

Build

Windows (Visual Studio)

mkdir build && cd build
cmake .. -G "Visual Studio 17 2022" -A x64
cmake --build . --config Release

Linux

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j

Requirements

Intel MKL 2024+
OpenMP
C11 compiler
CMake 3.16+

Files

├── particle_filter_2d.h      # Core PF2D header
├── particle_filter_2d.c      # Core implementation
├── pf2d_adaptive.h           # Self-calibration header
├── pf2d_adaptive.c           # Self-calibration implementation
├── pmmh_mkl.h                # PMMH with MKL optimizations
├── python/
│   └── pf2d.py               # Python bindings

References

Andrieu, Doucet, Holenstein (2010). Particle Markov chain Monte Carlo methods.
Gordon, Salmond, Smith (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation.
Doucet, Johansen (2009). A tutorial on particle filtering and smoothing.

License

GPL-3.0 license

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
1D		1D
2D		2D
3-RBPF		3-RBPF
kelly_integration		kelly_integration
python		python
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build_and_deploy.bat		build_and_deploy.bat

License

Tugbars/Particle-Filter-2D

Folders and files

Latest commit

History

Repository files navigation

PF2D: Self-Calibrating Particle Filter with PMMH

Benchmark Results

Why PF2D Over Traditional Filters?

The Parameter Drift Problem

PMMH: Online Parameter Re-Estimation

Why PMMH?

PMMH Optimization Journey

Key PMMH Optimizations

1. Division Elimination

2. Adaptive Resampling with ESS Threshold

3. Float Precision

4. Arena Allocation

5. Prefetching for Random Gather

PMMH Final Configuration

PF2D_ADAPTIVE: Self-Calibration Layer

Feature 1: ESS-Driven σ_vol Scaling

Feature 2: Volatility Regime Detection

Feature 3: PMMH Integration

PF2D Optimizations

Optimization History

Key Optimizations

Fused Parallel Regions

ICDF vs Box-Muller RNG

LUT Regime Sampling

Intel Hybrid CPU Configuration

State Space Model

State Vector

Dynamics (per regime r)

Regime Parameters

Quick Start

C API

Running PMMH for Parameter Re-estimation

Python API

Build

Windows (Visual Studio)

Linux

Requirements

Files

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages