High-performance particle filter in C with Intel MKL acceleration. 41× faster than FilterPy (NumPy golden standard). Features self-calibration via ESS-driven adaptation, online parameter re-estimation via PMMH, and automatic CPU profiling for Intel hybrid architectures (P-core/E-core optimization).
Tracks a 2D latent state [price, log_volatility] using Sequential Monte Carlo methods.
Tested on Intel i9-14900KF (8 P-cores, 16 threads), 4000 particles, 1000 ticks:
| Library | Latency (μs) | Throughput | vs pf2d |
|---|---|---|---|
| pf2d (batch) | 17.2 | 58,188/sec | baseline |
| pf2d (loop) | 17.9 | 55,751/sec | 1.0× slower |
| particles (Chopin) | 211.4 | 4,730/sec | 12.3× slower |
| FilterPy (NumPy) | 710.9 | 1,407/sec | 41.4× slower |
→ pf2d is 41× faster than FilterPy and 12× faster than the academic reference implementation.
Note: mesured with benchmark_comparison.py in python folder
| Feature | EKF/UKF | PF2D |
|---|---|---|
| Distribution assumption | Gaussian only | Any (particles) |
| Nonlinearity handling | Linearization/sigma points | Exact (Monte Carlo) |
| Heavy tails | ✗ | ✓ |
| Multimodal states | ✗ | ✓ (multiple regimes) |
| Self-calibration | ✗ | ✓ (ESS-driven σ_vol scaling) |
| Online parameter updates | Manual/offline | ✓ (PMMH) |
PF2D is essentially a better UKF that:
- Makes no Gaussian assumptions
- Handles regime switching natively
- Self-calibrates via adaptive layer
- Re-estimates parameters online via PMMH
The stochastic volatility model has parameters that can become stale:
log_vol[t+1] = (1-θ) × log_vol[t] + θ × μ_v + σ_v × ε
| Parameter | Symbol | What it controls |
|---|---|---|
mu_vol |
μ_v | Long-term mean volatility level |
sigma_vol |
σ_v | How erratically volatility moves |
drift |
— | Expected return per tick |
theta_vol |
θ | Mean-reversion speed (fixed) |
The problem: Parameters calibrated in one regime fail in another.
Solution: PMMH re-estimates parameters when conditions change.
| Approach | Problem |
|---|---|
| Offline batch MLE | Too slow, stale by completion |
| Online gradient descent | Gets stuck in local optima |
| Grid search | Curse of dimensionality |
| PMMH | ✓ Handles intractable likelihood, ✓ Bayesian posterior, ✓ Fast enough |
PMMH (Particle Marginal Metropolis-Hastings) uses the particle filter itself to estimate the likelihood of different parameter values, then uses MCMC to find the posterior distribution.
We optimized PMMH from 3.3x to 7.6x speedup over scalar baseline:
| Optimization | Time | Speedup | μ_v Error |
|---|---|---|---|
| Scalar baseline | 1853 ms | 1x | 0.086 |
| MKL VSL/VML | 984 ms | 1.9x | 0.086 |
| + Division elimination | ~900 ms | — | — |
| + Adaptive ESS=0.5 | — | — | 0.878 ❌ |
| + Adaptive ESS=0.25 | — | — | 0.481 ❌ |
| + Adaptive ESS=0.1 | — | — | 0.243 |
| + Adaptive ESS=0.05 | 984 ms | 6.1x | 0.019 ✓ |
| + Float precision | 812 ms | 7.6x | 0.019 ✓ |
Hot loop had expensive division:
// Before: 20-30 cycles per division
double vol = exp(log_vol);
double inv_vol = 1.0 / vol; // SLOW
double z = ret * inv_vol;
// After: vectorized, no division
noise[i] = -2.0 * log_vol[i];
vdExp(np, noise, weights_exp); // inv_var = exp(-2*log_vol)
double log_w = -0.5 * ret_sq * weights_exp[i] - log_vol[i];Resampling is expensive (builds CDF, random gather). Only do it when necessary:
double ess = (sum_w * sum_w) / sum_w_sq; // Effective Sample Size
if (ess < N * 0.05) {
// Resample (expensive)
} else {
// Pointer swap (free!)
}Tuning the threshold was critical:
| ESS Threshold | Error | Why |
|---|---|---|
| 0.5 | 0.878 | Too aggressive, skips needed resamples |
| 0.25 | 0.481 | Still too aggressive |
| 0.1 | 0.243 | Getting better |
| 0.05 | 0.019 | Optimal — resamples when truly needed |
Single precision doubles SIMD throughput (8 floats vs 4 doubles per AVX register):
#ifndef PMMH_USE_DOUBLE
#define PMMH_USE_FLOAT
#endif
#ifdef PMMH_USE_FLOAT
typedef float pmmh_real;
#define pmmh_vExp vsExp
#define pmmh_RngGaussian vsRngGaussian
#else
typedef double pmmh_real;
#define pmmh_vExp vdExp
#define pmmh_RngGaussian vdRngGaussian
#endifResult: No accuracy loss, 23% faster.
All particle arrays in single contiguous block:
// Single malloc, single TLB entry
char *arena = mkl_malloc(total_size, 64);
s->log_vol = (pmmh_real*)(arena + 0 * r_size);
s->log_vol_new = (pmmh_real*)(arena + 1 * r_size);
s->weights = (pmmh_real*)(arena + 2 * r_size);
// ... etcBenefits:
- Single TLB entry for all arrays
- Predictable memory layout for prefetcher
- Cache-line aligned (prevents false sharing)
Resampling gathers particles by random ancestor index:
for (int i = 0; i < np; i++) {
// Prefetch future random access
if (i + 8 < np) {
_mm_prefetch(&log_vol_new[ancestors[i + 8]], _MM_HINT_T0);
}
log_vol_curr[i] = log_vol_new[ancestors[i]];
}#define PMMH_N_PARTICLES 256
#define PMMH_N_ITERATIONS 300
#define PMMH_N_BURNIN 100
#define PMMH_ESS_THRESHOLD 0.05 // Tuned for accuracy
#define PMMH_USE_FLOAT // 2x SIMD throughputFinal performance: 812ms for 16 parallel chains, 7.6x speedup, 0.019 μ_v error.
Between parameter updates, the particle filter may drift. The adaptive layer provides continuous self-tuning.
ESS persistently low → σ_vol_scale *= 1.02 (widen exploration)
ESS persistently high → σ_vol_scale *= 0.99 (tighten)
ESS healthy → σ_vol_scale → 1.0 (decay to baseline)Why: Low ESS means particles aren't tracking well. Widening σ_vol helps particles explore more of the state space.
vol_short_ema / vol_long_ema > 1.5 → Enter high-vol mode
vol_short_ema / vol_long_ema < 1.2 → Exit high-vol mode (hysteresis)High-vol mode adjusts:
- Lower resample threshold (more aggressive resampling)
- Wider kernel bandwidths
// When PMMH completes: reset scaling to prevent double-adaptation
void pf2d_adaptive_reset_after_pmcmc(PF2D *pf) {
pf->adaptive.sigma_vol_scale = 1.0;
pf->adaptive.low_ess_streak = 0;
pf->adaptive.high_ess_streak = 0;
}Why reset after PMMH? New parameters will naturally improve tracking. If we keep the scaled σ_vol, we over-compensate.
| Version | Mean | P50 | Throughput | Change |
|---|---|---|---|---|
| Baseline | 73.7 μs | 92.1 μs | 13,574/sec | — |
| MKL vectorized | 37.8 μs | 23.1 μs | 26,468/sec | +95% |
| Fused parallel regions | 35.7 μs | 20.7 μs | 27,995/sec | +6% |
| ICDF RNG method | 33.4 μs | 19.1 μs | 29,988/sec | +7% |
| P-core affinity | 32.2 μs | 17.1 μs | 31,057/sec | +4% |
| Block processing | 31.0 μs | 16.0 μs | 32,922/sec | +6% |
| Single-pass variance | 30.0 μs | 15.6 μs | 33,214/sec | +1% |
| Float precision | 28.0 μs | 12.6 μs | 36,198/sec | +9% |
Eliminated 3 OpenMP barrier syncs by fusing RNG + physics + weights into single parallel region.
Savings: ~6 μs/tick
| Method | Operations | Performance |
|---|---|---|
| Box-Muller | log + sqrt + sin + cos | Baseline |
| ICDF | erfinv polynomial | +7% faster |
O(1) lookup instead of O(R) search:
int lut_idx = (int)(u * 1023); // 1024-entry LUT
int regime = regime_lut[lut_idx];Intel 12th-14th gen CPUs have P-cores (fast) and E-cores (slow). Default MKL uses all cores, causing E-core bottleneck.
pf2d_mkl_config_14900kf(); // 16 P-core threads onlyImpact: +30-50% performance.
x_t = [price_t, log_vol_t]
price_t = price_{t-1} + drift_r + exp(log_vol_{t-1}) × ε₁
log_vol_t = (1 - θ_r) × log_vol_{t-1} + θ_r × μ_r + σ_r × ε₂
where ε₁, ε₂ ~ N(0, 1)
| Parameter | Symbol | Description | Typical Values |
|---|---|---|---|
drift |
— | Price drift per tick | 0.0 to 0.001 |
theta_vol |
θ | Mean-reversion speed | 0.02 to 0.20 |
mu_vol |
μ | Long-run log-volatility | -4.6 (≈1% vol) |
sigma_vol |
σ | Vol-of-vol | 0.03 to 0.20 |
rho |
ρ | Price-vol correlation | -0.5 to 0.0 |
#include "particle_filter_2d.h"
#include "pf2d_adaptive.h"
#include "pmmh_mkl.h"
int main() {
// Configure for Intel hybrid CPU
pf2d_mkl_config_14900kf(0);
// Create filter: 4000 particles, 4 regimes
PF2D* pf = pf2d_create(4000, 4);
// Configure regimes
pf2d_set_regime_params(pf, 0, 0.001, 0.02, -4.6, 0.05, 0.0);
pf2d_set_regime_params(pf, 1, 0.000, 0.05, -4.8, 0.03, 0.0);
pf2d_set_regime_params(pf, 2, 0.000, 0.10, -3.5, 0.10, 0.0);
pf2d_set_regime_params(pf, 3, 0.000, 0.20, -3.0, 0.20, 0.0);
// Initialize
pf2d_initialize(pf, 100.0, 0.01, -4.6, 0.5);
pf2d_adaptive_init(pf);
pf2d_warmup(pf);
// Main loop
for (int t = 0; t < n_ticks; t++) {
PF2DOutput out = pf2d_update(pf, observations[t], ®ime_probs);
pf2d_adaptive_tick(pf, &out);
// Use out.price_mean, out.vol_mean, out.regime_probs, etc.
}
pf2d_destroy(pf);
}#include "pmmh_mkl.h"
// When you need to re-estimate parameters:
PMMHPrior prior = {
.mean = {0.0005, -3.5, 0.15}, // drift, mu_vol, sigma_vol
.std = {0.001, 1.5, 0.15}
};
PMMHResult result = pmmh_run_parallel(
returns, n_obs, // observation window
&prior,
256, // particles
300, // iterations
100, // burnin
16, // chains
0.02 // theta_vol (fixed)
);
// Apply new parameters to PF2D
pf2d_set_regime_params(pf, regime,
result.posterior_mean.drift,
0.02, // theta fixed
result.posterior_mean.mu_vol,
result.posterior_mean.sigma_vol,
0.0);
pf2d_adaptive_reset_after_pmcmc(pf);
printf("PMMH: μ_v=%.3f±%.3f, σ_v=%.3f±%.3f, accept=%.1f%%\n",
result.posterior_mean.mu_vol, result.posterior_std.mu_vol,
result.posterior_mean.sigma_vol, result.posterior_std.sigma_vol,
result.acceptance_rate * 100);from pf2d import ParticleFilter2D, create_default_filter
import numpy as np
pf = create_default_filter(n_particles=4000)
pf.initialize(price0=100.0, log_vol0=np.log(0.01))
pf.warmup()
# Batch processing
price_est, vol_est, ess = pf.run(observations)
# Streaming
for obs in observation_stream:
result = pf.update(obs)mkdir build && cd build
cmake .. -G "Visual Studio 17 2022" -A x64
cmake --build . --config Releasemkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j- Intel MKL 2024+
- OpenMP
- C11 compiler
- CMake 3.16+
├── particle_filter_2d.h # Core PF2D header
├── particle_filter_2d.c # Core implementation
├── pf2d_adaptive.h # Self-calibration header
├── pf2d_adaptive.c # Self-calibration implementation
├── pmmh_mkl.h # PMMH with MKL optimizations
├── python/
│ └── pf2d.py # Python bindings
- Andrieu, Doucet, Holenstein (2010). Particle Markov chain Monte Carlo methods.
- Gordon, Salmond, Smith (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation.
- Doucet, Johansen (2009). A tutorial on particle filtering and smoothing.
GPL-3.0 license