Skip to content
This repository has been archived by the owner on Jul 7, 2022. It is now read-only.

Performance boost through normal approximations #49

Merged
merged 40 commits into from
Apr 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
3c1dbd9
Start to set up python module structure
ckoerber Apr 23, 2020
03afafb
Updated requirements
ckoerber Apr 23, 2020
f2efc48
New submodules
ckoerber Apr 23, 2020
d78c52e
Added logistic functions
ckoerber Apr 23, 2020
dccd4c5
renamed utils to utilities
ckoerber Apr 23, 2020
f7c60a3
First draft of abstract model class
ckoerber Apr 23, 2020
9a8e919
Doc updates for model class
ckoerber Apr 23, 2020
af21754
Shorter notation for typevars
ckoerber Apr 23, 2020
e8e99bb
Reshuffled module to simplify for users
ckoerber Apr 23, 2020
c63a1d0
First implementation of SIR model
ckoerber Apr 23, 2020
46cfe66
Flatten fit array if shape is 1D
ckoerber Apr 23, 2020
37bc30b
Iterate over dates instead of numbers
ckoerber Apr 23, 2020
15d3ab2
First unittests for SIR model
ckoerber Apr 23, 2020
0b6697a
Added policy implementation test
ckoerber Apr 23, 2020
eb4d568
Added test for logistic function
ckoerber Apr 23, 2020
2094cec
Changed freq to days_per_step
ckoerber Apr 23, 2020
9000a02
Added unit conversion test
ckoerber Apr 23, 2020
8b0c6d3
Added SEIR model plus test
ckoerber Apr 23, 2020
77ff9b5
Added setup.py
ckoerber Apr 23, 2020
2e92b96
Added comparison commit hash
ckoerber Apr 23, 2020
57b7f8d
Started to work on presentation
ckoerber Apr 23, 2020
bf4d8b0
Fixed model parameter typos
ckoerber Apr 24, 2020
a1cf08d
Cast dates to index if possible
ckoerber Apr 24, 2020
642f90c
Fit to ensemble now allows ignoring outliers
ckoerber Apr 24, 2020
01aa005
Additional tools to plot bands
ckoerber Apr 24, 2020
f89c62c
Module reproduces error propagation
ckoerber Apr 24, 2020
c01573f
Debbugged fitting
ckoerber Apr 24, 2020
97e3537
Added z-factor back in for plotting gvars
ckoerber Apr 24, 2020
a4c854f
Copy dicts for plotting to not overwrite future styling options
ckoerber Apr 24, 2020
9e61a21
Finalized analysis
ckoerber Apr 24, 2020
a6228d5
Add readme for new module
ckoerber Apr 24, 2020
c5643f0
Added notebook which compares normal vs general Bayesian parameter in…
ckoerber Apr 24, 2020
8746984
First draft of notes
ckoerber Apr 26, 2020
e0dbe4e
Typos, phrasing & example for gvar
ckoerber Apr 27, 2020
0e97442
Included correlations in posterior
ckoerber Apr 27, 2020
9bb0928
Implemented vent and icu admits and census (not tested)
ckoerber Apr 27, 2020
f323c27
Moved admits and census computation entirely to post process and enab…
ckoerber Apr 27, 2020
e9ea031
New census implementation
ckoerber Apr 27, 2020
899d01f
Added new vent fits to notebook
ckoerber Apr 27, 2020
8193550
Added executed notebook.
ckoerber Apr 27, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ __pycache__/
.DS_Store
output/*
!output/
*.egg-info
99 changes: 99 additions & 0 deletions README-normal-extension.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Normal extension of Bayesian analysis

The folder `bayes_chime/normal` contains an extension of the Bayesian analysis of the SEIR model.
In this extension, it is assumed that all data and model parameters are normally distributed.
This simplifies the propagation of errors and allows to analytically compute approximated posterior distributions.
Particularly, propagating parameter distributions through a 200-day simulation takes `20ms` while the estimation of the posterior distribution is at the order of `1s`.
Tests indicated that, for a given parameter distribution, general Bayesian forecasts agree with their normal approximations within uncertainty.


## Install the module
You can locally install this module using pip

```bash
pip install [--user] [-e] .
```

## How to use the module

### How to propagate uncertainties
```python
from gvar import gvar
from bayes_chime.normal.models import SEIRModel

# define fixed model parameters
xx = {"market_share": 0.26, "region_pop": 1200000, ...}

# define model parameter distributions
pp = {"incubation_days": gvar(4.6, 1.5), "recovery_days": gvar(15.3, 3.5)}

# set up the model
seir = SEIRModel()
df = seir.propagate_uncertainties(xx, pp)
```
The `gvar` variable represents a Gaussian random number characterized by it's mean and standard deviation.

### How to compute posteriors
```python
... # after code above

from lsqfit import nonlinear_fit

yy = gvar(data_mean, data_sdev)

fit = nonlinear_fit(data=(xx, yy), prior=pp, fcn=seir.fit_fcn)

print(fit) # Fit statistics
print("Posterior:", fit.p)
```

In the `notebooks/How-to-use-normal-approximations-module.ipynb`, the general usage of the module is explained.

## Technical details

### Uncertainty propagation and posteriors

This module makes use of two libraries developed by [Peter Lepage](https://physics.cornell.edu/peter-lepage)

* [`gvar`](https://gvar.readthedocs.io) and
* [`lsqfit`](https://lsqfit.readthedocs.io)

The random numbers represented by `gvar`s act like `floats` but automatically compute analytic derivatives in successive computations.
Thus they allow to propagate errors through the whole computation.
Because normal distributions are self-conjugate (if the likelihood is a Gaussian and the prior is Gaussian, so will be the posterior).

To determine posterior distributions, the `nonlinear_fit` function from `lsqfit` uses the saddle-point approximation for the kernel of the posterior function (evolve the residuals up to the second order and evaluate at the point where the first derivative vanishes).
Thus, computing the posterior effectively boils down to finding the parameters which cause the first derivative of the `chi**2` to vanish.
For this it uses regular minimization routines.
More details are specified in the [appendix A of the linked publication](https://arxiv.org/pdf/1406.2279.pdf).

### Kernel functions

To utilize the above modules, kernel operations must be written to utilize the syntax of `gvar`s and `lsqfit`.
Model parameters are either categorized as independent / fixed parameters or distribution / variable parameters.
Variable parameters follow normal distributions, fixed parameters are numbers.

This module abstracts the compartment models like SIR or SEIR such that future implementations can easily be extended.
E.g., after inheriting from the `bayes_chime.normal.models.base.CompartmentModel`, one only has to provide a `simulation_step` method and can use existing API to run simulations
```python
def simulation_step(data, **pars):
susceptible = data["susceptible"]
exposed = data["exposed"]
infected = data["infected"]
recovered = data["recovered"]

infected += pars["beta"] * infected * susceptible / pars["total"]
...

return oupdated_data
```

```

## Cross checks

This module tests against the non-Bayesian `penn_chime` module. These tests can be run in the repo root with
```bash
pytest
```
Furthermore, the `How-to-use` notebook fits posterior distributions generated with the main module and compares the propagation of uncertainties.
Empty file added bayes_chime/__init__.py
Empty file.
Empty file added bayes_chime/normal/__init__.py
Empty file.
9 changes: 9 additions & 0 deletions bayes_chime/normal/doc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
notes.aux
notes.log
xetex_import.sty
notes.bbl
notes.blg
notes.out
notes.synctex.gz
notesNotes.bib
notes.pdf
12 changes: 12 additions & 0 deletions bayes_chime/normal/doc/notes.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
@article{Bouchard:2014ypa,
author = "Bouchard, C.M. and Lepage, G. Peter and Monahan, Christopher and Na, Heechang and Shigemitsu, Junko",
archivePrefix = "arXiv",
doi = "10.1103/PhysRevD.90.054506",
eprint = "1406.2279",
journal = "Phys. Rev. D",
pages = "054506",
primaryClass = "hep-lat",
title = "{$B_s \to K \ell \nu$ form factors from lattice QCD}",
volume = "90",
year = "2014"
}
242 changes: 242 additions & 0 deletions bayes_chime/normal/doc/notes.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
\batchmode
\documentclass[paper=a4, fontsize=12pt, prl, notitlepage]{revtex4-1}
\usepackage[ssp]{xetex_import}
\linespread{1.4}
\errorstopmode

\begin{document}

\title{Math behind Bayesian uncertainty propagation using normal approximations}
\author{Christopher Körber}
\affiliation{%
Department of Physics,
University of California,
Berkeley, CA 94720, USA\\
Nuclear Science Division, %
LBNL,
Berkeley, CA 94720, USA
}
\date{\today}
\begin{abstract}
This document addresses the math used to propagate errors and to infer posterior distributions (Bayesian statistics) used in \href{https://github.com/pennsignals/chime_sims/pull/49}{the performance boost through normal approximations PR in the \texttt{pennsignals/chime\_sims} repo}.
\end{abstract}

\maketitle

%==============================================================================
% Content
%==============================================================================

\section{Assumptions}
The general theme of this document is: \textit{casting everything to normal distribution simplifies computations}.

In general, the techniques in PR \#49 make two approximations
\begin{enumerate}
\item All model and data \textit{input} distributions can be sufficiently approximated with multivariate normal distributions
\item The posterior distributions can be sufficiently approximated by a multivariate normal distribution
\end{enumerate}

\section{Error propagation}
The \texttt{gvar} module is used to analytically propagate errors.
The eqution used to computed the standard deviation $\sigma_f$ of a function $f(\vec p)$ at a given point in the parameter space $\bar {\vec p}$ is
\begin{equation}
\sigma_f^2(\bar {\vec p})
=
\left[
\sum_{i,j=1}^N
\frac{\partial f(\vec p)}{\partial p_i}
\left(\Sigma_p\right)_{ij}
\frac{\partial f(\vec p)}{\partial p_j}
\right]_{\vec p = \bar {\vec p}}
\, ,
\end{equation}
where $\Sigma_p$ is the covariance matrix of the normally distributed parameters $\vec p$.
In the uncorrelated case, this simplifies to
\begin{equation}
\sigma_f^2(\bar {\vec p})
=
\left[
\sum_{i=1}^N
\left(\frac{\partial f(\vec p)}{\partial p_i}\right)^2
\sigma_{p_i}^2
\right]_{\vec p = \bar {\vec p}}
\, .
\end{equation}

The module \texttt{gvar} is capable of tracing such derivatives since implemented functions are aware of their analytical derivatives.

\subsubsection{Example}
For example, suppose
\begin{align}
f(x, y)
&=
x y^2
&& \Rightarrow &
\sigma_f^2(\bar{\vec p})
&=
y^4 (\Sigma_p)_{xx}
+ 2 x y^3 \left[ (\Sigma_p)_{xy} + (\Sigma_p)_{yx} \right]
+ 4 x^2 y^2 (\Sigma_p)_{yy}
\end{align}
Thus, if $x, y$ follow a multivariate normal distribution of mean $\bar{\vec p}$ and covariance $\Sigma_p$, one finds
\begin{align}
\bar{\vec p}
&=
\begin{pmatrix}
1 \\ -2
\end{pmatrix}
\, , &
\Sigma_p
&=
\begin{pmatrix}
2 & 1 \\ 1 & 1
\end{pmatrix}
&
\Rightarrow
\sigma_f(\bar{\vec p})
&=
\sqrt{32 - 32 + 16} = 4
\end{align}
The corresponding \texttt{gvar} code returns
\begin{lstlisting}[style=python]
from gvar import gvar

mean = [1, -2]
cov = [[2, 1], [1, 1]]
x, y = gvar(mean, cov)
f = x * y ** 2
print(f)

> 4.0(4.0)
\end{lstlisting}


\section{Computation of the posterior}

This section explains how the \texttt{lsqfit} module approximates the posterior distribution $P(\vec p|D,M)$ given data $D$, an input model $M$ and it's corresponding priors $P(\vec p| M)$.

\subsection{Defintions}

The posterior distribution $P(\vec p|D, M)$ is proportional to the prior times the probability of the data given the model and parameters $P(D|\vec p, M)$ (the likelihood)
\begin{align}
P(\vec p|D, M) &=
\frac{P(D|\vec p, M)P(\vec p | M)}{P(D|M)}
\propto
P(D|\vec p, M)P(\vec p | M)
\, .
\end{align}
The marginal likelihood of the data $D$ given the model $M$ is obtained by integrating over the whole parameter space
\begin{equation}
P(D|M)
=
\int d \vec p P(D|\vec p, M)P(\vec p | M) \, .
\end{equation}
Because the posterior is normalized by the ratio of both distributions, one can neglect constant factors in the computation.

The likelihood of the data given the model and parameters is described by a $\chi^2$ distribution
\begin{equation}
P(D|\vec p, M)
\sim
\exp\left\{
- \frac{1}{2}
\sum_{i,j=1}^N
\left[y_i - \vec f_M(x_i, \vec p)\right]
\left(\Sigma_y^{-1}\right)_{ij}
\left[y_j - \vec f_M(x_j, \vec p)\right]
\right\}
=
\exp\left\{
- \frac{1}{2}
\chi^2_D(\vec p)
\right\}
\, ,
\end{equation}
where $\Sigma_y$ is the covariance matrix of the data and $f_M(x_j, \vec p)$ the model function evaluated at point $x_j$ which aims to describe the data point $y_j$.

Maximizing the Likelihood as a function of $\vec p$ corresponds to minimizing the exponent--which is the standard $\chi^2$-minimization procedure.
Computing the posterior distribution function for a given prior $P(\vec p| M)$ is called Bayesian inference.
Normal approximations of the posterior distribution are somewhere in the middle of a full Bayesian treatment and regular $\chi^2$-minimization.

\subsection{Normal approximation of the posterior}
Including the multivariate normal prior distribution with mean $\vec p_0$ and covariance $\Sigma_{p_0}$, the posterior distribution is proportional to
\begin{align}
P(\vec p|D, M)
&\sim
\exp\left\{
- \frac{1}{2}
\chi^2_D(\vec p)
- \frac{1}{2}
\chi^2_M(\vec p)
\right\}
=
\exp\left\{
- \frac{1}{2}
\chi^2_{DM}(\vec p)
\right\}
\, , &
\chi^2_M(\vec p)
&=
\left[\vec p - \vec p_0\right]^T \cdot
\Sigma_{p_0}^{-1}
\left[\vec p - \vec p_0\right]\, .
\end{align}
In short, the \texttt{lsqfit} module approximates the posterior by expressing the function $\chi^2_{DM}(\vec p)$ up to second order in $\vec p$ at the point $\bar{\vec p}$ where the first derivative vanishes (stationary or almost always minimal point)
\begin{align}
\chi^2_{DM}(\vec p)
& \approx
\chi^2_{DM}(\bar{\vec p})
+
\left[\vec p - \bar{\vec p}\right]^T
\Sigma_{DM}^{-1}(\bar{\vec p})
\left[\vec p - \bar{\vec p}\right]^T
\, , & &
\left.\frac{\partial \chi^2_{DM}(\vec p)}{\partial p_\alpha}\right|_{\vec p = \bar{\vec p}} = 0 \, \quad \forall_\alpha \, .
\end{align}
In this approximation, the posterior is again a multivariate normal distribution of mean $\bar{\vec p}$ (same as maximal likelihood estimation) with covariance $\Sigma_{DM}(\bar{\vec p})$.
The \texttt{nonlinear\_fit} method numerically computes the vector which minimizes the posterior $\bar{\vec p}$ (fitting) and analytically computes and evaluates the covariance matrix $\Sigma_{DM}(\bar{\vec p})$ at this point.
The appendix A of \cite{Bouchard:2014ypa} describes how $\Sigma_{DM}(\bar{\vec p})$ is estimated using derivatives of the residuals with respect of prior parameters.
%
% \subsubsection{Example}
%
% Suppose the fit function is a linear function $f(x_i, \vec p) = p^{(0)} + p^{(1)} x_i$.
% In this case, the normal approximation of the posterior is exact
% \begin{align}
% \chi^2_{DM}(\vec p)
% & =
% \sum_{i,j=1}^N
% \left[y_i - p^{(0)} - p^{(1)} x_i \right]
% \left(\Sigma_y^{-1}\right)_{ij}
% \left[y_j - p^{(0)} - p^{(1)} x_j\right]
% +
% \left[\begin{pmatrix} p^{(0)} \\ p^{(1)} \end{pmatrix} - \vec p_0 \right]^T
% \left(\Sigma_p^{-1}\right)_{ij}
% \left[\begin{pmatrix} p^{(0)} \\ p^{(1)} \end{pmatrix} - \vec p_0 \right]
% \\
% & =
% \left[
% \sum_{i,j=1}^N y_i \left(\Sigma_y^{-1}\right)_{ij} y_j
% + \vec p_0^T
% \left(\Sigma_p^{-1}\right)
% \vec p_0
% \right]
% +
% \left[\begin{pmatrix} p^{(0)} \\ p^{(1)} \end{pmatrix} - \bar{\vec p} \right]
% \left(\Sigma_{DM}^{-1}\right)_{ij}
% \left[\begin{pmatrix} p^{(0)} \\ p^{(1)} \end{pmatrix} - \bar{\vec p} \right]^T
% \, ,
% \end{align}
% with
% \begin{align}
% \bar{\vec p} &= \ldots
% \\
% \Sigma_{DM}^{-1} &= \ldots
% \end{align}

%==============================================================================
% End Content
%==============================================================================
\bibliography{notes.bib}{}
\bibliographystyle{plainurl}

\batchmode
\end{document}
Loading