Binary segmentation

Background

Binary segmentation is an extremely fast heuristic algorithm for computing changepoint models in sequence data sets (e.g., time series, genomics, etc). Its asymptotic complexity for N data and K segments is

O( N log K ) in the typical/average case, which occurs when each split results in two segments of similar sizes (e.g., splitting 100 data points into 50 and 50).
O( N K ) in the worst case, which occurs when each split results in two segments of very different sizes (e.g., splitting 100 data points into 99 and 1).

Related work

changepoint::cpt.mean(method=”BinSeg”) uses C code that can compute binary segmentation for various statistical models (Normal change in mean with constant variance, Normal change in mean and variance, Poisson change in mean and variance, etc). It supports input of min segment length constraints. Its quadratic O(K^2) output size (K x K changepoint matrix) can be prohibitive for large K. It outputs segment-specific model parameters (e.g., segment means) for only the selected model (but does not support output of segment-specific parameters for all models from 1 to K segments).
binsegRcpp::binseg_normal uses C++ code that computes binary segmentation for only one statistical model (Normal change in mean with constant variance). It does not support input of min segment length constraints. Its linear O(K) output size (data table with K rows) works even for very large K. Using this output the coef method supports computing segment-specific parameters (e.g., segment means) for any models from 1 to K segments. In addition it uses a std::multiset from the C++ Standard Template Library for efficient storage/lookup of split candidates, and it uses an object-oriented design with various classes that help avoid repetition.
gfpop::gfpop uses C++ code to compute optimal segmentation for various statistical models. It is optimal in the sense of the distribution-specific loss function – it is guaranteed to compute changepoints which result in the minimum loss (whereas binary segmentation may not). The supported distributions are:
- “mean” Normal change in mean with constant variance.
- “variance” Normal change in variance with constant mean.
- “poisson” Poisson change in mean and variance.
- “exp” Exponential change in mean and variance.
- “negbin” Negative binomial change in mean with constant variance.

Overall we have the following table which compares the features of the binary segmentation packages:

Feature	changepoint	binsegRcpp
distributions	many	one
min seg length	yes	no
output size	quadratic	linear
output segment params	one model	all models

So changepoint has some advantages, and binsegRcpp has other strengths.

Coding project: new BinarySegmentation package

The goal is to code a new BinarySegmentation package that combines the useful features which are currently only available in either binsegRcpp or changepoint. It should

support observation-specific positive real weights for the cost function.
support input of min segment length constraints.
support at least the same distributions as gfpop, ideally all of the distributions in changepoint as well.
use C++ std::multiset for efficient storage/lookup of split candidates.
use an object-oriented design with various classes that help avoid repetition.
for max number of segments = K, output data table with K rows (one per model size), which includes segment-specific model parameters (e.g., segment means), with a coef method for converting that output to a data table with one row per segment.

The source code for computing the distribution-specific loss function in the changepoint package is in src/cost_general_functions.c. The table below shows the mapping from gfpop distributions types and changepoint cost function names:

gfpop type	changepoint cost
mean	mean_norm
variance	var_norm
poisson	meanvar_poisson
exp	meanvar_exp
negbin	NA

Expected impact

This new fast and fully-featured implementation of binary segmentation will be very useful for useRs with sequential data and changepoint detection problems.

Mentors

Please get in touch after completing at least one of the tests below.

EVALUATING MENTOR: TODO
Co-mentor: Toby Dylan Hocking <[email protected]>
Co-mentor: Rebecca Killick <[email protected]>

Tests

Do one or several — doing more hard tests makes you more likely to be selected.

Easy: TODO
Medium: TODO
Hard: TODO

Solutions of tests

Students, please post a link to your test results here.

EXAMPLE STUDENT 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.

Please refrain from editing this Footer. If you want to edit this page, please scroll to the top of the web page and click the Edit button in the upper right.

Binary segmentation

Background

Related work

Coding project: new BinarySegmentation package

Expected impact

Mentors

Tests

Solutions of tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally