Skip to content

Binary segmentation

Toby Dylan Hocking edited this page Nov 4, 2020 · 13 revisions

Background

Binary segmentation is an extremely fast heuristic algorithm for computing changepoint models in sequence data sets (e.g., time series, genomics, etc). Its asymptotic complexity for N data and K segments is

  • O( N log K ) in the typical/average case, which occurs when each split results in two segments of similar sizes (e.g., splitting 100 data points into 50 and 50).
  • O( N K ) in the worst case, which occurs when each split results in two segments of very different sizes (e.g., splitting 100 data points into 99 and 1).

Related work

  • changepoint::cpt.mean(method=”BinSeg”) uses C code that can compute binary segmentation for various statistical models (Normal change in mean with constant variance, Normal change in mean and variance, Poisson change in mean and variance, etc). It supports input of min segment length constraints. Its quadratic O(K^2) output size (K x K changepoint matrix) can be prohibitive for large K. It outputs segment-specific model parameters (e.g., segment means) for only the selected model (but does not support output of segment-specific parameters for all models from 1 to K segments).
  • binsegRcpp::binseg_normal uses C++ code that computes binary segmentation for only one statistical model (Normal change in mean with constant variance). It does not support input of min segment length constraints. Its linear O(K) output size (data table with K rows) works even for very large K. Using this output the coef method supports computing segment-specific parameters (e.g., segment means) for any models from 1 to K segments. In addition it uses a std::multiset from the C++ Standard Template Library for efficient storage/lookup of split candidates, and it uses an object-oriented design with various classes that help avoid repetition.
  • gfpop::gfpop uses C++ code to compute optimal segmentation for various statistical models. It is optimal in the sense of the distribution-specific loss function – it is guaranteed to compute changepoints which result in the minimum loss (whereas binary segmentation may not). The supported distributions are:
    • “mean” Normal change in mean with constant variance.
    • “variance” Normal change in variance with constant mean.
    • “poisson” Poisson change in mean and variance.
    • “exp” Exponential change in mean and variance.
    • “negbin” Negative binomial change in mean with constant variance.

Overall we have the following table which compares the features of the binary segmentation packages:

Feature changepoint binsegRcpp
distributions many one
min seg length yes no
output size quadratic linear
output segment params one model all models

So changepoint has some advantages, and binsegRcpp has other strengths.

Coding project: new BinarySegmentation package

The goal is to code a new BinarySegmentation package that combines the useful features which are currently only available in either binsegRcpp or changepoint. It should

  • support observation-specific positive real weights for the cost function.
  • support input of min segment length constraints.
  • support at least the same distributions as gfpop, ideally all of the distributions in changepoint as well.
  • use C++ std::multiset for efficient storage/lookup of split candidates.
  • use an object-oriented design with various classes that help avoid repetition.
  • for max number of segments = K, output data table with K rows (one per model size), which includes segment-specific model parameters (e.g., segment means), with a coef method for converting that output to a data table with one row per segment.

The source code for computing the distribution-specific loss function in the changepoint package is in src/cost_general_functions.c. The table below shows the mapping from gfpop distributions types and changepoint cost function names:

gfpop type changepoint cost
mean mean_norm
variance var_norm
poisson meanvar_poisson
exp meanvar_exp
negbin NA

Expected impact

This new fast and fully-featured implementation of binary segmentation will be very useful for useRs with sequential data and changepoint detection problems.

Mentors

Please get in touch after completing at least one of the tests below.

Tests

Do one or several — doing more hard tests makes you more likely to be selected.

  • Easy: TODO
  • Medium: TODO
  • Hard: TODO

Solutions of tests

Students, please post a link to your test results here.

  • EXAMPLE STUDENT 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
Clone this wiki locally