An R package for computing constrained optimal segmentation models
install.packages("devtools")
devtools::install_github("tdhock/coseg")
Linear time exact C++ solver for the optimal change points using the Poisson loss and the PeakSeg constraint.
There are two R functions that solve the Segment Neighborhood problem (best models with at most 1,…,S segments). The constraints are first segment mean <= second segment mean >= third segment mean, etc. For S segments and N data points, complexity is O(S N log N) memory and O(S N log N) time.
- PeakSegPDPA is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step. Input parameter is the maximum number of segments S, and outputs are the optimal models from 1 to S segments.
- PeakSegPDPAchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output. Input parameter is the maximum number of peaks P (implying S = P*2+1 segments), and outputs are the optimal models from 0 to P peaks.
There are two R functions that solve the Optimal Partitioning problem (best model for a given penalty parameter). Input parameter is a non-negative penalty constant (bigger for fewer peaks, smaller for more peaks), and output is the model that minimizes the penalized cost using the specified penalty constant. For N data points, complexity is O(N log N) memory and time. The first segment mean is constrained to be down (mu_1 <= mu_2), as is the last (muN-1 >= mu_N).
- PeakSegFPOP is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step.
- PeakSegFPOPchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output.
Note that as defined in the PeakSeg ICML’15 paper, the PeakSeg problem has strict inequality constraints (segment mean 1 < segment mean 2 > segment mean 3, etc). However, the solvers in this package use non-strict inequality constraints (segment mean 1 <= segment mean 2 >= segment mean 3, etc), and so may return a model with adjacent segment means that are equal. In this case, the minimum of the PeakSeg problem with strict inequality constraints is undefined. For example,
library(coseg)
data.vec <- as.integer(c(1, 2, 3))
fit <- PeakSegPDPA(data.vec, max.segments=3L)
For the data 1, 2, 3 the optimal model with at most 3 segments actually has only 2 segments:
> fit$mean.mat[3,] [1] 1.0 2.5 2.5
This implies that the optimal model with strict inequality constraints is undefined for these data. For example, the model with means 1, 3, 2 satisfies the strict inequality constraint, and has a PoissonLoss of
> PoissonLoss(c(1,2,3), c(1,3,2)) [1] 1.723334
But that model is not optimal, because there is another model that satisfies the strict inequality constraints, but has a better cost. For example,
> PoissonLoss(c(1,2,3), c(1,2.9,2.1)) [1] 1.644766
But that model is not optimal, for the same reason (exercise for the reader).
If you want to use the problem.* functions to perform optimal segmentation and peak calling on the whole genome, please read the PeakSegFPOP tutorial. R is limited by memory, so it is better to use the PeakSegFPOP command line program to compute optimal peak models for large bedGraph files, since it stores the optimal cost in a temporary file on disk (rather than in memory).
An on-disk implementation of PeakSegFPOP is available as a command line program.
implementation | time | memory | disk |
---|---|---|---|
command line | O(N log N) | O(log N) | O(N log N) |
R pkg coseg | O(N log N) | O(N log N) | 0 |
Note that although both implementations are O(N log N) time complexity for N data points, the command line program is slower due to disk read/write overhead.