GitHub - at82/coseg: Constrained Optimal Segmentation

An R package for computing constrained optimal segmentation models

Installation

install.packages("devtools")
devtools::install_github("tdhock/coseg")

PeakSeg solvers

Linear time exact C++ solver for the optimal change points using the Poisson loss and the PeakSeg constraint.

PeakSeg Segment Neighborhood problem

There are two R functions that solve the Segment Neighborhood problem (best models with at most 1,…,S segments). The constraints are first segment mean <= second segment mean >= third segment mean, etc. For S segments and N data points, complexity is O(S N log N) memory and O(S N log N) time.

PeakSegPDPA is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step. Input parameter is the maximum number of segments S, and outputs are the optimal models from 1 to S segments.
PeakSegPDPAchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output. Input parameter is the maximum number of peaks P (implying S = P*2+1 segments), and outputs are the optimal models from 0 to P peaks.

PeakSeg Optimal Partitioning problem

There are two R functions that solve the Optimal Partitioning problem (best model for a given penalty parameter). Input parameter is a non-negative penalty constant (bigger for fewer peaks, smaller for more peaks), and output is the model that minimizes the penalized cost using the specified penalty constant. For N data points, complexity is O(N log N) memory and time. The first segment mean is constrained to be down (mu_1 <= mu_2), as is the last (mu_N-1 >= mu_N).

PeakSegFPOP is a direct interface to the C++ code. Returns matrices for the optimal cost and number of intervals stored by the algorithm at each step.
PeakSegFPOPchrom is a more user-friendly interface, which takes a data.frame with columns chrom, chromStart, chromEnd, count as input, and returns data.frames as output.

Note about inequality constraints

Note that as defined in the PeakSeg ICML’15 paper, the PeakSeg problem has strict inequality constraints (segment mean 1 < segment mean 2 > segment mean 3, etc). However, the solvers in this package use non-strict inequality constraints (segment mean 1 <= segment mean 2 >= segment mean 3, etc), and so may return a model with adjacent segment means that are equal. In this case, the minimum of the PeakSeg problem with strict inequality constraints is undefined. For example,

library(coseg)
data.vec <- as.integer(c(1, 2, 3))
fit <- PeakSegPDPA(data.vec, max.segments=3L)

For the data 1, 2, 3 the optimal model with at most 3 segments actually has only 2 segments:

> fit$mean.mat[3,]
[1] 1.0 2.5 2.5

This implies that the optimal model with strict inequality constraints is undefined for these data. For example, the model with means 1, 3, 2 satisfies the strict inequality constraint, and has a PoissonLoss of

> PoissonLoss(c(1,2,3), c(1,3,2))
[1] 1.723334

But that model is not optimal, because there is another model that satisfies the strict inequality constraints, but has a better cost. For example,

> PoissonLoss(c(1,2,3), c(1,2.9,2.1))
[1] 1.644766

But that model is not optimal, for the same reason (exercise for the reader).

problem.* functions

If you want to use the problem.* functions to perform optimal segmentation and peak calling on the whole genome, please read the PeakSegFPOP tutorial. R is limited by memory, so it is better to use the PeakSegFPOP command line program to compute optimal peak models for large bedGraph files, since it stores the optimal cost in a temporary file on disk (rather than in memory).

Related work

An on-disk implementation of PeakSegFPOP is available as a command line program.

implementation	time	memory	disk
command line	O(N log N)	O(log N)	O(N log N)
R pkg coseg	O(N log N)	O(N log N)	0

Note that although both implementations are O(N log N) time complexity for N data points, the command line program is slower due to disk read/write overhead.

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
R		R
data		data
man		man
src		src
tests		tests
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS		NEWS
README.org		README.org
coseg.Rproj		coseg.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

PeakSeg solvers

PeakSeg Segment Neighborhood problem

PeakSeg Optimal Partitioning problem

Note about inequality constraints

problem.* functions

Related work

About

Releases

Packages

Languages

at82/coseg

Folders and files

Latest commit

History

Repository files navigation

Installation

PeakSeg solvers

PeakSeg Segment Neighborhood problem

PeakSeg Optimal Partitioning problem

Note about inequality constraints

problem.* functions

Related work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages