Skip to content

Efficient translation of MS data from Python to R #51

@jorainer

Description

@jorainer

Re-evaluating the performance to convert the MS data from a Python list of matchms.Spectrum objects to an R list of numeric matrix. To support the subset operator [ For the new MsBackendPy we need to iterate over an index of spectra, instead of the full list.

To prepare the data:

library(SpectriPy)
library(Spectra)
library(MsBackendMgf)

fl <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
s <- Spectra(fl, source = MsBackendMgf())
## Convert the data to Python and assign to the py Python environment
py_set_attr(py, "s_p", rspec_to_pyspec(s))

The py$s_p variable (or s_p variable if accessed from within Python is now a list of matchms.Spectrum objects. Three different functions to extract the data for a predefined set of spectra are below:

  • pd1: iterate over the full object in Python, using reticulate calls and subset the results.
  • pd2: iterate over the index using reticulate calls.
  • pd3: assign the index to a variable in Python and execute a custom Python command to iterate (in Python); this call needs an additional lapply loop to set the column names of the matrices.
## iterate over the full object subsetting the result in R
pd1 <- function(x, i) {
    x <- py_get_attr(py, x)
    res <- iterate(x, function(z) {
        m <- z$peaks$to_numpy
        colnames(m) <- c("mz", "intensity")
        m
    }, simplify = FALSE)
    res[i]
}

## iterating over the index
pd2 <- function(x, i) {
    x <- py_get_attr(py, x)
    res <- iterate(r_to_py(as.integer(i) -1L), function(i) {
        m <- x[i]$peaks$to_numpy
        colnames(m) <- c("mz", "intensity")
        m
    }, simplify = FALSE)
    res
}

## use py_run_string to execute a custom Python loop - assigning first the
## index as a variable.
pd3 <- function(x, i) {
    py$i_ <- r_to_py(as.integer(i) - 1L)
    cmd <- paste0("_res_ = list()\n",
                  "for i in i_:\n",
                  "  _res_.append(", x, "[i].peaks.to_numpy)\n")
    res <- py_to_r(py_run_string(cmd, local = TRUE, convert = FALSE)[["_res_"]])
    lapply(res, function(z) {
        colnames(z) <- c("mz", "intensity")
        z
    })
}

Running this for a set of spectra:

library(microbenchmark)

idx <- 1:20
microbenchmark(
    pd1("s_p", idx),
    pd2("s_p", idx),
    pd3("s_p", idx)
)
Unit: microseconds
            expr      min        lq      mean   median       uq       max neval cld
 pd1("s_p", idx) 3318.639 3360.9410 3968.0201 3432.772 3947.492 21702.829   100 a  
 pd2("s_p", idx) 1080.050 1110.7780 1225.6545 1144.179 1223.493  4303.152   100  b 
 pd3("s_p", idx)  256.585  281.0395  334.4408  296.791  314.645  3629.156   100   c

Version 3 with the custom Python command is thus the fastest - even if a second loop to set the column names is needed. We repeat with all 100 spectra:

idx <- 1:100
microbenchmark(
    pd1("s_p", idx),
    pd2("s_p", idx),
    pd3("s_p", idx)
)
Unit: milliseconds
            expr      min       lq     mean   median       uq       max neval cld
 pd1("s_p", idx) 3.272338 3.338710 3.423525 3.370418 3.457476  4.499518   100 a  
 pd2("s_p", idx) 5.315825 5.430013 6.696067 5.486508 5.667932 36.454650   100  b 
 pd3("s_p", idx) 1.094341 1.130274 1.168435 1.143476 1.173361  1.543985   100   c

Also here pd3() outperforms the other two options.

While eventually easier to maintain, the reticulate Python expressions have a lower performance, most likely because of additional translations of the commands and/or data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    informationInformation or explanation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions