Efficient translation of MS data from Python to R

Re-evaluating the performance to convert the MS data from a Python `list` of `matchms.Spectrum` objects to an R `list` of numeric `matrix`. To support the subset operator `[` For the new `MsBackendPy` we need to iterate over an index of spectra, instead of the full list.

To prepare the data:

```r
library(SpectriPy)
library(Spectra)
library(MsBackendMgf)

fl <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
s <- Spectra(fl, source = MsBackendMgf())
## Convert the data to Python and assign to the py Python environment
py_set_attr(py, "s_p", rspec_to_pyspec(s))
```

The `py$s_p` variable (or `s_p` variable if accessed from within Python is now a list of `matchms.Spectrum` objects. Three different functions to extract the data for a predefined set of spectra are below:

- `pd1`: iterate over the full object in Python, using *reticulate* calls and subset the results.
- `pd2`: iterate over the index using *reticulate* calls.
- `pd3`: assign the index to a variable in Python and execute a custom Python command to iterate (in Python); this call needs an additional `lapply` loop to set the column names of the matrices.

```r
## iterate over the full object subsetting the result in R
pd1 <- function(x, i) {
    x <- py_get_attr(py, x)
    res <- iterate(x, function(z) {
        m <- z$peaks$to_numpy
        colnames(m) <- c("mz", "intensity")
        m
    }, simplify = FALSE)
    res[i]
}

## iterating over the index
pd2 <- function(x, i) {
    x <- py_get_attr(py, x)
    res <- iterate(r_to_py(as.integer(i) -1L), function(i) {
        m <- x[i]$peaks$to_numpy
        colnames(m) <- c("mz", "intensity")
        m
    }, simplify = FALSE)
    res
}

## use py_run_string to execute a custom Python loop - assigning first the
## index as a variable.
pd3 <- function(x, i) {
    py$i_ <- r_to_py(as.integer(i) - 1L)
    cmd <- paste0("_res_ = list()\n",
                  "for i in i_:\n",
                  "  _res_.append(", x, "[i].peaks.to_numpy)\n")
    res <- py_to_r(py_run_string(cmd, local = TRUE, convert = FALSE)[["_res_"]])
    lapply(res, function(z) {
        colnames(z) <- c("mz", "intensity")
        z
    })
}
```

Running this for a set of spectra:

```r
library(microbenchmark)

idx <- 1:20
microbenchmark(
    pd1("s_p", idx),
    pd2("s_p", idx),
    pd3("s_p", idx)
)
Unit: microseconds
            expr      min        lq      mean   median       uq       max neval cld
 pd1("s_p", idx) 3318.639 3360.9410 3968.0201 3432.772 3947.492 21702.829   100 a  
 pd2("s_p", idx) 1080.050 1110.7780 1225.6545 1144.179 1223.493  4303.152   100  b 
 pd3("s_p", idx)  256.585  281.0395  334.4408  296.791  314.645  3629.156   100   c

```

Version 3 with the custom Python command is thus the fastest - even if a second loop to set the column names is needed. We repeat with all 100 spectra:

```r
idx <- 1:100
microbenchmark(
    pd1("s_p", idx),
    pd2("s_p", idx),
    pd3("s_p", idx)
)
Unit: milliseconds
            expr      min       lq     mean   median       uq       max neval cld
 pd1("s_p", idx) 3.272338 3.338710 3.423525 3.370418 3.457476  4.499518   100 a  
 pd2("s_p", idx) 5.315825 5.430013 6.696067 5.486508 5.667932 36.454650   100  b 
 pd3("s_p", idx) 1.094341 1.130274 1.168435 1.143476 1.173361  1.543985   100   c
```

Also here `pd3()` outperforms the other two options.

While eventually easier to maintain, the *reticulate* Python expressions have a lower performance, most likely because of additional translations of the commands and/or data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient translation of MS data from Python to R #51

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Efficient translation of MS data from Python to R #51

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions