-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Re-evaluating the performance to convert the MS data from a Python list of matchms.Spectrum objects to an R list of numeric matrix. To support the subset operator [ For the new MsBackendPy we need to iterate over an index of spectra, instead of the full list.
To prepare the data:
library(SpectriPy)
library(Spectra)
library(MsBackendMgf)
fl <- system.file("extdata", "mgf", "test.mgf", package = "SpectriPy")
s <- Spectra(fl, source = MsBackendMgf())
## Convert the data to Python and assign to the py Python environment
py_set_attr(py, "s_p", rspec_to_pyspec(s))The py$s_p variable (or s_p variable if accessed from within Python is now a list of matchms.Spectrum objects. Three different functions to extract the data for a predefined set of spectra are below:
pd1: iterate over the full object in Python, using reticulate calls and subset the results.pd2: iterate over the index using reticulate calls.pd3: assign the index to a variable in Python and execute a custom Python command to iterate (in Python); this call needs an additionallapplyloop to set the column names of the matrices.
## iterate over the full object subsetting the result in R
pd1 <- function(x, i) {
x <- py_get_attr(py, x)
res <- iterate(x, function(z) {
m <- z$peaks$to_numpy
colnames(m) <- c("mz", "intensity")
m
}, simplify = FALSE)
res[i]
}
## iterating over the index
pd2 <- function(x, i) {
x <- py_get_attr(py, x)
res <- iterate(r_to_py(as.integer(i) -1L), function(i) {
m <- x[i]$peaks$to_numpy
colnames(m) <- c("mz", "intensity")
m
}, simplify = FALSE)
res
}
## use py_run_string to execute a custom Python loop - assigning first the
## index as a variable.
pd3 <- function(x, i) {
py$i_ <- r_to_py(as.integer(i) - 1L)
cmd <- paste0("_res_ = list()\n",
"for i in i_:\n",
" _res_.append(", x, "[i].peaks.to_numpy)\n")
res <- py_to_r(py_run_string(cmd, local = TRUE, convert = FALSE)[["_res_"]])
lapply(res, function(z) {
colnames(z) <- c("mz", "intensity")
z
})
}Running this for a set of spectra:
library(microbenchmark)
idx <- 1:20
microbenchmark(
pd1("s_p", idx),
pd2("s_p", idx),
pd3("s_p", idx)
)
Unit: microseconds
expr min lq mean median uq max neval cld
pd1("s_p", idx) 3318.639 3360.9410 3968.0201 3432.772 3947.492 21702.829 100 a
pd2("s_p", idx) 1080.050 1110.7780 1225.6545 1144.179 1223.493 4303.152 100 b
pd3("s_p", idx) 256.585 281.0395 334.4408 296.791 314.645 3629.156 100 c
Version 3 with the custom Python command is thus the fastest - even if a second loop to set the column names is needed. We repeat with all 100 spectra:
idx <- 1:100
microbenchmark(
pd1("s_p", idx),
pd2("s_p", idx),
pd3("s_p", idx)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
pd1("s_p", idx) 3.272338 3.338710 3.423525 3.370418 3.457476 4.499518 100 a
pd2("s_p", idx) 5.315825 5.430013 6.696067 5.486508 5.667932 36.454650 100 b
pd3("s_p", idx) 1.094341 1.130274 1.168435 1.143476 1.173361 1.543985 100 cAlso here pd3() outperforms the other two options.
While eventually easier to maintain, the reticulate Python expressions have a lower performance, most likely because of additional translations of the commands and/or data.