Add support for (very) large-scale data #794

jorainer · 2025-04-28T07:25:43Z

This PR adds a new xcms result object XcmsExperimentHdf5 that stores all preprocessing results to a local file in HDF5 format. Thus, no data/results are kept in memory reducing the memory demand to a minimum.

For an example and performance evaluation of this backend check the large_scale branch on Metabonaut: https://github.com/rformassspectrometry/Metabonaut/tree/large_scale_data

There are also other, smaller, changes in this PR including:

improve performance of chromatogram call by reading only spectra within the retention time ranges
linking to Metabonaut
update all documentations to roxygen/markdown. Fix links in documentation.

Apologies for the very large PR - there are unit tests for all functions, so the code should be OK. So maybe just discuss/evaluate the concept?

- `chromatograms()` to throw an error if parameter `mz` or `rt` contain missing values.

sneumann

Hi, awesome work. Some cosmetic questions here and there. Yours, Steffen

sneumann · 2025-05-16T16:12:55Z

NAMESPACE

@@ -291,156 +294,6 @@ exportClasses(
    "FilterIntensityParam",
    "ChromPeakAreaParam"
 )
-## Param methods


Were those removed entirely ? I see a few new exports elsewhere, but only a minority of these here

I removed all of the getter/setter methods for the *Param classes (please let me know if there are any left - I would like to remove all).

Why I removed these:

it clutters the documentation with seldomly, if at all, used functions

users will almost never use these functions. It's easier to just create a new param object than changing the value of a previously created one.

sneumann · 2025-05-16T16:21:41Z

NEWS.md

+
+## Changes in version 4.7.1
+
+- Change the naming convention of chromatographic peaks to include also the MS


That starts to put semantics into the identifier. I am extra careful with that, code needs to check MS level <=9. Are there any other semantics that could creep around the corner and need consideration ? Ion Mobility ? Probably not, remains an attribute of the peaks. Can you still rely on the leading digit in older datasets ? CP2001 could also be the 2001st peak in an old dataset ...

The chrom peak identifiers were always just thought as an identifier within the same data set. So, IMHO, it does not matter what values these identifiers have (whether it's a running number or even just a random number, as long as it is unique). I agree that adding semantics to identifiers is not ideal - but for the new XcmsExperimentHdf5 I needed something that allows me to define chromatographic peaks IDs separately for each process/file. Peak detection is performed on a per sample (file) basis and each of these processes stores the identified peaks to the HDF5 file without knowing anything about the other projects. That's why I came up with this solution - to still have unique chrom peak IDs within the same data set. I'm of course open to any other working solution :)

sneumann · 2025-05-16T16:22:43Z

NEWS.md

+- Optimization and performance improvements for extraction of chromatographic
+  data. This includes using `MsCoreUtils::reduce()`.
+- Restructure and clean-up of documentation.
+- Don't export unnecessary get/set methods for `Param` classes.


Ok. Methods are not neccessary or their export ? Could we have any old users of those functions ?

Hard to tell if there are any users out there - I guess very few if any (we never really promoted the idea of using getter/setter methods of the parameter classes). I would change and see what happens - my guess is users come forward and will open issues, then we can address it.

sneumann · 2025-05-16T16:30:25Z

R/DataClasses.R

 #' peak detection in purely chromatographic data.
 #'
 #' @references
 #' Colin A. Smith, Elizabeth J. Want, Grace O'Maille, Ruben Abagyan and
 #' Gary Siuzdak. "XCMS: Processing Mass Spectrometry Data for Metabolite
 #' Profiling Using Nonlinear Peak Alignment, Matching, and Identification"
-#' \emph{Anal. Chem.} 2006, 78:779-787.
+#' *Anal. Chem.* 2006, 78:779-787.


Future citations could include the DOI.

good point. I'll add the DOI for all references.

I added the DOIs in commit c120f61

sneumann · 2025-05-16T16:33:11Z

R/DataClasses.R

 #' p
-#'
+NULL


I'd love to learn what these two NULL do, and what's the difference between 'em.

The NULL after a roxygen documention just ensures that the documentation file (Rd) is created - without adding also the next function to it. For general documentation or collection/topics of function I use that, i.e., write the roxygen documentation and add a NULL at the end. Then, I add all related functions/methods/classes to the documentation using #' @rdname <name of the general, topic, documentation>

sneumann · 2025-05-16T21:23:54Z

R/XcmsExperimentHdf5.R

+                expandMz = expandMz, ppm = ppm, skipFilled = skipFilled,
+                peaks = peaks, chromPeakColumns = chromPeakColumns)
+        })
+        res <- Spectra:::.concatenate_spectra(res)


Should that be exported by Spectra in the future ?

I have to check the function - could be that it's already exported.

This one without leading dot ?
https://github.com/rformassspectrometry/Spectra/blob/82f773a3a8a305341b9baed3216632599649719d/NAMESPACE#L14

yes. I will update in the next PR.

sneumann · 2025-05-16T21:31:08Z

R/functions-XCMSnExp.R

@@ -577,35 +584,6 @@ dropGenericProcessHistory <- function(x, fun) {
    idxs
 }

-#' @rdname adjustRtime
-adjustRtimePeakGroups <- function(object, param = PeakGroupsParam(),


This is entirely gone now ?

on the contrary - I converted that to a method - so I added dedicated implementations for XcmsExperiment, XcmsExperimentHdf5 and XCMSnExp

sneumann · 2025-05-16T21:42:31Z

R/methods-xcmsFragments.R

@@ -3,6 +3,7 @@

 ############################################################
 ## show
+#' @rdname hidden_aliases
 setMethod("show", "xcmsFragments", .xcmsFragments.show)


How useful is xcmsFragments nowadays ? Would this devel cycle be a good time to deprecate it and mark for removal in the next cycle ?

I would be OK with that. I don't know what this class actually is/was used for... But we should maybe start deprecation in a separate branch/PR.

sneumann · 2025-05-16T21:52:21Z

R/ramp.R

@@ -1,228 +0,0 @@
-## # rampInit <- function() {


Rest in pieces :-)

sneumann · 2025-05-16T22:01:39Z

tests/testthat/test_XcmsExperimentHdf5-functions.R

@@ -0,0 +1,1083 @@
+library(rhdf5)


should the library() be in the parent directory, for all *hdf5 tests ?

I wanted to keep the HDF5 functionality as separate as possible thus I did not add it into the main testthat.R file. I find it cleaner if the unit test file specifically sets the stage for its R source file.

- Add DOI for all references. - Add parameter `force.overwrite` to `findChromPeaks()` to allow overwriting of a HDF5 result file. - Rename `.is_chrom_peaks_within_mz_rt()` to `.is_chrom_peak_within_mz_rt()`.

sneumann · 2025-05-26T13:13:05Z

Excellent, thanks for the hard work and answering questions. Will merge next. Yours, Steffen

jorainer added 30 commits October 1, 2024 10:48

Add initial ideas and code

770849c

feat: add some more functions related to HDF5 files.

78e0a74

feat: add additional functionaliy related to HDF5 files

4e397d7

feat: add XcmsExperimentHdf class and first methods

176d835

fix: small fixes.

908a700

feat: add refineChromPeaks,XcmsExperimentHdf5

e780523

feat: add .h5_chrom_peaks function

c4e25d0

feat: add chromPeaks,XcmsExperimentHdf5 method

7dd8597

feat: add parameter columns to chromPeaks for XcmsExperiment

1b5ef93

feat: add correspondence functionality

f8fb0e0

feat: add featureValues,XcmsExperimentHdf5

71129d6

feat: add functionality to get chrom peaks for features

e72e1c2

Merge branch 'devel' into large_scale

eafadff

Add functionality related to retention time alignment

3dc7b28

feat: add adjustRtime and related functions for XcmsExperimentHdf5

6834d1c

Small fixes and improvements

9e1a1a6

Add chromPeakData,XcmsExperimentHdf5 function

9255c63

Performance improvement for chromPeakData,XcmsExperimentHdf5

4ef9c2b

feat: add chromatogram,XcmsExperimentHdf5

7bd035d

tests: fix unit test

6d5b344

Add first gap filling functionality

365042c

feat: add featureArea for XcmsExperimentHdf5

ad0650d

tests: fix unit tests

0663595

Merge remote-tracking branch 'origin/devel' into large_scale

4ed04b4

feat: add fillChromPeaks,XcmsExperimentHdf5

d62c5c6

feat: add dropFilledChromPeaks,XcmsExperimentHdf5

fe2c689

feat: add filterMsLevel,XcmsExperimentHdf5

b591cf7

feat: add filterRt,XcmsExperimentHdf5

26021cc

feat: add filterIsolationWindow,XcmsExperimentHdf5

a8fe30b

feat: add functions to convert between xcms result objects

65095fe

jorainer and others added 17 commits March 13, 2025 16:47

Merge branch 'AnnotatedDataFrame' into large_scale

e87279b

feat: add filterFeatureDefinitions,XcmsExperimentHdf5

50766c1

docs: update documentation to markdown

d6fc2d5

Merge branch 'devel' into large_scale

0788b28

feat: add manualChromPeaks,XcmsExperimentHdf5 method

6fa9868

refactor: use reduce from MsCoreUtils

01c0b7a

fix: smaller fixes

00383e5

feat: add chromPeakSummary,XcmsExperimentHdf5

82bb3b4

bump x.y.z version to even y prior to creation of RELEASE_3_21 branch

71642c4

bump x.y.z version to odd y following creation of RELEASE_3_21 branch

67d516c

docs: restructure documentation and move aliases to hidden aliases

87af4ec

docs: small fixes to the vignettes

3ec9362

docs: fix aliases and add Metabonaut links to vignettes

92288c1

docs: update NEWS

b471611

docs: small fix in vignette

d7831ab

Merge branch 'devel' into large_scale

e4ec435

ci: update R version

e1da916

jorainer requested review from philouail and sneumann April 28, 2025 08:19

jorainer added 4 commits May 5, 2025 06:40

fix: chromatograms() to throw error if NA ranges provided

3006638

- `chromatograms()` to throw an error if parameter `mz` or `rt` contain missing values.

ci: set environment variables to fix data.table install

3353c92

ci: install gettext on macOS

0d6f80b

fix: disable inclusion of libintl.h in massifquant

6997398

sneumann reviewed May 16, 2025

View reviewed changes

jorainer added 2 commits May 19, 2025 18:27

address Steffen's comments

c120f61

- Add DOI for all references. - Add parameter `force.overwrite` to `findChromPeaks()` to allow overwriting of a HDF5 result file. - Rename `.is_chrom_peaks_within_mz_rt()` to `.is_chrom_peak_within_mz_rt()`.

review: show supported values in error message

de9ae22

jorainer requested a review from sneumann May 22, 2025 05:48

sneumann merged commit f8fa7f3 into devel May 26, 2025
2 of 3 checks passed

jorainer deleted the large_scale branch May 27, 2025 06:24


		## Changes in version 4.7.1

		- Change the naming convention of chromatographic peaks to include also the MS

Add support for (very) large-scale data #794

Add support for (very) large-scale data #794

Uh oh!

Conversation

jorainer commented Apr 28, 2025

Uh oh!

sneumann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneumann commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!