Skip to content

Add support for (very) large-scale data #794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
May 26, 2025
Merged

Add support for (very) large-scale data #794

merged 64 commits into from
May 26, 2025

Conversation

jorainer
Copy link
Collaborator

This PR adds a new xcms result object XcmsExperimentHdf5 that stores all preprocessing results to a local file in HDF5 format. Thus, no data/results are kept in memory reducing the memory demand to a minimum.

For an example and performance evaluation of this backend check the large_scale branch on Metabonaut: https://github.com/rformassspectrometry/Metabonaut/tree/large_scale_data

There are also other, smaller, changes in this PR including:

  • improve performance of chromatogram call by reading only spectra within the retention time ranges
  • linking to Metabonaut
  • update all documentations to roxygen/markdown. Fix links in documentation.

Apologies for the very large PR - there are unit tests for all functions, so the code should be OK. So maybe just discuss/evaluate the concept?

@jorainer jorainer requested review from philouail and sneumann April 28, 2025 08:19
Copy link
Owner

@sneumann sneumann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, awesome work. Some cosmetic questions here and there. Yours, Steffen

@@ -291,156 +294,6 @@ exportClasses(
"FilterIntensityParam",
"ChromPeakAreaParam"
)
## Param methods
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were those removed entirely ? I see a few new exports elsewhere, but only a minority of these here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all of the getter/setter methods for the *Param classes (please let me know if there are any left - I would like to remove all).

Why I removed these:

  • it clutters the documentation with seldomly, if at all, used functions
  • users will almost never use these functions. It's easier to just create a new param object than changing the value of a previously created one.


## Changes in version 4.7.1

- Change the naming convention of chromatographic peaks to include also the MS
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That starts to put semantics into the identifier. I am extra careful with that, code needs to check MS level <=9. Are there any other semantics that could creep around the corner and need consideration ? Ion Mobility ? Probably not, remains an attribute of the peaks. Can you still rely on the leading digit in older datasets ? CP2001 could also be the 2001st peak in an old dataset ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chrom peak identifiers were always just thought as an identifier within the same data set. So, IMHO, it does not matter what values these identifiers have (whether it's a running number or even just a random number, as long as it is unique). I agree that adding semantics to identifiers is not ideal - but for the new XcmsExperimentHdf5 I needed something that allows me to define chromatographic peaks IDs separately for each process/file. Peak detection is performed on a per sample (file) basis and each of these processes stores the identified peaks to the HDF5 file without knowing anything about the other projects. That's why I came up with this solution - to still have unique chrom peak IDs within the same data set. I'm of course open to any other working solution :)

- Optimization and performance improvements for extraction of chromatographic
data. This includes using `MsCoreUtils::reduce()`.
- Restructure and clean-up of documentation.
- Don't export unnecessary get/set methods for `Param` classes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Methods are not neccessary or their export ? Could we have any old users of those functions ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to tell if there are any users out there - I guess very few if any (we never really promoted the idea of using getter/setter methods of the parameter classes). I would change and see what happens - my guess is users come forward and will open issues, then we can address it.

#' peak detection in purely chromatographic data.
#'
#' @references
#' Colin A. Smith, Elizabeth J. Want, Grace O'Maille, Ruben Abagyan and
#' Gary Siuzdak. "XCMS: Processing Mass Spectrometry Data for Metabolite
#' Profiling Using Nonlinear Peak Alignment, Matching, and Identification"
#' \emph{Anal. Chem.} 2006, 78:779-787.
#' *Anal. Chem.* 2006, 78:779-787.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future citations could include the DOI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I'll add the DOI for all references.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the DOIs in commit c120f61

#' p
#'
NULL
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to learn what these two NULL do, and what's the difference between 'em.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NULL after a roxygen documention just ensures that the documentation file (Rd) is created - without adding also the next function to it. For general documentation or collection/topics of function I use that, i.e., write the roxygen documentation and add a NULL at the end. Then, I add all related functions/methods/classes to the documentation using #' @rdname <name of the general, topic, documentation>

expandMz = expandMz, ppm = ppm, skipFilled = skipFilled,
peaks = peaks, chromPeakColumns = chromPeakColumns)
})
res <- Spectra:::.concatenate_spectra(res)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be exported by Spectra in the future ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check the function - could be that it's already exported.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I will update in the next PR.

@@ -577,35 +584,6 @@ dropGenericProcessHistory <- function(x, fun) {
idxs
}

#' @rdname adjustRtime
adjustRtimePeakGroups <- function(object, param = PeakGroupsParam(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is entirely gone now ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the contrary - I converted that to a method - so I added dedicated implementations for XcmsExperiment, XcmsExperimentHdf5 and XCMSnExp

@@ -3,6 +3,7 @@

############################################################
## show
#' @rdname hidden_aliases
setMethod("show", "xcmsFragments", .xcmsFragments.show)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How useful is xcmsFragments nowadays ? Would this devel cycle be a good time to deprecate it and mark for removal in the next cycle ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with that. I don't know what this class actually is/was used for... But we should maybe start deprecation in a separate branch/PR.

@@ -1,228 +0,0 @@
## # rampInit <- function() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest in pieces :-)

@@ -0,0 +1,1083 @@
library(rhdf5)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the library() be in the parent directory, for all *hdf5 tests ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the HDF5 functionality as separate as possible thus I did not add it into the main testthat.R file. I find it cleaner if the unit test file specifically sets the stage for its R source file.

jorainer added 2 commits May 19, 2025 18:27
- Add DOI for all references.
- Add parameter `force.overwrite` to `findChromPeaks()` to allow overwriting of
  a HDF5 result file.
- Rename `.is_chrom_peaks_within_mz_rt()` to `.is_chrom_peak_within_mz_rt()`.
@jorainer jorainer requested a review from sneumann May 22, 2025 05:48
@sneumann
Copy link
Owner

Excellent, thanks for the hard work and answering questions. Will merge next. Yours, Steffen

@sneumann sneumann merged commit f8fa7f3 into devel May 26, 2025
2 of 3 checks passed
@jorainer jorainer deleted the large_scale branch May 27, 2025 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants