-
Notifications
You must be signed in to change notification settings - Fork 82
Add support for (very) large-scale data #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- `chromatograms()` to throw an error if parameter `mz` or `rt` contain missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, awesome work. Some cosmetic questions here and there. Yours, Steffen
@@ -291,156 +294,6 @@ exportClasses( | |||
"FilterIntensityParam", | |||
"ChromPeakAreaParam" | |||
) | |||
## Param methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were those removed entirely ? I see a few new exports elsewhere, but only a minority of these here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed all of the getter/setter methods for the *Param
classes (please let me know if there are any left - I would like to remove all).
Why I removed these:
- it clutters the documentation with seldomly, if at all, used functions
- users will almost never use these functions. It's easier to just create a new param object than changing the value of a previously created one.
|
||
## Changes in version 4.7.1 | ||
|
||
- Change the naming convention of chromatographic peaks to include also the MS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That starts to put semantics into the identifier. I am extra careful with that, code needs to check MS level <=9. Are there any other semantics that could creep around the corner and need consideration ? Ion Mobility ? Probably not, remains an attribute of the peaks. Can you still rely on the leading digit in older datasets ? CP2001 could also be the 2001st peak in an old dataset ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chrom peak identifiers were always just thought as an identifier within the same data set. So, IMHO, it does not matter what values these identifiers have (whether it's a running number or even just a random number, as long as it is unique). I agree that adding semantics to identifiers is not ideal - but for the new XcmsExperimentHdf5
I needed something that allows me to define chromatographic peaks IDs separately for each process/file. Peak detection is performed on a per sample (file) basis and each of these processes stores the identified peaks to the HDF5 file without knowing anything about the other projects. That's why I came up with this solution - to still have unique chrom peak IDs within the same data set. I'm of course open to any other working solution :)
- Optimization and performance improvements for extraction of chromatographic | ||
data. This includes using `MsCoreUtils::reduce()`. | ||
- Restructure and clean-up of documentation. | ||
- Don't export unnecessary get/set methods for `Param` classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Methods are not neccessary or their export ? Could we have any old users of those functions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to tell if there are any users out there - I guess very few if any (we never really promoted the idea of using getter/setter methods of the parameter classes). I would change and see what happens - my guess is users come forward and will open issues, then we can address it.
#' peak detection in purely chromatographic data. | ||
#' | ||
#' @references | ||
#' Colin A. Smith, Elizabeth J. Want, Grace O'Maille, Ruben Abagyan and | ||
#' Gary Siuzdak. "XCMS: Processing Mass Spectrometry Data for Metabolite | ||
#' Profiling Using Nonlinear Peak Alignment, Matching, and Identification" | ||
#' \emph{Anal. Chem.} 2006, 78:779-787. | ||
#' *Anal. Chem.* 2006, 78:779-787. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future citations could include the DOI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I'll add the DOI for all references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the DOIs in commit c120f61
#' p | ||
#' | ||
NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to learn what these two NULL do, and what's the difference between 'em.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NULL
after a roxygen documention just ensures that the documentation file (Rd) is created - without adding also the next function to it. For general documentation or collection/topics of function I use that, i.e., write the roxygen documentation and add a NULL
at the end. Then, I add all related functions/methods/classes to the documentation using #' @rdname <name of the general, topic, documentation>
expandMz = expandMz, ppm = ppm, skipFilled = skipFilled, | ||
peaks = peaks, chromPeakColumns = chromPeakColumns) | ||
}) | ||
res <- Spectra:::.concatenate_spectra(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that be exported by Spectra in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to check the function - could be that it's already exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one without leading dot ?
https://github.com/rformassspectrometry/Spectra/blob/82f773a3a8a305341b9baed3216632599649719d/NAMESPACE#L14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I will update in the next PR.
@@ -577,35 +584,6 @@ dropGenericProcessHistory <- function(x, fun) { | |||
idxs | |||
} | |||
|
|||
#' @rdname adjustRtime | |||
adjustRtimePeakGroups <- function(object, param = PeakGroupsParam(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is entirely gone now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the contrary - I converted that to a method - so I added dedicated implementations for XcmsExperiment
, XcmsExperimentHdf5
and XCMSnExp
@@ -3,6 +3,7 @@ | |||
|
|||
############################################################ | |||
## show | |||
#' @rdname hidden_aliases | |||
setMethod("show", "xcmsFragments", .xcmsFragments.show) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How useful is xcmsFragments nowadays ? Would this devel cycle be a good time to deprecate it and mark for removal in the next cycle ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be OK with that. I don't know what this class actually is/was used for... But we should maybe start deprecation in a separate branch/PR.
@@ -1,228 +0,0 @@ | |||
## # rampInit <- function() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest in pieces :-)
@@ -0,0 +1,1083 @@ | |||
library(rhdf5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the library() be in the parent directory, for all *hdf5 tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the HDF5 functionality as separate as possible thus I did not add it into the main testthat.R file. I find it cleaner if the unit test file specifically sets the stage for its R source file.
- Add DOI for all references. - Add parameter `force.overwrite` to `findChromPeaks()` to allow overwriting of a HDF5 result file. - Rename `.is_chrom_peaks_within_mz_rt()` to `.is_chrom_peak_within_mz_rt()`.
Excellent, thanks for the hard work and answering questions. Will merge next. Yours, Steffen |
This PR adds a new xcms result object
XcmsExperimentHdf5
that stores all preprocessing results to a local file in HDF5 format. Thus, no data/results are kept in memory reducing the memory demand to a minimum.For an example and performance evaluation of this backend check the large_scale branch on Metabonaut: https://github.com/rformassspectrometry/Metabonaut/tree/large_scale_data
There are also other, smaller, changes in this PR including:
chromatogram
call by reading only spectra within the retention time rangesApologies for the very large PR - there are unit tests for all functions, so the code should be OK. So maybe just discuss/evaluate the concept?