-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add window-wise RANSAC to reduce RAM requirements #66
Conversation
Whoops, accidentally submitted the PR before writing a description! Fixing that now. Also, apologies for putting so much of this into a single commit: I know that makes it hard to review! I couldn't think of a better way to do this piece-wise though, since apart from the utility functions the rest of the code is interdependent. |
gonna do a proper review later, but regarding the weird matprep correlation --> is that maybe because the data are already mean-centered? ... i.e., mean(x) is already around 0, so x - mean(x) is unnecessary and you can just square x directly? Just a hunch ... could be completely off. |
That sounds like a good explanation. That said, it is generally going to be true that the real or predicted signal for each channel during any given 5-second window is centered on zero? I should probably compare the results directly with normal Pearson correlation to see how much of a difference it makes. Also, the "bad-by-correlation" code uses the normal numpy Regardless, if we ultimately decide that Pearson (or some other method of correlation) would make a better default then we'll probably need to adjust the RANSAC bad-correlation threshold as well, since the current one is tuned for the current calculation method. That makes me think this question might be better suited for a future issue/PR. EDIT: Just tested directly, and it looks like the the results are pretty similar (although still somewhat divergent) between the correlation method MatPREP uses and the one we were using before ( Comparisons of `_correlate_arrays` and `np.corrcoef` values (click to unhide)With
With
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Austin, could you perhaps adjust the PR title to be more descriptive of what is truly happening in terms of code? The current title
Make PyPREP's RANSAC produce identical results to MATLAB PREP
is ambiguous :-)
pyprep/utils.py
Outdated
@@ -121,6 +144,37 @@ def _get_random_subset(x, size, rand_state): | |||
return sample | |||
|
|||
|
|||
def _correlate_arrays(a, b, axis=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this _mat_correlate_arrays? Or something the like? So that we better see that this comes from Matlab PREP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far I've only used the _mat
prefix for functions that mimic the behaviour of built-in MATLAB functions. I can add a _matprep
prefix to differentiate, but I was thinking for my follow-up PR that I'd make _correlate_arrays
like _predict_median_signals
and add a matlab_strict
flag to switch between PyPREP (np.corrcoef
) and MatPREP behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I was thinking for my follow-up PR that I'd make _correlate_arrays like _predict_median_signals and add a matlab_strict flag to switch between PyPREP (np.corrcoef) and MatPREP behaviour.
okay, as long as you don't forget about that, that sound better 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that maybe it was better using the whole "matlab" prefix instead of the "mat" one. I understand is more verbose but one could confuse "mat" as being associated with a matrix.
In any case, this is just a minor suggestion, is not that relevant.
I don't know, it was a shot in the dark.
yes, good point :)
that does suggest that mean(x) is indeed close to zero, and it makes little difference whether it's subtracted from x or not. I like |
Me too. I'm going to make the MatPREP correlation method a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Sorry for joining in late. I have been really busy with uni and work. Really looking forward to testing this implementation against a lot of files I processed through matlab's prep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-hurst are you taking a break from this PR to come back later? Or are there any blocking items that you need help with?
Since I've broken this down into a bunch of separate PRs now, the only new thing this PR adds is support for is window-wise RANSAC, which is slower than the current method but has much lower RAM demands for longer recordings. To finish this up, I'm going to rebase on the current master branch and see if I can improve performance at all. I'll try and do the former today! |
sounds great! --> but even if it's a bit slower, it'd still be a win for users with low RAM machines 👍 |
When I pull this together, should I make window-wise the new default then, with channel-wise still as an option for people with high-RAM workstations who want to process as quickly as possible? |
Sounds good! |
Codecov Report
@@ Coverage Diff @@
## master #66 +/- ##
==========================================
+ Coverage 97.31% 97.49% +0.17%
==========================================
Files 7 7
Lines 634 678 +44
==========================================
+ Hits 617 661 +44
Misses 17 17
Continue to review full report at Codecov.
|
@sappelhoff Okay, so I think this is ready for review now: I haven't updated the The main thing I wanted to get feedback on is the repurposing of the However, I realize that Apart from that, my main changes are:
Let me know what you think! |
generally yes, but with the immense changes that are coming in version
another good argument why breaking would be fine here
+1 to None 👍 would be interested in your opinion as well here @yjmantilla |
@@ -362,6 +362,44 @@ def split_list(mylist, chunk_size): | |||
] | |||
|
|||
|
|||
def print_progress(current, end, start=None, stepsize=1, every=0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you opt to code up you own progress tracking function instead of using tqdm
? To avoid an additional dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol I had no idea tqdm
existed until you mentioned it. I'd recently written more-or-less the same code for a slower R script, so it was just a question of porting it over/documenting/adding tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be fun to see how fast/slow your function is compared to what tqdm says:
Overhead is low -- about 60ns per iteration (80ns with tqdm.gui), and is unit tested against performance regression. By comparison, the well-established ProgressBar has an 800ns/iter overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh. My guess is that their code is probably better optimized but that my thing is still faster than ProgressBar (which I also didn't know existed until just now): I made no attempt to optimize my own approach, but I don't think there are any particularly expensive or wasteful function calls I'm doing per loop and I'm not trying to write as much to the terminal on every update.
pyprep/find_noisy_channels.py
Outdated
Whether RANSAC should predict signals for whole chunks of channels | ||
at once instead of predicting signals for each RANSAC window | ||
individually. Channel-wise RANSAC generally has higher RAM demands | ||
than window-wise RANSAC (especially if `max_chunk_size` is | ||
``None``), but can be faster on systems with lots of RAM to spare. | ||
Defaults to ``False``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I think I am a little confused by this.
We can run RANSAC in the following ways:
- over all channels and over the whole time --> this is how I coded it back then
- needs huge amounts of RAM
- but if that RAM is available, it's fast
- one channel at a time over the whole time --> this is what @yjmantilla implemented (if I recall correctly)
- much less RAM needed
- significantly slower
- calculate RANSAC over all channels but over windows
- this is what MATLAB PREP does, and what autoreject does
- implemented by @a-hurst
- Probably the best default
There would be a fourth potential option -> running single channels over windows ... so that'd be the slowest possible, but the lowest RAM ever (can probably even run on a raspberry pi 🤣 ).
Overall I am not sure how many of these options we should offer and how to document them.
Right now, I think one option should be fine: the matlab/autoreject way of doing all channels at once but over windows ... and if too little RAM is there, we just need to make the windows smaller ... right? (or would that affect the accuracy at some point?)
what are your opinions @a-hurst @yjmantilla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd thought about that too (chunking-by-channel in window-wise RANSAC), but wasn't sure how to best merge the chunking code with the windowing stuff without a massive rewrite.
Right now, window-wise still takes less RAM than one-at-a-time channel-wise unless (channel count x 5 seconds) is larger than the length of the whole recording (i.e., one-channel-at-a-time only takes less RAM if a 32-channel recording is under 2.6 minutes, a 64-channel is under 5.3 mins, a 128-channel is under 10.6 mins, etc.). Thus, in most cases window-wise should be good enough on the RAM front. Still, I wouldn't be opposed to getting that down further: with enough optimization, I could probably get PyPREP running (slowly) without swapping on my old iMac G4!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think window-wise is better. I remember talking bout this in another thread. In the usual use-case window-wise will usually consume less ram than the approach i took.
I think it would be nice to offer low-ram version even though it is the slowest, maybe fitting the largest chunk of channels could be done automatically like it is done currently in the channel wise version. but is not a priority. All the work @a-hurst is doing is quite amazing, priority should be that matlab prep comparison.
So +1 making the window-wise the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yjmantilla! As for the MATLAB PREP comparison, I just got the MATLAB side of that working today (automated MatPREP on GitHub actions, saving relevant data as artifacts for later testing) so expect some preliminary stuff on that end shortly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please adjust this docstr according to https://github.com/sappelhoff/pyprep/pull/66/files#r624438768 for consistency :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol that commit was after I'd spent a good 30-40 minutes trying to figure out why pop_saveset
was cryptically telling me I didn't have write access to the artifacts folder.
Turns out, if you use any keyword arguments in a MATLAB function call (e.g. 'savemode', 'onefile'
), that somehow means that positional arguments beyond the first one no longer work and you have to use keyword arguments for everything else (i.e. pop_saveset(EEG, outpath, 'savemode', 'onefile');
doesn't work, even though pop_saveset(EEG, 'filename', outpath, 'savemode', 'onefile');
does and pop_saveset(EEG, outpath);
does too). And I thought passing argument names as strings was bad enough...
@sappelhoff Just thinking of how to implement the public API for this now: do you think these kinds of settings should be part of the EDIT: Also, should |
Probably separate parameters 🤔 and yes, the find_all_bads method should have these params as well. |
It does not, and even if it would I would do a fork.
I agree with this. I don't think much people used that channel-wise argument and people are aware this is an early stage python package so I think we should do the best to make things intuitive and logically coherent up until the point we are ready to say "this is it, pyprep is operating as it should be" |
@sappelhoff I've now added a proper public API for I'm still open to adding chunking support for window-wise as you suggested, but I think that might be better to do in a future PR given the amount of change in this one as-is: from a cursory look at what's involved I'd have to refactor both channel-wise and window-wise RANSAC a fair bit to make that work (see: all the wrapping try/catching re: RAM-checking that's currently not used at all for window-wise after the initial check further up). While working on this PR I had some thoughts on how to clean that up too, but again I didn't want to overload this with too many peripheral changes. |
pyprep/find_noisy_channels.py
Outdated
ransac : bool | ||
To detect channels by ransac or not. | ||
ransac : bool, optional | ||
Whether RANSAC should be for bad channel detection, in addition to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether RANSAC should be for bad channel detection, in addition to | |
Whether RANSAC should be used for bad channel detection, in addition to |
pyprep/find_noisy_channels.py
Outdated
Whether RANSAC should predict signals for whole chunks of channels | ||
at once instead of predicting signals for each RANSAC window | ||
individually. Channel-wise RANSAC generally has higher RAM demands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether RANSAC should predict signals for whole chunks of channels | |
at once instead of predicting signals for each RANSAC window | |
individually. Channel-wise RANSAC generally has higher RAM demands | |
Whether RANSAC should predict signals for chunks of channels over the entire | |
signal length ("channel-wise RANSAC", see `max_chunk_size` parameter). | |
If ``False``, RANSAC will instead predict signals for all channels at once | |
but over a number of smaller time windows instead of over the entire | |
signal length ("window-wise RANSAC"). | |
Channel-wise RANSAC generally has higher RAM demands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and clear, I like it!
@@ -38,6 +38,19 @@ class PrepPipeline: | |||
ransac : bool, optional | |||
Whether or not to use RANSAC for noisy channel detection in addition to | |||
the other methods in :class:`~pyprep.NoisyChannels`. Defaults to True. | |||
channel_wise : bool, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -32,6 +32,19 @@ class Reference: | |||
ransac : bool, optional | |||
Whether or not to use RANSAC for noisy channel detection in addition to | |||
the other methods in :class:`~pyprep.NoisyChannels`. Defaults to True. | |||
channel_wise : bool, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once more, see: https://github.com/sappelhoff/pyprep/pull/66/files#r624438768
@@ -70,10 +72,17 @@ def find_bad_by_ransac( | |||
The duration (in seconds) of each RANSAC correlation window. Defaults to | |||
5 seconds. | |||
channel_wise : bool, optional | |||
Whether RANSAC should be performed one channel at a time (lower RAM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I have only one major comment for the docstr, the rest looks good to go!
Fixed in the latest commit, should be good to go now 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you had a tiny typo and I think you forgot to update one of the many docstrs. These suggestions should fix it. I'll accept them and then merge -> but feel free to double check :-)
Thanks a lot! One leap closer to 0.4 👏
PR Description
(Updated) This PR does a few things:
Currently Missing
matlab_strict
mode (handled in earlier PR)Other Notes
sub_002/ses-01
from that open meditation dataset), window-wise RANSAC is about 36% slower than the existing RANSAC approach (192 vs 141 seconds). However, RAM usage during RANSAC is now much lower (maxed out at 2.85 GB), which is nice but also suggests that window-wise RANSAC could be improved to take better advantage of RAM on beefier systems.Merge Checklist
closes #<issue-number>
to automatically close an issue