-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize run_spin_precession! and run_spin_excitation! for CPU #443
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, allowing scalar indexing makes the GPU suffer.
When stepping through the sequence in time, we do need to access arrays such as seq.Gx / Gy / Gz, seq.ADC, and sig at a specific index for each time step, hence the scalar indexing. My understanding from https://discourse.julialang.org/t/overcoming-slow-scalar-operations-on-gpu-arrays/49554/5 is that this is ok since the number of spins is expected to be >> than the length of seq.Δt inside each block, and nspins is the dimension of the array we are broadcasting along. Although, as I mentioned earlier, the for-loop stepping through time approach is probably inefficient in general for the GPU, and we should change back to using a matrix for a future BlochKernel sim method. At one point, I did have it so that each scalar index was turned into an array and broadcasted, so for example: Bz_2 .= x .* seq.Gx[seq_idx,:] .+ y .* seq.Gy[seq_idx,:] .+ z.* seq.Gz[seq_idx,:] .+ p.Δw / T(2π * γ) instead of: Bz_2 .= x .* seq.Gx[seq_idx] .+ y .* seq.Gy[seq_idx] .+ z.* seq.Gz[seq_idx] .+ p.Δw / T(2π * γ) But this was slightly slower (both on the CPU and GPU) compared with just using scalar indexing, so I changed it. |
Another possible approach for run_spin_precession would be to still use the matrix-based approach, but pre-allocate the matrix with size nspins x seq.Δt for the longest block in the sequence, and then for blocks with a smaller seq.Δt use a view of this matrix. This would still cut memory usage down by a lot I think. I'm pretty sure the main reason for the time being faster on the CPU is that we are only calculating Mxy if seq.ADC is true for that time step, whereas before we were calculating it for every time step, and this is the most expensive calculation. We could still do this using a matrix, however. I can play around a bit with this approach and see how the time compares. |
Yeah, I'm not sure what's best :/. Let's discuss it tomorrow! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #443 +/- ##
==========================================
+ Coverage 90.52% 90.75% +0.22%
==========================================
Files 51 53 +2
Lines 2786 2855 +69
==========================================
+ Hits 2522 2591 +69
Misses 264 264
Flags with carried forward coverage won't be shown. Click here to find out more.
|
I've updated the file structure of KomaMRICore to match what we discussed. The default run_spin_precession! and run_spin_excitation! functions are now in a new file SimulationMethod.jl and are dispatched at the most abstract level for the sim_method, backend, and prealloc parameters. The specialized CPU implementation for run_spin_precession! is in a separate file BlochCPU.jl and is dispatched to for sim_method::Bloch, backend::KA.CPU, and prealloc::BlochCPUPrealloc. I also added an optimized implementation for run_spin_excitation! which improved memory usage a bit more for the MRI Lab benchmark but didn't seem to affect the simulation time. Let me know if there's any more changes you want made before merging! |
Doing some more profiling this morning, I was surprised to see some lines allocating more than expected. It turns out that there were quite a few statements that were not fully broadcasted. Apparently, even if multiplying / dividing by a scalar, it still helps to use ./ or .* compared with / or *. I think this makes it easier for the compiler to broadcast the entire statement. Making these changes, I was able to get memory usage to decrease by a factor of ~15 from where it was immediately before! However, simulation time only decreased by about 10%, so at this point, we are clearly compute-bound and not memory-bound. |
Hi, it looks good! There's a few details it would be nice to change before merging (1) Can we call the previous Bloch method for This is not for now, but: (2) To organize the pre allocations, maybe we can put them in structs. For example, I think it makes sense to have
For the excitation for now we only use B_new (I would like to add a trapezoidal integration later). For the precession, the term exp/cis(phi) I think is the same as the Spinor alpha, with beta=0. Mag is not a very good name for the struct (as it could also be used for B_new and B_old), it is a vector in SU2 representation, VectorSU2. By putting them inside structs we can use functions defined for them, for example |
Just waiting for the CI to pass to merge! |
I've updated the run_spin_precession function to no longer use a matrix of size nspins x seq.Δt, in order to reduce memory use and (hopefully) simulation time as well. Instead, it now allocates arrays of size 1 x nspins, and steps through time to update them. The arrays are also pre-allocated at the beginning of run_sim_time_iter, so they can be re-used from block-to-block.
Interestingly, running the MRI Lab benchmark on my computer, I did see significantly faster simulation times on the CPU (roughly 4x faster), and a ~80% decrease in memory usage, however, GPU performance for Metal was much slower, going from ~6 seconds to run the MRI Lab benchmark to ~46 seconds. I haven't looked deeply into why this is yet, but I suspect this approach of stepping through time is more suited to the CPU than GPU. When I write a new BlochKernel simulation method using KernelAbstractions, I think we should switch back to the matrix approach, so we will have one method specialized for the CPU and another for the GPU. For the GPU method, we can have one thread for each entry in the nspins x seq.Δt matrix and compute the effective field all at once. There are parts that need to be sequential in time, but I think we can reduce across the time dimension in O(log(seq.Δt)) cycles.
I was originally planning to have a separate pull request for optimizing run_spin_excitation, but I think it's simpler to just include in this pull request and merge together, so I'll work on that next. Other issues fixed:
In Benchmark.yml, the job is not run for external pull requests since the Buildkite action is not run