Replies: 1 comment
-
Thanks for the post! The issue here is that Dagger will never be as fast as Base.Threads when the problem is highly regular, problem size is small, and only multithreading is being used. The reason for this is multi-fold:
Now, you make the valid point that you would expect a speedup from Dagger on this problem, because you have 2 cores (forget the 4 threads, this problem is memory-bound anyway, so the extra processor ports won't help) - that would be ideal, but this problem is really just too small to overcome Dagger's task and scheduler overhead, which on Of course, we don't want Dagger to have this much overhead! I'm constantly working on reducing Dagger's overhead to enhance performance on smaller problems like these, and my hope is that in 1-2 years we'll expect to see <50us per task spawned. But accomplishing this requires time-intensive performance engineering, and I'm only one person with only so much time to dedicate to optimizations. So if you'd like to help out and help me improve performance at this scale, I'd love the help! AddendumI tried playing with this example on the 32-core (64-thread) AWS VM I've been doing some Dagger benchmarks on, to see if I could find the scale that Dagger performs well on this problem. The first thing I found is that, actually,
So there's no free lunch with any parallelization method 😄 I also don't know why Dagger performs so much better here, maybe because of work stealing and task pinning to threads? Anyway, here's a larger run with problem size 100M:
So Dagger can definitely provide a speedup, assuming your problem size is large and task count is at the right level to provide low-overhead and max parallelism. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I plan on adding parallel support to my tensor decomposition package BlockTensorDecomposition.jl and wanted to test out and understand Dagger.jl before I do.
Following a similar example to a parallel version of
sum
as described in Base.Threads, I wanted to impliment a parallel version of a dot product. On my 2 core (4 logical processors) laptop, a basic implimentation using threads does speed up the dot product (although not quite 2-4 times faster as I would expect), but using Dagger.jl is slower than both Threads, and the serial implimentation of dot.Is there something I am not understanding about using these tools? Or is the problem too small for the amount of overhead in Dagger.jl? Below is my implimentation and output.
parallel_dot.jl
Output
Beta Was this translation helpful? Give feedback.
All reactions