Understanding why Dagger.jl is slower then Base.Threads and even serial code #651

njericha · 2025-08-19T22:47:40Z

njericha
Aug 19, 2025

I plan on adding parallel support to my tensor decomposition package BlockTensorDecomposition.jl and wanted to test out and understand Dagger.jl before I do.

Following a similar example to a parallel version of sum as described in Base.Threads, I wanted to impliment a parallel version of a dot product. On my 2 core (4 logical processors) laptop, a basic implimentation using threads does speed up the dot product (although not quite 2-4 times faster as I would expect), but using Dagger.jl is slower than both Threads, and the serial implimentation of dot.

Is there something I am not understanding about using these tools? Or is the problem too small for the amount of overhead in Dagger.jl? Below is my implimentation and output.

parallel_dot.jl

using Dagger
using Base.Threads
using BenchmarkTools

@show nthreads()

n = 1000000
x = collect(1:n)
y = collect(2:n+1)

function my_dot(x, y)
    total = 0
    for (xi, yi) in zip(x,y)
        total += xi * yi
    end
    return total
end

my_dot(x, y) # compile
println("Regular Dot Product")
@btime my_dot(x, y)

function parallel_dot_threads(x, y)
    chunk_size = length(x) ÷ nthreads()
    chunks_x = Iterators.partition(x, chunk_size)
    chunks_y = Iterators.partition(y, chunk_size)
    tasks = map(xy -> (Threads.@spawn my_dot(xy[1], xy[2])), zip(chunks_x, chunks_y))
    totals = fetch.(tasks)
    return sum(totals)
end

parallel_dot_threads(x, y) # compile
println("Threads Dot Product")
@btime parallel_dot_threads(x, y)

function parallel_dot_dagger(x, y)
    chunk_size = length(x) ÷ nthreads()
    chunks_x = Iterators.partition(x, chunk_size)
    chunks_y = Iterators.partition(y, chunk_size)
    tasks = map(xy -> (Dagger.@spawn my_dot(xy[1], xy[2])), zip(chunks_x, chunks_y))
    totals = fetch.(tasks)
    return sum(totals)
end

parallel_dot_dagger(x, y) # compile
println("Dagger Dot Product")
@btime parallel_dot_dagger(x, y)

Output

nthreads() = 4
Regular Dot Product
  982.100 μs (1 allocation: 16 bytes)
Threads Dot Product
  626.600 μs (36 allocations: 3.16 KiB)
Dagger Dot Product
  1.659 ms (2607 allocations: 124.48 KiB)
333334333334000000

jpsamaroo · 2025-08-20T19:18:53Z

jpsamaroo
Aug 20, 2025
Maintainer

Thanks for the post! The issue here is that Dagger will never be as fast as Base.Threads when the problem is highly regular, problem size is small, and only multithreading is being used. The reason for this is multi-fold:

Dagger is built on top of Base.Threads (specifically Tasks, but basically Threads.@spawn), which means we're just adding overhead
Small problem sizes (and relatedly, higher spawned task counts) cause Dagger's per-task overhead to dominate overall runtime; this effect goes away as problem size increases
Very regular computations like these are trivial for Julia's thread scheduler to handle, and don't really benefit from Dagger's work balancing, work stealing, and scheduler logic

Now, you make the valid point that you would expect a speedup from Dagger on this problem, because you have 2 cores (forget the 4 threads, this problem is memory-bound anyway, so the extra processor ports won't help) - that would be ideal, but this problem is really just too small to overcome Dagger's task and scheduler overhead, which on master adds >= 200us per task. If you're running this problem with 4 Julia threads, that means you have 4 Dagger tasks (since you split the input by nthreads()), so you have ~800us of expected overhead, which about matches what you're seeing here (~600us + ~800us ~= 1.2ms, so quite close).

Of course, we don't want Dagger to have this much overhead! I'm constantly working on reducing Dagger's overhead to enhance performance on smaller problems like these, and my hope is that in 1-2 years we'll expect to see <50us per task spawned. But accomplishing this requires time-intensive performance engineering, and I'm only one person with only so much time to dedicate to optimizations. So if you'd like to help out and help me improve performance at this scale, I'd love the help!

Addendum

I tried playing with this example on the 32-core (64-thread) AWS VM I've been doing some Dagger benchmarks on, to see if I could find the scale that Dagger performs well on this problem. The first thing I found is that, actually, Base.Threads has the same problems as Dagger at small scales (here, still 1M problem size, 64 Julia threads, ran 3 times each):

Regular Dot Product
  594.468 μs (0 allocations: 0 bytes)
333334333334000000
  597.218 μs (0 allocations: 0 bytes)
333334333334000000
  596.969 μs (0 allocations: 0 bytes)
333334333334000000

Threads Dot Product
  15.324 ms (396 allocations: 38.41 KiB)
333334333334000000
  13.342 ms (396 allocations: 38.41 KiB)
333334333334000000
  13.312 ms (396 allocations: 38.41 KiB)
333334333334000000

Dagger Dot Product
  12.932 ms (142330 allocations: 9.21 MiB)
333334333334000000
  12.220 ms (159376 allocations: 10.74 MiB)
333334333334000000
  11.634 ms (175789 allocations: 12.71 MiB)
333334333334000000

So there's no free lunch with any parallelization method 😄 I also don't know why Dagger performs so much better here, maybe because of work stealing and task pinning to threads?

Anyway, here's a larger run with problem size 100M:

Regular Dot Product
  70.184 ms (0 allocations: 0 bytes)
677921401802298880
  70.175 ms (0 allocations: 0 bytes)
677921401802298880
  69.467 ms (0 allocations: 0 bytes)
677921401802298880

Threads Dot Product
  23.724 ms (396 allocations: 38.41 KiB)
677921401802298880
  23.134 ms (396 allocations: 38.41 KiB)
677921401802298880
  23.450 ms (396 allocations: 38.41 KiB)
677921401802298880

Dagger Dot Product
  32.878 ms (216544 allocations: 17.10 MiB)
677921401802298880
  33.088 ms (322587 allocations: 28.63 MiB)
677921401802298880
  33.319 ms (234196 allocations: 18.65 MiB)
677921401802298880

So Dagger can definitely provide a speedup, assuming your problem size is large and task count is at the right level to provide low-overhead and max parallelism.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Understanding why Dagger.jl is slower then Base.Threads and even serial code #651

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Understanding why Dagger.jl is slower then Base.Threads and even serial code #651

Uh oh!

njericha Aug 19, 2025

Replies: 1 comment

Uh oh!

jpsamaroo Aug 20, 2025 Maintainer

Addendum

njericha
Aug 19, 2025

jpsamaroo
Aug 20, 2025
Maintainer