Skip to content

Conversation

ericphanson
Copy link
Owner

I tried to see if AllocArrays would work with Lux/Boltz, since it seems to be working for ObjectDetector. There the only issue I had was performance problems with mixed UnsafeArrays vs matrices leading to generic matmul instead of blas. This is kinda similar to the issue from JuliaLang/julia#57799 (though technically quite different) and I came up with a better workaround this time: we can adapt the function within the with_allocator block to homogenize the types. As long as adapting is fast relative to the runtime, it's kinda fine.

However this means the function we evaluate does not persist, so it might not work well for ObjectDetector which updates the model struct even on the forward pass with its W parameter. So here I have the terrible name CachedUpdatelessFunction, where Cached is meant to mean it reuses allocations (not values), Updateless means it won't persist updates to the model, and function for a generic callable (doesn't have to be an ML model).

With this, I get

using Boltz, Lux, JLD2, AllocArrays, Adapt, Random, BenchmarkTools

model = Vision.VGG(13; pretrained=true)

model_aa = AllocArrays.CachedUpdatelessFunction(model)

batch = rand(Float32, 224, 224, 3, 20)

ps, st = Lux.setup(Random.default_rng(), model)

println("Benchmark with Array:")
display(@benchmark $model($batch, $ps, $(Lux.testmode(st))))

println("Benchmark with AllocArray:")
display(@benchmark $model_aa($batch, $ps, $(Lux.testmode(st))))

yielding

julia> include("script.jl");
Benchmark with Array:
BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range (min  max):  1.914 s    1.970 s  ┊ GC (min  max): 4.06%  3.62%
 Time  (median):     1.938 s              ┊ GC (median):    3.78%
 Time  (mean ± σ):   1.940 s ± 28.213 ms  ┊ GC (mean ± σ):  3.82% ± 0.22%

  █                       █                               █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.91 s         Histogram: frequency by time        1.97 s <

 Memory estimate: 2.33 GiB, allocs estimate: 476.
Benchmark with AllocArray:
BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range (min  max):  1.967 s    1.991 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.977 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.978 s ± 12.323 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                      █                                █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.97 s         Histogram: frequency by time        1.99 s <

 Memory estimate: 152.69 KiB, allocs estimate: 759.

So we were able to eliminate GC and most allocations, though possibly with a small runtime cost. Maybe we are still not hitting some optimized path.

This feels a bit clunky/hard-to-describe currently, and moves Adapt from an extension to a regular dep, so not sure we want this but thought I'd put it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant